95
Deep Nets, Bayes and the story of AI (continued) David Barber

David Barber - Deep Nets, Bayes and the story of AI

Embed Size (px)

Citation preview

Page 1: David Barber - Deep Nets, Bayes and the story of AI

Deep Nets Bayes and the story of AI (continued)

David Barber

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Intelligent Machinery

1948 Turing and Champernowne lsquopaper and pencilrsquo chess

Intelligent Machinery

1951 Prinz mate-in-two moves chess machine

1952 Strachey programs first computer draughts algorithm

Learning Machines

1951 Oettinger makes first program that lsquolearnsrsquo

1955 Samuel adds lsquolearningrsquo to his draughts algorithm

Logical Intelligence

1968 Rischrsquos algorithm for integration in calculus

1972 Prolog for general logical reasoning

1997 Deep Blue defeats Kasparov

Other forms of intelligence

But is this getting us to where wersquod like to beSelfridge-Shannon film clip

Speech Recognition

Visual Processing

Natural Language modelling

Planning and decision in uncertain environments

Perhaps a different approach would be useful

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Astonishing Hypothesis Crick

ldquoA personrsquos mental activities are entirely due to the behaviour of nervecells and the molecules that make them up and influence themrdquo

Neurons

Visual Pathway

Information Processing in Brains

Neurons

Re

al

Wo

rld

Layer 1 Layer 2 Highminuslevel

Concepts

Feature

Hierarchical Modular Binary Parallel Noisy

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Artificial Neuron (Perceptron)

weight 7

output neuron

neuron 1neuron 2neuron 3neuron 4

neuron 7neuron 6neuron 5

inputs

weight 1

Training an artificial neural network

Want to generalise to new images with high accuracy

Artificial Network

1957 Rosenblattrsquos perceptron

perceptron film clip

Connectionism

1960 Realised a perceptron can only solve simple tasks

1970 Decline in interest

1980 New computing power made training multilayer networks feasible

outputinputs

Each node (or lsquoneuronrsquo) computes a function of a weighted combination ofparental nodes hj = σ(

sumi wijhi)

Neural Networks and Deep LearningHistorical Problems with Neural Nets (1990s)

NNs are difficult to train (many local optima)

Particularly difficult to train a NN with a large number of layers (say largerthan around 10)

lsquoGradient Diffusion Problemrsquo ndash difficult to assign responsibility of errors toindividual lsquoneuronsrsquo

Machine Learning (up to 2006)

A large section of the machine learning community abandoned NNs

More principled and computationally better understood techniques (SVMsand related convex methods) replaced them

Bayesian AI (1990s onwards)

From mid 1990s there was a realisation that pattern recognition is notsufficient for all AI purposes

Uncertainty and reasoning are not naturally representable using standardfeed-forward nets

Explosion in more lsquosymbolicrsquo Bayesian AI

Deep Learning

NNs have resurged in interest in the last few years (Hinton Bengio )

Also called lsquodeep learningrsquo

Sense that very complex tasks (object recognition learning complex structurein data) requires going beyond simple (convex) statistical techniques

The brain uses hierarchical distributed processing and it is likely to be for agood reason

Many problems have a hierarchical structure images are made of partslanguage is hierarchical etc

Why now

New computing resources (GPU processing)

Availability of large amount of data means that we can train nets with manyparameters (1010)

Recent evidence suggests local optima are not particularly problematic

Autoencoder

y1 y2 y3 y4 y5

h1 h2 h3

h4 h5

y1 y2 y3 y4 y5

h6 h7 h8

The bottleneck forces the network to try to find a low dimensionalrepresentation of the data

Useful for unsupervised learning

Autoencoder on MNIST digits (Hinton 2006 Science)

Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA

60000 training images (28times 28 = 784 pixels)

Use a form of autoencoder to find a lower (30) dimensional representation

At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation

Google Cats

10 Million Youtube video frames (200x200 pixel images)

Use a specialised autoencoder with 9 layers (1 billion weights)

2000 computers + two weeks of computing

Examine units to see what images they most respond to

Google Autoencoder

From Nando De Freitas

Convolutional NNs

CNNs are particularly popular in image processing

Often the feature maps correspond (not to macro features such as bicycles)but micro features

For example in handwritten digit recognition they correspond to smallconstituent parts of the digits

These are used then to process the image into a representation that is betterfor recognition

NNs in NLP

Bag of Words

We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index

We can also think of this as a Euclidean embedding e

aardvarkrarr eaardvark =

100

zorrorarr ezorro =

001

Word Embeddings

Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned

Objective is for example next word prediction accuracy

These are often called lsquoneural language modelsrsquo

NNs in NLP

Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used

Consider the sentence

the cat sat on the mat

and that we wish to predict the word on given the two preceding cat sat

and two succeeding words the mat

We can use a network that has inputs vcat vsat vthe vmat

The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high

The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context

Word Embeddings

Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)

Word Embeddings

There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender

vwoman minus vman asymp vaunt minus vuncle

vwoman minus vman asymp vqueen minus vking

From Mikolov (2013)

Word Embeddings Analogies

Given a relationship France-Paris we get the lsquorelationshiprsquo embedding

v = vParis minus vFrance

Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)

Word Embeddings Constrained Embeddings

We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)

Word Embeddings Constrained Embeddings

Recursive Nets and Embeddings

Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)

Recursive Nets and Embeddings

Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node

Recursive Nets and EmbeddingsTraining

We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree

The weights of this classifier are shared across all nodes

At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings

The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier

We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy

Prediction

For a new movie review the review is first parsed using a standard grammartree parser

This forms the tree which can be used to recursively form the sentiment classlabel for the review

Currently the best sentiment classifier Socher (2013)

Recursive Nets and Embeddingsotilde otilde

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

otilde

otilde

eth

middotshy

otilde

eth

plusmnsup2raquo

otilde

eth

plusmnordm

otilde

otilde

eth

notcedilraquo

otilde

otilde

eth

sup3plusmnshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

yen

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

yen

yen

eth

middotshy

yen

eth

plusmnsup2raquo

yen

eth

plusmnordm

yen

yen

eth

notcedilraquo

yen

yen

yen

acuteraquoiquestshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

otilde

eth

times

otilde

otilde

otilde

acutemiddotmicroraquofrac14

eth

eth

eth

raquoordfraquoregsect

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

times

yen

yen

eth

eth

frac14middotfrac14

eth

sup2ugravenot

eth

eth

acutemiddotmicroraquo

eth

eth

eth

iquest

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

timesnot

yen

yen

eth

eth

ugraveshy

eth

paralaquoshynot

yen

otilde

middotsup2frac12regraquofrac14middotfrac34acutesect

yen yen

frac14laquoacuteacute

eth

ograve

eth

eth

timesnot

eth

eth

eth

eth

eth

ugraveshy

otilde

yen

sup2plusmnnot

yen yen

frac14laquoacuteacute

eth

ograve

Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve

Recurrent Nets

x1 x2 x3

h1 h2 h3

y1 y2 y3

A A A

C C C

B B

RNNs are used in timeseries applications

The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt

In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram

I omitted the potential links from xtminus1 ytminus1 to ht

Handwriting Generation using a RNN

Some training examples

Handwriting Generation using a RNN

Some generated examples Top line is real handwriting for comparison See AlexGraversquos work

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 2: David Barber - Deep Nets, Bayes and the story of AI

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Intelligent Machinery

1948 Turing and Champernowne lsquopaper and pencilrsquo chess

Intelligent Machinery

1951 Prinz mate-in-two moves chess machine

1952 Strachey programs first computer draughts algorithm

Learning Machines

1951 Oettinger makes first program that lsquolearnsrsquo

1955 Samuel adds lsquolearningrsquo to his draughts algorithm

Logical Intelligence

1968 Rischrsquos algorithm for integration in calculus

1972 Prolog for general logical reasoning

1997 Deep Blue defeats Kasparov

Other forms of intelligence

But is this getting us to where wersquod like to beSelfridge-Shannon film clip

Speech Recognition

Visual Processing

Natural Language modelling

Planning and decision in uncertain environments

Perhaps a different approach would be useful

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Astonishing Hypothesis Crick

ldquoA personrsquos mental activities are entirely due to the behaviour of nervecells and the molecules that make them up and influence themrdquo

Neurons

Visual Pathway

Information Processing in Brains

Neurons

Re

al

Wo

rld

Layer 1 Layer 2 Highminuslevel

Concepts

Feature

Hierarchical Modular Binary Parallel Noisy

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Artificial Neuron (Perceptron)

weight 7

output neuron

neuron 1neuron 2neuron 3neuron 4

neuron 7neuron 6neuron 5

inputs

weight 1

Training an artificial neural network

Want to generalise to new images with high accuracy

Artificial Network

1957 Rosenblattrsquos perceptron

perceptron film clip

Connectionism

1960 Realised a perceptron can only solve simple tasks

1970 Decline in interest

1980 New computing power made training multilayer networks feasible

outputinputs

Each node (or lsquoneuronrsquo) computes a function of a weighted combination ofparental nodes hj = σ(

sumi wijhi)

Neural Networks and Deep LearningHistorical Problems with Neural Nets (1990s)

NNs are difficult to train (many local optima)

Particularly difficult to train a NN with a large number of layers (say largerthan around 10)

lsquoGradient Diffusion Problemrsquo ndash difficult to assign responsibility of errors toindividual lsquoneuronsrsquo

Machine Learning (up to 2006)

A large section of the machine learning community abandoned NNs

More principled and computationally better understood techniques (SVMsand related convex methods) replaced them

Bayesian AI (1990s onwards)

From mid 1990s there was a realisation that pattern recognition is notsufficient for all AI purposes

Uncertainty and reasoning are not naturally representable using standardfeed-forward nets

Explosion in more lsquosymbolicrsquo Bayesian AI

Deep Learning

NNs have resurged in interest in the last few years (Hinton Bengio )

Also called lsquodeep learningrsquo

Sense that very complex tasks (object recognition learning complex structurein data) requires going beyond simple (convex) statistical techniques

The brain uses hierarchical distributed processing and it is likely to be for agood reason

Many problems have a hierarchical structure images are made of partslanguage is hierarchical etc

Why now

New computing resources (GPU processing)

Availability of large amount of data means that we can train nets with manyparameters (1010)

Recent evidence suggests local optima are not particularly problematic

Autoencoder

y1 y2 y3 y4 y5

h1 h2 h3

h4 h5

y1 y2 y3 y4 y5

h6 h7 h8

The bottleneck forces the network to try to find a low dimensionalrepresentation of the data

Useful for unsupervised learning

Autoencoder on MNIST digits (Hinton 2006 Science)

Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA

60000 training images (28times 28 = 784 pixels)

Use a form of autoencoder to find a lower (30) dimensional representation

At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation

Google Cats

10 Million Youtube video frames (200x200 pixel images)

Use a specialised autoencoder with 9 layers (1 billion weights)

2000 computers + two weeks of computing

Examine units to see what images they most respond to

Google Autoencoder

From Nando De Freitas

Convolutional NNs

CNNs are particularly popular in image processing

Often the feature maps correspond (not to macro features such as bicycles)but micro features

For example in handwritten digit recognition they correspond to smallconstituent parts of the digits

These are used then to process the image into a representation that is betterfor recognition

NNs in NLP

Bag of Words

We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index

We can also think of this as a Euclidean embedding e

aardvarkrarr eaardvark =

100

zorrorarr ezorro =

001

Word Embeddings

Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned

Objective is for example next word prediction accuracy

These are often called lsquoneural language modelsrsquo

NNs in NLP

Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used

Consider the sentence

the cat sat on the mat

and that we wish to predict the word on given the two preceding cat sat

and two succeeding words the mat

We can use a network that has inputs vcat vsat vthe vmat

The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high

The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context

Word Embeddings

Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)

Word Embeddings

There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender

vwoman minus vman asymp vaunt minus vuncle

vwoman minus vman asymp vqueen minus vking

From Mikolov (2013)

Word Embeddings Analogies

Given a relationship France-Paris we get the lsquorelationshiprsquo embedding

v = vParis minus vFrance

Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)

Word Embeddings Constrained Embeddings

We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)

Word Embeddings Constrained Embeddings

Recursive Nets and Embeddings

Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)

Recursive Nets and Embeddings

Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node

Recursive Nets and EmbeddingsTraining

We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree

The weights of this classifier are shared across all nodes

At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings

The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier

We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy

Prediction

For a new movie review the review is first parsed using a standard grammartree parser

This forms the tree which can be used to recursively form the sentiment classlabel for the review

Currently the best sentiment classifier Socher (2013)

Recursive Nets and Embeddingsotilde otilde

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

otilde

otilde

eth

middotshy

otilde

eth

plusmnsup2raquo

otilde

eth

plusmnordm

otilde

otilde

eth

notcedilraquo

otilde

otilde

eth

sup3plusmnshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

yen

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

yen

yen

eth

middotshy

yen

eth

plusmnsup2raquo

yen

eth

plusmnordm

yen

yen

eth

notcedilraquo

yen

yen

yen

acuteraquoiquestshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

otilde

eth

times

otilde

otilde

otilde

acutemiddotmicroraquofrac14

eth

eth

eth

raquoordfraquoregsect

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

times

yen

yen

eth

eth

frac14middotfrac14

eth

sup2ugravenot

eth

eth

acutemiddotmicroraquo

eth

eth

eth

iquest

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

timesnot

yen

yen

eth

eth

ugraveshy

eth

paralaquoshynot

yen

otilde

middotsup2frac12regraquofrac14middotfrac34acutesect

yen yen

frac14laquoacuteacute

eth

ograve

eth

eth

timesnot

eth

eth

eth

eth

eth

ugraveshy

otilde

yen

sup2plusmnnot

yen yen

frac14laquoacuteacute

eth

ograve

Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve

Recurrent Nets

x1 x2 x3

h1 h2 h3

y1 y2 y3

A A A

C C C

B B

RNNs are used in timeseries applications

The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt

In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram

I omitted the potential links from xtminus1 ytminus1 to ht

Handwriting Generation using a RNN

Some training examples

Handwriting Generation using a RNN

Some generated examples Top line is real handwriting for comparison See AlexGraversquos work

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 3: David Barber - Deep Nets, Bayes and the story of AI

Intelligent Machinery

1948 Turing and Champernowne lsquopaper and pencilrsquo chess

Intelligent Machinery

1951 Prinz mate-in-two moves chess machine

1952 Strachey programs first computer draughts algorithm

Learning Machines

1951 Oettinger makes first program that lsquolearnsrsquo

1955 Samuel adds lsquolearningrsquo to his draughts algorithm

Logical Intelligence

1968 Rischrsquos algorithm for integration in calculus

1972 Prolog for general logical reasoning

1997 Deep Blue defeats Kasparov

Other forms of intelligence

But is this getting us to where wersquod like to beSelfridge-Shannon film clip

Speech Recognition

Visual Processing

Natural Language modelling

Planning and decision in uncertain environments

Perhaps a different approach would be useful

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Astonishing Hypothesis Crick

ldquoA personrsquos mental activities are entirely due to the behaviour of nervecells and the molecules that make them up and influence themrdquo

Neurons

Visual Pathway

Information Processing in Brains

Neurons

Re

al

Wo

rld

Layer 1 Layer 2 Highminuslevel

Concepts

Feature

Hierarchical Modular Binary Parallel Noisy

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Artificial Neuron (Perceptron)

weight 7

output neuron

neuron 1neuron 2neuron 3neuron 4

neuron 7neuron 6neuron 5

inputs

weight 1

Training an artificial neural network

Want to generalise to new images with high accuracy

Artificial Network

1957 Rosenblattrsquos perceptron

perceptron film clip

Connectionism

1960 Realised a perceptron can only solve simple tasks

1970 Decline in interest

1980 New computing power made training multilayer networks feasible

outputinputs

Each node (or lsquoneuronrsquo) computes a function of a weighted combination ofparental nodes hj = σ(

sumi wijhi)

Neural Networks and Deep LearningHistorical Problems with Neural Nets (1990s)

NNs are difficult to train (many local optima)

Particularly difficult to train a NN with a large number of layers (say largerthan around 10)

lsquoGradient Diffusion Problemrsquo ndash difficult to assign responsibility of errors toindividual lsquoneuronsrsquo

Machine Learning (up to 2006)

A large section of the machine learning community abandoned NNs

More principled and computationally better understood techniques (SVMsand related convex methods) replaced them

Bayesian AI (1990s onwards)

From mid 1990s there was a realisation that pattern recognition is notsufficient for all AI purposes

Uncertainty and reasoning are not naturally representable using standardfeed-forward nets

Explosion in more lsquosymbolicrsquo Bayesian AI

Deep Learning

NNs have resurged in interest in the last few years (Hinton Bengio )

Also called lsquodeep learningrsquo

Sense that very complex tasks (object recognition learning complex structurein data) requires going beyond simple (convex) statistical techniques

The brain uses hierarchical distributed processing and it is likely to be for agood reason

Many problems have a hierarchical structure images are made of partslanguage is hierarchical etc

Why now

New computing resources (GPU processing)

Availability of large amount of data means that we can train nets with manyparameters (1010)

Recent evidence suggests local optima are not particularly problematic

Autoencoder

y1 y2 y3 y4 y5

h1 h2 h3

h4 h5

y1 y2 y3 y4 y5

h6 h7 h8

The bottleneck forces the network to try to find a low dimensionalrepresentation of the data

Useful for unsupervised learning

Autoencoder on MNIST digits (Hinton 2006 Science)

Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA

60000 training images (28times 28 = 784 pixels)

Use a form of autoencoder to find a lower (30) dimensional representation

At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation

Google Cats

10 Million Youtube video frames (200x200 pixel images)

Use a specialised autoencoder with 9 layers (1 billion weights)

2000 computers + two weeks of computing

Examine units to see what images they most respond to

Google Autoencoder

From Nando De Freitas

Convolutional NNs

CNNs are particularly popular in image processing

Often the feature maps correspond (not to macro features such as bicycles)but micro features

For example in handwritten digit recognition they correspond to smallconstituent parts of the digits

These are used then to process the image into a representation that is betterfor recognition

NNs in NLP

Bag of Words

We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index

We can also think of this as a Euclidean embedding e

aardvarkrarr eaardvark =

100

zorrorarr ezorro =

001

Word Embeddings

Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned

Objective is for example next word prediction accuracy

These are often called lsquoneural language modelsrsquo

NNs in NLP

Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used

Consider the sentence

the cat sat on the mat

and that we wish to predict the word on given the two preceding cat sat

and two succeeding words the mat

We can use a network that has inputs vcat vsat vthe vmat

The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high

The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context

Word Embeddings

Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)

Word Embeddings

There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender

vwoman minus vman asymp vaunt minus vuncle

vwoman minus vman asymp vqueen minus vking

From Mikolov (2013)

Word Embeddings Analogies

Given a relationship France-Paris we get the lsquorelationshiprsquo embedding

v = vParis minus vFrance

Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)

Word Embeddings Constrained Embeddings

We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)

Word Embeddings Constrained Embeddings

Recursive Nets and Embeddings

Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)

Recursive Nets and Embeddings

Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node

Recursive Nets and EmbeddingsTraining

We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree

The weights of this classifier are shared across all nodes

At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings

The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier

We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy

Prediction

For a new movie review the review is first parsed using a standard grammartree parser

This forms the tree which can be used to recursively form the sentiment classlabel for the review

Currently the best sentiment classifier Socher (2013)

Recursive Nets and Embeddingsotilde otilde

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

otilde

otilde

eth

middotshy

otilde

eth

plusmnsup2raquo

otilde

eth

plusmnordm

otilde

otilde

eth

notcedilraquo

otilde

otilde

eth

sup3plusmnshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

yen

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

yen

yen

eth

middotshy

yen

eth

plusmnsup2raquo

yen

eth

plusmnordm

yen

yen

eth

notcedilraquo

yen

yen

yen

acuteraquoiquestshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

otilde

eth

times

otilde

otilde

otilde

acutemiddotmicroraquofrac14

eth

eth

eth

raquoordfraquoregsect

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

times

yen

yen

eth

eth

frac14middotfrac14

eth

sup2ugravenot

eth

eth

acutemiddotmicroraquo

eth

eth

eth

iquest

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

timesnot

yen

yen

eth

eth

ugraveshy

eth

paralaquoshynot

yen

otilde

middotsup2frac12regraquofrac14middotfrac34acutesect

yen yen

frac14laquoacuteacute

eth

ograve

eth

eth

timesnot

eth

eth

eth

eth

eth

ugraveshy

otilde

yen

sup2plusmnnot

yen yen

frac14laquoacuteacute

eth

ograve

Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve

Recurrent Nets

x1 x2 x3

h1 h2 h3

y1 y2 y3

A A A

C C C

B B

RNNs are used in timeseries applications

The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt

In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram

I omitted the potential links from xtminus1 ytminus1 to ht

Handwriting Generation using a RNN

Some training examples

Handwriting Generation using a RNN

Some generated examples Top line is real handwriting for comparison See AlexGraversquos work

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 4: David Barber - Deep Nets, Bayes and the story of AI

Intelligent Machinery

1951 Prinz mate-in-two moves chess machine

1952 Strachey programs first computer draughts algorithm

Learning Machines

1951 Oettinger makes first program that lsquolearnsrsquo

1955 Samuel adds lsquolearningrsquo to his draughts algorithm

Logical Intelligence

1968 Rischrsquos algorithm for integration in calculus

1972 Prolog for general logical reasoning

1997 Deep Blue defeats Kasparov

Other forms of intelligence

But is this getting us to where wersquod like to beSelfridge-Shannon film clip

Speech Recognition

Visual Processing

Natural Language modelling

Planning and decision in uncertain environments

Perhaps a different approach would be useful

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Astonishing Hypothesis Crick

ldquoA personrsquos mental activities are entirely due to the behaviour of nervecells and the molecules that make them up and influence themrdquo

Neurons

Visual Pathway

Information Processing in Brains

Neurons

Re

al

Wo

rld

Layer 1 Layer 2 Highminuslevel

Concepts

Feature

Hierarchical Modular Binary Parallel Noisy

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Artificial Neuron (Perceptron)

weight 7

output neuron

neuron 1neuron 2neuron 3neuron 4

neuron 7neuron 6neuron 5

inputs

weight 1

Training an artificial neural network

Want to generalise to new images with high accuracy

Artificial Network

1957 Rosenblattrsquos perceptron

perceptron film clip

Connectionism

1960 Realised a perceptron can only solve simple tasks

1970 Decline in interest

1980 New computing power made training multilayer networks feasible

outputinputs

Each node (or lsquoneuronrsquo) computes a function of a weighted combination ofparental nodes hj = σ(

sumi wijhi)

Neural Networks and Deep LearningHistorical Problems with Neural Nets (1990s)

NNs are difficult to train (many local optima)

Particularly difficult to train a NN with a large number of layers (say largerthan around 10)

lsquoGradient Diffusion Problemrsquo ndash difficult to assign responsibility of errors toindividual lsquoneuronsrsquo

Machine Learning (up to 2006)

A large section of the machine learning community abandoned NNs

More principled and computationally better understood techniques (SVMsand related convex methods) replaced them

Bayesian AI (1990s onwards)

From mid 1990s there was a realisation that pattern recognition is notsufficient for all AI purposes

Uncertainty and reasoning are not naturally representable using standardfeed-forward nets

Explosion in more lsquosymbolicrsquo Bayesian AI

Deep Learning

NNs have resurged in interest in the last few years (Hinton Bengio )

Also called lsquodeep learningrsquo

Sense that very complex tasks (object recognition learning complex structurein data) requires going beyond simple (convex) statistical techniques

The brain uses hierarchical distributed processing and it is likely to be for agood reason

Many problems have a hierarchical structure images are made of partslanguage is hierarchical etc

Why now

New computing resources (GPU processing)

Availability of large amount of data means that we can train nets with manyparameters (1010)

Recent evidence suggests local optima are not particularly problematic

Autoencoder

y1 y2 y3 y4 y5

h1 h2 h3

h4 h5

y1 y2 y3 y4 y5

h6 h7 h8

The bottleneck forces the network to try to find a low dimensionalrepresentation of the data

Useful for unsupervised learning

Autoencoder on MNIST digits (Hinton 2006 Science)

Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA

60000 training images (28times 28 = 784 pixels)

Use a form of autoencoder to find a lower (30) dimensional representation

At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation

Google Cats

10 Million Youtube video frames (200x200 pixel images)

Use a specialised autoencoder with 9 layers (1 billion weights)

2000 computers + two weeks of computing

Examine units to see what images they most respond to

Google Autoencoder

From Nando De Freitas

Convolutional NNs

CNNs are particularly popular in image processing

Often the feature maps correspond (not to macro features such as bicycles)but micro features

For example in handwritten digit recognition they correspond to smallconstituent parts of the digits

These are used then to process the image into a representation that is betterfor recognition

NNs in NLP

Bag of Words

We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index

We can also think of this as a Euclidean embedding e

aardvarkrarr eaardvark =

100

zorrorarr ezorro =

001

Word Embeddings

Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned

Objective is for example next word prediction accuracy

These are often called lsquoneural language modelsrsquo

NNs in NLP

Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used

Consider the sentence

the cat sat on the mat

and that we wish to predict the word on given the two preceding cat sat

and two succeeding words the mat

We can use a network that has inputs vcat vsat vthe vmat

The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high

The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context

Word Embeddings

Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)

Word Embeddings

There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender

vwoman minus vman asymp vaunt minus vuncle

vwoman minus vman asymp vqueen minus vking

From Mikolov (2013)

Word Embeddings Analogies

Given a relationship France-Paris we get the lsquorelationshiprsquo embedding

v = vParis minus vFrance

Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)

Word Embeddings Constrained Embeddings

We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)

Word Embeddings Constrained Embeddings

Recursive Nets and Embeddings

Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)

Recursive Nets and Embeddings

Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node

Recursive Nets and EmbeddingsTraining

We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree

The weights of this classifier are shared across all nodes

At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings

The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier

We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy

Prediction

For a new movie review the review is first parsed using a standard grammartree parser

This forms the tree which can be used to recursively form the sentiment classlabel for the review

Currently the best sentiment classifier Socher (2013)

Recursive Nets and Embeddingsotilde otilde

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

otilde

otilde

eth

middotshy

otilde

eth

plusmnsup2raquo

otilde

eth

plusmnordm

otilde

otilde

eth

notcedilraquo

otilde

otilde

eth

sup3plusmnshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

yen

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

yen

yen

eth

middotshy

yen

eth

plusmnsup2raquo

yen

eth

plusmnordm

yen

yen

eth

notcedilraquo

yen

yen

yen

acuteraquoiquestshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

otilde

eth

times

otilde

otilde

otilde

acutemiddotmicroraquofrac14

eth

eth

eth

raquoordfraquoregsect

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

times

yen

yen

eth

eth

frac14middotfrac14

eth

sup2ugravenot

eth

eth

acutemiddotmicroraquo

eth

eth

eth

iquest

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

timesnot

yen

yen

eth

eth

ugraveshy

eth

paralaquoshynot

yen

otilde

middotsup2frac12regraquofrac14middotfrac34acutesect

yen yen

frac14laquoacuteacute

eth

ograve

eth

eth

timesnot

eth

eth

eth

eth

eth

ugraveshy

otilde

yen

sup2plusmnnot

yen yen

frac14laquoacuteacute

eth

ograve

Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve

Recurrent Nets

x1 x2 x3

h1 h2 h3

y1 y2 y3

A A A

C C C

B B

RNNs are used in timeseries applications

The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt

In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram

I omitted the potential links from xtminus1 ytminus1 to ht

Handwriting Generation using a RNN

Some training examples

Handwriting Generation using a RNN

Some generated examples Top line is real handwriting for comparison See AlexGraversquos work

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 5: David Barber - Deep Nets, Bayes and the story of AI

Learning Machines

1951 Oettinger makes first program that lsquolearnsrsquo

1955 Samuel adds lsquolearningrsquo to his draughts algorithm

Logical Intelligence

1968 Rischrsquos algorithm for integration in calculus

1972 Prolog for general logical reasoning

1997 Deep Blue defeats Kasparov

Other forms of intelligence

But is this getting us to where wersquod like to beSelfridge-Shannon film clip

Speech Recognition

Visual Processing

Natural Language modelling

Planning and decision in uncertain environments

Perhaps a different approach would be useful

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Astonishing Hypothesis Crick

ldquoA personrsquos mental activities are entirely due to the behaviour of nervecells and the molecules that make them up and influence themrdquo

Neurons

Visual Pathway

Information Processing in Brains

Neurons

Re

al

Wo

rld

Layer 1 Layer 2 Highminuslevel

Concepts

Feature

Hierarchical Modular Binary Parallel Noisy

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Artificial Neuron (Perceptron)

weight 7

output neuron

neuron 1neuron 2neuron 3neuron 4

neuron 7neuron 6neuron 5

inputs

weight 1

Training an artificial neural network

Want to generalise to new images with high accuracy

Artificial Network

1957 Rosenblattrsquos perceptron

perceptron film clip

Connectionism

1960 Realised a perceptron can only solve simple tasks

1970 Decline in interest

1980 New computing power made training multilayer networks feasible

outputinputs

Each node (or lsquoneuronrsquo) computes a function of a weighted combination ofparental nodes hj = σ(

sumi wijhi)

Neural Networks and Deep LearningHistorical Problems with Neural Nets (1990s)

NNs are difficult to train (many local optima)

Particularly difficult to train a NN with a large number of layers (say largerthan around 10)

lsquoGradient Diffusion Problemrsquo ndash difficult to assign responsibility of errors toindividual lsquoneuronsrsquo

Machine Learning (up to 2006)

A large section of the machine learning community abandoned NNs

More principled and computationally better understood techniques (SVMsand related convex methods) replaced them

Bayesian AI (1990s onwards)

From mid 1990s there was a realisation that pattern recognition is notsufficient for all AI purposes

Uncertainty and reasoning are not naturally representable using standardfeed-forward nets

Explosion in more lsquosymbolicrsquo Bayesian AI

Deep Learning

NNs have resurged in interest in the last few years (Hinton Bengio )

Also called lsquodeep learningrsquo

Sense that very complex tasks (object recognition learning complex structurein data) requires going beyond simple (convex) statistical techniques

The brain uses hierarchical distributed processing and it is likely to be for agood reason

Many problems have a hierarchical structure images are made of partslanguage is hierarchical etc

Why now

New computing resources (GPU processing)

Availability of large amount of data means that we can train nets with manyparameters (1010)

Recent evidence suggests local optima are not particularly problematic

Autoencoder

y1 y2 y3 y4 y5

h1 h2 h3

h4 h5

y1 y2 y3 y4 y5

h6 h7 h8

The bottleneck forces the network to try to find a low dimensionalrepresentation of the data

Useful for unsupervised learning

Autoencoder on MNIST digits (Hinton 2006 Science)

Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA

60000 training images (28times 28 = 784 pixels)

Use a form of autoencoder to find a lower (30) dimensional representation

At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation

Google Cats

10 Million Youtube video frames (200x200 pixel images)

Use a specialised autoencoder with 9 layers (1 billion weights)

2000 computers + two weeks of computing

Examine units to see what images they most respond to

Google Autoencoder

From Nando De Freitas

Convolutional NNs

CNNs are particularly popular in image processing

Often the feature maps correspond (not to macro features such as bicycles)but micro features

For example in handwritten digit recognition they correspond to smallconstituent parts of the digits

These are used then to process the image into a representation that is betterfor recognition

NNs in NLP

Bag of Words

We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index

We can also think of this as a Euclidean embedding e

aardvarkrarr eaardvark =

100

zorrorarr ezorro =

001

Word Embeddings

Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned

Objective is for example next word prediction accuracy

These are often called lsquoneural language modelsrsquo

NNs in NLP

Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used

Consider the sentence

the cat sat on the mat

and that we wish to predict the word on given the two preceding cat sat

and two succeeding words the mat

We can use a network that has inputs vcat vsat vthe vmat

The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high

The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context

Word Embeddings

Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)

Word Embeddings

There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender

vwoman minus vman asymp vaunt minus vuncle

vwoman minus vman asymp vqueen minus vking

From Mikolov (2013)

Word Embeddings Analogies

Given a relationship France-Paris we get the lsquorelationshiprsquo embedding

v = vParis minus vFrance

Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)

Word Embeddings Constrained Embeddings

We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)

Word Embeddings Constrained Embeddings

Recursive Nets and Embeddings

Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)

Recursive Nets and Embeddings

Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node

Recursive Nets and EmbeddingsTraining

We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree

The weights of this classifier are shared across all nodes

At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings

The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier

We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy

Prediction

For a new movie review the review is first parsed using a standard grammartree parser

This forms the tree which can be used to recursively form the sentiment classlabel for the review

Currently the best sentiment classifier Socher (2013)

Recursive Nets and Embeddingsotilde otilde

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

otilde

otilde

eth

middotshy

otilde

eth

plusmnsup2raquo

otilde

eth

plusmnordm

otilde

otilde

eth

notcedilraquo

otilde

otilde

eth

sup3plusmnshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

yen

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

yen

yen

eth

middotshy

yen

eth

plusmnsup2raquo

yen

eth

plusmnordm

yen

yen

eth

notcedilraquo

yen

yen

yen

acuteraquoiquestshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

otilde

eth

times

otilde

otilde

otilde

acutemiddotmicroraquofrac14

eth

eth

eth

raquoordfraquoregsect

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

times

yen

yen

eth

eth

frac14middotfrac14

eth

sup2ugravenot

eth

eth

acutemiddotmicroraquo

eth

eth

eth

iquest

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

timesnot

yen

yen

eth

eth

ugraveshy

eth

paralaquoshynot

yen

otilde

middotsup2frac12regraquofrac14middotfrac34acutesect

yen yen

frac14laquoacuteacute

eth

ograve

eth

eth

timesnot

eth

eth

eth

eth

eth

ugraveshy

otilde

yen

sup2plusmnnot

yen yen

frac14laquoacuteacute

eth

ograve

Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve

Recurrent Nets

x1 x2 x3

h1 h2 h3

y1 y2 y3

A A A

C C C

B B

RNNs are used in timeseries applications

The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt

In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram

I omitted the potential links from xtminus1 ytminus1 to ht

Handwriting Generation using a RNN

Some training examples

Handwriting Generation using a RNN

Some generated examples Top line is real handwriting for comparison See AlexGraversquos work

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 6: David Barber - Deep Nets, Bayes and the story of AI

Logical Intelligence

1968 Rischrsquos algorithm for integration in calculus

1972 Prolog for general logical reasoning

1997 Deep Blue defeats Kasparov

Other forms of intelligence

But is this getting us to where wersquod like to beSelfridge-Shannon film clip

Speech Recognition

Visual Processing

Natural Language modelling

Planning and decision in uncertain environments

Perhaps a different approach would be useful

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Astonishing Hypothesis Crick

ldquoA personrsquos mental activities are entirely due to the behaviour of nervecells and the molecules that make them up and influence themrdquo

Neurons

Visual Pathway

Information Processing in Brains

Neurons

Re

al

Wo

rld

Layer 1 Layer 2 Highminuslevel

Concepts

Feature

Hierarchical Modular Binary Parallel Noisy

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Artificial Neuron (Perceptron)

weight 7

output neuron

neuron 1neuron 2neuron 3neuron 4

neuron 7neuron 6neuron 5

inputs

weight 1

Training an artificial neural network

Want to generalise to new images with high accuracy

Artificial Network

1957 Rosenblattrsquos perceptron

perceptron film clip

Connectionism

1960 Realised a perceptron can only solve simple tasks

1970 Decline in interest

1980 New computing power made training multilayer networks feasible

outputinputs

Each node (or lsquoneuronrsquo) computes a function of a weighted combination ofparental nodes hj = σ(

sumi wijhi)

Neural Networks and Deep LearningHistorical Problems with Neural Nets (1990s)

NNs are difficult to train (many local optima)

Particularly difficult to train a NN with a large number of layers (say largerthan around 10)

lsquoGradient Diffusion Problemrsquo ndash difficult to assign responsibility of errors toindividual lsquoneuronsrsquo

Machine Learning (up to 2006)

A large section of the machine learning community abandoned NNs

More principled and computationally better understood techniques (SVMsand related convex methods) replaced them

Bayesian AI (1990s onwards)

From mid 1990s there was a realisation that pattern recognition is notsufficient for all AI purposes

Uncertainty and reasoning are not naturally representable using standardfeed-forward nets

Explosion in more lsquosymbolicrsquo Bayesian AI

Deep Learning

NNs have resurged in interest in the last few years (Hinton Bengio )

Also called lsquodeep learningrsquo

Sense that very complex tasks (object recognition learning complex structurein data) requires going beyond simple (convex) statistical techniques

The brain uses hierarchical distributed processing and it is likely to be for agood reason

Many problems have a hierarchical structure images are made of partslanguage is hierarchical etc

Why now

New computing resources (GPU processing)

Availability of large amount of data means that we can train nets with manyparameters (1010)

Recent evidence suggests local optima are not particularly problematic

Autoencoder

y1 y2 y3 y4 y5

h1 h2 h3

h4 h5

y1 y2 y3 y4 y5

h6 h7 h8

The bottleneck forces the network to try to find a low dimensionalrepresentation of the data

Useful for unsupervised learning

Autoencoder on MNIST digits (Hinton 2006 Science)

Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA

60000 training images (28times 28 = 784 pixels)

Use a form of autoencoder to find a lower (30) dimensional representation

At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation

Google Cats

10 Million Youtube video frames (200x200 pixel images)

Use a specialised autoencoder with 9 layers (1 billion weights)

2000 computers + two weeks of computing

Examine units to see what images they most respond to

Google Autoencoder

From Nando De Freitas

Convolutional NNs

CNNs are particularly popular in image processing

Often the feature maps correspond (not to macro features such as bicycles)but micro features

For example in handwritten digit recognition they correspond to smallconstituent parts of the digits

These are used then to process the image into a representation that is betterfor recognition

NNs in NLP

Bag of Words

We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index

We can also think of this as a Euclidean embedding e

aardvarkrarr eaardvark =

100

zorrorarr ezorro =

001

Word Embeddings

Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned

Objective is for example next word prediction accuracy

These are often called lsquoneural language modelsrsquo

NNs in NLP

Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used

Consider the sentence

the cat sat on the mat

and that we wish to predict the word on given the two preceding cat sat

and two succeeding words the mat

We can use a network that has inputs vcat vsat vthe vmat

The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high

The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context

Word Embeddings

Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)

Word Embeddings

There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender

vwoman minus vman asymp vaunt minus vuncle

vwoman minus vman asymp vqueen minus vking

From Mikolov (2013)

Word Embeddings Analogies

Given a relationship France-Paris we get the lsquorelationshiprsquo embedding

v = vParis minus vFrance

Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)

Word Embeddings Constrained Embeddings

We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)

Word Embeddings Constrained Embeddings

Recursive Nets and Embeddings

Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)

Recursive Nets and Embeddings

Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node

Recursive Nets and EmbeddingsTraining

We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree

The weights of this classifier are shared across all nodes

At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings

The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier

We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy

Prediction

For a new movie review the review is first parsed using a standard grammartree parser

This forms the tree which can be used to recursively form the sentiment classlabel for the review

Currently the best sentiment classifier Socher (2013)

Recursive Nets and Embeddingsotilde otilde

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

otilde

otilde

eth

middotshy

otilde

eth

plusmnsup2raquo

otilde

eth

plusmnordm

otilde

otilde

eth

notcedilraquo

otilde

otilde

eth

sup3plusmnshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

yen

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

yen

yen

eth

middotshy

yen

eth

plusmnsup2raquo

yen

eth

plusmnordm

yen

yen

eth

notcedilraquo

yen

yen

yen

acuteraquoiquestshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

otilde

eth

times

otilde

otilde

otilde

acutemiddotmicroraquofrac14

eth

eth

eth

raquoordfraquoregsect

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

times

yen

yen

eth

eth

frac14middotfrac14

eth

sup2ugravenot

eth

eth

acutemiddotmicroraquo

eth

eth

eth

iquest

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

timesnot

yen

yen

eth

eth

ugraveshy

eth

paralaquoshynot

yen

otilde

middotsup2frac12regraquofrac14middotfrac34acutesect

yen yen

frac14laquoacuteacute

eth

ograve

eth

eth

timesnot

eth

eth

eth

eth

eth

ugraveshy

otilde

yen

sup2plusmnnot

yen yen

frac14laquoacuteacute

eth

ograve

Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve

Recurrent Nets

x1 x2 x3

h1 h2 h3

y1 y2 y3

A A A

C C C

B B

RNNs are used in timeseries applications

The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt

In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram

I omitted the potential links from xtminus1 ytminus1 to ht

Handwriting Generation using a RNN

Some training examples

Handwriting Generation using a RNN

Some generated examples Top line is real handwriting for comparison See AlexGraversquos work

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 7: David Barber - Deep Nets, Bayes and the story of AI

Other forms of intelligence

But is this getting us to where wersquod like to beSelfridge-Shannon film clip

Speech Recognition

Visual Processing

Natural Language modelling

Planning and decision in uncertain environments

Perhaps a different approach would be useful

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Astonishing Hypothesis Crick

ldquoA personrsquos mental activities are entirely due to the behaviour of nervecells and the molecules that make them up and influence themrdquo

Neurons

Visual Pathway

Information Processing in Brains

Neurons

Re

al

Wo

rld

Layer 1 Layer 2 Highminuslevel

Concepts

Feature

Hierarchical Modular Binary Parallel Noisy

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Artificial Neuron (Perceptron)

weight 7

output neuron

neuron 1neuron 2neuron 3neuron 4

neuron 7neuron 6neuron 5

inputs

weight 1

Training an artificial neural network

Want to generalise to new images with high accuracy

Artificial Network

1957 Rosenblattrsquos perceptron

perceptron film clip

Connectionism

1960 Realised a perceptron can only solve simple tasks

1970 Decline in interest

1980 New computing power made training multilayer networks feasible

outputinputs

Each node (or lsquoneuronrsquo) computes a function of a weighted combination ofparental nodes hj = σ(

sumi wijhi)

Neural Networks and Deep LearningHistorical Problems with Neural Nets (1990s)

NNs are difficult to train (many local optima)

Particularly difficult to train a NN with a large number of layers (say largerthan around 10)

lsquoGradient Diffusion Problemrsquo ndash difficult to assign responsibility of errors toindividual lsquoneuronsrsquo

Machine Learning (up to 2006)

A large section of the machine learning community abandoned NNs

More principled and computationally better understood techniques (SVMsand related convex methods) replaced them

Bayesian AI (1990s onwards)

From mid 1990s there was a realisation that pattern recognition is notsufficient for all AI purposes

Uncertainty and reasoning are not naturally representable using standardfeed-forward nets

Explosion in more lsquosymbolicrsquo Bayesian AI

Deep Learning

NNs have resurged in interest in the last few years (Hinton Bengio )

Also called lsquodeep learningrsquo

Sense that very complex tasks (object recognition learning complex structurein data) requires going beyond simple (convex) statistical techniques

The brain uses hierarchical distributed processing and it is likely to be for agood reason

Many problems have a hierarchical structure images are made of partslanguage is hierarchical etc

Why now

New computing resources (GPU processing)

Availability of large amount of data means that we can train nets with manyparameters (1010)

Recent evidence suggests local optima are not particularly problematic

Autoencoder

y1 y2 y3 y4 y5

h1 h2 h3

h4 h5

y1 y2 y3 y4 y5

h6 h7 h8

The bottleneck forces the network to try to find a low dimensionalrepresentation of the data

Useful for unsupervised learning

Autoencoder on MNIST digits (Hinton 2006 Science)

Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA

60000 training images (28times 28 = 784 pixels)

Use a form of autoencoder to find a lower (30) dimensional representation

At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation

Google Cats

10 Million Youtube video frames (200x200 pixel images)

Use a specialised autoencoder with 9 layers (1 billion weights)

2000 computers + two weeks of computing

Examine units to see what images they most respond to

Google Autoencoder

From Nando De Freitas

Convolutional NNs

CNNs are particularly popular in image processing

Often the feature maps correspond (not to macro features such as bicycles)but micro features

For example in handwritten digit recognition they correspond to smallconstituent parts of the digits

These are used then to process the image into a representation that is betterfor recognition

NNs in NLP

Bag of Words

We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index

We can also think of this as a Euclidean embedding e

aardvarkrarr eaardvark =

100

zorrorarr ezorro =

001

Word Embeddings

Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned

Objective is for example next word prediction accuracy

These are often called lsquoneural language modelsrsquo

NNs in NLP

Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used

Consider the sentence

the cat sat on the mat

and that we wish to predict the word on given the two preceding cat sat

and two succeeding words the mat

We can use a network that has inputs vcat vsat vthe vmat

The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high

The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context

Word Embeddings

Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)

Word Embeddings

There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender

vwoman minus vman asymp vaunt minus vuncle

vwoman minus vman asymp vqueen minus vking

From Mikolov (2013)

Word Embeddings Analogies

Given a relationship France-Paris we get the lsquorelationshiprsquo embedding

v = vParis minus vFrance

Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)

Word Embeddings Constrained Embeddings

We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)

Word Embeddings Constrained Embeddings

Recursive Nets and Embeddings

Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)

Recursive Nets and Embeddings

Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node

Recursive Nets and EmbeddingsTraining

We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree

The weights of this classifier are shared across all nodes

At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings

The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier

We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy

Prediction

For a new movie review the review is first parsed using a standard grammartree parser

This forms the tree which can be used to recursively form the sentiment classlabel for the review

Currently the best sentiment classifier Socher (2013)

Recursive Nets and Embeddingsotilde otilde

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

otilde

otilde

eth

middotshy

otilde

eth

plusmnsup2raquo

otilde

eth

plusmnordm

otilde

otilde

eth

notcedilraquo

otilde

otilde

eth

sup3plusmnshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

yen

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

yen

yen

eth

middotshy

yen

eth

plusmnsup2raquo

yen

eth

plusmnordm

yen

yen

eth

notcedilraquo

yen

yen

yen

acuteraquoiquestshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

otilde

eth

times

otilde

otilde

otilde

acutemiddotmicroraquofrac14

eth

eth

eth

raquoordfraquoregsect

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

times

yen

yen

eth

eth

frac14middotfrac14

eth

sup2ugravenot

eth

eth

acutemiddotmicroraquo

eth

eth

eth

iquest

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

timesnot

yen

yen

eth

eth

ugraveshy

eth

paralaquoshynot

yen

otilde

middotsup2frac12regraquofrac14middotfrac34acutesect

yen yen

frac14laquoacuteacute

eth

ograve

eth

eth

timesnot

eth

eth

eth

eth

eth

ugraveshy

otilde

yen

sup2plusmnnot

yen yen

frac14laquoacuteacute

eth

ograve

Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve

Recurrent Nets

x1 x2 x3

h1 h2 h3

y1 y2 y3

A A A

C C C

B B

RNNs are used in timeseries applications

The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt

In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram

I omitted the potential links from xtminus1 ytminus1 to ht

Handwriting Generation using a RNN

Some training examples

Handwriting Generation using a RNN

Some generated examples Top line is real handwriting for comparison See AlexGraversquos work

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 8: David Barber - Deep Nets, Bayes and the story of AI

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Astonishing Hypothesis Crick

ldquoA personrsquos mental activities are entirely due to the behaviour of nervecells and the molecules that make them up and influence themrdquo

Neurons

Visual Pathway

Information Processing in Brains

Neurons

Re

al

Wo

rld

Layer 1 Layer 2 Highminuslevel

Concepts

Feature

Hierarchical Modular Binary Parallel Noisy

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Artificial Neuron (Perceptron)

weight 7

output neuron

neuron 1neuron 2neuron 3neuron 4

neuron 7neuron 6neuron 5

inputs

weight 1

Training an artificial neural network

Want to generalise to new images with high accuracy

Artificial Network

1957 Rosenblattrsquos perceptron

perceptron film clip

Connectionism

1960 Realised a perceptron can only solve simple tasks

1970 Decline in interest

1980 New computing power made training multilayer networks feasible

outputinputs

Each node (or lsquoneuronrsquo) computes a function of a weighted combination ofparental nodes hj = σ(

sumi wijhi)

Neural Networks and Deep LearningHistorical Problems with Neural Nets (1990s)

NNs are difficult to train (many local optima)

Particularly difficult to train a NN with a large number of layers (say largerthan around 10)

lsquoGradient Diffusion Problemrsquo ndash difficult to assign responsibility of errors toindividual lsquoneuronsrsquo

Machine Learning (up to 2006)

A large section of the machine learning community abandoned NNs

More principled and computationally better understood techniques (SVMsand related convex methods) replaced them

Bayesian AI (1990s onwards)

From mid 1990s there was a realisation that pattern recognition is notsufficient for all AI purposes

Uncertainty and reasoning are not naturally representable using standardfeed-forward nets

Explosion in more lsquosymbolicrsquo Bayesian AI

Deep Learning

NNs have resurged in interest in the last few years (Hinton Bengio )

Also called lsquodeep learningrsquo

Sense that very complex tasks (object recognition learning complex structurein data) requires going beyond simple (convex) statistical techniques

The brain uses hierarchical distributed processing and it is likely to be for agood reason

Many problems have a hierarchical structure images are made of partslanguage is hierarchical etc

Why now

New computing resources (GPU processing)

Availability of large amount of data means that we can train nets with manyparameters (1010)

Recent evidence suggests local optima are not particularly problematic

Autoencoder

y1 y2 y3 y4 y5

h1 h2 h3

h4 h5

y1 y2 y3 y4 y5

h6 h7 h8

The bottleneck forces the network to try to find a low dimensionalrepresentation of the data

Useful for unsupervised learning

Autoencoder on MNIST digits (Hinton 2006 Science)

Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA

60000 training images (28times 28 = 784 pixels)

Use a form of autoencoder to find a lower (30) dimensional representation

At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation

Google Cats

10 Million Youtube video frames (200x200 pixel images)

Use a specialised autoencoder with 9 layers (1 billion weights)

2000 computers + two weeks of computing

Examine units to see what images they most respond to

Google Autoencoder

From Nando De Freitas

Convolutional NNs

CNNs are particularly popular in image processing

Often the feature maps correspond (not to macro features such as bicycles)but micro features

For example in handwritten digit recognition they correspond to smallconstituent parts of the digits

These are used then to process the image into a representation that is betterfor recognition

NNs in NLP

Bag of Words

We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index

We can also think of this as a Euclidean embedding e

aardvarkrarr eaardvark =

100

zorrorarr ezorro =

001

Word Embeddings

Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned

Objective is for example next word prediction accuracy

These are often called lsquoneural language modelsrsquo

NNs in NLP

Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used

Consider the sentence

the cat sat on the mat

and that we wish to predict the word on given the two preceding cat sat

and two succeeding words the mat

We can use a network that has inputs vcat vsat vthe vmat

The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high

The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context

Word Embeddings

Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)

Word Embeddings

There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender

vwoman minus vman asymp vaunt minus vuncle

vwoman minus vman asymp vqueen minus vking

From Mikolov (2013)

Word Embeddings Analogies

Given a relationship France-Paris we get the lsquorelationshiprsquo embedding

v = vParis minus vFrance

Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)

Word Embeddings Constrained Embeddings

We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)

Word Embeddings Constrained Embeddings

Recursive Nets and Embeddings

Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)

Recursive Nets and Embeddings

Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node

Recursive Nets and EmbeddingsTraining

We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree

The weights of this classifier are shared across all nodes

At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings

The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier

We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy

Prediction

For a new movie review the review is first parsed using a standard grammartree parser

This forms the tree which can be used to recursively form the sentiment classlabel for the review

Currently the best sentiment classifier Socher (2013)

Recursive Nets and Embeddingsotilde otilde

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

otilde

otilde

eth

middotshy

otilde

eth

plusmnsup2raquo

otilde

eth

plusmnordm

otilde

otilde

eth

notcedilraquo

otilde

otilde

eth

sup3plusmnshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

yen

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

yen

yen

eth

middotshy

yen

eth

plusmnsup2raquo

yen

eth

plusmnordm

yen

yen

eth

notcedilraquo

yen

yen

yen

acuteraquoiquestshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

otilde

eth

times

otilde

otilde

otilde

acutemiddotmicroraquofrac14

eth

eth

eth

raquoordfraquoregsect

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

times

yen

yen

eth

eth

frac14middotfrac14

eth

sup2ugravenot

eth

eth

acutemiddotmicroraquo

eth

eth

eth

iquest

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

timesnot

yen

yen

eth

eth

ugraveshy

eth

paralaquoshynot

yen

otilde

middotsup2frac12regraquofrac14middotfrac34acutesect

yen yen

frac14laquoacuteacute

eth

ograve

eth

eth

timesnot

eth

eth

eth

eth

eth

ugraveshy

otilde

yen

sup2plusmnnot

yen yen

frac14laquoacuteacute

eth

ograve

Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve

Recurrent Nets

x1 x2 x3

h1 h2 h3

y1 y2 y3

A A A

C C C

B B

RNNs are used in timeseries applications

The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt

In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram

I omitted the potential links from xtminus1 ytminus1 to ht

Handwriting Generation using a RNN

Some training examples

Handwriting Generation using a RNN

Some generated examples Top line is real handwriting for comparison See AlexGraversquos work

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 9: David Barber - Deep Nets, Bayes and the story of AI

Astonishing Hypothesis Crick

ldquoA personrsquos mental activities are entirely due to the behaviour of nervecells and the molecules that make them up and influence themrdquo

Neurons

Visual Pathway

Information Processing in Brains

Neurons

Re

al

Wo

rld

Layer 1 Layer 2 Highminuslevel

Concepts

Feature

Hierarchical Modular Binary Parallel Noisy

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Artificial Neuron (Perceptron)

weight 7

output neuron

neuron 1neuron 2neuron 3neuron 4

neuron 7neuron 6neuron 5

inputs

weight 1

Training an artificial neural network

Want to generalise to new images with high accuracy

Artificial Network

1957 Rosenblattrsquos perceptron

perceptron film clip

Connectionism

1960 Realised a perceptron can only solve simple tasks

1970 Decline in interest

1980 New computing power made training multilayer networks feasible

outputinputs

Each node (or lsquoneuronrsquo) computes a function of a weighted combination ofparental nodes hj = σ(

sumi wijhi)

Neural Networks and Deep LearningHistorical Problems with Neural Nets (1990s)

NNs are difficult to train (many local optima)

Particularly difficult to train a NN with a large number of layers (say largerthan around 10)

lsquoGradient Diffusion Problemrsquo ndash difficult to assign responsibility of errors toindividual lsquoneuronsrsquo

Machine Learning (up to 2006)

A large section of the machine learning community abandoned NNs

More principled and computationally better understood techniques (SVMsand related convex methods) replaced them

Bayesian AI (1990s onwards)

From mid 1990s there was a realisation that pattern recognition is notsufficient for all AI purposes

Uncertainty and reasoning are not naturally representable using standardfeed-forward nets

Explosion in more lsquosymbolicrsquo Bayesian AI

Deep Learning

NNs have resurged in interest in the last few years (Hinton Bengio )

Also called lsquodeep learningrsquo

Sense that very complex tasks (object recognition learning complex structurein data) requires going beyond simple (convex) statistical techniques

The brain uses hierarchical distributed processing and it is likely to be for agood reason

Many problems have a hierarchical structure images are made of partslanguage is hierarchical etc

Why now

New computing resources (GPU processing)

Availability of large amount of data means that we can train nets with manyparameters (1010)

Recent evidence suggests local optima are not particularly problematic

Autoencoder

y1 y2 y3 y4 y5

h1 h2 h3

h4 h5

y1 y2 y3 y4 y5

h6 h7 h8

The bottleneck forces the network to try to find a low dimensionalrepresentation of the data

Useful for unsupervised learning

Autoencoder on MNIST digits (Hinton 2006 Science)

Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA

60000 training images (28times 28 = 784 pixels)

Use a form of autoencoder to find a lower (30) dimensional representation

At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation

Google Cats

10 Million Youtube video frames (200x200 pixel images)

Use a specialised autoencoder with 9 layers (1 billion weights)

2000 computers + two weeks of computing

Examine units to see what images they most respond to

Google Autoencoder

From Nando De Freitas

Convolutional NNs

CNNs are particularly popular in image processing

Often the feature maps correspond (not to macro features such as bicycles)but micro features

For example in handwritten digit recognition they correspond to smallconstituent parts of the digits

These are used then to process the image into a representation that is betterfor recognition

NNs in NLP

Bag of Words

We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index

We can also think of this as a Euclidean embedding e

aardvarkrarr eaardvark =

100

zorrorarr ezorro =

001

Word Embeddings

Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned

Objective is for example next word prediction accuracy

These are often called lsquoneural language modelsrsquo

NNs in NLP

Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used

Consider the sentence

the cat sat on the mat

and that we wish to predict the word on given the two preceding cat sat

and two succeeding words the mat

We can use a network that has inputs vcat vsat vthe vmat

The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high

The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context

Word Embeddings

Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)

Word Embeddings

There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender

vwoman minus vman asymp vaunt minus vuncle

vwoman minus vman asymp vqueen minus vking

From Mikolov (2013)

Word Embeddings Analogies

Given a relationship France-Paris we get the lsquorelationshiprsquo embedding

v = vParis minus vFrance

Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)

Word Embeddings Constrained Embeddings

We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)

Word Embeddings Constrained Embeddings

Recursive Nets and Embeddings

Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)

Recursive Nets and Embeddings

Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node

Recursive Nets and EmbeddingsTraining

We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree

The weights of this classifier are shared across all nodes

At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings

The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier

We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy

Prediction

For a new movie review the review is first parsed using a standard grammartree parser

This forms the tree which can be used to recursively form the sentiment classlabel for the review

Currently the best sentiment classifier Socher (2013)

Recursive Nets and Embeddingsotilde otilde

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

otilde

otilde

eth

middotshy

otilde

eth

plusmnsup2raquo

otilde

eth

plusmnordm

otilde

otilde

eth

notcedilraquo

otilde

otilde

eth

sup3plusmnshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

yen

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

yen

yen

eth

middotshy

yen

eth

plusmnsup2raquo

yen

eth

plusmnordm

yen

yen

eth

notcedilraquo

yen

yen

yen

acuteraquoiquestshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

otilde

eth

times

otilde

otilde

otilde

acutemiddotmicroraquofrac14

eth

eth

eth

raquoordfraquoregsect

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

times

yen

yen

eth

eth

frac14middotfrac14

eth

sup2ugravenot

eth

eth

acutemiddotmicroraquo

eth

eth

eth

iquest

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

timesnot

yen

yen

eth

eth

ugraveshy

eth

paralaquoshynot

yen

otilde

middotsup2frac12regraquofrac14middotfrac34acutesect

yen yen

frac14laquoacuteacute

eth

ograve

eth

eth

timesnot

eth

eth

eth

eth

eth

ugraveshy

otilde

yen

sup2plusmnnot

yen yen

frac14laquoacuteacute

eth

ograve

Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve

Recurrent Nets

x1 x2 x3

h1 h2 h3

y1 y2 y3

A A A

C C C

B B

RNNs are used in timeseries applications

The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt

In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram

I omitted the potential links from xtminus1 ytminus1 to ht

Handwriting Generation using a RNN

Some training examples

Handwriting Generation using a RNN

Some generated examples Top line is real handwriting for comparison See AlexGraversquos work

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 10: David Barber - Deep Nets, Bayes and the story of AI

Neurons

Visual Pathway

Information Processing in Brains

Neurons

Re

al

Wo

rld

Layer 1 Layer 2 Highminuslevel

Concepts

Feature

Hierarchical Modular Binary Parallel Noisy

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Artificial Neuron (Perceptron)

weight 7

output neuron

neuron 1neuron 2neuron 3neuron 4

neuron 7neuron 6neuron 5

inputs

weight 1

Training an artificial neural network

Want to generalise to new images with high accuracy

Artificial Network

1957 Rosenblattrsquos perceptron

perceptron film clip

Connectionism

1960 Realised a perceptron can only solve simple tasks

1970 Decline in interest

1980 New computing power made training multilayer networks feasible

outputinputs

Each node (or lsquoneuronrsquo) computes a function of a weighted combination ofparental nodes hj = σ(

sumi wijhi)

Neural Networks and Deep LearningHistorical Problems with Neural Nets (1990s)

NNs are difficult to train (many local optima)

Particularly difficult to train a NN with a large number of layers (say largerthan around 10)

lsquoGradient Diffusion Problemrsquo ndash difficult to assign responsibility of errors toindividual lsquoneuronsrsquo

Machine Learning (up to 2006)

A large section of the machine learning community abandoned NNs

More principled and computationally better understood techniques (SVMsand related convex methods) replaced them

Bayesian AI (1990s onwards)

From mid 1990s there was a realisation that pattern recognition is notsufficient for all AI purposes

Uncertainty and reasoning are not naturally representable using standardfeed-forward nets

Explosion in more lsquosymbolicrsquo Bayesian AI

Deep Learning

NNs have resurged in interest in the last few years (Hinton Bengio )

Also called lsquodeep learningrsquo

Sense that very complex tasks (object recognition learning complex structurein data) requires going beyond simple (convex) statistical techniques

The brain uses hierarchical distributed processing and it is likely to be for agood reason

Many problems have a hierarchical structure images are made of partslanguage is hierarchical etc

Why now

New computing resources (GPU processing)

Availability of large amount of data means that we can train nets with manyparameters (1010)

Recent evidence suggests local optima are not particularly problematic

Autoencoder

y1 y2 y3 y4 y5

h1 h2 h3

h4 h5

y1 y2 y3 y4 y5

h6 h7 h8

The bottleneck forces the network to try to find a low dimensionalrepresentation of the data

Useful for unsupervised learning

Autoencoder on MNIST digits (Hinton 2006 Science)

Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA

60000 training images (28times 28 = 784 pixels)

Use a form of autoencoder to find a lower (30) dimensional representation

At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation

Google Cats

10 Million Youtube video frames (200x200 pixel images)

Use a specialised autoencoder with 9 layers (1 billion weights)

2000 computers + two weeks of computing

Examine units to see what images they most respond to

Google Autoencoder

From Nando De Freitas

Convolutional NNs

CNNs are particularly popular in image processing

Often the feature maps correspond (not to macro features such as bicycles)but micro features

For example in handwritten digit recognition they correspond to smallconstituent parts of the digits

These are used then to process the image into a representation that is betterfor recognition

NNs in NLP

Bag of Words

We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index

We can also think of this as a Euclidean embedding e

aardvarkrarr eaardvark =

100

zorrorarr ezorro =

001

Word Embeddings

Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned

Objective is for example next word prediction accuracy

These are often called lsquoneural language modelsrsquo

NNs in NLP

Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used

Consider the sentence

the cat sat on the mat

and that we wish to predict the word on given the two preceding cat sat

and two succeeding words the mat

We can use a network that has inputs vcat vsat vthe vmat

The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high

The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context

Word Embeddings

Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)

Word Embeddings

There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender

vwoman minus vman asymp vaunt minus vuncle

vwoman minus vman asymp vqueen minus vking

From Mikolov (2013)

Word Embeddings Analogies

Given a relationship France-Paris we get the lsquorelationshiprsquo embedding

v = vParis minus vFrance

Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)

Word Embeddings Constrained Embeddings

We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)

Word Embeddings Constrained Embeddings

Recursive Nets and Embeddings

Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)

Recursive Nets and Embeddings

Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node

Recursive Nets and EmbeddingsTraining

We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree

The weights of this classifier are shared across all nodes

At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings

The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier

We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy

Prediction

For a new movie review the review is first parsed using a standard grammartree parser

This forms the tree which can be used to recursively form the sentiment classlabel for the review

Currently the best sentiment classifier Socher (2013)

Recursive Nets and Embeddingsotilde otilde

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

otilde

otilde

eth

middotshy

otilde

eth

plusmnsup2raquo

otilde

eth

plusmnordm

otilde

otilde

eth

notcedilraquo

otilde

otilde

eth

sup3plusmnshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

yen

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

yen

yen

eth

middotshy

yen

eth

plusmnsup2raquo

yen

eth

plusmnordm

yen

yen

eth

notcedilraquo

yen

yen

yen

acuteraquoiquestshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

otilde

eth

times

otilde

otilde

otilde

acutemiddotmicroraquofrac14

eth

eth

eth

raquoordfraquoregsect

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

times

yen

yen

eth

eth

frac14middotfrac14

eth

sup2ugravenot

eth

eth

acutemiddotmicroraquo

eth

eth

eth

iquest

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

timesnot

yen

yen

eth

eth

ugraveshy

eth

paralaquoshynot

yen

otilde

middotsup2frac12regraquofrac14middotfrac34acutesect

yen yen

frac14laquoacuteacute

eth

ograve

eth

eth

timesnot

eth

eth

eth

eth

eth

ugraveshy

otilde

yen

sup2plusmnnot

yen yen

frac14laquoacuteacute

eth

ograve

Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve

Recurrent Nets

x1 x2 x3

h1 h2 h3

y1 y2 y3

A A A

C C C

B B

RNNs are used in timeseries applications

The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt

In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram

I omitted the potential links from xtminus1 ytminus1 to ht

Handwriting Generation using a RNN

Some training examples

Handwriting Generation using a RNN

Some generated examples Top line is real handwriting for comparison See AlexGraversquos work

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 11: David Barber - Deep Nets, Bayes and the story of AI

Visual Pathway

Information Processing in Brains

Neurons

Re

al

Wo

rld

Layer 1 Layer 2 Highminuslevel

Concepts

Feature

Hierarchical Modular Binary Parallel Noisy

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Artificial Neuron (Perceptron)

weight 7

output neuron

neuron 1neuron 2neuron 3neuron 4

neuron 7neuron 6neuron 5

inputs

weight 1

Training an artificial neural network

Want to generalise to new images with high accuracy

Artificial Network

1957 Rosenblattrsquos perceptron

perceptron film clip

Connectionism

1960 Realised a perceptron can only solve simple tasks

1970 Decline in interest

1980 New computing power made training multilayer networks feasible

outputinputs

Each node (or lsquoneuronrsquo) computes a function of a weighted combination ofparental nodes hj = σ(

sumi wijhi)

Neural Networks and Deep LearningHistorical Problems with Neural Nets (1990s)

NNs are difficult to train (many local optima)

Particularly difficult to train a NN with a large number of layers (say largerthan around 10)

lsquoGradient Diffusion Problemrsquo ndash difficult to assign responsibility of errors toindividual lsquoneuronsrsquo

Machine Learning (up to 2006)

A large section of the machine learning community abandoned NNs

More principled and computationally better understood techniques (SVMsand related convex methods) replaced them

Bayesian AI (1990s onwards)

From mid 1990s there was a realisation that pattern recognition is notsufficient for all AI purposes

Uncertainty and reasoning are not naturally representable using standardfeed-forward nets

Explosion in more lsquosymbolicrsquo Bayesian AI

Deep Learning

NNs have resurged in interest in the last few years (Hinton Bengio )

Also called lsquodeep learningrsquo

Sense that very complex tasks (object recognition learning complex structurein data) requires going beyond simple (convex) statistical techniques

The brain uses hierarchical distributed processing and it is likely to be for agood reason

Many problems have a hierarchical structure images are made of partslanguage is hierarchical etc

Why now

New computing resources (GPU processing)

Availability of large amount of data means that we can train nets with manyparameters (1010)

Recent evidence suggests local optima are not particularly problematic

Autoencoder

y1 y2 y3 y4 y5

h1 h2 h3

h4 h5

y1 y2 y3 y4 y5

h6 h7 h8

The bottleneck forces the network to try to find a low dimensionalrepresentation of the data

Useful for unsupervised learning

Autoencoder on MNIST digits (Hinton 2006 Science)

Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA

60000 training images (28times 28 = 784 pixels)

Use a form of autoencoder to find a lower (30) dimensional representation

At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation

Google Cats

10 Million Youtube video frames (200x200 pixel images)

Use a specialised autoencoder with 9 layers (1 billion weights)

2000 computers + two weeks of computing

Examine units to see what images they most respond to

Google Autoencoder

From Nando De Freitas

Convolutional NNs

CNNs are particularly popular in image processing

Often the feature maps correspond (not to macro features such as bicycles)but micro features

For example in handwritten digit recognition they correspond to smallconstituent parts of the digits

These are used then to process the image into a representation that is betterfor recognition

NNs in NLP

Bag of Words

We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index

We can also think of this as a Euclidean embedding e

aardvarkrarr eaardvark =

100

zorrorarr ezorro =

001

Word Embeddings

Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned

Objective is for example next word prediction accuracy

These are often called lsquoneural language modelsrsquo

NNs in NLP

Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used

Consider the sentence

the cat sat on the mat

and that we wish to predict the word on given the two preceding cat sat

and two succeeding words the mat

We can use a network that has inputs vcat vsat vthe vmat

The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high

The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context

Word Embeddings

Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)

Word Embeddings

There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender

vwoman minus vman asymp vaunt minus vuncle

vwoman minus vman asymp vqueen minus vking

From Mikolov (2013)

Word Embeddings Analogies

Given a relationship France-Paris we get the lsquorelationshiprsquo embedding

v = vParis minus vFrance

Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)

Word Embeddings Constrained Embeddings

We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)

Word Embeddings Constrained Embeddings

Recursive Nets and Embeddings

Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)

Recursive Nets and Embeddings

Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node

Recursive Nets and EmbeddingsTraining

We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree

The weights of this classifier are shared across all nodes

At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings

The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier

We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy

Prediction

For a new movie review the review is first parsed using a standard grammartree parser

This forms the tree which can be used to recursively form the sentiment classlabel for the review

Currently the best sentiment classifier Socher (2013)

Recursive Nets and Embeddingsotilde otilde

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

otilde

otilde

eth

middotshy

otilde

eth

plusmnsup2raquo

otilde

eth

plusmnordm

otilde

otilde

eth

notcedilraquo

otilde

otilde

eth

sup3plusmnshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

yen

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

yen

yen

eth

middotshy

yen

eth

plusmnsup2raquo

yen

eth

plusmnordm

yen

yen

eth

notcedilraquo

yen

yen

yen

acuteraquoiquestshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

otilde

eth

times

otilde

otilde

otilde

acutemiddotmicroraquofrac14

eth

eth

eth

raquoordfraquoregsect

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

times

yen

yen

eth

eth

frac14middotfrac14

eth

sup2ugravenot

eth

eth

acutemiddotmicroraquo

eth

eth

eth

iquest

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

timesnot

yen

yen

eth

eth

ugraveshy

eth

paralaquoshynot

yen

otilde

middotsup2frac12regraquofrac14middotfrac34acutesect

yen yen

frac14laquoacuteacute

eth

ograve

eth

eth

timesnot

eth

eth

eth

eth

eth

ugraveshy

otilde

yen

sup2plusmnnot

yen yen

frac14laquoacuteacute

eth

ograve

Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve

Recurrent Nets

x1 x2 x3

h1 h2 h3

y1 y2 y3

A A A

C C C

B B

RNNs are used in timeseries applications

The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt

In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram

I omitted the potential links from xtminus1 ytminus1 to ht

Handwriting Generation using a RNN

Some training examples

Handwriting Generation using a RNN

Some generated examples Top line is real handwriting for comparison See AlexGraversquos work

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 12: David Barber - Deep Nets, Bayes and the story of AI

Information Processing in Brains

Neurons

Re

al

Wo

rld

Layer 1 Layer 2 Highminuslevel

Concepts

Feature

Hierarchical Modular Binary Parallel Noisy

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Artificial Neuron (Perceptron)

weight 7

output neuron

neuron 1neuron 2neuron 3neuron 4

neuron 7neuron 6neuron 5

inputs

weight 1

Training an artificial neural network

Want to generalise to new images with high accuracy

Artificial Network

1957 Rosenblattrsquos perceptron

perceptron film clip

Connectionism

1960 Realised a perceptron can only solve simple tasks

1970 Decline in interest

1980 New computing power made training multilayer networks feasible

outputinputs

Each node (or lsquoneuronrsquo) computes a function of a weighted combination ofparental nodes hj = σ(

sumi wijhi)

Neural Networks and Deep LearningHistorical Problems with Neural Nets (1990s)

NNs are difficult to train (many local optima)

Particularly difficult to train a NN with a large number of layers (say largerthan around 10)

lsquoGradient Diffusion Problemrsquo ndash difficult to assign responsibility of errors toindividual lsquoneuronsrsquo

Machine Learning (up to 2006)

A large section of the machine learning community abandoned NNs

More principled and computationally better understood techniques (SVMsand related convex methods) replaced them

Bayesian AI (1990s onwards)

From mid 1990s there was a realisation that pattern recognition is notsufficient for all AI purposes

Uncertainty and reasoning are not naturally representable using standardfeed-forward nets

Explosion in more lsquosymbolicrsquo Bayesian AI

Deep Learning

NNs have resurged in interest in the last few years (Hinton Bengio )

Also called lsquodeep learningrsquo

Sense that very complex tasks (object recognition learning complex structurein data) requires going beyond simple (convex) statistical techniques

The brain uses hierarchical distributed processing and it is likely to be for agood reason

Many problems have a hierarchical structure images are made of partslanguage is hierarchical etc

Why now

New computing resources (GPU processing)

Availability of large amount of data means that we can train nets with manyparameters (1010)

Recent evidence suggests local optima are not particularly problematic

Autoencoder

y1 y2 y3 y4 y5

h1 h2 h3

h4 h5

y1 y2 y3 y4 y5

h6 h7 h8

The bottleneck forces the network to try to find a low dimensionalrepresentation of the data

Useful for unsupervised learning

Autoencoder on MNIST digits (Hinton 2006 Science)

Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA

60000 training images (28times 28 = 784 pixels)

Use a form of autoencoder to find a lower (30) dimensional representation

At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation

Google Cats

10 Million Youtube video frames (200x200 pixel images)

Use a specialised autoencoder with 9 layers (1 billion weights)

2000 computers + two weeks of computing

Examine units to see what images they most respond to

Google Autoencoder

From Nando De Freitas

Convolutional NNs

CNNs are particularly popular in image processing

Often the feature maps correspond (not to macro features such as bicycles)but micro features

For example in handwritten digit recognition they correspond to smallconstituent parts of the digits

These are used then to process the image into a representation that is betterfor recognition

NNs in NLP

Bag of Words

We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index

We can also think of this as a Euclidean embedding e

aardvarkrarr eaardvark =

100

zorrorarr ezorro =

001

Word Embeddings

Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned

Objective is for example next word prediction accuracy

These are often called lsquoneural language modelsrsquo

NNs in NLP

Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used

Consider the sentence

the cat sat on the mat

and that we wish to predict the word on given the two preceding cat sat

and two succeeding words the mat

We can use a network that has inputs vcat vsat vthe vmat

The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high

The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context

Word Embeddings

Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)

Word Embeddings

There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender

vwoman minus vman asymp vaunt minus vuncle

vwoman minus vman asymp vqueen minus vking

From Mikolov (2013)

Word Embeddings Analogies

Given a relationship France-Paris we get the lsquorelationshiprsquo embedding

v = vParis minus vFrance

Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)

Word Embeddings Constrained Embeddings

We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)

Word Embeddings Constrained Embeddings

Recursive Nets and Embeddings

Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)

Recursive Nets and Embeddings

Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node

Recursive Nets and EmbeddingsTraining

We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree

The weights of this classifier are shared across all nodes

At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings

The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier

We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy

Prediction

For a new movie review the review is first parsed using a standard grammartree parser

This forms the tree which can be used to recursively form the sentiment classlabel for the review

Currently the best sentiment classifier Socher (2013)

Recursive Nets and Embeddingsotilde otilde

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

otilde

otilde

eth

middotshy

otilde

eth

plusmnsup2raquo

otilde

eth

plusmnordm

otilde

otilde

eth

notcedilraquo

otilde

otilde

eth

sup3plusmnshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

yen

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

yen

yen

eth

middotshy

yen

eth

plusmnsup2raquo

yen

eth

plusmnordm

yen

yen

eth

notcedilraquo

yen

yen

yen

acuteraquoiquestshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

otilde

eth

times

otilde

otilde

otilde

acutemiddotmicroraquofrac14

eth

eth

eth

raquoordfraquoregsect

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

times

yen

yen

eth

eth

frac14middotfrac14

eth

sup2ugravenot

eth

eth

acutemiddotmicroraquo

eth

eth

eth

iquest

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

timesnot

yen

yen

eth

eth

ugraveshy

eth

paralaquoshynot

yen

otilde

middotsup2frac12regraquofrac14middotfrac34acutesect

yen yen

frac14laquoacuteacute

eth

ograve

eth

eth

timesnot

eth

eth

eth

eth

eth

ugraveshy

otilde

yen

sup2plusmnnot

yen yen

frac14laquoacuteacute

eth

ograve

Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve

Recurrent Nets

x1 x2 x3

h1 h2 h3

y1 y2 y3

A A A

C C C

B B

RNNs are used in timeseries applications

The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt

In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram

I omitted the potential links from xtminus1 ytminus1 to ht

Handwriting Generation using a RNN

Some training examples

Handwriting Generation using a RNN

Some generated examples Top line is real handwriting for comparison See AlexGraversquos work

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 13: David Barber - Deep Nets, Bayes and the story of AI

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Artificial Neuron (Perceptron)

weight 7

output neuron

neuron 1neuron 2neuron 3neuron 4

neuron 7neuron 6neuron 5

inputs

weight 1

Training an artificial neural network

Want to generalise to new images with high accuracy

Artificial Network

1957 Rosenblattrsquos perceptron

perceptron film clip

Connectionism

1960 Realised a perceptron can only solve simple tasks

1970 Decline in interest

1980 New computing power made training multilayer networks feasible

outputinputs

Each node (or lsquoneuronrsquo) computes a function of a weighted combination ofparental nodes hj = σ(

sumi wijhi)

Neural Networks and Deep LearningHistorical Problems with Neural Nets (1990s)

NNs are difficult to train (many local optima)

Particularly difficult to train a NN with a large number of layers (say largerthan around 10)

lsquoGradient Diffusion Problemrsquo ndash difficult to assign responsibility of errors toindividual lsquoneuronsrsquo

Machine Learning (up to 2006)

A large section of the machine learning community abandoned NNs

More principled and computationally better understood techniques (SVMsand related convex methods) replaced them

Bayesian AI (1990s onwards)

From mid 1990s there was a realisation that pattern recognition is notsufficient for all AI purposes

Uncertainty and reasoning are not naturally representable using standardfeed-forward nets

Explosion in more lsquosymbolicrsquo Bayesian AI

Deep Learning

NNs have resurged in interest in the last few years (Hinton Bengio )

Also called lsquodeep learningrsquo

Sense that very complex tasks (object recognition learning complex structurein data) requires going beyond simple (convex) statistical techniques

The brain uses hierarchical distributed processing and it is likely to be for agood reason

Many problems have a hierarchical structure images are made of partslanguage is hierarchical etc

Why now

New computing resources (GPU processing)

Availability of large amount of data means that we can train nets with manyparameters (1010)

Recent evidence suggests local optima are not particularly problematic

Autoencoder

y1 y2 y3 y4 y5

h1 h2 h3

h4 h5

y1 y2 y3 y4 y5

h6 h7 h8

The bottleneck forces the network to try to find a low dimensionalrepresentation of the data

Useful for unsupervised learning

Autoencoder on MNIST digits (Hinton 2006 Science)

Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA

60000 training images (28times 28 = 784 pixels)

Use a form of autoencoder to find a lower (30) dimensional representation

At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation

Google Cats

10 Million Youtube video frames (200x200 pixel images)

Use a specialised autoencoder with 9 layers (1 billion weights)

2000 computers + two weeks of computing

Examine units to see what images they most respond to

Google Autoencoder

From Nando De Freitas

Convolutional NNs

CNNs are particularly popular in image processing

Often the feature maps correspond (not to macro features such as bicycles)but micro features

For example in handwritten digit recognition they correspond to smallconstituent parts of the digits

These are used then to process the image into a representation that is betterfor recognition

NNs in NLP

Bag of Words

We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index

We can also think of this as a Euclidean embedding e

aardvarkrarr eaardvark =

100

zorrorarr ezorro =

001

Word Embeddings

Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned

Objective is for example next word prediction accuracy

These are often called lsquoneural language modelsrsquo

NNs in NLP

Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used

Consider the sentence

the cat sat on the mat

and that we wish to predict the word on given the two preceding cat sat

and two succeeding words the mat

We can use a network that has inputs vcat vsat vthe vmat

The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high

The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context

Word Embeddings

Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)

Word Embeddings

There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender

vwoman minus vman asymp vaunt minus vuncle

vwoman minus vman asymp vqueen minus vking

From Mikolov (2013)

Word Embeddings Analogies

Given a relationship France-Paris we get the lsquorelationshiprsquo embedding

v = vParis minus vFrance

Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)

Word Embeddings Constrained Embeddings

We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)

Word Embeddings Constrained Embeddings

Recursive Nets and Embeddings

Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)

Recursive Nets and Embeddings

Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node

Recursive Nets and EmbeddingsTraining

We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree

The weights of this classifier are shared across all nodes

At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings

The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier

We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy

Prediction

For a new movie review the review is first parsed using a standard grammartree parser

This forms the tree which can be used to recursively form the sentiment classlabel for the review

Currently the best sentiment classifier Socher (2013)

Recursive Nets and Embeddingsotilde otilde

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

otilde

otilde

eth

middotshy

otilde

eth

plusmnsup2raquo

otilde

eth

plusmnordm

otilde

otilde

eth

notcedilraquo

otilde

otilde

eth

sup3plusmnshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

yen

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

yen

yen

eth

middotshy

yen

eth

plusmnsup2raquo

yen

eth

plusmnordm

yen

yen

eth

notcedilraquo

yen

yen

yen

acuteraquoiquestshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

otilde

eth

times

otilde

otilde

otilde

acutemiddotmicroraquofrac14

eth

eth

eth

raquoordfraquoregsect

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

times

yen

yen

eth

eth

frac14middotfrac14

eth

sup2ugravenot

eth

eth

acutemiddotmicroraquo

eth

eth

eth

iquest

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

timesnot

yen

yen

eth

eth

ugraveshy

eth

paralaquoshynot

yen

otilde

middotsup2frac12regraquofrac14middotfrac34acutesect

yen yen

frac14laquoacuteacute

eth

ograve

eth

eth

timesnot

eth

eth

eth

eth

eth

ugraveshy

otilde

yen

sup2plusmnnot

yen yen

frac14laquoacuteacute

eth

ograve

Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve

Recurrent Nets

x1 x2 x3

h1 h2 h3

y1 y2 y3

A A A

C C C

B B

RNNs are used in timeseries applications

The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt

In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram

I omitted the potential links from xtminus1 ytminus1 to ht

Handwriting Generation using a RNN

Some training examples

Handwriting Generation using a RNN

Some generated examples Top line is real handwriting for comparison See AlexGraversquos work

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 14: David Barber - Deep Nets, Bayes and the story of AI

Artificial Neuron (Perceptron)

weight 7

output neuron

neuron 1neuron 2neuron 3neuron 4

neuron 7neuron 6neuron 5

inputs

weight 1

Training an artificial neural network

Want to generalise to new images with high accuracy

Artificial Network

1957 Rosenblattrsquos perceptron

perceptron film clip

Connectionism

1960 Realised a perceptron can only solve simple tasks

1970 Decline in interest

1980 New computing power made training multilayer networks feasible

outputinputs

Each node (or lsquoneuronrsquo) computes a function of a weighted combination ofparental nodes hj = σ(

sumi wijhi)

Neural Networks and Deep LearningHistorical Problems with Neural Nets (1990s)

NNs are difficult to train (many local optima)

Particularly difficult to train a NN with a large number of layers (say largerthan around 10)

lsquoGradient Diffusion Problemrsquo ndash difficult to assign responsibility of errors toindividual lsquoneuronsrsquo

Machine Learning (up to 2006)

A large section of the machine learning community abandoned NNs

More principled and computationally better understood techniques (SVMsand related convex methods) replaced them

Bayesian AI (1990s onwards)

From mid 1990s there was a realisation that pattern recognition is notsufficient for all AI purposes

Uncertainty and reasoning are not naturally representable using standardfeed-forward nets

Explosion in more lsquosymbolicrsquo Bayesian AI

Deep Learning

NNs have resurged in interest in the last few years (Hinton Bengio )

Also called lsquodeep learningrsquo

Sense that very complex tasks (object recognition learning complex structurein data) requires going beyond simple (convex) statistical techniques

The brain uses hierarchical distributed processing and it is likely to be for agood reason

Many problems have a hierarchical structure images are made of partslanguage is hierarchical etc

Why now

New computing resources (GPU processing)

Availability of large amount of data means that we can train nets with manyparameters (1010)

Recent evidence suggests local optima are not particularly problematic

Autoencoder

y1 y2 y3 y4 y5

h1 h2 h3

h4 h5

y1 y2 y3 y4 y5

h6 h7 h8

The bottleneck forces the network to try to find a low dimensionalrepresentation of the data

Useful for unsupervised learning

Autoencoder on MNIST digits (Hinton 2006 Science)

Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA

60000 training images (28times 28 = 784 pixels)

Use a form of autoencoder to find a lower (30) dimensional representation

At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation

Google Cats

10 Million Youtube video frames (200x200 pixel images)

Use a specialised autoencoder with 9 layers (1 billion weights)

2000 computers + two weeks of computing

Examine units to see what images they most respond to

Google Autoencoder

From Nando De Freitas

Convolutional NNs

CNNs are particularly popular in image processing

Often the feature maps correspond (not to macro features such as bicycles)but micro features

For example in handwritten digit recognition they correspond to smallconstituent parts of the digits

These are used then to process the image into a representation that is betterfor recognition

NNs in NLP

Bag of Words

We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index

We can also think of this as a Euclidean embedding e

aardvarkrarr eaardvark =

100

zorrorarr ezorro =

001

Word Embeddings

Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned

Objective is for example next word prediction accuracy

These are often called lsquoneural language modelsrsquo

NNs in NLP

Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used

Consider the sentence

the cat sat on the mat

and that we wish to predict the word on given the two preceding cat sat

and two succeeding words the mat

We can use a network that has inputs vcat vsat vthe vmat

The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high

The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context

Word Embeddings

Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)

Word Embeddings

There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender

vwoman minus vman asymp vaunt minus vuncle

vwoman minus vman asymp vqueen minus vking

From Mikolov (2013)

Word Embeddings Analogies

Given a relationship France-Paris we get the lsquorelationshiprsquo embedding

v = vParis minus vFrance

Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)

Word Embeddings Constrained Embeddings

We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)

Word Embeddings Constrained Embeddings

Recursive Nets and Embeddings

Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)

Recursive Nets and Embeddings

Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node

Recursive Nets and EmbeddingsTraining

We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree

The weights of this classifier are shared across all nodes

At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings

The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier

We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy

Prediction

For a new movie review the review is first parsed using a standard grammartree parser

This forms the tree which can be used to recursively form the sentiment classlabel for the review

Currently the best sentiment classifier Socher (2013)

Recursive Nets and Embeddingsotilde otilde

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

otilde

otilde

eth

middotshy

otilde

eth

plusmnsup2raquo

otilde

eth

plusmnordm

otilde

otilde

eth

notcedilraquo

otilde

otilde

eth

sup3plusmnshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

yen

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

yen

yen

eth

middotshy

yen

eth

plusmnsup2raquo

yen

eth

plusmnordm

yen

yen

eth

notcedilraquo

yen

yen

yen

acuteraquoiquestshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

otilde

eth

times

otilde

otilde

otilde

acutemiddotmicroraquofrac14

eth

eth

eth

raquoordfraquoregsect

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

times

yen

yen

eth

eth

frac14middotfrac14

eth

sup2ugravenot

eth

eth

acutemiddotmicroraquo

eth

eth

eth

iquest

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

timesnot

yen

yen

eth

eth

ugraveshy

eth

paralaquoshynot

yen

otilde

middotsup2frac12regraquofrac14middotfrac34acutesect

yen yen

frac14laquoacuteacute

eth

ograve

eth

eth

timesnot

eth

eth

eth

eth

eth

ugraveshy

otilde

yen

sup2plusmnnot

yen yen

frac14laquoacuteacute

eth

ograve

Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve

Recurrent Nets

x1 x2 x3

h1 h2 h3

y1 y2 y3

A A A

C C C

B B

RNNs are used in timeseries applications

The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt

In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram

I omitted the potential links from xtminus1 ytminus1 to ht

Handwriting Generation using a RNN

Some training examples

Handwriting Generation using a RNN

Some generated examples Top line is real handwriting for comparison See AlexGraversquos work

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 15: David Barber - Deep Nets, Bayes and the story of AI

Training an artificial neural network

Want to generalise to new images with high accuracy

Artificial Network

1957 Rosenblattrsquos perceptron

perceptron film clip

Connectionism

1960 Realised a perceptron can only solve simple tasks

1970 Decline in interest

1980 New computing power made training multilayer networks feasible

outputinputs

Each node (or lsquoneuronrsquo) computes a function of a weighted combination ofparental nodes hj = σ(

sumi wijhi)

Neural Networks and Deep LearningHistorical Problems with Neural Nets (1990s)

NNs are difficult to train (many local optima)

Particularly difficult to train a NN with a large number of layers (say largerthan around 10)

lsquoGradient Diffusion Problemrsquo ndash difficult to assign responsibility of errors toindividual lsquoneuronsrsquo

Machine Learning (up to 2006)

A large section of the machine learning community abandoned NNs

More principled and computationally better understood techniques (SVMsand related convex methods) replaced them

Bayesian AI (1990s onwards)

From mid 1990s there was a realisation that pattern recognition is notsufficient for all AI purposes

Uncertainty and reasoning are not naturally representable using standardfeed-forward nets

Explosion in more lsquosymbolicrsquo Bayesian AI

Deep Learning

NNs have resurged in interest in the last few years (Hinton Bengio )

Also called lsquodeep learningrsquo

Sense that very complex tasks (object recognition learning complex structurein data) requires going beyond simple (convex) statistical techniques

The brain uses hierarchical distributed processing and it is likely to be for agood reason

Many problems have a hierarchical structure images are made of partslanguage is hierarchical etc

Why now

New computing resources (GPU processing)

Availability of large amount of data means that we can train nets with manyparameters (1010)

Recent evidence suggests local optima are not particularly problematic

Autoencoder

y1 y2 y3 y4 y5

h1 h2 h3

h4 h5

y1 y2 y3 y4 y5

h6 h7 h8

The bottleneck forces the network to try to find a low dimensionalrepresentation of the data

Useful for unsupervised learning

Autoencoder on MNIST digits (Hinton 2006 Science)

Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA

60000 training images (28times 28 = 784 pixels)

Use a form of autoencoder to find a lower (30) dimensional representation

At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation

Google Cats

10 Million Youtube video frames (200x200 pixel images)

Use a specialised autoencoder with 9 layers (1 billion weights)

2000 computers + two weeks of computing

Examine units to see what images they most respond to

Google Autoencoder

From Nando De Freitas

Convolutional NNs

CNNs are particularly popular in image processing

Often the feature maps correspond (not to macro features such as bicycles)but micro features

For example in handwritten digit recognition they correspond to smallconstituent parts of the digits

These are used then to process the image into a representation that is betterfor recognition

NNs in NLP

Bag of Words

We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index

We can also think of this as a Euclidean embedding e

aardvarkrarr eaardvark =

100

zorrorarr ezorro =

001

Word Embeddings

Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned

Objective is for example next word prediction accuracy

These are often called lsquoneural language modelsrsquo

NNs in NLP

Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used

Consider the sentence

the cat sat on the mat

and that we wish to predict the word on given the two preceding cat sat

and two succeeding words the mat

We can use a network that has inputs vcat vsat vthe vmat

The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high

The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context

Word Embeddings

Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)

Word Embeddings

There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender

vwoman minus vman asymp vaunt minus vuncle

vwoman minus vman asymp vqueen minus vking

From Mikolov (2013)

Word Embeddings Analogies

Given a relationship France-Paris we get the lsquorelationshiprsquo embedding

v = vParis minus vFrance

Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)

Word Embeddings Constrained Embeddings

We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)

Word Embeddings Constrained Embeddings

Recursive Nets and Embeddings

Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)

Recursive Nets and Embeddings

Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node

Recursive Nets and EmbeddingsTraining

We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree

The weights of this classifier are shared across all nodes

At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings

The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier

We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy

Prediction

For a new movie review the review is first parsed using a standard grammartree parser

This forms the tree which can be used to recursively form the sentiment classlabel for the review

Currently the best sentiment classifier Socher (2013)

Recursive Nets and Embeddingsotilde otilde

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

otilde

otilde

eth

middotshy

otilde

eth

plusmnsup2raquo

otilde

eth

plusmnordm

otilde

otilde

eth

notcedilraquo

otilde

otilde

eth

sup3plusmnshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

yen

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

yen

yen

eth

middotshy

yen

eth

plusmnsup2raquo

yen

eth

plusmnordm

yen

yen

eth

notcedilraquo

yen

yen

yen

acuteraquoiquestshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

otilde

eth

times

otilde

otilde

otilde

acutemiddotmicroraquofrac14

eth

eth

eth

raquoordfraquoregsect

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

times

yen

yen

eth

eth

frac14middotfrac14

eth

sup2ugravenot

eth

eth

acutemiddotmicroraquo

eth

eth

eth

iquest

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

timesnot

yen

yen

eth

eth

ugraveshy

eth

paralaquoshynot

yen

otilde

middotsup2frac12regraquofrac14middotfrac34acutesect

yen yen

frac14laquoacuteacute

eth

ograve

eth

eth

timesnot

eth

eth

eth

eth

eth

ugraveshy

otilde

yen

sup2plusmnnot

yen yen

frac14laquoacuteacute

eth

ograve

Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve

Recurrent Nets

x1 x2 x3

h1 h2 h3

y1 y2 y3

A A A

C C C

B B

RNNs are used in timeseries applications

The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt

In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram

I omitted the potential links from xtminus1 ytminus1 to ht

Handwriting Generation using a RNN

Some training examples

Handwriting Generation using a RNN

Some generated examples Top line is real handwriting for comparison See AlexGraversquos work

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 16: David Barber - Deep Nets, Bayes and the story of AI

Artificial Network

1957 Rosenblattrsquos perceptron

perceptron film clip

Connectionism

1960 Realised a perceptron can only solve simple tasks

1970 Decline in interest

1980 New computing power made training multilayer networks feasible

outputinputs

Each node (or lsquoneuronrsquo) computes a function of a weighted combination ofparental nodes hj = σ(

sumi wijhi)

Neural Networks and Deep LearningHistorical Problems with Neural Nets (1990s)

NNs are difficult to train (many local optima)

Particularly difficult to train a NN with a large number of layers (say largerthan around 10)

lsquoGradient Diffusion Problemrsquo ndash difficult to assign responsibility of errors toindividual lsquoneuronsrsquo

Machine Learning (up to 2006)

A large section of the machine learning community abandoned NNs

More principled and computationally better understood techniques (SVMsand related convex methods) replaced them

Bayesian AI (1990s onwards)

From mid 1990s there was a realisation that pattern recognition is notsufficient for all AI purposes

Uncertainty and reasoning are not naturally representable using standardfeed-forward nets

Explosion in more lsquosymbolicrsquo Bayesian AI

Deep Learning

NNs have resurged in interest in the last few years (Hinton Bengio )

Also called lsquodeep learningrsquo

Sense that very complex tasks (object recognition learning complex structurein data) requires going beyond simple (convex) statistical techniques

The brain uses hierarchical distributed processing and it is likely to be for agood reason

Many problems have a hierarchical structure images are made of partslanguage is hierarchical etc

Why now

New computing resources (GPU processing)

Availability of large amount of data means that we can train nets with manyparameters (1010)

Recent evidence suggests local optima are not particularly problematic

Autoencoder

y1 y2 y3 y4 y5

h1 h2 h3

h4 h5

y1 y2 y3 y4 y5

h6 h7 h8

The bottleneck forces the network to try to find a low dimensionalrepresentation of the data

Useful for unsupervised learning

Autoencoder on MNIST digits (Hinton 2006 Science)

Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA

60000 training images (28times 28 = 784 pixels)

Use a form of autoencoder to find a lower (30) dimensional representation

At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation

Google Cats

10 Million Youtube video frames (200x200 pixel images)

Use a specialised autoencoder with 9 layers (1 billion weights)

2000 computers + two weeks of computing

Examine units to see what images they most respond to

Google Autoencoder

From Nando De Freitas

Convolutional NNs

CNNs are particularly popular in image processing

Often the feature maps correspond (not to macro features such as bicycles)but micro features

For example in handwritten digit recognition they correspond to smallconstituent parts of the digits

These are used then to process the image into a representation that is betterfor recognition

NNs in NLP

Bag of Words

We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index

We can also think of this as a Euclidean embedding e

aardvarkrarr eaardvark =

100

zorrorarr ezorro =

001

Word Embeddings

Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned

Objective is for example next word prediction accuracy

These are often called lsquoneural language modelsrsquo

NNs in NLP

Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used

Consider the sentence

the cat sat on the mat

and that we wish to predict the word on given the two preceding cat sat

and two succeeding words the mat

We can use a network that has inputs vcat vsat vthe vmat

The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high

The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context

Word Embeddings

Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)

Word Embeddings

There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender

vwoman minus vman asymp vaunt minus vuncle

vwoman minus vman asymp vqueen minus vking

From Mikolov (2013)

Word Embeddings Analogies

Given a relationship France-Paris we get the lsquorelationshiprsquo embedding

v = vParis minus vFrance

Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)

Word Embeddings Constrained Embeddings

We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)

Word Embeddings Constrained Embeddings

Recursive Nets and Embeddings

Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)

Recursive Nets and Embeddings

Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node

Recursive Nets and EmbeddingsTraining

We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree

The weights of this classifier are shared across all nodes

At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings

The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier

We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy

Prediction

For a new movie review the review is first parsed using a standard grammartree parser

This forms the tree which can be used to recursively form the sentiment classlabel for the review

Currently the best sentiment classifier Socher (2013)

Recursive Nets and Embeddingsotilde otilde

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

otilde

otilde

eth

middotshy

otilde

eth

plusmnsup2raquo

otilde

eth

plusmnordm

otilde

otilde

eth

notcedilraquo

otilde

otilde

eth

sup3plusmnshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

yen

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

yen

yen

eth

middotshy

yen

eth

plusmnsup2raquo

yen

eth

plusmnordm

yen

yen

eth

notcedilraquo

yen

yen

yen

acuteraquoiquestshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

otilde

eth

times

otilde

otilde

otilde

acutemiddotmicroraquofrac14

eth

eth

eth

raquoordfraquoregsect

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

times

yen

yen

eth

eth

frac14middotfrac14

eth

sup2ugravenot

eth

eth

acutemiddotmicroraquo

eth

eth

eth

iquest

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

timesnot

yen

yen

eth

eth

ugraveshy

eth

paralaquoshynot

yen

otilde

middotsup2frac12regraquofrac14middotfrac34acutesect

yen yen

frac14laquoacuteacute

eth

ograve

eth

eth

timesnot

eth

eth

eth

eth

eth

ugraveshy

otilde

yen

sup2plusmnnot

yen yen

frac14laquoacuteacute

eth

ograve

Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve

Recurrent Nets

x1 x2 x3

h1 h2 h3

y1 y2 y3

A A A

C C C

B B

RNNs are used in timeseries applications

The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt

In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram

I omitted the potential links from xtminus1 ytminus1 to ht

Handwriting Generation using a RNN

Some training examples

Handwriting Generation using a RNN

Some generated examples Top line is real handwriting for comparison See AlexGraversquos work

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 17: David Barber - Deep Nets, Bayes and the story of AI

Connectionism

1960 Realised a perceptron can only solve simple tasks

1970 Decline in interest

1980 New computing power made training multilayer networks feasible

outputinputs

Each node (or lsquoneuronrsquo) computes a function of a weighted combination ofparental nodes hj = σ(

sumi wijhi)

Neural Networks and Deep LearningHistorical Problems with Neural Nets (1990s)

NNs are difficult to train (many local optima)

Particularly difficult to train a NN with a large number of layers (say largerthan around 10)

lsquoGradient Diffusion Problemrsquo ndash difficult to assign responsibility of errors toindividual lsquoneuronsrsquo

Machine Learning (up to 2006)

A large section of the machine learning community abandoned NNs

More principled and computationally better understood techniques (SVMsand related convex methods) replaced them

Bayesian AI (1990s onwards)

From mid 1990s there was a realisation that pattern recognition is notsufficient for all AI purposes

Uncertainty and reasoning are not naturally representable using standardfeed-forward nets

Explosion in more lsquosymbolicrsquo Bayesian AI

Deep Learning

NNs have resurged in interest in the last few years (Hinton Bengio )

Also called lsquodeep learningrsquo

Sense that very complex tasks (object recognition learning complex structurein data) requires going beyond simple (convex) statistical techniques

The brain uses hierarchical distributed processing and it is likely to be for agood reason

Many problems have a hierarchical structure images are made of partslanguage is hierarchical etc

Why now

New computing resources (GPU processing)

Availability of large amount of data means that we can train nets with manyparameters (1010)

Recent evidence suggests local optima are not particularly problematic

Autoencoder

y1 y2 y3 y4 y5

h1 h2 h3

h4 h5

y1 y2 y3 y4 y5

h6 h7 h8

The bottleneck forces the network to try to find a low dimensionalrepresentation of the data

Useful for unsupervised learning

Autoencoder on MNIST digits (Hinton 2006 Science)

Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA

60000 training images (28times 28 = 784 pixels)

Use a form of autoencoder to find a lower (30) dimensional representation

At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation

Google Cats

10 Million Youtube video frames (200x200 pixel images)

Use a specialised autoencoder with 9 layers (1 billion weights)

2000 computers + two weeks of computing

Examine units to see what images they most respond to

Google Autoencoder

From Nando De Freitas

Convolutional NNs

CNNs are particularly popular in image processing

Often the feature maps correspond (not to macro features such as bicycles)but micro features

For example in handwritten digit recognition they correspond to smallconstituent parts of the digits

These are used then to process the image into a representation that is betterfor recognition

NNs in NLP

Bag of Words

We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index

We can also think of this as a Euclidean embedding e

aardvarkrarr eaardvark =

100

zorrorarr ezorro =

001

Word Embeddings

Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned

Objective is for example next word prediction accuracy

These are often called lsquoneural language modelsrsquo

NNs in NLP

Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used

Consider the sentence

the cat sat on the mat

and that we wish to predict the word on given the two preceding cat sat

and two succeeding words the mat

We can use a network that has inputs vcat vsat vthe vmat

The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high

The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context

Word Embeddings

Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)

Word Embeddings

There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender

vwoman minus vman asymp vaunt minus vuncle

vwoman minus vman asymp vqueen minus vking

From Mikolov (2013)

Word Embeddings Analogies

Given a relationship France-Paris we get the lsquorelationshiprsquo embedding

v = vParis minus vFrance

Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)

Word Embeddings Constrained Embeddings

We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)

Word Embeddings Constrained Embeddings

Recursive Nets and Embeddings

Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)

Recursive Nets and Embeddings

Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node

Recursive Nets and EmbeddingsTraining

We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree

The weights of this classifier are shared across all nodes

At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings

The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier

We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy

Prediction

For a new movie review the review is first parsed using a standard grammartree parser

This forms the tree which can be used to recursively form the sentiment classlabel for the review

Currently the best sentiment classifier Socher (2013)

Recursive Nets and Embeddingsotilde otilde

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

otilde

otilde

eth

middotshy

otilde

eth

plusmnsup2raquo

otilde

eth

plusmnordm

otilde

otilde

eth

notcedilraquo

otilde

otilde

eth

sup3plusmnshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

yen

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

yen

yen

eth

middotshy

yen

eth

plusmnsup2raquo

yen

eth

plusmnordm

yen

yen

eth

notcedilraquo

yen

yen

yen

acuteraquoiquestshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

otilde

eth

times

otilde

otilde

otilde

acutemiddotmicroraquofrac14

eth

eth

eth

raquoordfraquoregsect

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

times

yen

yen

eth

eth

frac14middotfrac14

eth

sup2ugravenot

eth

eth

acutemiddotmicroraquo

eth

eth

eth

iquest

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

timesnot

yen

yen

eth

eth

ugraveshy

eth

paralaquoshynot

yen

otilde

middotsup2frac12regraquofrac14middotfrac34acutesect

yen yen

frac14laquoacuteacute

eth

ograve

eth

eth

timesnot

eth

eth

eth

eth

eth

ugraveshy

otilde

yen

sup2plusmnnot

yen yen

frac14laquoacuteacute

eth

ograve

Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve

Recurrent Nets

x1 x2 x3

h1 h2 h3

y1 y2 y3

A A A

C C C

B B

RNNs are used in timeseries applications

The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt

In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram

I omitted the potential links from xtminus1 ytminus1 to ht

Handwriting Generation using a RNN

Some training examples

Handwriting Generation using a RNN

Some generated examples Top line is real handwriting for comparison See AlexGraversquos work

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 18: David Barber - Deep Nets, Bayes and the story of AI

Neural Networks and Deep LearningHistorical Problems with Neural Nets (1990s)

NNs are difficult to train (many local optima)

Particularly difficult to train a NN with a large number of layers (say largerthan around 10)

lsquoGradient Diffusion Problemrsquo ndash difficult to assign responsibility of errors toindividual lsquoneuronsrsquo

Machine Learning (up to 2006)

A large section of the machine learning community abandoned NNs

More principled and computationally better understood techniques (SVMsand related convex methods) replaced them

Bayesian AI (1990s onwards)

From mid 1990s there was a realisation that pattern recognition is notsufficient for all AI purposes

Uncertainty and reasoning are not naturally representable using standardfeed-forward nets

Explosion in more lsquosymbolicrsquo Bayesian AI

Deep Learning

NNs have resurged in interest in the last few years (Hinton Bengio )

Also called lsquodeep learningrsquo

Sense that very complex tasks (object recognition learning complex structurein data) requires going beyond simple (convex) statistical techniques

The brain uses hierarchical distributed processing and it is likely to be for agood reason

Many problems have a hierarchical structure images are made of partslanguage is hierarchical etc

Why now

New computing resources (GPU processing)

Availability of large amount of data means that we can train nets with manyparameters (1010)

Recent evidence suggests local optima are not particularly problematic

Autoencoder

y1 y2 y3 y4 y5

h1 h2 h3

h4 h5

y1 y2 y3 y4 y5

h6 h7 h8

The bottleneck forces the network to try to find a low dimensionalrepresentation of the data

Useful for unsupervised learning

Autoencoder on MNIST digits (Hinton 2006 Science)

Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA

60000 training images (28times 28 = 784 pixels)

Use a form of autoencoder to find a lower (30) dimensional representation

At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation

Google Cats

10 Million Youtube video frames (200x200 pixel images)

Use a specialised autoencoder with 9 layers (1 billion weights)

2000 computers + two weeks of computing

Examine units to see what images they most respond to

Google Autoencoder

From Nando De Freitas

Convolutional NNs

CNNs are particularly popular in image processing

Often the feature maps correspond (not to macro features such as bicycles)but micro features

For example in handwritten digit recognition they correspond to smallconstituent parts of the digits

These are used then to process the image into a representation that is betterfor recognition

NNs in NLP

Bag of Words

We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index

We can also think of this as a Euclidean embedding e

aardvarkrarr eaardvark =

100

zorrorarr ezorro =

001

Word Embeddings

Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned

Objective is for example next word prediction accuracy

These are often called lsquoneural language modelsrsquo

NNs in NLP

Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used

Consider the sentence

the cat sat on the mat

and that we wish to predict the word on given the two preceding cat sat

and two succeeding words the mat

We can use a network that has inputs vcat vsat vthe vmat

The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high

The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context

Word Embeddings

Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)

Word Embeddings

There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender

vwoman minus vman asymp vaunt minus vuncle

vwoman minus vman asymp vqueen minus vking

From Mikolov (2013)

Word Embeddings Analogies

Given a relationship France-Paris we get the lsquorelationshiprsquo embedding

v = vParis minus vFrance

Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)

Word Embeddings Constrained Embeddings

We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)

Word Embeddings Constrained Embeddings

Recursive Nets and Embeddings

Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)

Recursive Nets and Embeddings

Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node

Recursive Nets and EmbeddingsTraining

We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree

The weights of this classifier are shared across all nodes

At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings

The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier

We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy

Prediction

For a new movie review the review is first parsed using a standard grammartree parser

This forms the tree which can be used to recursively form the sentiment classlabel for the review

Currently the best sentiment classifier Socher (2013)

Recursive Nets and Embeddingsotilde otilde

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

otilde

otilde

eth

middotshy

otilde

eth

plusmnsup2raquo

otilde

eth

plusmnordm

otilde

otilde

eth

notcedilraquo

otilde

otilde

eth

sup3plusmnshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

yen

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

yen

yen

eth

middotshy

yen

eth

plusmnsup2raquo

yen

eth

plusmnordm

yen

yen

eth

notcedilraquo

yen

yen

yen

acuteraquoiquestshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

otilde

eth

times

otilde

otilde

otilde

acutemiddotmicroraquofrac14

eth

eth

eth

raquoordfraquoregsect

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

times

yen

yen

eth

eth

frac14middotfrac14

eth

sup2ugravenot

eth

eth

acutemiddotmicroraquo

eth

eth

eth

iquest

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

timesnot

yen

yen

eth

eth

ugraveshy

eth

paralaquoshynot

yen

otilde

middotsup2frac12regraquofrac14middotfrac34acutesect

yen yen

frac14laquoacuteacute

eth

ograve

eth

eth

timesnot

eth

eth

eth

eth

eth

ugraveshy

otilde

yen

sup2plusmnnot

yen yen

frac14laquoacuteacute

eth

ograve

Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve

Recurrent Nets

x1 x2 x3

h1 h2 h3

y1 y2 y3

A A A

C C C

B B

RNNs are used in timeseries applications

The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt

In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram

I omitted the potential links from xtminus1 ytminus1 to ht

Handwriting Generation using a RNN

Some training examples

Handwriting Generation using a RNN

Some generated examples Top line is real handwriting for comparison See AlexGraversquos work

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 19: David Barber - Deep Nets, Bayes and the story of AI

Deep Learning

NNs have resurged in interest in the last few years (Hinton Bengio )

Also called lsquodeep learningrsquo

Sense that very complex tasks (object recognition learning complex structurein data) requires going beyond simple (convex) statistical techniques

The brain uses hierarchical distributed processing and it is likely to be for agood reason

Many problems have a hierarchical structure images are made of partslanguage is hierarchical etc

Why now

New computing resources (GPU processing)

Availability of large amount of data means that we can train nets with manyparameters (1010)

Recent evidence suggests local optima are not particularly problematic

Autoencoder

y1 y2 y3 y4 y5

h1 h2 h3

h4 h5

y1 y2 y3 y4 y5

h6 h7 h8

The bottleneck forces the network to try to find a low dimensionalrepresentation of the data

Useful for unsupervised learning

Autoencoder on MNIST digits (Hinton 2006 Science)

Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA

60000 training images (28times 28 = 784 pixels)

Use a form of autoencoder to find a lower (30) dimensional representation

At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation

Google Cats

10 Million Youtube video frames (200x200 pixel images)

Use a specialised autoencoder with 9 layers (1 billion weights)

2000 computers + two weeks of computing

Examine units to see what images they most respond to

Google Autoencoder

From Nando De Freitas

Convolutional NNs

CNNs are particularly popular in image processing

Often the feature maps correspond (not to macro features such as bicycles)but micro features

For example in handwritten digit recognition they correspond to smallconstituent parts of the digits

These are used then to process the image into a representation that is betterfor recognition

NNs in NLP

Bag of Words

We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index

We can also think of this as a Euclidean embedding e

aardvarkrarr eaardvark =

100

zorrorarr ezorro =

001

Word Embeddings

Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned

Objective is for example next word prediction accuracy

These are often called lsquoneural language modelsrsquo

NNs in NLP

Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used

Consider the sentence

the cat sat on the mat

and that we wish to predict the word on given the two preceding cat sat

and two succeeding words the mat

We can use a network that has inputs vcat vsat vthe vmat

The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high

The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context

Word Embeddings

Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)

Word Embeddings

There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender

vwoman minus vman asymp vaunt minus vuncle

vwoman minus vman asymp vqueen minus vking

From Mikolov (2013)

Word Embeddings Analogies

Given a relationship France-Paris we get the lsquorelationshiprsquo embedding

v = vParis minus vFrance

Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)

Word Embeddings Constrained Embeddings

We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)

Word Embeddings Constrained Embeddings

Recursive Nets and Embeddings

Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)

Recursive Nets and Embeddings

Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node

Recursive Nets and EmbeddingsTraining

We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree

The weights of this classifier are shared across all nodes

At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings

The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier

We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy

Prediction

For a new movie review the review is first parsed using a standard grammartree parser

This forms the tree which can be used to recursively form the sentiment classlabel for the review

Currently the best sentiment classifier Socher (2013)

Recursive Nets and Embeddingsotilde otilde

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

otilde

otilde

eth

middotshy

otilde

eth

plusmnsup2raquo

otilde

eth

plusmnordm

otilde

otilde

eth

notcedilraquo

otilde

otilde

eth

sup3plusmnshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

yen

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

yen

yen

eth

middotshy

yen

eth

plusmnsup2raquo

yen

eth

plusmnordm

yen

yen

eth

notcedilraquo

yen

yen

yen

acuteraquoiquestshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

otilde

eth

times

otilde

otilde

otilde

acutemiddotmicroraquofrac14

eth

eth

eth

raquoordfraquoregsect

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

times

yen

yen

eth

eth

frac14middotfrac14

eth

sup2ugravenot

eth

eth

acutemiddotmicroraquo

eth

eth

eth

iquest

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

timesnot

yen

yen

eth

eth

ugraveshy

eth

paralaquoshynot

yen

otilde

middotsup2frac12regraquofrac14middotfrac34acutesect

yen yen

frac14laquoacuteacute

eth

ograve

eth

eth

timesnot

eth

eth

eth

eth

eth

ugraveshy

otilde

yen

sup2plusmnnot

yen yen

frac14laquoacuteacute

eth

ograve

Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve

Recurrent Nets

x1 x2 x3

h1 h2 h3

y1 y2 y3

A A A

C C C

B B

RNNs are used in timeseries applications

The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt

In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram

I omitted the potential links from xtminus1 ytminus1 to ht

Handwriting Generation using a RNN

Some training examples

Handwriting Generation using a RNN

Some generated examples Top line is real handwriting for comparison See AlexGraversquos work

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 20: David Barber - Deep Nets, Bayes and the story of AI

Autoencoder

y1 y2 y3 y4 y5

h1 h2 h3

h4 h5

y1 y2 y3 y4 y5

h6 h7 h8

The bottleneck forces the network to try to find a low dimensionalrepresentation of the data

Useful for unsupervised learning

Autoencoder on MNIST digits (Hinton 2006 Science)

Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA

60000 training images (28times 28 = 784 pixels)

Use a form of autoencoder to find a lower (30) dimensional representation

At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation

Google Cats

10 Million Youtube video frames (200x200 pixel images)

Use a specialised autoencoder with 9 layers (1 billion weights)

2000 computers + two weeks of computing

Examine units to see what images they most respond to

Google Autoencoder

From Nando De Freitas

Convolutional NNs

CNNs are particularly popular in image processing

Often the feature maps correspond (not to macro features such as bicycles)but micro features

For example in handwritten digit recognition they correspond to smallconstituent parts of the digits

These are used then to process the image into a representation that is betterfor recognition

NNs in NLP

Bag of Words

We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index

We can also think of this as a Euclidean embedding e

aardvarkrarr eaardvark =

100

zorrorarr ezorro =

001

Word Embeddings

Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned

Objective is for example next word prediction accuracy

These are often called lsquoneural language modelsrsquo

NNs in NLP

Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used

Consider the sentence

the cat sat on the mat

and that we wish to predict the word on given the two preceding cat sat

and two succeeding words the mat

We can use a network that has inputs vcat vsat vthe vmat

The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high

The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context

Word Embeddings

Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)

Word Embeddings

There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender

vwoman minus vman asymp vaunt minus vuncle

vwoman minus vman asymp vqueen minus vking

From Mikolov (2013)

Word Embeddings Analogies

Given a relationship France-Paris we get the lsquorelationshiprsquo embedding

v = vParis minus vFrance

Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)

Word Embeddings Constrained Embeddings

We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)

Word Embeddings Constrained Embeddings

Recursive Nets and Embeddings

Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)

Recursive Nets and Embeddings

Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node

Recursive Nets and EmbeddingsTraining

We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree

The weights of this classifier are shared across all nodes

At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings

The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier

We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy

Prediction

For a new movie review the review is first parsed using a standard grammartree parser

This forms the tree which can be used to recursively form the sentiment classlabel for the review

Currently the best sentiment classifier Socher (2013)

Recursive Nets and Embeddingsotilde otilde

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

otilde

otilde

eth

middotshy

otilde

eth

plusmnsup2raquo

otilde

eth

plusmnordm

otilde

otilde

eth

notcedilraquo

otilde

otilde

eth

sup3plusmnshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

yen

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

yen

yen

eth

middotshy

yen

eth

plusmnsup2raquo

yen

eth

plusmnordm

yen

yen

eth

notcedilraquo

yen

yen

yen

acuteraquoiquestshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

otilde

eth

times

otilde

otilde

otilde

acutemiddotmicroraquofrac14

eth

eth

eth

raquoordfraquoregsect

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

times

yen

yen

eth

eth

frac14middotfrac14

eth

sup2ugravenot

eth

eth

acutemiddotmicroraquo

eth

eth

eth

iquest

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

timesnot

yen

yen

eth

eth

ugraveshy

eth

paralaquoshynot

yen

otilde

middotsup2frac12regraquofrac14middotfrac34acutesect

yen yen

frac14laquoacuteacute

eth

ograve

eth

eth

timesnot

eth

eth

eth

eth

eth

ugraveshy

otilde

yen

sup2plusmnnot

yen yen

frac14laquoacuteacute

eth

ograve

Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve

Recurrent Nets

x1 x2 x3

h1 h2 h3

y1 y2 y3

A A A

C C C

B B

RNNs are used in timeseries applications

The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt

In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram

I omitted the potential links from xtminus1 ytminus1 to ht

Handwriting Generation using a RNN

Some training examples

Handwriting Generation using a RNN

Some generated examples Top line is real handwriting for comparison See AlexGraversquos work

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 21: David Barber - Deep Nets, Bayes and the story of AI

Autoencoder on MNIST digits (Hinton 2006 Science)

Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA

60000 training images (28times 28 = 784 pixels)

Use a form of autoencoder to find a lower (30) dimensional representation

At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation

Google Cats

10 Million Youtube video frames (200x200 pixel images)

Use a specialised autoencoder with 9 layers (1 billion weights)

2000 computers + two weeks of computing

Examine units to see what images they most respond to

Google Autoencoder

From Nando De Freitas

Convolutional NNs

CNNs are particularly popular in image processing

Often the feature maps correspond (not to macro features such as bicycles)but micro features

For example in handwritten digit recognition they correspond to smallconstituent parts of the digits

These are used then to process the image into a representation that is betterfor recognition

NNs in NLP

Bag of Words

We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index

We can also think of this as a Euclidean embedding e

aardvarkrarr eaardvark =

100

zorrorarr ezorro =

001

Word Embeddings

Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned

Objective is for example next word prediction accuracy

These are often called lsquoneural language modelsrsquo

NNs in NLP

Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used

Consider the sentence

the cat sat on the mat

and that we wish to predict the word on given the two preceding cat sat

and two succeeding words the mat

We can use a network that has inputs vcat vsat vthe vmat

The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high

The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context

Word Embeddings

Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)

Word Embeddings

There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender

vwoman minus vman asymp vaunt minus vuncle

vwoman minus vman asymp vqueen minus vking

From Mikolov (2013)

Word Embeddings Analogies

Given a relationship France-Paris we get the lsquorelationshiprsquo embedding

v = vParis minus vFrance

Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)

Word Embeddings Constrained Embeddings

We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)

Word Embeddings Constrained Embeddings

Recursive Nets and Embeddings

Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)

Recursive Nets and Embeddings

Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node

Recursive Nets and EmbeddingsTraining

We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree

The weights of this classifier are shared across all nodes

At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings

The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier

We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy

Prediction

For a new movie review the review is first parsed using a standard grammartree parser

This forms the tree which can be used to recursively form the sentiment classlabel for the review

Currently the best sentiment classifier Socher (2013)

Recursive Nets and Embeddingsotilde otilde

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

otilde

otilde

eth

middotshy

otilde

eth

plusmnsup2raquo

otilde

eth

plusmnordm

otilde

otilde

eth

notcedilraquo

otilde

otilde

eth

sup3plusmnshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

yen

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

yen

yen

eth

middotshy

yen

eth

plusmnsup2raquo

yen

eth

plusmnordm

yen

yen

eth

notcedilraquo

yen

yen

yen

acuteraquoiquestshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

otilde

eth

times

otilde

otilde

otilde

acutemiddotmicroraquofrac14

eth

eth

eth

raquoordfraquoregsect

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

times

yen

yen

eth

eth

frac14middotfrac14

eth

sup2ugravenot

eth

eth

acutemiddotmicroraquo

eth

eth

eth

iquest

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

timesnot

yen

yen

eth

eth

ugraveshy

eth

paralaquoshynot

yen

otilde

middotsup2frac12regraquofrac14middotfrac34acutesect

yen yen

frac14laquoacuteacute

eth

ograve

eth

eth

timesnot

eth

eth

eth

eth

eth

ugraveshy

otilde

yen

sup2plusmnnot

yen yen

frac14laquoacuteacute

eth

ograve

Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve

Recurrent Nets

x1 x2 x3

h1 h2 h3

y1 y2 y3

A A A

C C C

B B

RNNs are used in timeseries applications

The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt

In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram

I omitted the potential links from xtminus1 ytminus1 to ht

Handwriting Generation using a RNN

Some training examples

Handwriting Generation using a RNN

Some generated examples Top line is real handwriting for comparison See AlexGraversquos work

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 22: David Barber - Deep Nets, Bayes and the story of AI

Google Cats

10 Million Youtube video frames (200x200 pixel images)

Use a specialised autoencoder with 9 layers (1 billion weights)

2000 computers + two weeks of computing

Examine units to see what images they most respond to

Google Autoencoder

From Nando De Freitas

Convolutional NNs

CNNs are particularly popular in image processing

Often the feature maps correspond (not to macro features such as bicycles)but micro features

For example in handwritten digit recognition they correspond to smallconstituent parts of the digits

These are used then to process the image into a representation that is betterfor recognition

NNs in NLP

Bag of Words

We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index

We can also think of this as a Euclidean embedding e

aardvarkrarr eaardvark =

100

zorrorarr ezorro =

001

Word Embeddings

Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned

Objective is for example next word prediction accuracy

These are often called lsquoneural language modelsrsquo

NNs in NLP

Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used

Consider the sentence

the cat sat on the mat

and that we wish to predict the word on given the two preceding cat sat

and two succeeding words the mat

We can use a network that has inputs vcat vsat vthe vmat

The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high

The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context

Word Embeddings

Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)

Word Embeddings

There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender

vwoman minus vman asymp vaunt minus vuncle

vwoman minus vman asymp vqueen minus vking

From Mikolov (2013)

Word Embeddings Analogies

Given a relationship France-Paris we get the lsquorelationshiprsquo embedding

v = vParis minus vFrance

Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)

Word Embeddings Constrained Embeddings

We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)

Word Embeddings Constrained Embeddings

Recursive Nets and Embeddings

Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)

Recursive Nets and Embeddings

Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node

Recursive Nets and EmbeddingsTraining

We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree

The weights of this classifier are shared across all nodes

At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings

The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier

We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy

Prediction

For a new movie review the review is first parsed using a standard grammartree parser

This forms the tree which can be used to recursively form the sentiment classlabel for the review

Currently the best sentiment classifier Socher (2013)

Recursive Nets and Embeddingsotilde otilde

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

otilde

otilde

eth

middotshy

otilde

eth

plusmnsup2raquo

otilde

eth

plusmnordm

otilde

otilde

eth

notcedilraquo

otilde

otilde

eth

sup3plusmnshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

yen

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

yen

yen

eth

middotshy

yen

eth

plusmnsup2raquo

yen

eth

plusmnordm

yen

yen

eth

notcedilraquo

yen

yen

yen

acuteraquoiquestshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

otilde

eth

times

otilde

otilde

otilde

acutemiddotmicroraquofrac14

eth

eth

eth

raquoordfraquoregsect

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

times

yen

yen

eth

eth

frac14middotfrac14

eth

sup2ugravenot

eth

eth

acutemiddotmicroraquo

eth

eth

eth

iquest

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

timesnot

yen

yen

eth

eth

ugraveshy

eth

paralaquoshynot

yen

otilde

middotsup2frac12regraquofrac14middotfrac34acutesect

yen yen

frac14laquoacuteacute

eth

ograve

eth

eth

timesnot

eth

eth

eth

eth

eth

ugraveshy

otilde

yen

sup2plusmnnot

yen yen

frac14laquoacuteacute

eth

ograve

Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve

Recurrent Nets

x1 x2 x3

h1 h2 h3

y1 y2 y3

A A A

C C C

B B

RNNs are used in timeseries applications

The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt

In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram

I omitted the potential links from xtminus1 ytminus1 to ht

Handwriting Generation using a RNN

Some training examples

Handwriting Generation using a RNN

Some generated examples Top line is real handwriting for comparison See AlexGraversquos work

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 23: David Barber - Deep Nets, Bayes and the story of AI

Google Autoencoder

From Nando De Freitas

Convolutional NNs

CNNs are particularly popular in image processing

Often the feature maps correspond (not to macro features such as bicycles)but micro features

For example in handwritten digit recognition they correspond to smallconstituent parts of the digits

These are used then to process the image into a representation that is betterfor recognition

NNs in NLP

Bag of Words

We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index

We can also think of this as a Euclidean embedding e

aardvarkrarr eaardvark =

100

zorrorarr ezorro =

001

Word Embeddings

Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned

Objective is for example next word prediction accuracy

These are often called lsquoneural language modelsrsquo

NNs in NLP

Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used

Consider the sentence

the cat sat on the mat

and that we wish to predict the word on given the two preceding cat sat

and two succeeding words the mat

We can use a network that has inputs vcat vsat vthe vmat

The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high

The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context

Word Embeddings

Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)

Word Embeddings

There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender

vwoman minus vman asymp vaunt minus vuncle

vwoman minus vman asymp vqueen minus vking

From Mikolov (2013)

Word Embeddings Analogies

Given a relationship France-Paris we get the lsquorelationshiprsquo embedding

v = vParis minus vFrance

Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)

Word Embeddings Constrained Embeddings

We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)

Word Embeddings Constrained Embeddings

Recursive Nets and Embeddings

Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)

Recursive Nets and Embeddings

Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node

Recursive Nets and EmbeddingsTraining

We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree

The weights of this classifier are shared across all nodes

At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings

The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier

We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy

Prediction

For a new movie review the review is first parsed using a standard grammartree parser

This forms the tree which can be used to recursively form the sentiment classlabel for the review

Currently the best sentiment classifier Socher (2013)

Recursive Nets and Embeddingsotilde otilde

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

otilde

otilde

eth

middotshy

otilde

eth

plusmnsup2raquo

otilde

eth

plusmnordm

otilde

otilde

eth

notcedilraquo

otilde

otilde

eth

sup3plusmnshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

yen

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

yen

yen

eth

middotshy

yen

eth

plusmnsup2raquo

yen

eth

plusmnordm

yen

yen

eth

notcedilraquo

yen

yen

yen

acuteraquoiquestshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

otilde

eth

times

otilde

otilde

otilde

acutemiddotmicroraquofrac14

eth

eth

eth

raquoordfraquoregsect

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

times

yen

yen

eth

eth

frac14middotfrac14

eth

sup2ugravenot

eth

eth

acutemiddotmicroraquo

eth

eth

eth

iquest

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

timesnot

yen

yen

eth

eth

ugraveshy

eth

paralaquoshynot

yen

otilde

middotsup2frac12regraquofrac14middotfrac34acutesect

yen yen

frac14laquoacuteacute

eth

ograve

eth

eth

timesnot

eth

eth

eth

eth

eth

ugraveshy

otilde

yen

sup2plusmnnot

yen yen

frac14laquoacuteacute

eth

ograve

Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve

Recurrent Nets

x1 x2 x3

h1 h2 h3

y1 y2 y3

A A A

C C C

B B

RNNs are used in timeseries applications

The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt

In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram

I omitted the potential links from xtminus1 ytminus1 to ht

Handwriting Generation using a RNN

Some training examples

Handwriting Generation using a RNN

Some generated examples Top line is real handwriting for comparison See AlexGraversquos work

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 24: David Barber - Deep Nets, Bayes and the story of AI

Convolutional NNs

CNNs are particularly popular in image processing

Often the feature maps correspond (not to macro features such as bicycles)but micro features

For example in handwritten digit recognition they correspond to smallconstituent parts of the digits

These are used then to process the image into a representation that is betterfor recognition

NNs in NLP

Bag of Words

We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index

We can also think of this as a Euclidean embedding e

aardvarkrarr eaardvark =

100

zorrorarr ezorro =

001

Word Embeddings

Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned

Objective is for example next word prediction accuracy

These are often called lsquoneural language modelsrsquo

NNs in NLP

Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used

Consider the sentence

the cat sat on the mat

and that we wish to predict the word on given the two preceding cat sat

and two succeeding words the mat

We can use a network that has inputs vcat vsat vthe vmat

The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high

The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context

Word Embeddings

Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)

Word Embeddings

There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender

vwoman minus vman asymp vaunt minus vuncle

vwoman minus vman asymp vqueen minus vking

From Mikolov (2013)

Word Embeddings Analogies

Given a relationship France-Paris we get the lsquorelationshiprsquo embedding

v = vParis minus vFrance

Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)

Word Embeddings Constrained Embeddings

We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)

Word Embeddings Constrained Embeddings

Recursive Nets and Embeddings

Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)

Recursive Nets and Embeddings

Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node

Recursive Nets and EmbeddingsTraining

We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree

The weights of this classifier are shared across all nodes

At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings

The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier

We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy

Prediction

For a new movie review the review is first parsed using a standard grammartree parser

This forms the tree which can be used to recursively form the sentiment classlabel for the review

Currently the best sentiment classifier Socher (2013)

Recursive Nets and Embeddingsotilde otilde

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

otilde

otilde

eth

middotshy

otilde

eth

plusmnsup2raquo

otilde

eth

plusmnordm

otilde

otilde

eth

notcedilraquo

otilde

otilde

eth

sup3plusmnshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

yen

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

yen

yen

eth

middotshy

yen

eth

plusmnsup2raquo

yen

eth

plusmnordm

yen

yen

eth

notcedilraquo

yen

yen

yen

acuteraquoiquestshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

otilde

eth

times

otilde

otilde

otilde

acutemiddotmicroraquofrac14

eth

eth

eth

raquoordfraquoregsect

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

times

yen

yen

eth

eth

frac14middotfrac14

eth

sup2ugravenot

eth

eth

acutemiddotmicroraquo

eth

eth

eth

iquest

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

timesnot

yen

yen

eth

eth

ugraveshy

eth

paralaquoshynot

yen

otilde

middotsup2frac12regraquofrac14middotfrac34acutesect

yen yen

frac14laquoacuteacute

eth

ograve

eth

eth

timesnot

eth

eth

eth

eth

eth

ugraveshy

otilde

yen

sup2plusmnnot

yen yen

frac14laquoacuteacute

eth

ograve

Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve

Recurrent Nets

x1 x2 x3

h1 h2 h3

y1 y2 y3

A A A

C C C

B B

RNNs are used in timeseries applications

The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt

In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram

I omitted the potential links from xtminus1 ytminus1 to ht

Handwriting Generation using a RNN

Some training examples

Handwriting Generation using a RNN

Some generated examples Top line is real handwriting for comparison See AlexGraversquos work

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 25: David Barber - Deep Nets, Bayes and the story of AI

NNs in NLP

Bag of Words

We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index

We can also think of this as a Euclidean embedding e

aardvarkrarr eaardvark =

100

zorrorarr ezorro =

001

Word Embeddings

Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned

Objective is for example next word prediction accuracy

These are often called lsquoneural language modelsrsquo

NNs in NLP

Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used

Consider the sentence

the cat sat on the mat

and that we wish to predict the word on given the two preceding cat sat

and two succeeding words the mat

We can use a network that has inputs vcat vsat vthe vmat

The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high

The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context

Word Embeddings

Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)

Word Embeddings

There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender

vwoman minus vman asymp vaunt minus vuncle

vwoman minus vman asymp vqueen minus vking

From Mikolov (2013)

Word Embeddings Analogies

Given a relationship France-Paris we get the lsquorelationshiprsquo embedding

v = vParis minus vFrance

Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)

Word Embeddings Constrained Embeddings

We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)

Word Embeddings Constrained Embeddings

Recursive Nets and Embeddings

Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)

Recursive Nets and Embeddings

Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node

Recursive Nets and EmbeddingsTraining

We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree

The weights of this classifier are shared across all nodes

At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings

The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier

We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy

Prediction

For a new movie review the review is first parsed using a standard grammartree parser

This forms the tree which can be used to recursively form the sentiment classlabel for the review

Currently the best sentiment classifier Socher (2013)

Recursive Nets and Embeddingsotilde otilde

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

otilde

otilde

eth

middotshy

otilde

eth

plusmnsup2raquo

otilde

eth

plusmnordm

otilde

otilde

eth

notcedilraquo

otilde

otilde

eth

sup3plusmnshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

yen

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

yen

yen

eth

middotshy

yen

eth

plusmnsup2raquo

yen

eth

plusmnordm

yen

yen

eth

notcedilraquo

yen

yen

yen

acuteraquoiquestshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

otilde

eth

times

otilde

otilde

otilde

acutemiddotmicroraquofrac14

eth

eth

eth

raquoordfraquoregsect

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

times

yen

yen

eth

eth

frac14middotfrac14

eth

sup2ugravenot

eth

eth

acutemiddotmicroraquo

eth

eth

eth

iquest

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

timesnot

yen

yen

eth

eth

ugraveshy

eth

paralaquoshynot

yen

otilde

middotsup2frac12regraquofrac14middotfrac34acutesect

yen yen

frac14laquoacuteacute

eth

ograve

eth

eth

timesnot

eth

eth

eth

eth

eth

ugraveshy

otilde

yen

sup2plusmnnot

yen yen

frac14laquoacuteacute

eth

ograve

Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve

Recurrent Nets

x1 x2 x3

h1 h2 h3

y1 y2 y3

A A A

C C C

B B

RNNs are used in timeseries applications

The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt

In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram

I omitted the potential links from xtminus1 ytminus1 to ht

Handwriting Generation using a RNN

Some training examples

Handwriting Generation using a RNN

Some generated examples Top line is real handwriting for comparison See AlexGraversquos work

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 26: David Barber - Deep Nets, Bayes and the story of AI

NNs in NLP

Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used

Consider the sentence

the cat sat on the mat

and that we wish to predict the word on given the two preceding cat sat

and two succeeding words the mat

We can use a network that has inputs vcat vsat vthe vmat

The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high

The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context

Word Embeddings

Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)

Word Embeddings

There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender

vwoman minus vman asymp vaunt minus vuncle

vwoman minus vman asymp vqueen minus vking

From Mikolov (2013)

Word Embeddings Analogies

Given a relationship France-Paris we get the lsquorelationshiprsquo embedding

v = vParis minus vFrance

Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)

Word Embeddings Constrained Embeddings

We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)

Word Embeddings Constrained Embeddings

Recursive Nets and Embeddings

Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)

Recursive Nets and Embeddings

Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node

Recursive Nets and EmbeddingsTraining

We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree

The weights of this classifier are shared across all nodes

At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings

The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier

We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy

Prediction

For a new movie review the review is first parsed using a standard grammartree parser

This forms the tree which can be used to recursively form the sentiment classlabel for the review

Currently the best sentiment classifier Socher (2013)

Recursive Nets and Embeddingsotilde otilde

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

otilde

otilde

eth

middotshy

otilde

eth

plusmnsup2raquo

otilde

eth

plusmnordm

otilde

otilde

eth

notcedilraquo

otilde

otilde

eth

sup3plusmnshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

yen

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

yen

yen

eth

middotshy

yen

eth

plusmnsup2raquo

yen

eth

plusmnordm

yen

yen

eth

notcedilraquo

yen

yen

yen

acuteraquoiquestshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

otilde

eth

times

otilde

otilde

otilde

acutemiddotmicroraquofrac14

eth

eth

eth

raquoordfraquoregsect

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

times

yen

yen

eth

eth

frac14middotfrac14

eth

sup2ugravenot

eth

eth

acutemiddotmicroraquo

eth

eth

eth

iquest

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

timesnot

yen

yen

eth

eth

ugraveshy

eth

paralaquoshynot

yen

otilde

middotsup2frac12regraquofrac14middotfrac34acutesect

yen yen

frac14laquoacuteacute

eth

ograve

eth

eth

timesnot

eth

eth

eth

eth

eth

ugraveshy

otilde

yen

sup2plusmnnot

yen yen

frac14laquoacuteacute

eth

ograve

Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve

Recurrent Nets

x1 x2 x3

h1 h2 h3

y1 y2 y3

A A A

C C C

B B

RNNs are used in timeseries applications

The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt

In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram

I omitted the potential links from xtminus1 ytminus1 to ht

Handwriting Generation using a RNN

Some training examples

Handwriting Generation using a RNN

Some generated examples Top line is real handwriting for comparison See AlexGraversquos work

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 27: David Barber - Deep Nets, Bayes and the story of AI

Word Embeddings

Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)

Word Embeddings

There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender

vwoman minus vman asymp vaunt minus vuncle

vwoman minus vman asymp vqueen minus vking

From Mikolov (2013)

Word Embeddings Analogies

Given a relationship France-Paris we get the lsquorelationshiprsquo embedding

v = vParis minus vFrance

Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)

Word Embeddings Constrained Embeddings

We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)

Word Embeddings Constrained Embeddings

Recursive Nets and Embeddings

Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)

Recursive Nets and Embeddings

Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node

Recursive Nets and EmbeddingsTraining

We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree

The weights of this classifier are shared across all nodes

At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings

The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier

We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy

Prediction

For a new movie review the review is first parsed using a standard grammartree parser

This forms the tree which can be used to recursively form the sentiment classlabel for the review

Currently the best sentiment classifier Socher (2013)

Recursive Nets and Embeddingsotilde otilde

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

otilde

otilde

eth

middotshy

otilde

eth

plusmnsup2raquo

otilde

eth

plusmnordm

otilde

otilde

eth

notcedilraquo

otilde

otilde

eth

sup3plusmnshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

yen

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

yen

yen

eth

middotshy

yen

eth

plusmnsup2raquo

yen

eth

plusmnordm

yen

yen

eth

notcedilraquo

yen

yen

yen

acuteraquoiquestshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

otilde

eth

times

otilde

otilde

otilde

acutemiddotmicroraquofrac14

eth

eth

eth

raquoordfraquoregsect

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

times

yen

yen

eth

eth

frac14middotfrac14

eth

sup2ugravenot

eth

eth

acutemiddotmicroraquo

eth

eth

eth

iquest

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

timesnot

yen

yen

eth

eth

ugraveshy

eth

paralaquoshynot

yen

otilde

middotsup2frac12regraquofrac14middotfrac34acutesect

yen yen

frac14laquoacuteacute

eth

ograve

eth

eth

timesnot

eth

eth

eth

eth

eth

ugraveshy

otilde

yen

sup2plusmnnot

yen yen

frac14laquoacuteacute

eth

ograve

Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve

Recurrent Nets

x1 x2 x3

h1 h2 h3

y1 y2 y3

A A A

C C C

B B

RNNs are used in timeseries applications

The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt

In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram

I omitted the potential links from xtminus1 ytminus1 to ht

Handwriting Generation using a RNN

Some training examples

Handwriting Generation using a RNN

Some generated examples Top line is real handwriting for comparison See AlexGraversquos work

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 28: David Barber - Deep Nets, Bayes and the story of AI

Word Embeddings

There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender

vwoman minus vman asymp vaunt minus vuncle

vwoman minus vman asymp vqueen minus vking

From Mikolov (2013)

Word Embeddings Analogies

Given a relationship France-Paris we get the lsquorelationshiprsquo embedding

v = vParis minus vFrance

Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)

Word Embeddings Constrained Embeddings

We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)

Word Embeddings Constrained Embeddings

Recursive Nets and Embeddings

Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)

Recursive Nets and Embeddings

Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node

Recursive Nets and EmbeddingsTraining

We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree

The weights of this classifier are shared across all nodes

At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings

The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier

We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy

Prediction

For a new movie review the review is first parsed using a standard grammartree parser

This forms the tree which can be used to recursively form the sentiment classlabel for the review

Currently the best sentiment classifier Socher (2013)

Recursive Nets and Embeddingsotilde otilde

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

otilde

otilde

eth

middotshy

otilde

eth

plusmnsup2raquo

otilde

eth

plusmnordm

otilde

otilde

eth

notcedilraquo

otilde

otilde

eth

sup3plusmnshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

yen

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

yen

yen

eth

middotshy

yen

eth

plusmnsup2raquo

yen

eth

plusmnordm

yen

yen

eth

notcedilraquo

yen

yen

yen

acuteraquoiquestshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

otilde

eth

times

otilde

otilde

otilde

acutemiddotmicroraquofrac14

eth

eth

eth

raquoordfraquoregsect

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

times

yen

yen

eth

eth

frac14middotfrac14

eth

sup2ugravenot

eth

eth

acutemiddotmicroraquo

eth

eth

eth

iquest

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

timesnot

yen

yen

eth

eth

ugraveshy

eth

paralaquoshynot

yen

otilde

middotsup2frac12regraquofrac14middotfrac34acutesect

yen yen

frac14laquoacuteacute

eth

ograve

eth

eth

timesnot

eth

eth

eth

eth

eth

ugraveshy

otilde

yen

sup2plusmnnot

yen yen

frac14laquoacuteacute

eth

ograve

Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve

Recurrent Nets

x1 x2 x3

h1 h2 h3

y1 y2 y3

A A A

C C C

B B

RNNs are used in timeseries applications

The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt

In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram

I omitted the potential links from xtminus1 ytminus1 to ht

Handwriting Generation using a RNN

Some training examples

Handwriting Generation using a RNN

Some generated examples Top line is real handwriting for comparison See AlexGraversquos work

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 29: David Barber - Deep Nets, Bayes and the story of AI

Word Embeddings Analogies

Given a relationship France-Paris we get the lsquorelationshiprsquo embedding

v = vParis minus vFrance

Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)

Word Embeddings Constrained Embeddings

We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)

Word Embeddings Constrained Embeddings

Recursive Nets and Embeddings

Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)

Recursive Nets and Embeddings

Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node

Recursive Nets and EmbeddingsTraining

We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree

The weights of this classifier are shared across all nodes

At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings

The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier

We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy

Prediction

For a new movie review the review is first parsed using a standard grammartree parser

This forms the tree which can be used to recursively form the sentiment classlabel for the review

Currently the best sentiment classifier Socher (2013)

Recursive Nets and Embeddingsotilde otilde

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

otilde

otilde

eth

middotshy

otilde

eth

plusmnsup2raquo

otilde

eth

plusmnordm

otilde

otilde

eth

notcedilraquo

otilde

otilde

eth

sup3plusmnshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

yen

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

yen

yen

eth

middotshy

yen

eth

plusmnsup2raquo

yen

eth

plusmnordm

yen

yen

eth

notcedilraquo

yen

yen

yen

acuteraquoiquestshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

otilde

eth

times

otilde

otilde

otilde

acutemiddotmicroraquofrac14

eth

eth

eth

raquoordfraquoregsect

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

times

yen

yen

eth

eth

frac14middotfrac14

eth

sup2ugravenot

eth

eth

acutemiddotmicroraquo

eth

eth

eth

iquest

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

timesnot

yen

yen

eth

eth

ugraveshy

eth

paralaquoshynot

yen

otilde

middotsup2frac12regraquofrac14middotfrac34acutesect

yen yen

frac14laquoacuteacute

eth

ograve

eth

eth

timesnot

eth

eth

eth

eth

eth

ugraveshy

otilde

yen

sup2plusmnnot

yen yen

frac14laquoacuteacute

eth

ograve

Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve

Recurrent Nets

x1 x2 x3

h1 h2 h3

y1 y2 y3

A A A

C C C

B B

RNNs are used in timeseries applications

The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt

In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram

I omitted the potential links from xtminus1 ytminus1 to ht

Handwriting Generation using a RNN

Some training examples

Handwriting Generation using a RNN

Some generated examples Top line is real handwriting for comparison See AlexGraversquos work

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 30: David Barber - Deep Nets, Bayes and the story of AI

Word Embeddings Constrained Embeddings

We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)

Word Embeddings Constrained Embeddings

Recursive Nets and Embeddings

Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)

Recursive Nets and Embeddings

Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node

Recursive Nets and EmbeddingsTraining

We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree

The weights of this classifier are shared across all nodes

At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings

The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier

We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy

Prediction

For a new movie review the review is first parsed using a standard grammartree parser

This forms the tree which can be used to recursively form the sentiment classlabel for the review

Currently the best sentiment classifier Socher (2013)

Recursive Nets and Embeddingsotilde otilde

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

otilde

otilde

eth

middotshy

otilde

eth

plusmnsup2raquo

otilde

eth

plusmnordm

otilde

otilde

eth

notcedilraquo

otilde

otilde

eth

sup3plusmnshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

yen

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

yen

yen

eth

middotshy

yen

eth

plusmnsup2raquo

yen

eth

plusmnordm

yen

yen

eth

notcedilraquo

yen

yen

yen

acuteraquoiquestshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

otilde

eth

times

otilde

otilde

otilde

acutemiddotmicroraquofrac14

eth

eth

eth

raquoordfraquoregsect

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

times

yen

yen

eth

eth

frac14middotfrac14

eth

sup2ugravenot

eth

eth

acutemiddotmicroraquo

eth

eth

eth

iquest

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

timesnot

yen

yen

eth

eth

ugraveshy

eth

paralaquoshynot

yen

otilde

middotsup2frac12regraquofrac14middotfrac34acutesect

yen yen

frac14laquoacuteacute

eth

ograve

eth

eth

timesnot

eth

eth

eth

eth

eth

ugraveshy

otilde

yen

sup2plusmnnot

yen yen

frac14laquoacuteacute

eth

ograve

Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve

Recurrent Nets

x1 x2 x3

h1 h2 h3

y1 y2 y3

A A A

C C C

B B

RNNs are used in timeseries applications

The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt

In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram

I omitted the potential links from xtminus1 ytminus1 to ht

Handwriting Generation using a RNN

Some training examples

Handwriting Generation using a RNN

Some generated examples Top line is real handwriting for comparison See AlexGraversquos work

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 31: David Barber - Deep Nets, Bayes and the story of AI

Word Embeddings Constrained Embeddings

Recursive Nets and Embeddings

Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)

Recursive Nets and Embeddings

Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node

Recursive Nets and EmbeddingsTraining

We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree

The weights of this classifier are shared across all nodes

At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings

The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier

We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy

Prediction

For a new movie review the review is first parsed using a standard grammartree parser

This forms the tree which can be used to recursively form the sentiment classlabel for the review

Currently the best sentiment classifier Socher (2013)

Recursive Nets and Embeddingsotilde otilde

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

otilde

otilde

eth

middotshy

otilde

eth

plusmnsup2raquo

otilde

eth

plusmnordm

otilde

otilde

eth

notcedilraquo

otilde

otilde

eth

sup3plusmnshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

yen

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

yen

yen

eth

middotshy

yen

eth

plusmnsup2raquo

yen

eth

plusmnordm

yen

yen

eth

notcedilraquo

yen

yen

yen

acuteraquoiquestshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

otilde

eth

times

otilde

otilde

otilde

acutemiddotmicroraquofrac14

eth

eth

eth

raquoordfraquoregsect

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

times

yen

yen

eth

eth

frac14middotfrac14

eth

sup2ugravenot

eth

eth

acutemiddotmicroraquo

eth

eth

eth

iquest

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

timesnot

yen

yen

eth

eth

ugraveshy

eth

paralaquoshynot

yen

otilde

middotsup2frac12regraquofrac14middotfrac34acutesect

yen yen

frac14laquoacuteacute

eth

ograve

eth

eth

timesnot

eth

eth

eth

eth

eth

ugraveshy

otilde

yen

sup2plusmnnot

yen yen

frac14laquoacuteacute

eth

ograve

Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve

Recurrent Nets

x1 x2 x3

h1 h2 h3

y1 y2 y3

A A A

C C C

B B

RNNs are used in timeseries applications

The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt

In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram

I omitted the potential links from xtminus1 ytminus1 to ht

Handwriting Generation using a RNN

Some training examples

Handwriting Generation using a RNN

Some generated examples Top line is real handwriting for comparison See AlexGraversquos work

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 32: David Barber - Deep Nets, Bayes and the story of AI

Recursive Nets and Embeddings

Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)

Recursive Nets and Embeddings

Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node

Recursive Nets and EmbeddingsTraining

We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree

The weights of this classifier are shared across all nodes

At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings

The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier

We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy

Prediction

For a new movie review the review is first parsed using a standard grammartree parser

This forms the tree which can be used to recursively form the sentiment classlabel for the review

Currently the best sentiment classifier Socher (2013)

Recursive Nets and Embeddingsotilde otilde

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

otilde

otilde

eth

middotshy

otilde

eth

plusmnsup2raquo

otilde

eth

plusmnordm

otilde

otilde

eth

notcedilraquo

otilde

otilde

eth

sup3plusmnshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

yen

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

yen

yen

eth

middotshy

yen

eth

plusmnsup2raquo

yen

eth

plusmnordm

yen

yen

eth

notcedilraquo

yen

yen

yen

acuteraquoiquestshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

otilde

eth

times

otilde

otilde

otilde

acutemiddotmicroraquofrac14

eth

eth

eth

raquoordfraquoregsect

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

times

yen

yen

eth

eth

frac14middotfrac14

eth

sup2ugravenot

eth

eth

acutemiddotmicroraquo

eth

eth

eth

iquest

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

timesnot

yen

yen

eth

eth

ugraveshy

eth

paralaquoshynot

yen

otilde

middotsup2frac12regraquofrac14middotfrac34acutesect

yen yen

frac14laquoacuteacute

eth

ograve

eth

eth

timesnot

eth

eth

eth

eth

eth

ugraveshy

otilde

yen

sup2plusmnnot

yen yen

frac14laquoacuteacute

eth

ograve

Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve

Recurrent Nets

x1 x2 x3

h1 h2 h3

y1 y2 y3

A A A

C C C

B B

RNNs are used in timeseries applications

The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt

In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram

I omitted the potential links from xtminus1 ytminus1 to ht

Handwriting Generation using a RNN

Some training examples

Handwriting Generation using a RNN

Some generated examples Top line is real handwriting for comparison See AlexGraversquos work

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 33: David Barber - Deep Nets, Bayes and the story of AI

Recursive Nets and Embeddings

Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node

Recursive Nets and EmbeddingsTraining

We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree

The weights of this classifier are shared across all nodes

At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings

The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier

We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy

Prediction

For a new movie review the review is first parsed using a standard grammartree parser

This forms the tree which can be used to recursively form the sentiment classlabel for the review

Currently the best sentiment classifier Socher (2013)

Recursive Nets and Embeddingsotilde otilde

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

otilde

otilde

eth

middotshy

otilde

eth

plusmnsup2raquo

otilde

eth

plusmnordm

otilde

otilde

eth

notcedilraquo

otilde

otilde

eth

sup3plusmnshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

yen

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

yen

yen

eth

middotshy

yen

eth

plusmnsup2raquo

yen

eth

plusmnordm

yen

yen

eth

notcedilraquo

yen

yen

yen

acuteraquoiquestshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

otilde

eth

times

otilde

otilde

otilde

acutemiddotmicroraquofrac14

eth

eth

eth

raquoordfraquoregsect

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

times

yen

yen

eth

eth

frac14middotfrac14

eth

sup2ugravenot

eth

eth

acutemiddotmicroraquo

eth

eth

eth

iquest

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

timesnot

yen

yen

eth

eth

ugraveshy

eth

paralaquoshynot

yen

otilde

middotsup2frac12regraquofrac14middotfrac34acutesect

yen yen

frac14laquoacuteacute

eth

ograve

eth

eth

timesnot

eth

eth

eth

eth

eth

ugraveshy

otilde

yen

sup2plusmnnot

yen yen

frac14laquoacuteacute

eth

ograve

Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve

Recurrent Nets

x1 x2 x3

h1 h2 h3

y1 y2 y3

A A A

C C C

B B

RNNs are used in timeseries applications

The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt

In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram

I omitted the potential links from xtminus1 ytminus1 to ht

Handwriting Generation using a RNN

Some training examples

Handwriting Generation using a RNN

Some generated examples Top line is real handwriting for comparison See AlexGraversquos work

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 34: David Barber - Deep Nets, Bayes and the story of AI

Recursive Nets and EmbeddingsTraining

We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree

The weights of this classifier are shared across all nodes

At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings

The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier

We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy

Prediction

For a new movie review the review is first parsed using a standard grammartree parser

This forms the tree which can be used to recursively form the sentiment classlabel for the review

Currently the best sentiment classifier Socher (2013)

Recursive Nets and Embeddingsotilde otilde

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

otilde

otilde

eth

middotshy

otilde

eth

plusmnsup2raquo

otilde

eth

plusmnordm

otilde

otilde

eth

notcedilraquo

otilde

otilde

eth

sup3plusmnshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

yen

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

yen

yen

eth

middotshy

yen

eth

plusmnsup2raquo

yen

eth

plusmnordm

yen

yen

eth

notcedilraquo

yen

yen

yen

acuteraquoiquestshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

otilde

eth

times

otilde

otilde

otilde

acutemiddotmicroraquofrac14

eth

eth

eth

raquoordfraquoregsect

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

times

yen

yen

eth

eth

frac14middotfrac14

eth

sup2ugravenot

eth

eth

acutemiddotmicroraquo

eth

eth

eth

iquest

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

timesnot

yen

yen

eth

eth

ugraveshy

eth

paralaquoshynot

yen

otilde

middotsup2frac12regraquofrac14middotfrac34acutesect

yen yen

frac14laquoacuteacute

eth

ograve

eth

eth

timesnot

eth

eth

eth

eth

eth

ugraveshy

otilde

yen

sup2plusmnnot

yen yen

frac14laquoacuteacute

eth

ograve

Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve

Recurrent Nets

x1 x2 x3

h1 h2 h3

y1 y2 y3

A A A

C C C

B B

RNNs are used in timeseries applications

The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt

In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram

I omitted the potential links from xtminus1 ytminus1 to ht

Handwriting Generation using a RNN

Some training examples

Handwriting Generation using a RNN

Some generated examples Top line is real handwriting for comparison See AlexGraversquos work

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 35: David Barber - Deep Nets, Bayes and the story of AI

Recursive Nets and Embeddingsotilde otilde

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

otilde

otilde

eth

middotshy

otilde

eth

plusmnsup2raquo

otilde

eth

plusmnordm

otilde

otilde

eth

notcedilraquo

otilde

otilde

eth

sup3plusmnshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

yen

eth

eth

Icircplusmnsup1raquoreg

eth

Uumlplusmnfrac14sup1raquoreg

yen

yen

eth

middotshy

yen

eth

plusmnsup2raquo

yen

eth

plusmnordm

yen

yen

eth

notcedilraquo

yen

yen

yen

acuteraquoiquestshynot

otilde

frac12plusmnsup3degraquoacuteacutemiddotsup2sup1

eth

ordfiquestregmiddotiquestnotmiddotplusmnsup2shy

eth

eth

plusmnsup2

eth

eth

notcedilmiddotshy

eth

notcedilraquosup3raquo

eth

ograve

otilde

eth

times

otilde

otilde

otilde

acutemiddotmicroraquofrac14

eth

eth

eth

raquoordfraquoregsect

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

times

yen

yen

eth

eth

frac14middotfrac14

eth

sup2ugravenot

eth

eth

acutemiddotmicroraquo

eth

eth

eth

iquest

eth

eth

shymiddotsup2sup1acuteraquo

eth

sup3middotsup2laquonotraquo

eth

eth

plusmnordm

eth

eth

notcedilmiddotshy

eth

eth

ograve

yen

eth

timesnot

yen

yen

eth

eth

ugraveshy

eth

paralaquoshynot

yen

otilde

middotsup2frac12regraquofrac14middotfrac34acutesect

yen yen

frac14laquoacuteacute

eth

ograve

eth

eth

timesnot

eth

eth

eth

eth

eth

ugraveshy

otilde

yen

sup2plusmnnot

yen yen

frac14laquoacuteacute

eth

ograve

Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve

Recurrent Nets

x1 x2 x3

h1 h2 h3

y1 y2 y3

A A A

C C C

B B

RNNs are used in timeseries applications

The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt

In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram

I omitted the potential links from xtminus1 ytminus1 to ht

Handwriting Generation using a RNN

Some training examples

Handwriting Generation using a RNN

Some generated examples Top line is real handwriting for comparison See AlexGraversquos work

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 36: David Barber - Deep Nets, Bayes and the story of AI

Recurrent Nets

x1 x2 x3

h1 h2 h3

y1 y2 y3

A A A

C C C

B B

RNNs are used in timeseries applications

The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt

In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram

I omitted the potential links from xtminus1 ytminus1 to ht

Handwriting Generation using a RNN

Some training examples

Handwriting Generation using a RNN

Some generated examples Top line is real handwriting for comparison See AlexGraversquos work

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 37: David Barber - Deep Nets, Bayes and the story of AI

Handwriting Generation using a RNN

Some training examples

Handwriting Generation using a RNN

Some generated examples Top line is real handwriting for comparison See AlexGraversquos work

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 38: David Barber - Deep Nets, Bayes and the story of AI

Handwriting Generation using a RNN

Some generated examples Top line is real handwriting for comparison See AlexGraversquos work

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 39: David Barber - Deep Nets, Bayes and the story of AI

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 40: David Barber - Deep Nets, Bayes and the story of AI

Reasons research in deep learning has exploded

Much greater compute power (GPU)

Much larger datasets

AutoDiff

What is AutoDiff

AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient

gi(x) equivpart

partxif

∥∥∥∥x

Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient

One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 41: David Barber - Deep Nets, Bayes and the story of AI

Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges

df

dx=partf

partx+partf

partg

dg

dx

x

f

gpartfpartx

dgdx

partfpartg

Example

For f(x) = x2 + xgh where g =x2 and h = xg2

x

f

gh2x+ gh

2x

xh

2gx

xg

g2

f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 42: David Barber - Deep Nets, Bayes and the story of AI

Reverse DifferentiationConsider

f(x1 x2) = cos (sin(x1x2))

We can represent this computationally using an Abstract Syntax Tree (AST)

x1 x2

f1

f2

f3

f1(x1 x2) = x1x2

f2(x) = sin(x)

f3(x) = cos(x)

Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 43: David Barber - Deep Nets, Bayes and the story of AI

Reverse Differentiation

x1 x2

f1

f2

f3

df3dx1

=partf3partf2

df2dx1

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx1

Similarly

df3dx2

=partf3partf2

df2df1︸ ︷︷ ︸

df3df1

df1dx2

The two derivatives share the same computation branch andwe want to exploit this

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 44: David Barber - Deep Nets, Bayes and the story of AI

Reverse Differentiation

x1 x2

f1

f2

f3

partf1partx1

= x2partf1partx2

= x1

partf2partf1

= cos(f1)

partf3partf2

= minus sin(f2)

1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)

2 Start with the first node n1 in the reverseschedule and define tn1 = 1

3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define

tn =sum

cisinch(n)

partfcpartfn

tc

4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes

This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 45: David Barber - Deep Nets, Bayes and the story of AI

Limitations of forward reasoning

World Representation

Recognising patterns (perceptron style) is only one form of intelligence

Solving chess problems is another and requires complex reasoning using someform of internal model

The world is noisy and information may be conflicting

Recognised that new approaches are required

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 46: David Barber - Deep Nets, Bayes and the story of AI

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 47: David Barber - Deep Nets, Bayes and the story of AI

Limitations of forward reasoning

World Representation

Models help us to fantasise about the world

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 48: David Barber - Deep Nets, Bayes and the story of AI

Models

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 49: David Barber - Deep Nets, Bayes and the story of AI

Models

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 50: David Barber - Deep Nets, Bayes and the story of AI

Burglar Problem

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 51: David Barber - Deep Nets, Bayes and the story of AI

Creaks and Bumps

Creak Bump

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 52: David Barber - Deep Nets, Bayes and the story of AI

Burglar Model

pos1 pos2 pos3 pos4

snd1 snd2 snd3 snd4

pos - position in kitchensnd ndash sound

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 53: David Barber - Deep Nets, Bayes and the story of AI

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 54: David Barber - Deep Nets, Bayes and the story of AI

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 55: David Barber - Deep Nets, Bayes and the story of AI

Finding the Burglar

creak creak

bump

creak

bump bump

creak

bump bump bump

creak

bump

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 56: David Barber - Deep Nets, Bayes and the story of AI

Stubby Fingers

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 57: David Barber - Deep Nets, Bayes and the story of AI

Stubby Fingers

int1 int2 int3 int4

hit1 hit2 hit3 hit4

int - intended keyhit ndash hit key

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 58: David Barber - Deep Nets, Bayes and the story of AI

Stubby Fingers errors

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz

005

01

015

02

025

03

035

04

045

05

055

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 59: David Barber - Deep Nets, Bayes and the story of AI

Stubby Fingers language

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghijkl

mnopqrstuvwxyz 0

01

02

03

04

05

06

07

08

09

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 60: David Barber - Deep Nets, Bayes and the story of AI

Stubby Fingers

Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to

List the 200 most likely hidden sequences

Discard those that are not in a standard English dictionary

Take the most likely proper English word as the intended typed word

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 61: David Barber - Deep Nets, Bayes and the story of AI

Speech Recognition raw signal

0 01 02 03 04 05 06 07 08 09minus02

minus015

minus01

minus005

0

005

01

015

02

025

03

Time

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 62: David Barber - Deep Nets, Bayes and the story of AI

lsquoneuralrsquo representation

10 20 30 40 50 60 70 80

5

10

15

20

25

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 63: David Barber - Deep Nets, Bayes and the story of AI

Speech Recognition

pho1 pho2 pho3 pho4

aud1 aud2 aud3 aud4

pho phoneme (letter)aud audio signal (neural representation)

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 64: David Barber - Deep Nets, Bayes and the story of AI

Medical Diagnosis

tumour flu meningitis

headache fever appetite x-ray

Combine known medical knowledge with patient specific information

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 65: David Barber - Deep Nets, Bayes and the story of AI

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 66: David Barber - Deep Nets, Bayes and the story of AI

Probability

Why Probability

Probability is a logical calculus of uncertainty

Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)

The need for structure

We often want to make a probabilistic description of many objects (electronspins neurons customers etc )

Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented

Without introducing strong structural limitations about how these objects caninteract probability is a non-starter

For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 67: David Barber - Deep Nets, Bayes and the story of AI

Graphical Models

We can use graphs to represent how objects can probabilistically interact witheach other

Graphical Models and then a marriage between Graph and Probability theory

Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph

The computational complexity of operations can often be related to thestructure of the graph

Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science

Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 68: David Barber - Deep Nets, Bayes and the story of AI

Uses in Industry

Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)

Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis

Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition

Used to estimate inherent desirability of products in consumer retail

Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 69: David Barber - Deep Nets, Bayes and the story of AI

Conditional Probability and Bayesrsquo Rule

The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as

p(x|y) equiv p(x y)

p(y)=p(y|x)p(x)

p(y)(Bayesrsquo rule)

Throwing darts

p(region 5|not region 20) =p(region 5 not region 20)

p(not region 20)

=p(region 5)

p(not region 20)=

120

1920=

1

19

Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 70: David Barber - Deep Nets, Bayes and the story of AI

Battleships

Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each

Can be placed anywhere on the 10times10 grid but cannot overlap

Let s1 is the origin of ship 1 and s2 the origin of ship 2

Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses

p(s1 s2|D) =p(D|s1 s2)p(s1 s2)

p(D)Let X be the matrix of pixel occupancy

p(X|D) =sums1s2

p(X s1 s2|D) =sums1s2

p(X|s1 s2)p(s1 s2|D)

demoBattleshipsm

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 71: David Barber - Deep Nets, Bayes and the story of AI

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 72: David Barber - Deep Nets, Bayes and the story of AI

Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents

The joint distribution is obtained by taking the product of the conditionalprobabilities

p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)

p(E|BC)

A B

C

DE

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 73: David Barber - Deep Nets, Bayes and the story of AI

Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes

Choosing an orderingWithout loss of generality we can write

p(AREB) = p(A|REB)p(REB)

= p(A|REB)p(R|EB)p(EB)

= p(A|REB)p(R|EB)p(E|B)p(B)

Assumptions

The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)

Therefore

p(AREB) = p(A|EB)p(R|E)p(E)p(B)

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 74: David Barber - Deep Nets, Bayes and the story of AI

Example ndash Part II Specifying the Tables

B

A

E

R

p(A|BE)

Alarm = 1 Burglar Earthquake09999 1 1

099 1 0099 0 1

00001 0 0

p(R|E)

Radio = 1 Earthquake1 10 0

The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 75: David Barber - Deep Nets, Bayes and the story of AI

Example Part III Inference

Initial Evidence The alarm is sounding

p(B = 1|A = 1) =

sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)

=

sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum

BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099

Additional Evidence The radio broadcasts an earthquake warning

A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001

Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 76: David Barber - Deep Nets, Bayes and the story of AI

Markov Models

For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition

p(v1T ) =

Tprodt=1

p(vt|v1tminus1)

with the convention p(vt|v1tminus1) = p(v1) for t = 1

v1 v2 v3 v4

Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 77: David Barber - Deep Nets, Bayes and the story of AI

Markov Chain

Only the recent past is relevant

p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)

where L ge 1 is the order of the Markov chain

p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)

For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)

v1 v2 v3 v4

(a)

v1 v2 v3 v4

(b)

Figure (a) First order Markov chain (b) Second order Markov chain

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 78: David Barber - Deep Nets, Bayes and the story of AI

Markov Chains

v1 v2 v3 v4

p(v1 vT ) = p(v1)︸ ︷︷ ︸initial

Tprodt=2

p(vt|vtminus1)︸ ︷︷ ︸Transition

State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)

1 2

34

56

7

8 9

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 79: David Barber - Deep Nets, Bayes and the story of AI

Most probable and shortest paths

1 2

34

56

7

8 9

The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7

The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 80: David Barber - Deep Nets, Bayes and the story of AI

Equilibrium distribution

It is interesting to know how the marginal p(xt) evolves through time

p(xt = i) =sumj

p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij

p(xtminus1 = j)

p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is

pt = Mtminus1p1

If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain

pinfin = Mpinfin

The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 81: David Barber - Deep Nets, Bayes and the story of AI

PageRank

Define the matrix

Aij =

1 if website j has a hyperlink to website i0 otherwise

From this we can define a Markov transition matrix with elements

Mij =Aijsumiprime Aiprimej

If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i

For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 82: David Barber - Deep Nets, Bayes and the story of AI

Hidden Markov Models

The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution

p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2

p(vt|ht)p(ht|htminus1)

For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time

v1 v2 v3 v4

h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 83: David Barber - Deep Nets, Bayes and the story of AI

The classical inference problems

Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax

h1T

p(h1T |v1T )

For prediction one is also often interested in p(vt|v1s) for t gt s

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 84: David Barber - Deep Nets, Bayes and the story of AI

Inference in Hidden Markov Models

Belief network representation of a HMM

h1 h2 h3 h4

v1 v2 v3 v4

Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)

The algorithms are variants of lsquomessage passing on factor graphsrsquo

Algorithm guaranteed to work if the graph is singly-connected

Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 85: David Barber - Deep Nets, Bayes and the story of AI

HMMs for speech recognition

ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 86: David Barber - Deep Nets, Bayes and the story of AI

Deep Nets and HMMs

h1 h2 h3 h4

v1 v2 v3 v4

Recently companies including Google have made big advances in speechrecognition

The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)

This function is a deep neural network trained on a large amount of data

Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 87: David Barber - Deep Nets, Bayes and the story of AI

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 88: David Barber - Deep Nets, Bayes and the story of AI

Generative Model

h1 h2

v1 v2 v3 v4

It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation

Note that this is a Graphical Model not a Function

The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images

The bad news

Inference (computing p(h|v) and parameter learning) is intractable in thesemodels

Statisticians typically use sampling as an approximation

Very popular in ML to use a variational method ndash much faster for inference

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 89: David Barber - Deep Nets, Bayes and the story of AI

Variational InferenceConsider a distribution

p(v|θ) =inth

p(v|h θ)p(h)

and that we wish to learn θ to maximise the probability this model generatesobserved data

log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +

inth

q(h|v φ)p(v|h θ) + const

Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently

We then jointly maximise the bound wrt φ and θ

We can parameterise p(v|h θ) using a deep network

Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms

Extension to semi-supervised method using p(v) =inth

sumc p(v|h c)p(c)p(h)

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 90: David Barber - Deep Nets, Bayes and the story of AI

DRAW

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 91: David Barber - Deep Nets, Bayes and the story of AI

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 92: David Barber - Deep Nets, Bayes and the story of AI

Reinforcement Learning

Can we teach computers to play Atari video games

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 93: David Barber - Deep Nets, Bayes and the story of AI

Deep Reinforcement Learning

Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals

Problem is that the number of pixel states is enormous

Need to learn a low dimensional representation of the screen (use a deepgenerative model)

Learn then which action to take given the low dimensional representation

Tetris

Google

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 94: David Barber - Deep Nets, Bayes and the story of AI

Table of Contents

History of the AI dream

How do brains work

Connectionism

AutoDiff

Fantasy Machines

Probability

Directed Graphical Models

Variational Generative Models

Reinforcement Learning

Outlook

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook
Page 95: David Barber - Deep Nets, Bayes and the story of AI

Outlook

Machine Learning is in a boom period

Renewed interest and hope in creating AI

Combine new computational power with suitable hierarchical representations

Impressive state of the art results in Speech Recognition Image AnalysisGame Playing

Challenges

Improve understanding of optimisation for deep learning

Learn how to more efficiently exploit computational resources

Learn how to exploit massive databases

Improve interaction between reinforcement learning and representationlearning

Marry non-symbolic (neural) with symbolic (Bayesian reasoning)

Emphasis is on scalability

Feel free to contact me at UCL or at my AI company reinfer

httpsreinferio

  • History of the AI dream
  • How do brains work
  • Connectionism
  • AutoDiff
  • Fantasy Machines
  • Probability
  • Directed Graphical Models
  • Variational Generative Models
  • Reinforcement Learning
  • Outlook