Upload
bayes-nets-meetup-london
View
837
Download
0
Embed Size (px)
Citation preview
Deep Nets Bayes and the story of AI (continued)
David Barber
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Intelligent Machinery
1948 Turing and Champernowne lsquopaper and pencilrsquo chess
Intelligent Machinery
1951 Prinz mate-in-two moves chess machine
1952 Strachey programs first computer draughts algorithm
Learning Machines
1951 Oettinger makes first program that lsquolearnsrsquo
1955 Samuel adds lsquolearningrsquo to his draughts algorithm
Logical Intelligence
1968 Rischrsquos algorithm for integration in calculus
1972 Prolog for general logical reasoning
1997 Deep Blue defeats Kasparov
Other forms of intelligence
But is this getting us to where wersquod like to beSelfridge-Shannon film clip
Speech Recognition
Visual Processing
Natural Language modelling
Planning and decision in uncertain environments
Perhaps a different approach would be useful
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Astonishing Hypothesis Crick
ldquoA personrsquos mental activities are entirely due to the behaviour of nervecells and the molecules that make them up and influence themrdquo
Neurons
Visual Pathway
Information Processing in Brains
Neurons
Re
al
Wo
rld
Layer 1 Layer 2 Highminuslevel
Concepts
Feature
Hierarchical Modular Binary Parallel Noisy
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Artificial Neuron (Perceptron)
weight 7
output neuron
neuron 1neuron 2neuron 3neuron 4
neuron 7neuron 6neuron 5
inputs
weight 1
Training an artificial neural network
Want to generalise to new images with high accuracy
Artificial Network
1957 Rosenblattrsquos perceptron
perceptron film clip
Connectionism
1960 Realised a perceptron can only solve simple tasks
1970 Decline in interest
1980 New computing power made training multilayer networks feasible
outputinputs
Each node (or lsquoneuronrsquo) computes a function of a weighted combination ofparental nodes hj = σ(
sumi wijhi)
Neural Networks and Deep LearningHistorical Problems with Neural Nets (1990s)
NNs are difficult to train (many local optima)
Particularly difficult to train a NN with a large number of layers (say largerthan around 10)
lsquoGradient Diffusion Problemrsquo ndash difficult to assign responsibility of errors toindividual lsquoneuronsrsquo
Machine Learning (up to 2006)
A large section of the machine learning community abandoned NNs
More principled and computationally better understood techniques (SVMsand related convex methods) replaced them
Bayesian AI (1990s onwards)
From mid 1990s there was a realisation that pattern recognition is notsufficient for all AI purposes
Uncertainty and reasoning are not naturally representable using standardfeed-forward nets
Explosion in more lsquosymbolicrsquo Bayesian AI
Deep Learning
NNs have resurged in interest in the last few years (Hinton Bengio )
Also called lsquodeep learningrsquo
Sense that very complex tasks (object recognition learning complex structurein data) requires going beyond simple (convex) statistical techniques
The brain uses hierarchical distributed processing and it is likely to be for agood reason
Many problems have a hierarchical structure images are made of partslanguage is hierarchical etc
Why now
New computing resources (GPU processing)
Availability of large amount of data means that we can train nets with manyparameters (1010)
Recent evidence suggests local optima are not particularly problematic
Autoencoder
y1 y2 y3 y4 y5
h1 h2 h3
h4 h5
y1 y2 y3 y4 y5
h6 h7 h8
The bottleneck forces the network to try to find a low dimensionalrepresentation of the data
Useful for unsupervised learning
Autoencoder on MNIST digits (Hinton 2006 Science)
Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA
60000 training images (28times 28 = 784 pixels)
Use a form of autoencoder to find a lower (30) dimensional representation
At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation
Google Cats
10 Million Youtube video frames (200x200 pixel images)
Use a specialised autoencoder with 9 layers (1 billion weights)
2000 computers + two weeks of computing
Examine units to see what images they most respond to
Google Autoencoder
From Nando De Freitas
Convolutional NNs
CNNs are particularly popular in image processing
Often the feature maps correspond (not to macro features such as bicycles)but micro features
For example in handwritten digit recognition they correspond to smallconstituent parts of the digits
These are used then to process the image into a representation that is betterfor recognition
NNs in NLP
Bag of Words
We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index
We can also think of this as a Euclidean embedding e
aardvarkrarr eaardvark =
100
zorrorarr ezorro =
001
Word Embeddings
Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned
Objective is for example next word prediction accuracy
These are often called lsquoneural language modelsrsquo
NNs in NLP
Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used
Consider the sentence
the cat sat on the mat
and that we wish to predict the word on given the two preceding cat sat
and two succeeding words the mat
We can use a network that has inputs vcat vsat vthe vmat
The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high
The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context
Word Embeddings
Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)
Word Embeddings
There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender
vwoman minus vman asymp vaunt minus vuncle
vwoman minus vman asymp vqueen minus vking
From Mikolov (2013)
Word Embeddings Analogies
Given a relationship France-Paris we get the lsquorelationshiprsquo embedding
v = vParis minus vFrance
Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)
Word Embeddings Constrained Embeddings
We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)
Word Embeddings Constrained Embeddings
Recursive Nets and Embeddings
Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)
Recursive Nets and Embeddings
Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node
Recursive Nets and EmbeddingsTraining
We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree
The weights of this classifier are shared across all nodes
At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings
The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier
We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy
Prediction
For a new movie review the review is first parsed using a standard grammartree parser
This forms the tree which can be used to recursively form the sentiment classlabel for the review
Currently the best sentiment classifier Socher (2013)
Recursive Nets and Embeddingsotilde otilde
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
otilde
otilde
eth
middotshy
otilde
eth
plusmnsup2raquo
otilde
eth
plusmnordm
otilde
otilde
eth
notcedilraquo
otilde
otilde
eth
sup3plusmnshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
yen
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
yen
yen
eth
middotshy
yen
eth
plusmnsup2raquo
yen
eth
plusmnordm
yen
yen
eth
notcedilraquo
yen
yen
yen
acuteraquoiquestshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
otilde
eth
times
otilde
otilde
otilde
acutemiddotmicroraquofrac14
eth
eth
eth
raquoordfraquoregsect
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
times
yen
yen
eth
eth
frac14middotfrac14
eth
sup2ugravenot
eth
eth
acutemiddotmicroraquo
eth
eth
eth
iquest
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
timesnot
yen
yen
eth
eth
ugraveshy
eth
paralaquoshynot
yen
otilde
middotsup2frac12regraquofrac14middotfrac34acutesect
yen yen
frac14laquoacuteacute
eth
ograve
eth
eth
timesnot
eth
eth
eth
eth
eth
ugraveshy
otilde
yen
sup2plusmnnot
yen yen
frac14laquoacuteacute
eth
ograve
Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve
Recurrent Nets
x1 x2 x3
h1 h2 h3
y1 y2 y3
A A A
C C C
B B
RNNs are used in timeseries applications
The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt
In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram
I omitted the potential links from xtminus1 ytminus1 to ht
Handwriting Generation using a RNN
Some training examples
Handwriting Generation using a RNN
Some generated examples Top line is real handwriting for comparison See AlexGraversquos work
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Intelligent Machinery
1948 Turing and Champernowne lsquopaper and pencilrsquo chess
Intelligent Machinery
1951 Prinz mate-in-two moves chess machine
1952 Strachey programs first computer draughts algorithm
Learning Machines
1951 Oettinger makes first program that lsquolearnsrsquo
1955 Samuel adds lsquolearningrsquo to his draughts algorithm
Logical Intelligence
1968 Rischrsquos algorithm for integration in calculus
1972 Prolog for general logical reasoning
1997 Deep Blue defeats Kasparov
Other forms of intelligence
But is this getting us to where wersquod like to beSelfridge-Shannon film clip
Speech Recognition
Visual Processing
Natural Language modelling
Planning and decision in uncertain environments
Perhaps a different approach would be useful
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Astonishing Hypothesis Crick
ldquoA personrsquos mental activities are entirely due to the behaviour of nervecells and the molecules that make them up and influence themrdquo
Neurons
Visual Pathway
Information Processing in Brains
Neurons
Re
al
Wo
rld
Layer 1 Layer 2 Highminuslevel
Concepts
Feature
Hierarchical Modular Binary Parallel Noisy
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Artificial Neuron (Perceptron)
weight 7
output neuron
neuron 1neuron 2neuron 3neuron 4
neuron 7neuron 6neuron 5
inputs
weight 1
Training an artificial neural network
Want to generalise to new images with high accuracy
Artificial Network
1957 Rosenblattrsquos perceptron
perceptron film clip
Connectionism
1960 Realised a perceptron can only solve simple tasks
1970 Decline in interest
1980 New computing power made training multilayer networks feasible
outputinputs
Each node (or lsquoneuronrsquo) computes a function of a weighted combination ofparental nodes hj = σ(
sumi wijhi)
Neural Networks and Deep LearningHistorical Problems with Neural Nets (1990s)
NNs are difficult to train (many local optima)
Particularly difficult to train a NN with a large number of layers (say largerthan around 10)
lsquoGradient Diffusion Problemrsquo ndash difficult to assign responsibility of errors toindividual lsquoneuronsrsquo
Machine Learning (up to 2006)
A large section of the machine learning community abandoned NNs
More principled and computationally better understood techniques (SVMsand related convex methods) replaced them
Bayesian AI (1990s onwards)
From mid 1990s there was a realisation that pattern recognition is notsufficient for all AI purposes
Uncertainty and reasoning are not naturally representable using standardfeed-forward nets
Explosion in more lsquosymbolicrsquo Bayesian AI
Deep Learning
NNs have resurged in interest in the last few years (Hinton Bengio )
Also called lsquodeep learningrsquo
Sense that very complex tasks (object recognition learning complex structurein data) requires going beyond simple (convex) statistical techniques
The brain uses hierarchical distributed processing and it is likely to be for agood reason
Many problems have a hierarchical structure images are made of partslanguage is hierarchical etc
Why now
New computing resources (GPU processing)
Availability of large amount of data means that we can train nets with manyparameters (1010)
Recent evidence suggests local optima are not particularly problematic
Autoencoder
y1 y2 y3 y4 y5
h1 h2 h3
h4 h5
y1 y2 y3 y4 y5
h6 h7 h8
The bottleneck forces the network to try to find a low dimensionalrepresentation of the data
Useful for unsupervised learning
Autoencoder on MNIST digits (Hinton 2006 Science)
Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA
60000 training images (28times 28 = 784 pixels)
Use a form of autoencoder to find a lower (30) dimensional representation
At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation
Google Cats
10 Million Youtube video frames (200x200 pixel images)
Use a specialised autoencoder with 9 layers (1 billion weights)
2000 computers + two weeks of computing
Examine units to see what images they most respond to
Google Autoencoder
From Nando De Freitas
Convolutional NNs
CNNs are particularly popular in image processing
Often the feature maps correspond (not to macro features such as bicycles)but micro features
For example in handwritten digit recognition they correspond to smallconstituent parts of the digits
These are used then to process the image into a representation that is betterfor recognition
NNs in NLP
Bag of Words
We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index
We can also think of this as a Euclidean embedding e
aardvarkrarr eaardvark =
100
zorrorarr ezorro =
001
Word Embeddings
Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned
Objective is for example next word prediction accuracy
These are often called lsquoneural language modelsrsquo
NNs in NLP
Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used
Consider the sentence
the cat sat on the mat
and that we wish to predict the word on given the two preceding cat sat
and two succeeding words the mat
We can use a network that has inputs vcat vsat vthe vmat
The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high
The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context
Word Embeddings
Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)
Word Embeddings
There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender
vwoman minus vman asymp vaunt minus vuncle
vwoman minus vman asymp vqueen minus vking
From Mikolov (2013)
Word Embeddings Analogies
Given a relationship France-Paris we get the lsquorelationshiprsquo embedding
v = vParis minus vFrance
Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)
Word Embeddings Constrained Embeddings
We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)
Word Embeddings Constrained Embeddings
Recursive Nets and Embeddings
Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)
Recursive Nets and Embeddings
Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node
Recursive Nets and EmbeddingsTraining
We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree
The weights of this classifier are shared across all nodes
At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings
The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier
We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy
Prediction
For a new movie review the review is first parsed using a standard grammartree parser
This forms the tree which can be used to recursively form the sentiment classlabel for the review
Currently the best sentiment classifier Socher (2013)
Recursive Nets and Embeddingsotilde otilde
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
otilde
otilde
eth
middotshy
otilde
eth
plusmnsup2raquo
otilde
eth
plusmnordm
otilde
otilde
eth
notcedilraquo
otilde
otilde
eth
sup3plusmnshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
yen
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
yen
yen
eth
middotshy
yen
eth
plusmnsup2raquo
yen
eth
plusmnordm
yen
yen
eth
notcedilraquo
yen
yen
yen
acuteraquoiquestshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
otilde
eth
times
otilde
otilde
otilde
acutemiddotmicroraquofrac14
eth
eth
eth
raquoordfraquoregsect
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
times
yen
yen
eth
eth
frac14middotfrac14
eth
sup2ugravenot
eth
eth
acutemiddotmicroraquo
eth
eth
eth
iquest
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
timesnot
yen
yen
eth
eth
ugraveshy
eth
paralaquoshynot
yen
otilde
middotsup2frac12regraquofrac14middotfrac34acutesect
yen yen
frac14laquoacuteacute
eth
ograve
eth
eth
timesnot
eth
eth
eth
eth
eth
ugraveshy
otilde
yen
sup2plusmnnot
yen yen
frac14laquoacuteacute
eth
ograve
Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve
Recurrent Nets
x1 x2 x3
h1 h2 h3
y1 y2 y3
A A A
C C C
B B
RNNs are used in timeseries applications
The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt
In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram
I omitted the potential links from xtminus1 ytminus1 to ht
Handwriting Generation using a RNN
Some training examples
Handwriting Generation using a RNN
Some generated examples Top line is real handwriting for comparison See AlexGraversquos work
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Intelligent Machinery
1948 Turing and Champernowne lsquopaper and pencilrsquo chess
Intelligent Machinery
1951 Prinz mate-in-two moves chess machine
1952 Strachey programs first computer draughts algorithm
Learning Machines
1951 Oettinger makes first program that lsquolearnsrsquo
1955 Samuel adds lsquolearningrsquo to his draughts algorithm
Logical Intelligence
1968 Rischrsquos algorithm for integration in calculus
1972 Prolog for general logical reasoning
1997 Deep Blue defeats Kasparov
Other forms of intelligence
But is this getting us to where wersquod like to beSelfridge-Shannon film clip
Speech Recognition
Visual Processing
Natural Language modelling
Planning and decision in uncertain environments
Perhaps a different approach would be useful
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Astonishing Hypothesis Crick
ldquoA personrsquos mental activities are entirely due to the behaviour of nervecells and the molecules that make them up and influence themrdquo
Neurons
Visual Pathway
Information Processing in Brains
Neurons
Re
al
Wo
rld
Layer 1 Layer 2 Highminuslevel
Concepts
Feature
Hierarchical Modular Binary Parallel Noisy
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Artificial Neuron (Perceptron)
weight 7
output neuron
neuron 1neuron 2neuron 3neuron 4
neuron 7neuron 6neuron 5
inputs
weight 1
Training an artificial neural network
Want to generalise to new images with high accuracy
Artificial Network
1957 Rosenblattrsquos perceptron
perceptron film clip
Connectionism
1960 Realised a perceptron can only solve simple tasks
1970 Decline in interest
1980 New computing power made training multilayer networks feasible
outputinputs
Each node (or lsquoneuronrsquo) computes a function of a weighted combination ofparental nodes hj = σ(
sumi wijhi)
Neural Networks and Deep LearningHistorical Problems with Neural Nets (1990s)
NNs are difficult to train (many local optima)
Particularly difficult to train a NN with a large number of layers (say largerthan around 10)
lsquoGradient Diffusion Problemrsquo ndash difficult to assign responsibility of errors toindividual lsquoneuronsrsquo
Machine Learning (up to 2006)
A large section of the machine learning community abandoned NNs
More principled and computationally better understood techniques (SVMsand related convex methods) replaced them
Bayesian AI (1990s onwards)
From mid 1990s there was a realisation that pattern recognition is notsufficient for all AI purposes
Uncertainty and reasoning are not naturally representable using standardfeed-forward nets
Explosion in more lsquosymbolicrsquo Bayesian AI
Deep Learning
NNs have resurged in interest in the last few years (Hinton Bengio )
Also called lsquodeep learningrsquo
Sense that very complex tasks (object recognition learning complex structurein data) requires going beyond simple (convex) statistical techniques
The brain uses hierarchical distributed processing and it is likely to be for agood reason
Many problems have a hierarchical structure images are made of partslanguage is hierarchical etc
Why now
New computing resources (GPU processing)
Availability of large amount of data means that we can train nets with manyparameters (1010)
Recent evidence suggests local optima are not particularly problematic
Autoencoder
y1 y2 y3 y4 y5
h1 h2 h3
h4 h5
y1 y2 y3 y4 y5
h6 h7 h8
The bottleneck forces the network to try to find a low dimensionalrepresentation of the data
Useful for unsupervised learning
Autoencoder on MNIST digits (Hinton 2006 Science)
Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA
60000 training images (28times 28 = 784 pixels)
Use a form of autoencoder to find a lower (30) dimensional representation
At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation
Google Cats
10 Million Youtube video frames (200x200 pixel images)
Use a specialised autoencoder with 9 layers (1 billion weights)
2000 computers + two weeks of computing
Examine units to see what images they most respond to
Google Autoencoder
From Nando De Freitas
Convolutional NNs
CNNs are particularly popular in image processing
Often the feature maps correspond (not to macro features such as bicycles)but micro features
For example in handwritten digit recognition they correspond to smallconstituent parts of the digits
These are used then to process the image into a representation that is betterfor recognition
NNs in NLP
Bag of Words
We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index
We can also think of this as a Euclidean embedding e
aardvarkrarr eaardvark =
100
zorrorarr ezorro =
001
Word Embeddings
Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned
Objective is for example next word prediction accuracy
These are often called lsquoneural language modelsrsquo
NNs in NLP
Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used
Consider the sentence
the cat sat on the mat
and that we wish to predict the word on given the two preceding cat sat
and two succeeding words the mat
We can use a network that has inputs vcat vsat vthe vmat
The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high
The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context
Word Embeddings
Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)
Word Embeddings
There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender
vwoman minus vman asymp vaunt minus vuncle
vwoman minus vman asymp vqueen minus vking
From Mikolov (2013)
Word Embeddings Analogies
Given a relationship France-Paris we get the lsquorelationshiprsquo embedding
v = vParis minus vFrance
Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)
Word Embeddings Constrained Embeddings
We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)
Word Embeddings Constrained Embeddings
Recursive Nets and Embeddings
Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)
Recursive Nets and Embeddings
Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node
Recursive Nets and EmbeddingsTraining
We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree
The weights of this classifier are shared across all nodes
At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings
The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier
We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy
Prediction
For a new movie review the review is first parsed using a standard grammartree parser
This forms the tree which can be used to recursively form the sentiment classlabel for the review
Currently the best sentiment classifier Socher (2013)
Recursive Nets and Embeddingsotilde otilde
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
otilde
otilde
eth
middotshy
otilde
eth
plusmnsup2raquo
otilde
eth
plusmnordm
otilde
otilde
eth
notcedilraquo
otilde
otilde
eth
sup3plusmnshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
yen
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
yen
yen
eth
middotshy
yen
eth
plusmnsup2raquo
yen
eth
plusmnordm
yen
yen
eth
notcedilraquo
yen
yen
yen
acuteraquoiquestshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
otilde
eth
times
otilde
otilde
otilde
acutemiddotmicroraquofrac14
eth
eth
eth
raquoordfraquoregsect
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
times
yen
yen
eth
eth
frac14middotfrac14
eth
sup2ugravenot
eth
eth
acutemiddotmicroraquo
eth
eth
eth
iquest
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
timesnot
yen
yen
eth
eth
ugraveshy
eth
paralaquoshynot
yen
otilde
middotsup2frac12regraquofrac14middotfrac34acutesect
yen yen
frac14laquoacuteacute
eth
ograve
eth
eth
timesnot
eth
eth
eth
eth
eth
ugraveshy
otilde
yen
sup2plusmnnot
yen yen
frac14laquoacuteacute
eth
ograve
Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve
Recurrent Nets
x1 x2 x3
h1 h2 h3
y1 y2 y3
A A A
C C C
B B
RNNs are used in timeseries applications
The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt
In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram
I omitted the potential links from xtminus1 ytminus1 to ht
Handwriting Generation using a RNN
Some training examples
Handwriting Generation using a RNN
Some generated examples Top line is real handwriting for comparison See AlexGraversquos work
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Intelligent Machinery
1951 Prinz mate-in-two moves chess machine
1952 Strachey programs first computer draughts algorithm
Learning Machines
1951 Oettinger makes first program that lsquolearnsrsquo
1955 Samuel adds lsquolearningrsquo to his draughts algorithm
Logical Intelligence
1968 Rischrsquos algorithm for integration in calculus
1972 Prolog for general logical reasoning
1997 Deep Blue defeats Kasparov
Other forms of intelligence
But is this getting us to where wersquod like to beSelfridge-Shannon film clip
Speech Recognition
Visual Processing
Natural Language modelling
Planning and decision in uncertain environments
Perhaps a different approach would be useful
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Astonishing Hypothesis Crick
ldquoA personrsquos mental activities are entirely due to the behaviour of nervecells and the molecules that make them up and influence themrdquo
Neurons
Visual Pathway
Information Processing in Brains
Neurons
Re
al
Wo
rld
Layer 1 Layer 2 Highminuslevel
Concepts
Feature
Hierarchical Modular Binary Parallel Noisy
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Artificial Neuron (Perceptron)
weight 7
output neuron
neuron 1neuron 2neuron 3neuron 4
neuron 7neuron 6neuron 5
inputs
weight 1
Training an artificial neural network
Want to generalise to new images with high accuracy
Artificial Network
1957 Rosenblattrsquos perceptron
perceptron film clip
Connectionism
1960 Realised a perceptron can only solve simple tasks
1970 Decline in interest
1980 New computing power made training multilayer networks feasible
outputinputs
Each node (or lsquoneuronrsquo) computes a function of a weighted combination ofparental nodes hj = σ(
sumi wijhi)
Neural Networks and Deep LearningHistorical Problems with Neural Nets (1990s)
NNs are difficult to train (many local optima)
Particularly difficult to train a NN with a large number of layers (say largerthan around 10)
lsquoGradient Diffusion Problemrsquo ndash difficult to assign responsibility of errors toindividual lsquoneuronsrsquo
Machine Learning (up to 2006)
A large section of the machine learning community abandoned NNs
More principled and computationally better understood techniques (SVMsand related convex methods) replaced them
Bayesian AI (1990s onwards)
From mid 1990s there was a realisation that pattern recognition is notsufficient for all AI purposes
Uncertainty and reasoning are not naturally representable using standardfeed-forward nets
Explosion in more lsquosymbolicrsquo Bayesian AI
Deep Learning
NNs have resurged in interest in the last few years (Hinton Bengio )
Also called lsquodeep learningrsquo
Sense that very complex tasks (object recognition learning complex structurein data) requires going beyond simple (convex) statistical techniques
The brain uses hierarchical distributed processing and it is likely to be for agood reason
Many problems have a hierarchical structure images are made of partslanguage is hierarchical etc
Why now
New computing resources (GPU processing)
Availability of large amount of data means that we can train nets with manyparameters (1010)
Recent evidence suggests local optima are not particularly problematic
Autoencoder
y1 y2 y3 y4 y5
h1 h2 h3
h4 h5
y1 y2 y3 y4 y5
h6 h7 h8
The bottleneck forces the network to try to find a low dimensionalrepresentation of the data
Useful for unsupervised learning
Autoencoder on MNIST digits (Hinton 2006 Science)
Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA
60000 training images (28times 28 = 784 pixels)
Use a form of autoencoder to find a lower (30) dimensional representation
At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation
Google Cats
10 Million Youtube video frames (200x200 pixel images)
Use a specialised autoencoder with 9 layers (1 billion weights)
2000 computers + two weeks of computing
Examine units to see what images they most respond to
Google Autoencoder
From Nando De Freitas
Convolutional NNs
CNNs are particularly popular in image processing
Often the feature maps correspond (not to macro features such as bicycles)but micro features
For example in handwritten digit recognition they correspond to smallconstituent parts of the digits
These are used then to process the image into a representation that is betterfor recognition
NNs in NLP
Bag of Words
We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index
We can also think of this as a Euclidean embedding e
aardvarkrarr eaardvark =
100
zorrorarr ezorro =
001
Word Embeddings
Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned
Objective is for example next word prediction accuracy
These are often called lsquoneural language modelsrsquo
NNs in NLP
Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used
Consider the sentence
the cat sat on the mat
and that we wish to predict the word on given the two preceding cat sat
and two succeeding words the mat
We can use a network that has inputs vcat vsat vthe vmat
The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high
The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context
Word Embeddings
Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)
Word Embeddings
There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender
vwoman minus vman asymp vaunt minus vuncle
vwoman minus vman asymp vqueen minus vking
From Mikolov (2013)
Word Embeddings Analogies
Given a relationship France-Paris we get the lsquorelationshiprsquo embedding
v = vParis minus vFrance
Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)
Word Embeddings Constrained Embeddings
We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)
Word Embeddings Constrained Embeddings
Recursive Nets and Embeddings
Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)
Recursive Nets and Embeddings
Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node
Recursive Nets and EmbeddingsTraining
We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree
The weights of this classifier are shared across all nodes
At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings
The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier
We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy
Prediction
For a new movie review the review is first parsed using a standard grammartree parser
This forms the tree which can be used to recursively form the sentiment classlabel for the review
Currently the best sentiment classifier Socher (2013)
Recursive Nets and Embeddingsotilde otilde
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
otilde
otilde
eth
middotshy
otilde
eth
plusmnsup2raquo
otilde
eth
plusmnordm
otilde
otilde
eth
notcedilraquo
otilde
otilde
eth
sup3plusmnshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
yen
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
yen
yen
eth
middotshy
yen
eth
plusmnsup2raquo
yen
eth
plusmnordm
yen
yen
eth
notcedilraquo
yen
yen
yen
acuteraquoiquestshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
otilde
eth
times
otilde
otilde
otilde
acutemiddotmicroraquofrac14
eth
eth
eth
raquoordfraquoregsect
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
times
yen
yen
eth
eth
frac14middotfrac14
eth
sup2ugravenot
eth
eth
acutemiddotmicroraquo
eth
eth
eth
iquest
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
timesnot
yen
yen
eth
eth
ugraveshy
eth
paralaquoshynot
yen
otilde
middotsup2frac12regraquofrac14middotfrac34acutesect
yen yen
frac14laquoacuteacute
eth
ograve
eth
eth
timesnot
eth
eth
eth
eth
eth
ugraveshy
otilde
yen
sup2plusmnnot
yen yen
frac14laquoacuteacute
eth
ograve
Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve
Recurrent Nets
x1 x2 x3
h1 h2 h3
y1 y2 y3
A A A
C C C
B B
RNNs are used in timeseries applications
The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt
In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram
I omitted the potential links from xtminus1 ytminus1 to ht
Handwriting Generation using a RNN
Some training examples
Handwriting Generation using a RNN
Some generated examples Top line is real handwriting for comparison See AlexGraversquos work
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Learning Machines
1951 Oettinger makes first program that lsquolearnsrsquo
1955 Samuel adds lsquolearningrsquo to his draughts algorithm
Logical Intelligence
1968 Rischrsquos algorithm for integration in calculus
1972 Prolog for general logical reasoning
1997 Deep Blue defeats Kasparov
Other forms of intelligence
But is this getting us to where wersquod like to beSelfridge-Shannon film clip
Speech Recognition
Visual Processing
Natural Language modelling
Planning and decision in uncertain environments
Perhaps a different approach would be useful
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Astonishing Hypothesis Crick
ldquoA personrsquos mental activities are entirely due to the behaviour of nervecells and the molecules that make them up and influence themrdquo
Neurons
Visual Pathway
Information Processing in Brains
Neurons
Re
al
Wo
rld
Layer 1 Layer 2 Highminuslevel
Concepts
Feature
Hierarchical Modular Binary Parallel Noisy
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Artificial Neuron (Perceptron)
weight 7
output neuron
neuron 1neuron 2neuron 3neuron 4
neuron 7neuron 6neuron 5
inputs
weight 1
Training an artificial neural network
Want to generalise to new images with high accuracy
Artificial Network
1957 Rosenblattrsquos perceptron
perceptron film clip
Connectionism
1960 Realised a perceptron can only solve simple tasks
1970 Decline in interest
1980 New computing power made training multilayer networks feasible
outputinputs
Each node (or lsquoneuronrsquo) computes a function of a weighted combination ofparental nodes hj = σ(
sumi wijhi)
Neural Networks and Deep LearningHistorical Problems with Neural Nets (1990s)
NNs are difficult to train (many local optima)
Particularly difficult to train a NN with a large number of layers (say largerthan around 10)
lsquoGradient Diffusion Problemrsquo ndash difficult to assign responsibility of errors toindividual lsquoneuronsrsquo
Machine Learning (up to 2006)
A large section of the machine learning community abandoned NNs
More principled and computationally better understood techniques (SVMsand related convex methods) replaced them
Bayesian AI (1990s onwards)
From mid 1990s there was a realisation that pattern recognition is notsufficient for all AI purposes
Uncertainty and reasoning are not naturally representable using standardfeed-forward nets
Explosion in more lsquosymbolicrsquo Bayesian AI
Deep Learning
NNs have resurged in interest in the last few years (Hinton Bengio )
Also called lsquodeep learningrsquo
Sense that very complex tasks (object recognition learning complex structurein data) requires going beyond simple (convex) statistical techniques
The brain uses hierarchical distributed processing and it is likely to be for agood reason
Many problems have a hierarchical structure images are made of partslanguage is hierarchical etc
Why now
New computing resources (GPU processing)
Availability of large amount of data means that we can train nets with manyparameters (1010)
Recent evidence suggests local optima are not particularly problematic
Autoencoder
y1 y2 y3 y4 y5
h1 h2 h3
h4 h5
y1 y2 y3 y4 y5
h6 h7 h8
The bottleneck forces the network to try to find a low dimensionalrepresentation of the data
Useful for unsupervised learning
Autoencoder on MNIST digits (Hinton 2006 Science)
Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA
60000 training images (28times 28 = 784 pixels)
Use a form of autoencoder to find a lower (30) dimensional representation
At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation
Google Cats
10 Million Youtube video frames (200x200 pixel images)
Use a specialised autoencoder with 9 layers (1 billion weights)
2000 computers + two weeks of computing
Examine units to see what images they most respond to
Google Autoencoder
From Nando De Freitas
Convolutional NNs
CNNs are particularly popular in image processing
Often the feature maps correspond (not to macro features such as bicycles)but micro features
For example in handwritten digit recognition they correspond to smallconstituent parts of the digits
These are used then to process the image into a representation that is betterfor recognition
NNs in NLP
Bag of Words
We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index
We can also think of this as a Euclidean embedding e
aardvarkrarr eaardvark =
100
zorrorarr ezorro =
001
Word Embeddings
Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned
Objective is for example next word prediction accuracy
These are often called lsquoneural language modelsrsquo
NNs in NLP
Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used
Consider the sentence
the cat sat on the mat
and that we wish to predict the word on given the two preceding cat sat
and two succeeding words the mat
We can use a network that has inputs vcat vsat vthe vmat
The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high
The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context
Word Embeddings
Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)
Word Embeddings
There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender
vwoman minus vman asymp vaunt minus vuncle
vwoman minus vman asymp vqueen minus vking
From Mikolov (2013)
Word Embeddings Analogies
Given a relationship France-Paris we get the lsquorelationshiprsquo embedding
v = vParis minus vFrance
Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)
Word Embeddings Constrained Embeddings
We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)
Word Embeddings Constrained Embeddings
Recursive Nets and Embeddings
Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)
Recursive Nets and Embeddings
Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node
Recursive Nets and EmbeddingsTraining
We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree
The weights of this classifier are shared across all nodes
At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings
The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier
We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy
Prediction
For a new movie review the review is first parsed using a standard grammartree parser
This forms the tree which can be used to recursively form the sentiment classlabel for the review
Currently the best sentiment classifier Socher (2013)
Recursive Nets and Embeddingsotilde otilde
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
otilde
otilde
eth
middotshy
otilde
eth
plusmnsup2raquo
otilde
eth
plusmnordm
otilde
otilde
eth
notcedilraquo
otilde
otilde
eth
sup3plusmnshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
yen
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
yen
yen
eth
middotshy
yen
eth
plusmnsup2raquo
yen
eth
plusmnordm
yen
yen
eth
notcedilraquo
yen
yen
yen
acuteraquoiquestshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
otilde
eth
times
otilde
otilde
otilde
acutemiddotmicroraquofrac14
eth
eth
eth
raquoordfraquoregsect
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
times
yen
yen
eth
eth
frac14middotfrac14
eth
sup2ugravenot
eth
eth
acutemiddotmicroraquo
eth
eth
eth
iquest
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
timesnot
yen
yen
eth
eth
ugraveshy
eth
paralaquoshynot
yen
otilde
middotsup2frac12regraquofrac14middotfrac34acutesect
yen yen
frac14laquoacuteacute
eth
ograve
eth
eth
timesnot
eth
eth
eth
eth
eth
ugraveshy
otilde
yen
sup2plusmnnot
yen yen
frac14laquoacuteacute
eth
ograve
Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve
Recurrent Nets
x1 x2 x3
h1 h2 h3
y1 y2 y3
A A A
C C C
B B
RNNs are used in timeseries applications
The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt
In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram
I omitted the potential links from xtminus1 ytminus1 to ht
Handwriting Generation using a RNN
Some training examples
Handwriting Generation using a RNN
Some generated examples Top line is real handwriting for comparison See AlexGraversquos work
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Logical Intelligence
1968 Rischrsquos algorithm for integration in calculus
1972 Prolog for general logical reasoning
1997 Deep Blue defeats Kasparov
Other forms of intelligence
But is this getting us to where wersquod like to beSelfridge-Shannon film clip
Speech Recognition
Visual Processing
Natural Language modelling
Planning and decision in uncertain environments
Perhaps a different approach would be useful
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Astonishing Hypothesis Crick
ldquoA personrsquos mental activities are entirely due to the behaviour of nervecells and the molecules that make them up and influence themrdquo
Neurons
Visual Pathway
Information Processing in Brains
Neurons
Re
al
Wo
rld
Layer 1 Layer 2 Highminuslevel
Concepts
Feature
Hierarchical Modular Binary Parallel Noisy
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Artificial Neuron (Perceptron)
weight 7
output neuron
neuron 1neuron 2neuron 3neuron 4
neuron 7neuron 6neuron 5
inputs
weight 1
Training an artificial neural network
Want to generalise to new images with high accuracy
Artificial Network
1957 Rosenblattrsquos perceptron
perceptron film clip
Connectionism
1960 Realised a perceptron can only solve simple tasks
1970 Decline in interest
1980 New computing power made training multilayer networks feasible
outputinputs
Each node (or lsquoneuronrsquo) computes a function of a weighted combination ofparental nodes hj = σ(
sumi wijhi)
Neural Networks and Deep LearningHistorical Problems with Neural Nets (1990s)
NNs are difficult to train (many local optima)
Particularly difficult to train a NN with a large number of layers (say largerthan around 10)
lsquoGradient Diffusion Problemrsquo ndash difficult to assign responsibility of errors toindividual lsquoneuronsrsquo
Machine Learning (up to 2006)
A large section of the machine learning community abandoned NNs
More principled and computationally better understood techniques (SVMsand related convex methods) replaced them
Bayesian AI (1990s onwards)
From mid 1990s there was a realisation that pattern recognition is notsufficient for all AI purposes
Uncertainty and reasoning are not naturally representable using standardfeed-forward nets
Explosion in more lsquosymbolicrsquo Bayesian AI
Deep Learning
NNs have resurged in interest in the last few years (Hinton Bengio )
Also called lsquodeep learningrsquo
Sense that very complex tasks (object recognition learning complex structurein data) requires going beyond simple (convex) statistical techniques
The brain uses hierarchical distributed processing and it is likely to be for agood reason
Many problems have a hierarchical structure images are made of partslanguage is hierarchical etc
Why now
New computing resources (GPU processing)
Availability of large amount of data means that we can train nets with manyparameters (1010)
Recent evidence suggests local optima are not particularly problematic
Autoencoder
y1 y2 y3 y4 y5
h1 h2 h3
h4 h5
y1 y2 y3 y4 y5
h6 h7 h8
The bottleneck forces the network to try to find a low dimensionalrepresentation of the data
Useful for unsupervised learning
Autoencoder on MNIST digits (Hinton 2006 Science)
Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA
60000 training images (28times 28 = 784 pixels)
Use a form of autoencoder to find a lower (30) dimensional representation
At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation
Google Cats
10 Million Youtube video frames (200x200 pixel images)
Use a specialised autoencoder with 9 layers (1 billion weights)
2000 computers + two weeks of computing
Examine units to see what images they most respond to
Google Autoencoder
From Nando De Freitas
Convolutional NNs
CNNs are particularly popular in image processing
Often the feature maps correspond (not to macro features such as bicycles)but micro features
For example in handwritten digit recognition they correspond to smallconstituent parts of the digits
These are used then to process the image into a representation that is betterfor recognition
NNs in NLP
Bag of Words
We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index
We can also think of this as a Euclidean embedding e
aardvarkrarr eaardvark =
100
zorrorarr ezorro =
001
Word Embeddings
Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned
Objective is for example next word prediction accuracy
These are often called lsquoneural language modelsrsquo
NNs in NLP
Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used
Consider the sentence
the cat sat on the mat
and that we wish to predict the word on given the two preceding cat sat
and two succeeding words the mat
We can use a network that has inputs vcat vsat vthe vmat
The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high
The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context
Word Embeddings
Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)
Word Embeddings
There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender
vwoman minus vman asymp vaunt minus vuncle
vwoman minus vman asymp vqueen minus vking
From Mikolov (2013)
Word Embeddings Analogies
Given a relationship France-Paris we get the lsquorelationshiprsquo embedding
v = vParis minus vFrance
Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)
Word Embeddings Constrained Embeddings
We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)
Word Embeddings Constrained Embeddings
Recursive Nets and Embeddings
Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)
Recursive Nets and Embeddings
Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node
Recursive Nets and EmbeddingsTraining
We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree
The weights of this classifier are shared across all nodes
At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings
The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier
We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy
Prediction
For a new movie review the review is first parsed using a standard grammartree parser
This forms the tree which can be used to recursively form the sentiment classlabel for the review
Currently the best sentiment classifier Socher (2013)
Recursive Nets and Embeddingsotilde otilde
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
otilde
otilde
eth
middotshy
otilde
eth
plusmnsup2raquo
otilde
eth
plusmnordm
otilde
otilde
eth
notcedilraquo
otilde
otilde
eth
sup3plusmnshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
yen
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
yen
yen
eth
middotshy
yen
eth
plusmnsup2raquo
yen
eth
plusmnordm
yen
yen
eth
notcedilraquo
yen
yen
yen
acuteraquoiquestshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
otilde
eth
times
otilde
otilde
otilde
acutemiddotmicroraquofrac14
eth
eth
eth
raquoordfraquoregsect
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
times
yen
yen
eth
eth
frac14middotfrac14
eth
sup2ugravenot
eth
eth
acutemiddotmicroraquo
eth
eth
eth
iquest
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
timesnot
yen
yen
eth
eth
ugraveshy
eth
paralaquoshynot
yen
otilde
middotsup2frac12regraquofrac14middotfrac34acutesect
yen yen
frac14laquoacuteacute
eth
ograve
eth
eth
timesnot
eth
eth
eth
eth
eth
ugraveshy
otilde
yen
sup2plusmnnot
yen yen
frac14laquoacuteacute
eth
ograve
Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve
Recurrent Nets
x1 x2 x3
h1 h2 h3
y1 y2 y3
A A A
C C C
B B
RNNs are used in timeseries applications
The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt
In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram
I omitted the potential links from xtminus1 ytminus1 to ht
Handwriting Generation using a RNN
Some training examples
Handwriting Generation using a RNN
Some generated examples Top line is real handwriting for comparison See AlexGraversquos work
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Other forms of intelligence
But is this getting us to where wersquod like to beSelfridge-Shannon film clip
Speech Recognition
Visual Processing
Natural Language modelling
Planning and decision in uncertain environments
Perhaps a different approach would be useful
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Astonishing Hypothesis Crick
ldquoA personrsquos mental activities are entirely due to the behaviour of nervecells and the molecules that make them up and influence themrdquo
Neurons
Visual Pathway
Information Processing in Brains
Neurons
Re
al
Wo
rld
Layer 1 Layer 2 Highminuslevel
Concepts
Feature
Hierarchical Modular Binary Parallel Noisy
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Artificial Neuron (Perceptron)
weight 7
output neuron
neuron 1neuron 2neuron 3neuron 4
neuron 7neuron 6neuron 5
inputs
weight 1
Training an artificial neural network
Want to generalise to new images with high accuracy
Artificial Network
1957 Rosenblattrsquos perceptron
perceptron film clip
Connectionism
1960 Realised a perceptron can only solve simple tasks
1970 Decline in interest
1980 New computing power made training multilayer networks feasible
outputinputs
Each node (or lsquoneuronrsquo) computes a function of a weighted combination ofparental nodes hj = σ(
sumi wijhi)
Neural Networks and Deep LearningHistorical Problems with Neural Nets (1990s)
NNs are difficult to train (many local optima)
Particularly difficult to train a NN with a large number of layers (say largerthan around 10)
lsquoGradient Diffusion Problemrsquo ndash difficult to assign responsibility of errors toindividual lsquoneuronsrsquo
Machine Learning (up to 2006)
A large section of the machine learning community abandoned NNs
More principled and computationally better understood techniques (SVMsand related convex methods) replaced them
Bayesian AI (1990s onwards)
From mid 1990s there was a realisation that pattern recognition is notsufficient for all AI purposes
Uncertainty and reasoning are not naturally representable using standardfeed-forward nets
Explosion in more lsquosymbolicrsquo Bayesian AI
Deep Learning
NNs have resurged in interest in the last few years (Hinton Bengio )
Also called lsquodeep learningrsquo
Sense that very complex tasks (object recognition learning complex structurein data) requires going beyond simple (convex) statistical techniques
The brain uses hierarchical distributed processing and it is likely to be for agood reason
Many problems have a hierarchical structure images are made of partslanguage is hierarchical etc
Why now
New computing resources (GPU processing)
Availability of large amount of data means that we can train nets with manyparameters (1010)
Recent evidence suggests local optima are not particularly problematic
Autoencoder
y1 y2 y3 y4 y5
h1 h2 h3
h4 h5
y1 y2 y3 y4 y5
h6 h7 h8
The bottleneck forces the network to try to find a low dimensionalrepresentation of the data
Useful for unsupervised learning
Autoencoder on MNIST digits (Hinton 2006 Science)
Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA
60000 training images (28times 28 = 784 pixels)
Use a form of autoencoder to find a lower (30) dimensional representation
At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation
Google Cats
10 Million Youtube video frames (200x200 pixel images)
Use a specialised autoencoder with 9 layers (1 billion weights)
2000 computers + two weeks of computing
Examine units to see what images they most respond to
Google Autoencoder
From Nando De Freitas
Convolutional NNs
CNNs are particularly popular in image processing
Often the feature maps correspond (not to macro features such as bicycles)but micro features
For example in handwritten digit recognition they correspond to smallconstituent parts of the digits
These are used then to process the image into a representation that is betterfor recognition
NNs in NLP
Bag of Words
We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index
We can also think of this as a Euclidean embedding e
aardvarkrarr eaardvark =
100
zorrorarr ezorro =
001
Word Embeddings
Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned
Objective is for example next word prediction accuracy
These are often called lsquoneural language modelsrsquo
NNs in NLP
Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used
Consider the sentence
the cat sat on the mat
and that we wish to predict the word on given the two preceding cat sat
and two succeeding words the mat
We can use a network that has inputs vcat vsat vthe vmat
The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high
The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context
Word Embeddings
Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)
Word Embeddings
There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender
vwoman minus vman asymp vaunt minus vuncle
vwoman minus vman asymp vqueen minus vking
From Mikolov (2013)
Word Embeddings Analogies
Given a relationship France-Paris we get the lsquorelationshiprsquo embedding
v = vParis minus vFrance
Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)
Word Embeddings Constrained Embeddings
We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)
Word Embeddings Constrained Embeddings
Recursive Nets and Embeddings
Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)
Recursive Nets and Embeddings
Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node
Recursive Nets and EmbeddingsTraining
We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree
The weights of this classifier are shared across all nodes
At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings
The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier
We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy
Prediction
For a new movie review the review is first parsed using a standard grammartree parser
This forms the tree which can be used to recursively form the sentiment classlabel for the review
Currently the best sentiment classifier Socher (2013)
Recursive Nets and Embeddingsotilde otilde
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
otilde
otilde
eth
middotshy
otilde
eth
plusmnsup2raquo
otilde
eth
plusmnordm
otilde
otilde
eth
notcedilraquo
otilde
otilde
eth
sup3plusmnshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
yen
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
yen
yen
eth
middotshy
yen
eth
plusmnsup2raquo
yen
eth
plusmnordm
yen
yen
eth
notcedilraquo
yen
yen
yen
acuteraquoiquestshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
otilde
eth
times
otilde
otilde
otilde
acutemiddotmicroraquofrac14
eth
eth
eth
raquoordfraquoregsect
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
times
yen
yen
eth
eth
frac14middotfrac14
eth
sup2ugravenot
eth
eth
acutemiddotmicroraquo
eth
eth
eth
iquest
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
timesnot
yen
yen
eth
eth
ugraveshy
eth
paralaquoshynot
yen
otilde
middotsup2frac12regraquofrac14middotfrac34acutesect
yen yen
frac14laquoacuteacute
eth
ograve
eth
eth
timesnot
eth
eth
eth
eth
eth
ugraveshy
otilde
yen
sup2plusmnnot
yen yen
frac14laquoacuteacute
eth
ograve
Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve
Recurrent Nets
x1 x2 x3
h1 h2 h3
y1 y2 y3
A A A
C C C
B B
RNNs are used in timeseries applications
The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt
In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram
I omitted the potential links from xtminus1 ytminus1 to ht
Handwriting Generation using a RNN
Some training examples
Handwriting Generation using a RNN
Some generated examples Top line is real handwriting for comparison See AlexGraversquos work
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Astonishing Hypothesis Crick
ldquoA personrsquos mental activities are entirely due to the behaviour of nervecells and the molecules that make them up and influence themrdquo
Neurons
Visual Pathway
Information Processing in Brains
Neurons
Re
al
Wo
rld
Layer 1 Layer 2 Highminuslevel
Concepts
Feature
Hierarchical Modular Binary Parallel Noisy
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Artificial Neuron (Perceptron)
weight 7
output neuron
neuron 1neuron 2neuron 3neuron 4
neuron 7neuron 6neuron 5
inputs
weight 1
Training an artificial neural network
Want to generalise to new images with high accuracy
Artificial Network
1957 Rosenblattrsquos perceptron
perceptron film clip
Connectionism
1960 Realised a perceptron can only solve simple tasks
1970 Decline in interest
1980 New computing power made training multilayer networks feasible
outputinputs
Each node (or lsquoneuronrsquo) computes a function of a weighted combination ofparental nodes hj = σ(
sumi wijhi)
Neural Networks and Deep LearningHistorical Problems with Neural Nets (1990s)
NNs are difficult to train (many local optima)
Particularly difficult to train a NN with a large number of layers (say largerthan around 10)
lsquoGradient Diffusion Problemrsquo ndash difficult to assign responsibility of errors toindividual lsquoneuronsrsquo
Machine Learning (up to 2006)
A large section of the machine learning community abandoned NNs
More principled and computationally better understood techniques (SVMsand related convex methods) replaced them
Bayesian AI (1990s onwards)
From mid 1990s there was a realisation that pattern recognition is notsufficient for all AI purposes
Uncertainty and reasoning are not naturally representable using standardfeed-forward nets
Explosion in more lsquosymbolicrsquo Bayesian AI
Deep Learning
NNs have resurged in interest in the last few years (Hinton Bengio )
Also called lsquodeep learningrsquo
Sense that very complex tasks (object recognition learning complex structurein data) requires going beyond simple (convex) statistical techniques
The brain uses hierarchical distributed processing and it is likely to be for agood reason
Many problems have a hierarchical structure images are made of partslanguage is hierarchical etc
Why now
New computing resources (GPU processing)
Availability of large amount of data means that we can train nets with manyparameters (1010)
Recent evidence suggests local optima are not particularly problematic
Autoencoder
y1 y2 y3 y4 y5
h1 h2 h3
h4 h5
y1 y2 y3 y4 y5
h6 h7 h8
The bottleneck forces the network to try to find a low dimensionalrepresentation of the data
Useful for unsupervised learning
Autoencoder on MNIST digits (Hinton 2006 Science)
Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA
60000 training images (28times 28 = 784 pixels)
Use a form of autoencoder to find a lower (30) dimensional representation
At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation
Google Cats
10 Million Youtube video frames (200x200 pixel images)
Use a specialised autoencoder with 9 layers (1 billion weights)
2000 computers + two weeks of computing
Examine units to see what images they most respond to
Google Autoencoder
From Nando De Freitas
Convolutional NNs
CNNs are particularly popular in image processing
Often the feature maps correspond (not to macro features such as bicycles)but micro features
For example in handwritten digit recognition they correspond to smallconstituent parts of the digits
These are used then to process the image into a representation that is betterfor recognition
NNs in NLP
Bag of Words
We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index
We can also think of this as a Euclidean embedding e
aardvarkrarr eaardvark =
100
zorrorarr ezorro =
001
Word Embeddings
Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned
Objective is for example next word prediction accuracy
These are often called lsquoneural language modelsrsquo
NNs in NLP
Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used
Consider the sentence
the cat sat on the mat
and that we wish to predict the word on given the two preceding cat sat
and two succeeding words the mat
We can use a network that has inputs vcat vsat vthe vmat
The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high
The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context
Word Embeddings
Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)
Word Embeddings
There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender
vwoman minus vman asymp vaunt minus vuncle
vwoman minus vman asymp vqueen minus vking
From Mikolov (2013)
Word Embeddings Analogies
Given a relationship France-Paris we get the lsquorelationshiprsquo embedding
v = vParis minus vFrance
Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)
Word Embeddings Constrained Embeddings
We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)
Word Embeddings Constrained Embeddings
Recursive Nets and Embeddings
Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)
Recursive Nets and Embeddings
Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node
Recursive Nets and EmbeddingsTraining
We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree
The weights of this classifier are shared across all nodes
At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings
The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier
We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy
Prediction
For a new movie review the review is first parsed using a standard grammartree parser
This forms the tree which can be used to recursively form the sentiment classlabel for the review
Currently the best sentiment classifier Socher (2013)
Recursive Nets and Embeddingsotilde otilde
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
otilde
otilde
eth
middotshy
otilde
eth
plusmnsup2raquo
otilde
eth
plusmnordm
otilde
otilde
eth
notcedilraquo
otilde
otilde
eth
sup3plusmnshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
yen
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
yen
yen
eth
middotshy
yen
eth
plusmnsup2raquo
yen
eth
plusmnordm
yen
yen
eth
notcedilraquo
yen
yen
yen
acuteraquoiquestshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
otilde
eth
times
otilde
otilde
otilde
acutemiddotmicroraquofrac14
eth
eth
eth
raquoordfraquoregsect
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
times
yen
yen
eth
eth
frac14middotfrac14
eth
sup2ugravenot
eth
eth
acutemiddotmicroraquo
eth
eth
eth
iquest
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
timesnot
yen
yen
eth
eth
ugraveshy
eth
paralaquoshynot
yen
otilde
middotsup2frac12regraquofrac14middotfrac34acutesect
yen yen
frac14laquoacuteacute
eth
ograve
eth
eth
timesnot
eth
eth
eth
eth
eth
ugraveshy
otilde
yen
sup2plusmnnot
yen yen
frac14laquoacuteacute
eth
ograve
Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve
Recurrent Nets
x1 x2 x3
h1 h2 h3
y1 y2 y3
A A A
C C C
B B
RNNs are used in timeseries applications
The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt
In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram
I omitted the potential links from xtminus1 ytminus1 to ht
Handwriting Generation using a RNN
Some training examples
Handwriting Generation using a RNN
Some generated examples Top line is real handwriting for comparison See AlexGraversquos work
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Astonishing Hypothesis Crick
ldquoA personrsquos mental activities are entirely due to the behaviour of nervecells and the molecules that make them up and influence themrdquo
Neurons
Visual Pathway
Information Processing in Brains
Neurons
Re
al
Wo
rld
Layer 1 Layer 2 Highminuslevel
Concepts
Feature
Hierarchical Modular Binary Parallel Noisy
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Artificial Neuron (Perceptron)
weight 7
output neuron
neuron 1neuron 2neuron 3neuron 4
neuron 7neuron 6neuron 5
inputs
weight 1
Training an artificial neural network
Want to generalise to new images with high accuracy
Artificial Network
1957 Rosenblattrsquos perceptron
perceptron film clip
Connectionism
1960 Realised a perceptron can only solve simple tasks
1970 Decline in interest
1980 New computing power made training multilayer networks feasible
outputinputs
Each node (or lsquoneuronrsquo) computes a function of a weighted combination ofparental nodes hj = σ(
sumi wijhi)
Neural Networks and Deep LearningHistorical Problems with Neural Nets (1990s)
NNs are difficult to train (many local optima)
Particularly difficult to train a NN with a large number of layers (say largerthan around 10)
lsquoGradient Diffusion Problemrsquo ndash difficult to assign responsibility of errors toindividual lsquoneuronsrsquo
Machine Learning (up to 2006)
A large section of the machine learning community abandoned NNs
More principled and computationally better understood techniques (SVMsand related convex methods) replaced them
Bayesian AI (1990s onwards)
From mid 1990s there was a realisation that pattern recognition is notsufficient for all AI purposes
Uncertainty and reasoning are not naturally representable using standardfeed-forward nets
Explosion in more lsquosymbolicrsquo Bayesian AI
Deep Learning
NNs have resurged in interest in the last few years (Hinton Bengio )
Also called lsquodeep learningrsquo
Sense that very complex tasks (object recognition learning complex structurein data) requires going beyond simple (convex) statistical techniques
The brain uses hierarchical distributed processing and it is likely to be for agood reason
Many problems have a hierarchical structure images are made of partslanguage is hierarchical etc
Why now
New computing resources (GPU processing)
Availability of large amount of data means that we can train nets with manyparameters (1010)
Recent evidence suggests local optima are not particularly problematic
Autoencoder
y1 y2 y3 y4 y5
h1 h2 h3
h4 h5
y1 y2 y3 y4 y5
h6 h7 h8
The bottleneck forces the network to try to find a low dimensionalrepresentation of the data
Useful for unsupervised learning
Autoencoder on MNIST digits (Hinton 2006 Science)
Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA
60000 training images (28times 28 = 784 pixels)
Use a form of autoencoder to find a lower (30) dimensional representation
At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation
Google Cats
10 Million Youtube video frames (200x200 pixel images)
Use a specialised autoencoder with 9 layers (1 billion weights)
2000 computers + two weeks of computing
Examine units to see what images they most respond to
Google Autoencoder
From Nando De Freitas
Convolutional NNs
CNNs are particularly popular in image processing
Often the feature maps correspond (not to macro features such as bicycles)but micro features
For example in handwritten digit recognition they correspond to smallconstituent parts of the digits
These are used then to process the image into a representation that is betterfor recognition
NNs in NLP
Bag of Words
We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index
We can also think of this as a Euclidean embedding e
aardvarkrarr eaardvark =
100
zorrorarr ezorro =
001
Word Embeddings
Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned
Objective is for example next word prediction accuracy
These are often called lsquoneural language modelsrsquo
NNs in NLP
Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used
Consider the sentence
the cat sat on the mat
and that we wish to predict the word on given the two preceding cat sat
and two succeeding words the mat
We can use a network that has inputs vcat vsat vthe vmat
The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high
The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context
Word Embeddings
Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)
Word Embeddings
There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender
vwoman minus vman asymp vaunt minus vuncle
vwoman minus vman asymp vqueen minus vking
From Mikolov (2013)
Word Embeddings Analogies
Given a relationship France-Paris we get the lsquorelationshiprsquo embedding
v = vParis minus vFrance
Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)
Word Embeddings Constrained Embeddings
We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)
Word Embeddings Constrained Embeddings
Recursive Nets and Embeddings
Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)
Recursive Nets and Embeddings
Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node
Recursive Nets and EmbeddingsTraining
We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree
The weights of this classifier are shared across all nodes
At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings
The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier
We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy
Prediction
For a new movie review the review is first parsed using a standard grammartree parser
This forms the tree which can be used to recursively form the sentiment classlabel for the review
Currently the best sentiment classifier Socher (2013)
Recursive Nets and Embeddingsotilde otilde
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
otilde
otilde
eth
middotshy
otilde
eth
plusmnsup2raquo
otilde
eth
plusmnordm
otilde
otilde
eth
notcedilraquo
otilde
otilde
eth
sup3plusmnshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
yen
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
yen
yen
eth
middotshy
yen
eth
plusmnsup2raquo
yen
eth
plusmnordm
yen
yen
eth
notcedilraquo
yen
yen
yen
acuteraquoiquestshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
otilde
eth
times
otilde
otilde
otilde
acutemiddotmicroraquofrac14
eth
eth
eth
raquoordfraquoregsect
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
times
yen
yen
eth
eth
frac14middotfrac14
eth
sup2ugravenot
eth
eth
acutemiddotmicroraquo
eth
eth
eth
iquest
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
timesnot
yen
yen
eth
eth
ugraveshy
eth
paralaquoshynot
yen
otilde
middotsup2frac12regraquofrac14middotfrac34acutesect
yen yen
frac14laquoacuteacute
eth
ograve
eth
eth
timesnot
eth
eth
eth
eth
eth
ugraveshy
otilde
yen
sup2plusmnnot
yen yen
frac14laquoacuteacute
eth
ograve
Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve
Recurrent Nets
x1 x2 x3
h1 h2 h3
y1 y2 y3
A A A
C C C
B B
RNNs are used in timeseries applications
The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt
In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram
I omitted the potential links from xtminus1 ytminus1 to ht
Handwriting Generation using a RNN
Some training examples
Handwriting Generation using a RNN
Some generated examples Top line is real handwriting for comparison See AlexGraversquos work
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Neurons
Visual Pathway
Information Processing in Brains
Neurons
Re
al
Wo
rld
Layer 1 Layer 2 Highminuslevel
Concepts
Feature
Hierarchical Modular Binary Parallel Noisy
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Artificial Neuron (Perceptron)
weight 7
output neuron
neuron 1neuron 2neuron 3neuron 4
neuron 7neuron 6neuron 5
inputs
weight 1
Training an artificial neural network
Want to generalise to new images with high accuracy
Artificial Network
1957 Rosenblattrsquos perceptron
perceptron film clip
Connectionism
1960 Realised a perceptron can only solve simple tasks
1970 Decline in interest
1980 New computing power made training multilayer networks feasible
outputinputs
Each node (or lsquoneuronrsquo) computes a function of a weighted combination ofparental nodes hj = σ(
sumi wijhi)
Neural Networks and Deep LearningHistorical Problems with Neural Nets (1990s)
NNs are difficult to train (many local optima)
Particularly difficult to train a NN with a large number of layers (say largerthan around 10)
lsquoGradient Diffusion Problemrsquo ndash difficult to assign responsibility of errors toindividual lsquoneuronsrsquo
Machine Learning (up to 2006)
A large section of the machine learning community abandoned NNs
More principled and computationally better understood techniques (SVMsand related convex methods) replaced them
Bayesian AI (1990s onwards)
From mid 1990s there was a realisation that pattern recognition is notsufficient for all AI purposes
Uncertainty and reasoning are not naturally representable using standardfeed-forward nets
Explosion in more lsquosymbolicrsquo Bayesian AI
Deep Learning
NNs have resurged in interest in the last few years (Hinton Bengio )
Also called lsquodeep learningrsquo
Sense that very complex tasks (object recognition learning complex structurein data) requires going beyond simple (convex) statistical techniques
The brain uses hierarchical distributed processing and it is likely to be for agood reason
Many problems have a hierarchical structure images are made of partslanguage is hierarchical etc
Why now
New computing resources (GPU processing)
Availability of large amount of data means that we can train nets with manyparameters (1010)
Recent evidence suggests local optima are not particularly problematic
Autoencoder
y1 y2 y3 y4 y5
h1 h2 h3
h4 h5
y1 y2 y3 y4 y5
h6 h7 h8
The bottleneck forces the network to try to find a low dimensionalrepresentation of the data
Useful for unsupervised learning
Autoencoder on MNIST digits (Hinton 2006 Science)
Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA
60000 training images (28times 28 = 784 pixels)
Use a form of autoencoder to find a lower (30) dimensional representation
At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation
Google Cats
10 Million Youtube video frames (200x200 pixel images)
Use a specialised autoencoder with 9 layers (1 billion weights)
2000 computers + two weeks of computing
Examine units to see what images they most respond to
Google Autoencoder
From Nando De Freitas
Convolutional NNs
CNNs are particularly popular in image processing
Often the feature maps correspond (not to macro features such as bicycles)but micro features
For example in handwritten digit recognition they correspond to smallconstituent parts of the digits
These are used then to process the image into a representation that is betterfor recognition
NNs in NLP
Bag of Words
We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index
We can also think of this as a Euclidean embedding e
aardvarkrarr eaardvark =
100
zorrorarr ezorro =
001
Word Embeddings
Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned
Objective is for example next word prediction accuracy
These are often called lsquoneural language modelsrsquo
NNs in NLP
Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used
Consider the sentence
the cat sat on the mat
and that we wish to predict the word on given the two preceding cat sat
and two succeeding words the mat
We can use a network that has inputs vcat vsat vthe vmat
The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high
The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context
Word Embeddings
Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)
Word Embeddings
There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender
vwoman minus vman asymp vaunt minus vuncle
vwoman minus vman asymp vqueen minus vking
From Mikolov (2013)
Word Embeddings Analogies
Given a relationship France-Paris we get the lsquorelationshiprsquo embedding
v = vParis minus vFrance
Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)
Word Embeddings Constrained Embeddings
We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)
Word Embeddings Constrained Embeddings
Recursive Nets and Embeddings
Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)
Recursive Nets and Embeddings
Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node
Recursive Nets and EmbeddingsTraining
We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree
The weights of this classifier are shared across all nodes
At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings
The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier
We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy
Prediction
For a new movie review the review is first parsed using a standard grammartree parser
This forms the tree which can be used to recursively form the sentiment classlabel for the review
Currently the best sentiment classifier Socher (2013)
Recursive Nets and Embeddingsotilde otilde
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
otilde
otilde
eth
middotshy
otilde
eth
plusmnsup2raquo
otilde
eth
plusmnordm
otilde
otilde
eth
notcedilraquo
otilde
otilde
eth
sup3plusmnshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
yen
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
yen
yen
eth
middotshy
yen
eth
plusmnsup2raquo
yen
eth
plusmnordm
yen
yen
eth
notcedilraquo
yen
yen
yen
acuteraquoiquestshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
otilde
eth
times
otilde
otilde
otilde
acutemiddotmicroraquofrac14
eth
eth
eth
raquoordfraquoregsect
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
times
yen
yen
eth
eth
frac14middotfrac14
eth
sup2ugravenot
eth
eth
acutemiddotmicroraquo
eth
eth
eth
iquest
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
timesnot
yen
yen
eth
eth
ugraveshy
eth
paralaquoshynot
yen
otilde
middotsup2frac12regraquofrac14middotfrac34acutesect
yen yen
frac14laquoacuteacute
eth
ograve
eth
eth
timesnot
eth
eth
eth
eth
eth
ugraveshy
otilde
yen
sup2plusmnnot
yen yen
frac14laquoacuteacute
eth
ograve
Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve
Recurrent Nets
x1 x2 x3
h1 h2 h3
y1 y2 y3
A A A
C C C
B B
RNNs are used in timeseries applications
The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt
In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram
I omitted the potential links from xtminus1 ytminus1 to ht
Handwriting Generation using a RNN
Some training examples
Handwriting Generation using a RNN
Some generated examples Top line is real handwriting for comparison See AlexGraversquos work
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Visual Pathway
Information Processing in Brains
Neurons
Re
al
Wo
rld
Layer 1 Layer 2 Highminuslevel
Concepts
Feature
Hierarchical Modular Binary Parallel Noisy
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Artificial Neuron (Perceptron)
weight 7
output neuron
neuron 1neuron 2neuron 3neuron 4
neuron 7neuron 6neuron 5
inputs
weight 1
Training an artificial neural network
Want to generalise to new images with high accuracy
Artificial Network
1957 Rosenblattrsquos perceptron
perceptron film clip
Connectionism
1960 Realised a perceptron can only solve simple tasks
1970 Decline in interest
1980 New computing power made training multilayer networks feasible
outputinputs
Each node (or lsquoneuronrsquo) computes a function of a weighted combination ofparental nodes hj = σ(
sumi wijhi)
Neural Networks and Deep LearningHistorical Problems with Neural Nets (1990s)
NNs are difficult to train (many local optima)
Particularly difficult to train a NN with a large number of layers (say largerthan around 10)
lsquoGradient Diffusion Problemrsquo ndash difficult to assign responsibility of errors toindividual lsquoneuronsrsquo
Machine Learning (up to 2006)
A large section of the machine learning community abandoned NNs
More principled and computationally better understood techniques (SVMsand related convex methods) replaced them
Bayesian AI (1990s onwards)
From mid 1990s there was a realisation that pattern recognition is notsufficient for all AI purposes
Uncertainty and reasoning are not naturally representable using standardfeed-forward nets
Explosion in more lsquosymbolicrsquo Bayesian AI
Deep Learning
NNs have resurged in interest in the last few years (Hinton Bengio )
Also called lsquodeep learningrsquo
Sense that very complex tasks (object recognition learning complex structurein data) requires going beyond simple (convex) statistical techniques
The brain uses hierarchical distributed processing and it is likely to be for agood reason
Many problems have a hierarchical structure images are made of partslanguage is hierarchical etc
Why now
New computing resources (GPU processing)
Availability of large amount of data means that we can train nets with manyparameters (1010)
Recent evidence suggests local optima are not particularly problematic
Autoencoder
y1 y2 y3 y4 y5
h1 h2 h3
h4 h5
y1 y2 y3 y4 y5
h6 h7 h8
The bottleneck forces the network to try to find a low dimensionalrepresentation of the data
Useful for unsupervised learning
Autoencoder on MNIST digits (Hinton 2006 Science)
Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA
60000 training images (28times 28 = 784 pixels)
Use a form of autoencoder to find a lower (30) dimensional representation
At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation
Google Cats
10 Million Youtube video frames (200x200 pixel images)
Use a specialised autoencoder with 9 layers (1 billion weights)
2000 computers + two weeks of computing
Examine units to see what images they most respond to
Google Autoencoder
From Nando De Freitas
Convolutional NNs
CNNs are particularly popular in image processing
Often the feature maps correspond (not to macro features such as bicycles)but micro features
For example in handwritten digit recognition they correspond to smallconstituent parts of the digits
These are used then to process the image into a representation that is betterfor recognition
NNs in NLP
Bag of Words
We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index
We can also think of this as a Euclidean embedding e
aardvarkrarr eaardvark =
100
zorrorarr ezorro =
001
Word Embeddings
Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned
Objective is for example next word prediction accuracy
These are often called lsquoneural language modelsrsquo
NNs in NLP
Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used
Consider the sentence
the cat sat on the mat
and that we wish to predict the word on given the two preceding cat sat
and two succeeding words the mat
We can use a network that has inputs vcat vsat vthe vmat
The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high
The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context
Word Embeddings
Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)
Word Embeddings
There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender
vwoman minus vman asymp vaunt minus vuncle
vwoman minus vman asymp vqueen minus vking
From Mikolov (2013)
Word Embeddings Analogies
Given a relationship France-Paris we get the lsquorelationshiprsquo embedding
v = vParis minus vFrance
Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)
Word Embeddings Constrained Embeddings
We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)
Word Embeddings Constrained Embeddings
Recursive Nets and Embeddings
Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)
Recursive Nets and Embeddings
Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node
Recursive Nets and EmbeddingsTraining
We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree
The weights of this classifier are shared across all nodes
At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings
The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier
We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy
Prediction
For a new movie review the review is first parsed using a standard grammartree parser
This forms the tree which can be used to recursively form the sentiment classlabel for the review
Currently the best sentiment classifier Socher (2013)
Recursive Nets and Embeddingsotilde otilde
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
otilde
otilde
eth
middotshy
otilde
eth
plusmnsup2raquo
otilde
eth
plusmnordm
otilde
otilde
eth
notcedilraquo
otilde
otilde
eth
sup3plusmnshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
yen
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
yen
yen
eth
middotshy
yen
eth
plusmnsup2raquo
yen
eth
plusmnordm
yen
yen
eth
notcedilraquo
yen
yen
yen
acuteraquoiquestshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
otilde
eth
times
otilde
otilde
otilde
acutemiddotmicroraquofrac14
eth
eth
eth
raquoordfraquoregsect
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
times
yen
yen
eth
eth
frac14middotfrac14
eth
sup2ugravenot
eth
eth
acutemiddotmicroraquo
eth
eth
eth
iquest
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
timesnot
yen
yen
eth
eth
ugraveshy
eth
paralaquoshynot
yen
otilde
middotsup2frac12regraquofrac14middotfrac34acutesect
yen yen
frac14laquoacuteacute
eth
ograve
eth
eth
timesnot
eth
eth
eth
eth
eth
ugraveshy
otilde
yen
sup2plusmnnot
yen yen
frac14laquoacuteacute
eth
ograve
Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve
Recurrent Nets
x1 x2 x3
h1 h2 h3
y1 y2 y3
A A A
C C C
B B
RNNs are used in timeseries applications
The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt
In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram
I omitted the potential links from xtminus1 ytminus1 to ht
Handwriting Generation using a RNN
Some training examples
Handwriting Generation using a RNN
Some generated examples Top line is real handwriting for comparison See AlexGraversquos work
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Information Processing in Brains
Neurons
Re
al
Wo
rld
Layer 1 Layer 2 Highminuslevel
Concepts
Feature
Hierarchical Modular Binary Parallel Noisy
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Artificial Neuron (Perceptron)
weight 7
output neuron
neuron 1neuron 2neuron 3neuron 4
neuron 7neuron 6neuron 5
inputs
weight 1
Training an artificial neural network
Want to generalise to new images with high accuracy
Artificial Network
1957 Rosenblattrsquos perceptron
perceptron film clip
Connectionism
1960 Realised a perceptron can only solve simple tasks
1970 Decline in interest
1980 New computing power made training multilayer networks feasible
outputinputs
Each node (or lsquoneuronrsquo) computes a function of a weighted combination ofparental nodes hj = σ(
sumi wijhi)
Neural Networks and Deep LearningHistorical Problems with Neural Nets (1990s)
NNs are difficult to train (many local optima)
Particularly difficult to train a NN with a large number of layers (say largerthan around 10)
lsquoGradient Diffusion Problemrsquo ndash difficult to assign responsibility of errors toindividual lsquoneuronsrsquo
Machine Learning (up to 2006)
A large section of the machine learning community abandoned NNs
More principled and computationally better understood techniques (SVMsand related convex methods) replaced them
Bayesian AI (1990s onwards)
From mid 1990s there was a realisation that pattern recognition is notsufficient for all AI purposes
Uncertainty and reasoning are not naturally representable using standardfeed-forward nets
Explosion in more lsquosymbolicrsquo Bayesian AI
Deep Learning
NNs have resurged in interest in the last few years (Hinton Bengio )
Also called lsquodeep learningrsquo
Sense that very complex tasks (object recognition learning complex structurein data) requires going beyond simple (convex) statistical techniques
The brain uses hierarchical distributed processing and it is likely to be for agood reason
Many problems have a hierarchical structure images are made of partslanguage is hierarchical etc
Why now
New computing resources (GPU processing)
Availability of large amount of data means that we can train nets with manyparameters (1010)
Recent evidence suggests local optima are not particularly problematic
Autoencoder
y1 y2 y3 y4 y5
h1 h2 h3
h4 h5
y1 y2 y3 y4 y5
h6 h7 h8
The bottleneck forces the network to try to find a low dimensionalrepresentation of the data
Useful for unsupervised learning
Autoencoder on MNIST digits (Hinton 2006 Science)
Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA
60000 training images (28times 28 = 784 pixels)
Use a form of autoencoder to find a lower (30) dimensional representation
At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation
Google Cats
10 Million Youtube video frames (200x200 pixel images)
Use a specialised autoencoder with 9 layers (1 billion weights)
2000 computers + two weeks of computing
Examine units to see what images they most respond to
Google Autoencoder
From Nando De Freitas
Convolutional NNs
CNNs are particularly popular in image processing
Often the feature maps correspond (not to macro features such as bicycles)but micro features
For example in handwritten digit recognition they correspond to smallconstituent parts of the digits
These are used then to process the image into a representation that is betterfor recognition
NNs in NLP
Bag of Words
We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index
We can also think of this as a Euclidean embedding e
aardvarkrarr eaardvark =
100
zorrorarr ezorro =
001
Word Embeddings
Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned
Objective is for example next word prediction accuracy
These are often called lsquoneural language modelsrsquo
NNs in NLP
Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used
Consider the sentence
the cat sat on the mat
and that we wish to predict the word on given the two preceding cat sat
and two succeeding words the mat
We can use a network that has inputs vcat vsat vthe vmat
The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high
The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context
Word Embeddings
Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)
Word Embeddings
There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender
vwoman minus vman asymp vaunt minus vuncle
vwoman minus vman asymp vqueen minus vking
From Mikolov (2013)
Word Embeddings Analogies
Given a relationship France-Paris we get the lsquorelationshiprsquo embedding
v = vParis minus vFrance
Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)
Word Embeddings Constrained Embeddings
We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)
Word Embeddings Constrained Embeddings
Recursive Nets and Embeddings
Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)
Recursive Nets and Embeddings
Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node
Recursive Nets and EmbeddingsTraining
We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree
The weights of this classifier are shared across all nodes
At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings
The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier
We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy
Prediction
For a new movie review the review is first parsed using a standard grammartree parser
This forms the tree which can be used to recursively form the sentiment classlabel for the review
Currently the best sentiment classifier Socher (2013)
Recursive Nets and Embeddingsotilde otilde
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
otilde
otilde
eth
middotshy
otilde
eth
plusmnsup2raquo
otilde
eth
plusmnordm
otilde
otilde
eth
notcedilraquo
otilde
otilde
eth
sup3plusmnshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
yen
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
yen
yen
eth
middotshy
yen
eth
plusmnsup2raquo
yen
eth
plusmnordm
yen
yen
eth
notcedilraquo
yen
yen
yen
acuteraquoiquestshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
otilde
eth
times
otilde
otilde
otilde
acutemiddotmicroraquofrac14
eth
eth
eth
raquoordfraquoregsect
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
times
yen
yen
eth
eth
frac14middotfrac14
eth
sup2ugravenot
eth
eth
acutemiddotmicroraquo
eth
eth
eth
iquest
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
timesnot
yen
yen
eth
eth
ugraveshy
eth
paralaquoshynot
yen
otilde
middotsup2frac12regraquofrac14middotfrac34acutesect
yen yen
frac14laquoacuteacute
eth
ograve
eth
eth
timesnot
eth
eth
eth
eth
eth
ugraveshy
otilde
yen
sup2plusmnnot
yen yen
frac14laquoacuteacute
eth
ograve
Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve
Recurrent Nets
x1 x2 x3
h1 h2 h3
y1 y2 y3
A A A
C C C
B B
RNNs are used in timeseries applications
The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt
In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram
I omitted the potential links from xtminus1 ytminus1 to ht
Handwriting Generation using a RNN
Some training examples
Handwriting Generation using a RNN
Some generated examples Top line is real handwriting for comparison See AlexGraversquos work
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Artificial Neuron (Perceptron)
weight 7
output neuron
neuron 1neuron 2neuron 3neuron 4
neuron 7neuron 6neuron 5
inputs
weight 1
Training an artificial neural network
Want to generalise to new images with high accuracy
Artificial Network
1957 Rosenblattrsquos perceptron
perceptron film clip
Connectionism
1960 Realised a perceptron can only solve simple tasks
1970 Decline in interest
1980 New computing power made training multilayer networks feasible
outputinputs
Each node (or lsquoneuronrsquo) computes a function of a weighted combination ofparental nodes hj = σ(
sumi wijhi)
Neural Networks and Deep LearningHistorical Problems with Neural Nets (1990s)
NNs are difficult to train (many local optima)
Particularly difficult to train a NN with a large number of layers (say largerthan around 10)
lsquoGradient Diffusion Problemrsquo ndash difficult to assign responsibility of errors toindividual lsquoneuronsrsquo
Machine Learning (up to 2006)
A large section of the machine learning community abandoned NNs
More principled and computationally better understood techniques (SVMsand related convex methods) replaced them
Bayesian AI (1990s onwards)
From mid 1990s there was a realisation that pattern recognition is notsufficient for all AI purposes
Uncertainty and reasoning are not naturally representable using standardfeed-forward nets
Explosion in more lsquosymbolicrsquo Bayesian AI
Deep Learning
NNs have resurged in interest in the last few years (Hinton Bengio )
Also called lsquodeep learningrsquo
Sense that very complex tasks (object recognition learning complex structurein data) requires going beyond simple (convex) statistical techniques
The brain uses hierarchical distributed processing and it is likely to be for agood reason
Many problems have a hierarchical structure images are made of partslanguage is hierarchical etc
Why now
New computing resources (GPU processing)
Availability of large amount of data means that we can train nets with manyparameters (1010)
Recent evidence suggests local optima are not particularly problematic
Autoencoder
y1 y2 y3 y4 y5
h1 h2 h3
h4 h5
y1 y2 y3 y4 y5
h6 h7 h8
The bottleneck forces the network to try to find a low dimensionalrepresentation of the data
Useful for unsupervised learning
Autoencoder on MNIST digits (Hinton 2006 Science)
Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA
60000 training images (28times 28 = 784 pixels)
Use a form of autoencoder to find a lower (30) dimensional representation
At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation
Google Cats
10 Million Youtube video frames (200x200 pixel images)
Use a specialised autoencoder with 9 layers (1 billion weights)
2000 computers + two weeks of computing
Examine units to see what images they most respond to
Google Autoencoder
From Nando De Freitas
Convolutional NNs
CNNs are particularly popular in image processing
Often the feature maps correspond (not to macro features such as bicycles)but micro features
For example in handwritten digit recognition they correspond to smallconstituent parts of the digits
These are used then to process the image into a representation that is betterfor recognition
NNs in NLP
Bag of Words
We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index
We can also think of this as a Euclidean embedding e
aardvarkrarr eaardvark =
100
zorrorarr ezorro =
001
Word Embeddings
Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned
Objective is for example next word prediction accuracy
These are often called lsquoneural language modelsrsquo
NNs in NLP
Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used
Consider the sentence
the cat sat on the mat
and that we wish to predict the word on given the two preceding cat sat
and two succeeding words the mat
We can use a network that has inputs vcat vsat vthe vmat
The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high
The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context
Word Embeddings
Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)
Word Embeddings
There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender
vwoman minus vman asymp vaunt minus vuncle
vwoman minus vman asymp vqueen minus vking
From Mikolov (2013)
Word Embeddings Analogies
Given a relationship France-Paris we get the lsquorelationshiprsquo embedding
v = vParis minus vFrance
Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)
Word Embeddings Constrained Embeddings
We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)
Word Embeddings Constrained Embeddings
Recursive Nets and Embeddings
Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)
Recursive Nets and Embeddings
Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node
Recursive Nets and EmbeddingsTraining
We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree
The weights of this classifier are shared across all nodes
At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings
The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier
We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy
Prediction
For a new movie review the review is first parsed using a standard grammartree parser
This forms the tree which can be used to recursively form the sentiment classlabel for the review
Currently the best sentiment classifier Socher (2013)
Recursive Nets and Embeddingsotilde otilde
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
otilde
otilde
eth
middotshy
otilde
eth
plusmnsup2raquo
otilde
eth
plusmnordm
otilde
otilde
eth
notcedilraquo
otilde
otilde
eth
sup3plusmnshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
yen
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
yen
yen
eth
middotshy
yen
eth
plusmnsup2raquo
yen
eth
plusmnordm
yen
yen
eth
notcedilraquo
yen
yen
yen
acuteraquoiquestshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
otilde
eth
times
otilde
otilde
otilde
acutemiddotmicroraquofrac14
eth
eth
eth
raquoordfraquoregsect
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
times
yen
yen
eth
eth
frac14middotfrac14
eth
sup2ugravenot
eth
eth
acutemiddotmicroraquo
eth
eth
eth
iquest
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
timesnot
yen
yen
eth
eth
ugraveshy
eth
paralaquoshynot
yen
otilde
middotsup2frac12regraquofrac14middotfrac34acutesect
yen yen
frac14laquoacuteacute
eth
ograve
eth
eth
timesnot
eth
eth
eth
eth
eth
ugraveshy
otilde
yen
sup2plusmnnot
yen yen
frac14laquoacuteacute
eth
ograve
Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve
Recurrent Nets
x1 x2 x3
h1 h2 h3
y1 y2 y3
A A A
C C C
B B
RNNs are used in timeseries applications
The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt
In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram
I omitted the potential links from xtminus1 ytminus1 to ht
Handwriting Generation using a RNN
Some training examples
Handwriting Generation using a RNN
Some generated examples Top line is real handwriting for comparison See AlexGraversquos work
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Artificial Neuron (Perceptron)
weight 7
output neuron
neuron 1neuron 2neuron 3neuron 4
neuron 7neuron 6neuron 5
inputs
weight 1
Training an artificial neural network
Want to generalise to new images with high accuracy
Artificial Network
1957 Rosenblattrsquos perceptron
perceptron film clip
Connectionism
1960 Realised a perceptron can only solve simple tasks
1970 Decline in interest
1980 New computing power made training multilayer networks feasible
outputinputs
Each node (or lsquoneuronrsquo) computes a function of a weighted combination ofparental nodes hj = σ(
sumi wijhi)
Neural Networks and Deep LearningHistorical Problems with Neural Nets (1990s)
NNs are difficult to train (many local optima)
Particularly difficult to train a NN with a large number of layers (say largerthan around 10)
lsquoGradient Diffusion Problemrsquo ndash difficult to assign responsibility of errors toindividual lsquoneuronsrsquo
Machine Learning (up to 2006)
A large section of the machine learning community abandoned NNs
More principled and computationally better understood techniques (SVMsand related convex methods) replaced them
Bayesian AI (1990s onwards)
From mid 1990s there was a realisation that pattern recognition is notsufficient for all AI purposes
Uncertainty and reasoning are not naturally representable using standardfeed-forward nets
Explosion in more lsquosymbolicrsquo Bayesian AI
Deep Learning
NNs have resurged in interest in the last few years (Hinton Bengio )
Also called lsquodeep learningrsquo
Sense that very complex tasks (object recognition learning complex structurein data) requires going beyond simple (convex) statistical techniques
The brain uses hierarchical distributed processing and it is likely to be for agood reason
Many problems have a hierarchical structure images are made of partslanguage is hierarchical etc
Why now
New computing resources (GPU processing)
Availability of large amount of data means that we can train nets with manyparameters (1010)
Recent evidence suggests local optima are not particularly problematic
Autoencoder
y1 y2 y3 y4 y5
h1 h2 h3
h4 h5
y1 y2 y3 y4 y5
h6 h7 h8
The bottleneck forces the network to try to find a low dimensionalrepresentation of the data
Useful for unsupervised learning
Autoencoder on MNIST digits (Hinton 2006 Science)
Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA
60000 training images (28times 28 = 784 pixels)
Use a form of autoencoder to find a lower (30) dimensional representation
At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation
Google Cats
10 Million Youtube video frames (200x200 pixel images)
Use a specialised autoencoder with 9 layers (1 billion weights)
2000 computers + two weeks of computing
Examine units to see what images they most respond to
Google Autoencoder
From Nando De Freitas
Convolutional NNs
CNNs are particularly popular in image processing
Often the feature maps correspond (not to macro features such as bicycles)but micro features
For example in handwritten digit recognition they correspond to smallconstituent parts of the digits
These are used then to process the image into a representation that is betterfor recognition
NNs in NLP
Bag of Words
We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index
We can also think of this as a Euclidean embedding e
aardvarkrarr eaardvark =
100
zorrorarr ezorro =
001
Word Embeddings
Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned
Objective is for example next word prediction accuracy
These are often called lsquoneural language modelsrsquo
NNs in NLP
Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used
Consider the sentence
the cat sat on the mat
and that we wish to predict the word on given the two preceding cat sat
and two succeeding words the mat
We can use a network that has inputs vcat vsat vthe vmat
The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high
The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context
Word Embeddings
Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)
Word Embeddings
There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender
vwoman minus vman asymp vaunt minus vuncle
vwoman minus vman asymp vqueen minus vking
From Mikolov (2013)
Word Embeddings Analogies
Given a relationship France-Paris we get the lsquorelationshiprsquo embedding
v = vParis minus vFrance
Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)
Word Embeddings Constrained Embeddings
We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)
Word Embeddings Constrained Embeddings
Recursive Nets and Embeddings
Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)
Recursive Nets and Embeddings
Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node
Recursive Nets and EmbeddingsTraining
We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree
The weights of this classifier are shared across all nodes
At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings
The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier
We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy
Prediction
For a new movie review the review is first parsed using a standard grammartree parser
This forms the tree which can be used to recursively form the sentiment classlabel for the review
Currently the best sentiment classifier Socher (2013)
Recursive Nets and Embeddingsotilde otilde
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
otilde
otilde
eth
middotshy
otilde
eth
plusmnsup2raquo
otilde
eth
plusmnordm
otilde
otilde
eth
notcedilraquo
otilde
otilde
eth
sup3plusmnshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
yen
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
yen
yen
eth
middotshy
yen
eth
plusmnsup2raquo
yen
eth
plusmnordm
yen
yen
eth
notcedilraquo
yen
yen
yen
acuteraquoiquestshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
otilde
eth
times
otilde
otilde
otilde
acutemiddotmicroraquofrac14
eth
eth
eth
raquoordfraquoregsect
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
times
yen
yen
eth
eth
frac14middotfrac14
eth
sup2ugravenot
eth
eth
acutemiddotmicroraquo
eth
eth
eth
iquest
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
timesnot
yen
yen
eth
eth
ugraveshy
eth
paralaquoshynot
yen
otilde
middotsup2frac12regraquofrac14middotfrac34acutesect
yen yen
frac14laquoacuteacute
eth
ograve
eth
eth
timesnot
eth
eth
eth
eth
eth
ugraveshy
otilde
yen
sup2plusmnnot
yen yen
frac14laquoacuteacute
eth
ograve
Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve
Recurrent Nets
x1 x2 x3
h1 h2 h3
y1 y2 y3
A A A
C C C
B B
RNNs are used in timeseries applications
The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt
In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram
I omitted the potential links from xtminus1 ytminus1 to ht
Handwriting Generation using a RNN
Some training examples
Handwriting Generation using a RNN
Some generated examples Top line is real handwriting for comparison See AlexGraversquos work
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Training an artificial neural network
Want to generalise to new images with high accuracy
Artificial Network
1957 Rosenblattrsquos perceptron
perceptron film clip
Connectionism
1960 Realised a perceptron can only solve simple tasks
1970 Decline in interest
1980 New computing power made training multilayer networks feasible
outputinputs
Each node (or lsquoneuronrsquo) computes a function of a weighted combination ofparental nodes hj = σ(
sumi wijhi)
Neural Networks and Deep LearningHistorical Problems with Neural Nets (1990s)
NNs are difficult to train (many local optima)
Particularly difficult to train a NN with a large number of layers (say largerthan around 10)
lsquoGradient Diffusion Problemrsquo ndash difficult to assign responsibility of errors toindividual lsquoneuronsrsquo
Machine Learning (up to 2006)
A large section of the machine learning community abandoned NNs
More principled and computationally better understood techniques (SVMsand related convex methods) replaced them
Bayesian AI (1990s onwards)
From mid 1990s there was a realisation that pattern recognition is notsufficient for all AI purposes
Uncertainty and reasoning are not naturally representable using standardfeed-forward nets
Explosion in more lsquosymbolicrsquo Bayesian AI
Deep Learning
NNs have resurged in interest in the last few years (Hinton Bengio )
Also called lsquodeep learningrsquo
Sense that very complex tasks (object recognition learning complex structurein data) requires going beyond simple (convex) statistical techniques
The brain uses hierarchical distributed processing and it is likely to be for agood reason
Many problems have a hierarchical structure images are made of partslanguage is hierarchical etc
Why now
New computing resources (GPU processing)
Availability of large amount of data means that we can train nets with manyparameters (1010)
Recent evidence suggests local optima are not particularly problematic
Autoencoder
y1 y2 y3 y4 y5
h1 h2 h3
h4 h5
y1 y2 y3 y4 y5
h6 h7 h8
The bottleneck forces the network to try to find a low dimensionalrepresentation of the data
Useful for unsupervised learning
Autoencoder on MNIST digits (Hinton 2006 Science)
Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA
60000 training images (28times 28 = 784 pixels)
Use a form of autoencoder to find a lower (30) dimensional representation
At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation
Google Cats
10 Million Youtube video frames (200x200 pixel images)
Use a specialised autoencoder with 9 layers (1 billion weights)
2000 computers + two weeks of computing
Examine units to see what images they most respond to
Google Autoencoder
From Nando De Freitas
Convolutional NNs
CNNs are particularly popular in image processing
Often the feature maps correspond (not to macro features such as bicycles)but micro features
For example in handwritten digit recognition they correspond to smallconstituent parts of the digits
These are used then to process the image into a representation that is betterfor recognition
NNs in NLP
Bag of Words
We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index
We can also think of this as a Euclidean embedding e
aardvarkrarr eaardvark =
100
zorrorarr ezorro =
001
Word Embeddings
Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned
Objective is for example next word prediction accuracy
These are often called lsquoneural language modelsrsquo
NNs in NLP
Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used
Consider the sentence
the cat sat on the mat
and that we wish to predict the word on given the two preceding cat sat
and two succeeding words the mat
We can use a network that has inputs vcat vsat vthe vmat
The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high
The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context
Word Embeddings
Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)
Word Embeddings
There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender
vwoman minus vman asymp vaunt minus vuncle
vwoman minus vman asymp vqueen minus vking
From Mikolov (2013)
Word Embeddings Analogies
Given a relationship France-Paris we get the lsquorelationshiprsquo embedding
v = vParis minus vFrance
Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)
Word Embeddings Constrained Embeddings
We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)
Word Embeddings Constrained Embeddings
Recursive Nets and Embeddings
Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)
Recursive Nets and Embeddings
Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node
Recursive Nets and EmbeddingsTraining
We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree
The weights of this classifier are shared across all nodes
At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings
The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier
We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy
Prediction
For a new movie review the review is first parsed using a standard grammartree parser
This forms the tree which can be used to recursively form the sentiment classlabel for the review
Currently the best sentiment classifier Socher (2013)
Recursive Nets and Embeddingsotilde otilde
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
otilde
otilde
eth
middotshy
otilde
eth
plusmnsup2raquo
otilde
eth
plusmnordm
otilde
otilde
eth
notcedilraquo
otilde
otilde
eth
sup3plusmnshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
yen
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
yen
yen
eth
middotshy
yen
eth
plusmnsup2raquo
yen
eth
plusmnordm
yen
yen
eth
notcedilraquo
yen
yen
yen
acuteraquoiquestshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
otilde
eth
times
otilde
otilde
otilde
acutemiddotmicroraquofrac14
eth
eth
eth
raquoordfraquoregsect
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
times
yen
yen
eth
eth
frac14middotfrac14
eth
sup2ugravenot
eth
eth
acutemiddotmicroraquo
eth
eth
eth
iquest
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
timesnot
yen
yen
eth
eth
ugraveshy
eth
paralaquoshynot
yen
otilde
middotsup2frac12regraquofrac14middotfrac34acutesect
yen yen
frac14laquoacuteacute
eth
ograve
eth
eth
timesnot
eth
eth
eth
eth
eth
ugraveshy
otilde
yen
sup2plusmnnot
yen yen
frac14laquoacuteacute
eth
ograve
Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve
Recurrent Nets
x1 x2 x3
h1 h2 h3
y1 y2 y3
A A A
C C C
B B
RNNs are used in timeseries applications
The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt
In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram
I omitted the potential links from xtminus1 ytminus1 to ht
Handwriting Generation using a RNN
Some training examples
Handwriting Generation using a RNN
Some generated examples Top line is real handwriting for comparison See AlexGraversquos work
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Artificial Network
1957 Rosenblattrsquos perceptron
perceptron film clip
Connectionism
1960 Realised a perceptron can only solve simple tasks
1970 Decline in interest
1980 New computing power made training multilayer networks feasible
outputinputs
Each node (or lsquoneuronrsquo) computes a function of a weighted combination ofparental nodes hj = σ(
sumi wijhi)
Neural Networks and Deep LearningHistorical Problems with Neural Nets (1990s)
NNs are difficult to train (many local optima)
Particularly difficult to train a NN with a large number of layers (say largerthan around 10)
lsquoGradient Diffusion Problemrsquo ndash difficult to assign responsibility of errors toindividual lsquoneuronsrsquo
Machine Learning (up to 2006)
A large section of the machine learning community abandoned NNs
More principled and computationally better understood techniques (SVMsand related convex methods) replaced them
Bayesian AI (1990s onwards)
From mid 1990s there was a realisation that pattern recognition is notsufficient for all AI purposes
Uncertainty and reasoning are not naturally representable using standardfeed-forward nets
Explosion in more lsquosymbolicrsquo Bayesian AI
Deep Learning
NNs have resurged in interest in the last few years (Hinton Bengio )
Also called lsquodeep learningrsquo
Sense that very complex tasks (object recognition learning complex structurein data) requires going beyond simple (convex) statistical techniques
The brain uses hierarchical distributed processing and it is likely to be for agood reason
Many problems have a hierarchical structure images are made of partslanguage is hierarchical etc
Why now
New computing resources (GPU processing)
Availability of large amount of data means that we can train nets with manyparameters (1010)
Recent evidence suggests local optima are not particularly problematic
Autoencoder
y1 y2 y3 y4 y5
h1 h2 h3
h4 h5
y1 y2 y3 y4 y5
h6 h7 h8
The bottleneck forces the network to try to find a low dimensionalrepresentation of the data
Useful for unsupervised learning
Autoencoder on MNIST digits (Hinton 2006 Science)
Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA
60000 training images (28times 28 = 784 pixels)
Use a form of autoencoder to find a lower (30) dimensional representation
At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation
Google Cats
10 Million Youtube video frames (200x200 pixel images)
Use a specialised autoencoder with 9 layers (1 billion weights)
2000 computers + two weeks of computing
Examine units to see what images they most respond to
Google Autoencoder
From Nando De Freitas
Convolutional NNs
CNNs are particularly popular in image processing
Often the feature maps correspond (not to macro features such as bicycles)but micro features
For example in handwritten digit recognition they correspond to smallconstituent parts of the digits
These are used then to process the image into a representation that is betterfor recognition
NNs in NLP
Bag of Words
We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index
We can also think of this as a Euclidean embedding e
aardvarkrarr eaardvark =
100
zorrorarr ezorro =
001
Word Embeddings
Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned
Objective is for example next word prediction accuracy
These are often called lsquoneural language modelsrsquo
NNs in NLP
Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used
Consider the sentence
the cat sat on the mat
and that we wish to predict the word on given the two preceding cat sat
and two succeeding words the mat
We can use a network that has inputs vcat vsat vthe vmat
The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high
The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context
Word Embeddings
Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)
Word Embeddings
There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender
vwoman minus vman asymp vaunt minus vuncle
vwoman minus vman asymp vqueen minus vking
From Mikolov (2013)
Word Embeddings Analogies
Given a relationship France-Paris we get the lsquorelationshiprsquo embedding
v = vParis minus vFrance
Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)
Word Embeddings Constrained Embeddings
We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)
Word Embeddings Constrained Embeddings
Recursive Nets and Embeddings
Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)
Recursive Nets and Embeddings
Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node
Recursive Nets and EmbeddingsTraining
We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree
The weights of this classifier are shared across all nodes
At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings
The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier
We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy
Prediction
For a new movie review the review is first parsed using a standard grammartree parser
This forms the tree which can be used to recursively form the sentiment classlabel for the review
Currently the best sentiment classifier Socher (2013)
Recursive Nets and Embeddingsotilde otilde
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
otilde
otilde
eth
middotshy
otilde
eth
plusmnsup2raquo
otilde
eth
plusmnordm
otilde
otilde
eth
notcedilraquo
otilde
otilde
eth
sup3plusmnshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
yen
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
yen
yen
eth
middotshy
yen
eth
plusmnsup2raquo
yen
eth
plusmnordm
yen
yen
eth
notcedilraquo
yen
yen
yen
acuteraquoiquestshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
otilde
eth
times
otilde
otilde
otilde
acutemiddotmicroraquofrac14
eth
eth
eth
raquoordfraquoregsect
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
times
yen
yen
eth
eth
frac14middotfrac14
eth
sup2ugravenot
eth
eth
acutemiddotmicroraquo
eth
eth
eth
iquest
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
timesnot
yen
yen
eth
eth
ugraveshy
eth
paralaquoshynot
yen
otilde
middotsup2frac12regraquofrac14middotfrac34acutesect
yen yen
frac14laquoacuteacute
eth
ograve
eth
eth
timesnot
eth
eth
eth
eth
eth
ugraveshy
otilde
yen
sup2plusmnnot
yen yen
frac14laquoacuteacute
eth
ograve
Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve
Recurrent Nets
x1 x2 x3
h1 h2 h3
y1 y2 y3
A A A
C C C
B B
RNNs are used in timeseries applications
The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt
In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram
I omitted the potential links from xtminus1 ytminus1 to ht
Handwriting Generation using a RNN
Some training examples
Handwriting Generation using a RNN
Some generated examples Top line is real handwriting for comparison See AlexGraversquos work
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Connectionism
1960 Realised a perceptron can only solve simple tasks
1970 Decline in interest
1980 New computing power made training multilayer networks feasible
outputinputs
Each node (or lsquoneuronrsquo) computes a function of a weighted combination ofparental nodes hj = σ(
sumi wijhi)
Neural Networks and Deep LearningHistorical Problems with Neural Nets (1990s)
NNs are difficult to train (many local optima)
Particularly difficult to train a NN with a large number of layers (say largerthan around 10)
lsquoGradient Diffusion Problemrsquo ndash difficult to assign responsibility of errors toindividual lsquoneuronsrsquo
Machine Learning (up to 2006)
A large section of the machine learning community abandoned NNs
More principled and computationally better understood techniques (SVMsand related convex methods) replaced them
Bayesian AI (1990s onwards)
From mid 1990s there was a realisation that pattern recognition is notsufficient for all AI purposes
Uncertainty and reasoning are not naturally representable using standardfeed-forward nets
Explosion in more lsquosymbolicrsquo Bayesian AI
Deep Learning
NNs have resurged in interest in the last few years (Hinton Bengio )
Also called lsquodeep learningrsquo
Sense that very complex tasks (object recognition learning complex structurein data) requires going beyond simple (convex) statistical techniques
The brain uses hierarchical distributed processing and it is likely to be for agood reason
Many problems have a hierarchical structure images are made of partslanguage is hierarchical etc
Why now
New computing resources (GPU processing)
Availability of large amount of data means that we can train nets with manyparameters (1010)
Recent evidence suggests local optima are not particularly problematic
Autoencoder
y1 y2 y3 y4 y5
h1 h2 h3
h4 h5
y1 y2 y3 y4 y5
h6 h7 h8
The bottleneck forces the network to try to find a low dimensionalrepresentation of the data
Useful for unsupervised learning
Autoencoder on MNIST digits (Hinton 2006 Science)
Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA
60000 training images (28times 28 = 784 pixels)
Use a form of autoencoder to find a lower (30) dimensional representation
At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation
Google Cats
10 Million Youtube video frames (200x200 pixel images)
Use a specialised autoencoder with 9 layers (1 billion weights)
2000 computers + two weeks of computing
Examine units to see what images they most respond to
Google Autoencoder
From Nando De Freitas
Convolutional NNs
CNNs are particularly popular in image processing
Often the feature maps correspond (not to macro features such as bicycles)but micro features
For example in handwritten digit recognition they correspond to smallconstituent parts of the digits
These are used then to process the image into a representation that is betterfor recognition
NNs in NLP
Bag of Words
We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index
We can also think of this as a Euclidean embedding e
aardvarkrarr eaardvark =
100
zorrorarr ezorro =
001
Word Embeddings
Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned
Objective is for example next word prediction accuracy
These are often called lsquoneural language modelsrsquo
NNs in NLP
Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used
Consider the sentence
the cat sat on the mat
and that we wish to predict the word on given the two preceding cat sat
and two succeeding words the mat
We can use a network that has inputs vcat vsat vthe vmat
The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high
The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context
Word Embeddings
Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)
Word Embeddings
There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender
vwoman minus vman asymp vaunt minus vuncle
vwoman minus vman asymp vqueen minus vking
From Mikolov (2013)
Word Embeddings Analogies
Given a relationship France-Paris we get the lsquorelationshiprsquo embedding
v = vParis minus vFrance
Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)
Word Embeddings Constrained Embeddings
We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)
Word Embeddings Constrained Embeddings
Recursive Nets and Embeddings
Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)
Recursive Nets and Embeddings
Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node
Recursive Nets and EmbeddingsTraining
We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree
The weights of this classifier are shared across all nodes
At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings
The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier
We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy
Prediction
For a new movie review the review is first parsed using a standard grammartree parser
This forms the tree which can be used to recursively form the sentiment classlabel for the review
Currently the best sentiment classifier Socher (2013)
Recursive Nets and Embeddingsotilde otilde
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
otilde
otilde
eth
middotshy
otilde
eth
plusmnsup2raquo
otilde
eth
plusmnordm
otilde
otilde
eth
notcedilraquo
otilde
otilde
eth
sup3plusmnshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
yen
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
yen
yen
eth
middotshy
yen
eth
plusmnsup2raquo
yen
eth
plusmnordm
yen
yen
eth
notcedilraquo
yen
yen
yen
acuteraquoiquestshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
otilde
eth
times
otilde
otilde
otilde
acutemiddotmicroraquofrac14
eth
eth
eth
raquoordfraquoregsect
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
times
yen
yen
eth
eth
frac14middotfrac14
eth
sup2ugravenot
eth
eth
acutemiddotmicroraquo
eth
eth
eth
iquest
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
timesnot
yen
yen
eth
eth
ugraveshy
eth
paralaquoshynot
yen
otilde
middotsup2frac12regraquofrac14middotfrac34acutesect
yen yen
frac14laquoacuteacute
eth
ograve
eth
eth
timesnot
eth
eth
eth
eth
eth
ugraveshy
otilde
yen
sup2plusmnnot
yen yen
frac14laquoacuteacute
eth
ograve
Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve
Recurrent Nets
x1 x2 x3
h1 h2 h3
y1 y2 y3
A A A
C C C
B B
RNNs are used in timeseries applications
The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt
In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram
I omitted the potential links from xtminus1 ytminus1 to ht
Handwriting Generation using a RNN
Some training examples
Handwriting Generation using a RNN
Some generated examples Top line is real handwriting for comparison See AlexGraversquos work
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Neural Networks and Deep LearningHistorical Problems with Neural Nets (1990s)
NNs are difficult to train (many local optima)
Particularly difficult to train a NN with a large number of layers (say largerthan around 10)
lsquoGradient Diffusion Problemrsquo ndash difficult to assign responsibility of errors toindividual lsquoneuronsrsquo
Machine Learning (up to 2006)
A large section of the machine learning community abandoned NNs
More principled and computationally better understood techniques (SVMsand related convex methods) replaced them
Bayesian AI (1990s onwards)
From mid 1990s there was a realisation that pattern recognition is notsufficient for all AI purposes
Uncertainty and reasoning are not naturally representable using standardfeed-forward nets
Explosion in more lsquosymbolicrsquo Bayesian AI
Deep Learning
NNs have resurged in interest in the last few years (Hinton Bengio )
Also called lsquodeep learningrsquo
Sense that very complex tasks (object recognition learning complex structurein data) requires going beyond simple (convex) statistical techniques
The brain uses hierarchical distributed processing and it is likely to be for agood reason
Many problems have a hierarchical structure images are made of partslanguage is hierarchical etc
Why now
New computing resources (GPU processing)
Availability of large amount of data means that we can train nets with manyparameters (1010)
Recent evidence suggests local optima are not particularly problematic
Autoencoder
y1 y2 y3 y4 y5
h1 h2 h3
h4 h5
y1 y2 y3 y4 y5
h6 h7 h8
The bottleneck forces the network to try to find a low dimensionalrepresentation of the data
Useful for unsupervised learning
Autoencoder on MNIST digits (Hinton 2006 Science)
Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA
60000 training images (28times 28 = 784 pixels)
Use a form of autoencoder to find a lower (30) dimensional representation
At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation
Google Cats
10 Million Youtube video frames (200x200 pixel images)
Use a specialised autoencoder with 9 layers (1 billion weights)
2000 computers + two weeks of computing
Examine units to see what images they most respond to
Google Autoencoder
From Nando De Freitas
Convolutional NNs
CNNs are particularly popular in image processing
Often the feature maps correspond (not to macro features such as bicycles)but micro features
For example in handwritten digit recognition they correspond to smallconstituent parts of the digits
These are used then to process the image into a representation that is betterfor recognition
NNs in NLP
Bag of Words
We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index
We can also think of this as a Euclidean embedding e
aardvarkrarr eaardvark =
100
zorrorarr ezorro =
001
Word Embeddings
Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned
Objective is for example next word prediction accuracy
These are often called lsquoneural language modelsrsquo
NNs in NLP
Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used
Consider the sentence
the cat sat on the mat
and that we wish to predict the word on given the two preceding cat sat
and two succeeding words the mat
We can use a network that has inputs vcat vsat vthe vmat
The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high
The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context
Word Embeddings
Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)
Word Embeddings
There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender
vwoman minus vman asymp vaunt minus vuncle
vwoman minus vman asymp vqueen minus vking
From Mikolov (2013)
Word Embeddings Analogies
Given a relationship France-Paris we get the lsquorelationshiprsquo embedding
v = vParis minus vFrance
Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)
Word Embeddings Constrained Embeddings
We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)
Word Embeddings Constrained Embeddings
Recursive Nets and Embeddings
Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)
Recursive Nets and Embeddings
Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node
Recursive Nets and EmbeddingsTraining
We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree
The weights of this classifier are shared across all nodes
At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings
The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier
We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy
Prediction
For a new movie review the review is first parsed using a standard grammartree parser
This forms the tree which can be used to recursively form the sentiment classlabel for the review
Currently the best sentiment classifier Socher (2013)
Recursive Nets and Embeddingsotilde otilde
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
otilde
otilde
eth
middotshy
otilde
eth
plusmnsup2raquo
otilde
eth
plusmnordm
otilde
otilde
eth
notcedilraquo
otilde
otilde
eth
sup3plusmnshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
yen
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
yen
yen
eth
middotshy
yen
eth
plusmnsup2raquo
yen
eth
plusmnordm
yen
yen
eth
notcedilraquo
yen
yen
yen
acuteraquoiquestshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
otilde
eth
times
otilde
otilde
otilde
acutemiddotmicroraquofrac14
eth
eth
eth
raquoordfraquoregsect
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
times
yen
yen
eth
eth
frac14middotfrac14
eth
sup2ugravenot
eth
eth
acutemiddotmicroraquo
eth
eth
eth
iquest
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
timesnot
yen
yen
eth
eth
ugraveshy
eth
paralaquoshynot
yen
otilde
middotsup2frac12regraquofrac14middotfrac34acutesect
yen yen
frac14laquoacuteacute
eth
ograve
eth
eth
timesnot
eth
eth
eth
eth
eth
ugraveshy
otilde
yen
sup2plusmnnot
yen yen
frac14laquoacuteacute
eth
ograve
Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve
Recurrent Nets
x1 x2 x3
h1 h2 h3
y1 y2 y3
A A A
C C C
B B
RNNs are used in timeseries applications
The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt
In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram
I omitted the potential links from xtminus1 ytminus1 to ht
Handwriting Generation using a RNN
Some training examples
Handwriting Generation using a RNN
Some generated examples Top line is real handwriting for comparison See AlexGraversquos work
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Deep Learning
NNs have resurged in interest in the last few years (Hinton Bengio )
Also called lsquodeep learningrsquo
Sense that very complex tasks (object recognition learning complex structurein data) requires going beyond simple (convex) statistical techniques
The brain uses hierarchical distributed processing and it is likely to be for agood reason
Many problems have a hierarchical structure images are made of partslanguage is hierarchical etc
Why now
New computing resources (GPU processing)
Availability of large amount of data means that we can train nets with manyparameters (1010)
Recent evidence suggests local optima are not particularly problematic
Autoencoder
y1 y2 y3 y4 y5
h1 h2 h3
h4 h5
y1 y2 y3 y4 y5
h6 h7 h8
The bottleneck forces the network to try to find a low dimensionalrepresentation of the data
Useful for unsupervised learning
Autoencoder on MNIST digits (Hinton 2006 Science)
Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA
60000 training images (28times 28 = 784 pixels)
Use a form of autoencoder to find a lower (30) dimensional representation
At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation
Google Cats
10 Million Youtube video frames (200x200 pixel images)
Use a specialised autoencoder with 9 layers (1 billion weights)
2000 computers + two weeks of computing
Examine units to see what images they most respond to
Google Autoencoder
From Nando De Freitas
Convolutional NNs
CNNs are particularly popular in image processing
Often the feature maps correspond (not to macro features such as bicycles)but micro features
For example in handwritten digit recognition they correspond to smallconstituent parts of the digits
These are used then to process the image into a representation that is betterfor recognition
NNs in NLP
Bag of Words
We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index
We can also think of this as a Euclidean embedding e
aardvarkrarr eaardvark =
100
zorrorarr ezorro =
001
Word Embeddings
Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned
Objective is for example next word prediction accuracy
These are often called lsquoneural language modelsrsquo
NNs in NLP
Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used
Consider the sentence
the cat sat on the mat
and that we wish to predict the word on given the two preceding cat sat
and two succeeding words the mat
We can use a network that has inputs vcat vsat vthe vmat
The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high
The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context
Word Embeddings
Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)
Word Embeddings
There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender
vwoman minus vman asymp vaunt minus vuncle
vwoman minus vman asymp vqueen minus vking
From Mikolov (2013)
Word Embeddings Analogies
Given a relationship France-Paris we get the lsquorelationshiprsquo embedding
v = vParis minus vFrance
Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)
Word Embeddings Constrained Embeddings
We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)
Word Embeddings Constrained Embeddings
Recursive Nets and Embeddings
Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)
Recursive Nets and Embeddings
Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node
Recursive Nets and EmbeddingsTraining
We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree
The weights of this classifier are shared across all nodes
At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings
The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier
We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy
Prediction
For a new movie review the review is first parsed using a standard grammartree parser
This forms the tree which can be used to recursively form the sentiment classlabel for the review
Currently the best sentiment classifier Socher (2013)
Recursive Nets and Embeddingsotilde otilde
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
otilde
otilde
eth
middotshy
otilde
eth
plusmnsup2raquo
otilde
eth
plusmnordm
otilde
otilde
eth
notcedilraquo
otilde
otilde
eth
sup3plusmnshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
yen
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
yen
yen
eth
middotshy
yen
eth
plusmnsup2raquo
yen
eth
plusmnordm
yen
yen
eth
notcedilraquo
yen
yen
yen
acuteraquoiquestshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
otilde
eth
times
otilde
otilde
otilde
acutemiddotmicroraquofrac14
eth
eth
eth
raquoordfraquoregsect
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
times
yen
yen
eth
eth
frac14middotfrac14
eth
sup2ugravenot
eth
eth
acutemiddotmicroraquo
eth
eth
eth
iquest
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
timesnot
yen
yen
eth
eth
ugraveshy
eth
paralaquoshynot
yen
otilde
middotsup2frac12regraquofrac14middotfrac34acutesect
yen yen
frac14laquoacuteacute
eth
ograve
eth
eth
timesnot
eth
eth
eth
eth
eth
ugraveshy
otilde
yen
sup2plusmnnot
yen yen
frac14laquoacuteacute
eth
ograve
Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve
Recurrent Nets
x1 x2 x3
h1 h2 h3
y1 y2 y3
A A A
C C C
B B
RNNs are used in timeseries applications
The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt
In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram
I omitted the potential links from xtminus1 ytminus1 to ht
Handwriting Generation using a RNN
Some training examples
Handwriting Generation using a RNN
Some generated examples Top line is real handwriting for comparison See AlexGraversquos work
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Autoencoder
y1 y2 y3 y4 y5
h1 h2 h3
h4 h5
y1 y2 y3 y4 y5
h6 h7 h8
The bottleneck forces the network to try to find a low dimensionalrepresentation of the data
Useful for unsupervised learning
Autoencoder on MNIST digits (Hinton 2006 Science)
Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA
60000 training images (28times 28 = 784 pixels)
Use a form of autoencoder to find a lower (30) dimensional representation
At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation
Google Cats
10 Million Youtube video frames (200x200 pixel images)
Use a specialised autoencoder with 9 layers (1 billion weights)
2000 computers + two weeks of computing
Examine units to see what images they most respond to
Google Autoencoder
From Nando De Freitas
Convolutional NNs
CNNs are particularly popular in image processing
Often the feature maps correspond (not to macro features such as bicycles)but micro features
For example in handwritten digit recognition they correspond to smallconstituent parts of the digits
These are used then to process the image into a representation that is betterfor recognition
NNs in NLP
Bag of Words
We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index
We can also think of this as a Euclidean embedding e
aardvarkrarr eaardvark =
100
zorrorarr ezorro =
001
Word Embeddings
Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned
Objective is for example next word prediction accuracy
These are often called lsquoneural language modelsrsquo
NNs in NLP
Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used
Consider the sentence
the cat sat on the mat
and that we wish to predict the word on given the two preceding cat sat
and two succeeding words the mat
We can use a network that has inputs vcat vsat vthe vmat
The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high
The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context
Word Embeddings
Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)
Word Embeddings
There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender
vwoman minus vman asymp vaunt minus vuncle
vwoman minus vman asymp vqueen minus vking
From Mikolov (2013)
Word Embeddings Analogies
Given a relationship France-Paris we get the lsquorelationshiprsquo embedding
v = vParis minus vFrance
Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)
Word Embeddings Constrained Embeddings
We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)
Word Embeddings Constrained Embeddings
Recursive Nets and Embeddings
Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)
Recursive Nets and Embeddings
Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node
Recursive Nets and EmbeddingsTraining
We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree
The weights of this classifier are shared across all nodes
At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings
The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier
We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy
Prediction
For a new movie review the review is first parsed using a standard grammartree parser
This forms the tree which can be used to recursively form the sentiment classlabel for the review
Currently the best sentiment classifier Socher (2013)
Recursive Nets and Embeddingsotilde otilde
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
otilde
otilde
eth
middotshy
otilde
eth
plusmnsup2raquo
otilde
eth
plusmnordm
otilde
otilde
eth
notcedilraquo
otilde
otilde
eth
sup3plusmnshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
yen
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
yen
yen
eth
middotshy
yen
eth
plusmnsup2raquo
yen
eth
plusmnordm
yen
yen
eth
notcedilraquo
yen
yen
yen
acuteraquoiquestshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
otilde
eth
times
otilde
otilde
otilde
acutemiddotmicroraquofrac14
eth
eth
eth
raquoordfraquoregsect
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
times
yen
yen
eth
eth
frac14middotfrac14
eth
sup2ugravenot
eth
eth
acutemiddotmicroraquo
eth
eth
eth
iquest
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
timesnot
yen
yen
eth
eth
ugraveshy
eth
paralaquoshynot
yen
otilde
middotsup2frac12regraquofrac14middotfrac34acutesect
yen yen
frac14laquoacuteacute
eth
ograve
eth
eth
timesnot
eth
eth
eth
eth
eth
ugraveshy
otilde
yen
sup2plusmnnot
yen yen
frac14laquoacuteacute
eth
ograve
Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve
Recurrent Nets
x1 x2 x3
h1 h2 h3
y1 y2 y3
A A A
C C C
B B
RNNs are used in timeseries applications
The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt
In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram
I omitted the potential links from xtminus1 ytminus1 to ht
Handwriting Generation using a RNN
Some training examples
Handwriting Generation using a RNN
Some generated examples Top line is real handwriting for comparison See AlexGraversquos work
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Autoencoder on MNIST digits (Hinton 2006 Science)
Figure Reconstructions using H = 30 components From the Top Original imageAutoencoder1 Autoencoder2 PCA
60000 training images (28times 28 = 784 pixels)
Use a form of autoencoder to find a lower (30) dimensional representation
At the time the special layerwise training procedure was consideredfundamental to the success of this approach Now not deemed necessaryprovided we use a sensible initialisation
Google Cats
10 Million Youtube video frames (200x200 pixel images)
Use a specialised autoencoder with 9 layers (1 billion weights)
2000 computers + two weeks of computing
Examine units to see what images they most respond to
Google Autoencoder
From Nando De Freitas
Convolutional NNs
CNNs are particularly popular in image processing
Often the feature maps correspond (not to macro features such as bicycles)but micro features
For example in handwritten digit recognition they correspond to smallconstituent parts of the digits
These are used then to process the image into a representation that is betterfor recognition
NNs in NLP
Bag of Words
We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index
We can also think of this as a Euclidean embedding e
aardvarkrarr eaardvark =
100
zorrorarr ezorro =
001
Word Embeddings
Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned
Objective is for example next word prediction accuracy
These are often called lsquoneural language modelsrsquo
NNs in NLP
Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used
Consider the sentence
the cat sat on the mat
and that we wish to predict the word on given the two preceding cat sat
and two succeeding words the mat
We can use a network that has inputs vcat vsat vthe vmat
The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high
The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context
Word Embeddings
Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)
Word Embeddings
There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender
vwoman minus vman asymp vaunt minus vuncle
vwoman minus vman asymp vqueen minus vking
From Mikolov (2013)
Word Embeddings Analogies
Given a relationship France-Paris we get the lsquorelationshiprsquo embedding
v = vParis minus vFrance
Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)
Word Embeddings Constrained Embeddings
We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)
Word Embeddings Constrained Embeddings
Recursive Nets and Embeddings
Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)
Recursive Nets and Embeddings
Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node
Recursive Nets and EmbeddingsTraining
We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree
The weights of this classifier are shared across all nodes
At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings
The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier
We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy
Prediction
For a new movie review the review is first parsed using a standard grammartree parser
This forms the tree which can be used to recursively form the sentiment classlabel for the review
Currently the best sentiment classifier Socher (2013)
Recursive Nets and Embeddingsotilde otilde
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
otilde
otilde
eth
middotshy
otilde
eth
plusmnsup2raquo
otilde
eth
plusmnordm
otilde
otilde
eth
notcedilraquo
otilde
otilde
eth
sup3plusmnshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
yen
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
yen
yen
eth
middotshy
yen
eth
plusmnsup2raquo
yen
eth
plusmnordm
yen
yen
eth
notcedilraquo
yen
yen
yen
acuteraquoiquestshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
otilde
eth
times
otilde
otilde
otilde
acutemiddotmicroraquofrac14
eth
eth
eth
raquoordfraquoregsect
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
times
yen
yen
eth
eth
frac14middotfrac14
eth
sup2ugravenot
eth
eth
acutemiddotmicroraquo
eth
eth
eth
iquest
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
timesnot
yen
yen
eth
eth
ugraveshy
eth
paralaquoshynot
yen
otilde
middotsup2frac12regraquofrac14middotfrac34acutesect
yen yen
frac14laquoacuteacute
eth
ograve
eth
eth
timesnot
eth
eth
eth
eth
eth
ugraveshy
otilde
yen
sup2plusmnnot
yen yen
frac14laquoacuteacute
eth
ograve
Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve
Recurrent Nets
x1 x2 x3
h1 h2 h3
y1 y2 y3
A A A
C C C
B B
RNNs are used in timeseries applications
The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt
In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram
I omitted the potential links from xtminus1 ytminus1 to ht
Handwriting Generation using a RNN
Some training examples
Handwriting Generation using a RNN
Some generated examples Top line is real handwriting for comparison See AlexGraversquos work
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Google Cats
10 Million Youtube video frames (200x200 pixel images)
Use a specialised autoencoder with 9 layers (1 billion weights)
2000 computers + two weeks of computing
Examine units to see what images they most respond to
Google Autoencoder
From Nando De Freitas
Convolutional NNs
CNNs are particularly popular in image processing
Often the feature maps correspond (not to macro features such as bicycles)but micro features
For example in handwritten digit recognition they correspond to smallconstituent parts of the digits
These are used then to process the image into a representation that is betterfor recognition
NNs in NLP
Bag of Words
We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index
We can also think of this as a Euclidean embedding e
aardvarkrarr eaardvark =
100
zorrorarr ezorro =
001
Word Embeddings
Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned
Objective is for example next word prediction accuracy
These are often called lsquoneural language modelsrsquo
NNs in NLP
Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used
Consider the sentence
the cat sat on the mat
and that we wish to predict the word on given the two preceding cat sat
and two succeeding words the mat
We can use a network that has inputs vcat vsat vthe vmat
The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high
The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context
Word Embeddings
Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)
Word Embeddings
There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender
vwoman minus vman asymp vaunt minus vuncle
vwoman minus vman asymp vqueen minus vking
From Mikolov (2013)
Word Embeddings Analogies
Given a relationship France-Paris we get the lsquorelationshiprsquo embedding
v = vParis minus vFrance
Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)
Word Embeddings Constrained Embeddings
We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)
Word Embeddings Constrained Embeddings
Recursive Nets and Embeddings
Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)
Recursive Nets and Embeddings
Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node
Recursive Nets and EmbeddingsTraining
We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree
The weights of this classifier are shared across all nodes
At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings
The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier
We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy
Prediction
For a new movie review the review is first parsed using a standard grammartree parser
This forms the tree which can be used to recursively form the sentiment classlabel for the review
Currently the best sentiment classifier Socher (2013)
Recursive Nets and Embeddingsotilde otilde
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
otilde
otilde
eth
middotshy
otilde
eth
plusmnsup2raquo
otilde
eth
plusmnordm
otilde
otilde
eth
notcedilraquo
otilde
otilde
eth
sup3plusmnshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
yen
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
yen
yen
eth
middotshy
yen
eth
plusmnsup2raquo
yen
eth
plusmnordm
yen
yen
eth
notcedilraquo
yen
yen
yen
acuteraquoiquestshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
otilde
eth
times
otilde
otilde
otilde
acutemiddotmicroraquofrac14
eth
eth
eth
raquoordfraquoregsect
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
times
yen
yen
eth
eth
frac14middotfrac14
eth
sup2ugravenot
eth
eth
acutemiddotmicroraquo
eth
eth
eth
iquest
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
timesnot
yen
yen
eth
eth
ugraveshy
eth
paralaquoshynot
yen
otilde
middotsup2frac12regraquofrac14middotfrac34acutesect
yen yen
frac14laquoacuteacute
eth
ograve
eth
eth
timesnot
eth
eth
eth
eth
eth
ugraveshy
otilde
yen
sup2plusmnnot
yen yen
frac14laquoacuteacute
eth
ograve
Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve
Recurrent Nets
x1 x2 x3
h1 h2 h3
y1 y2 y3
A A A
C C C
B B
RNNs are used in timeseries applications
The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt
In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram
I omitted the potential links from xtminus1 ytminus1 to ht
Handwriting Generation using a RNN
Some training examples
Handwriting Generation using a RNN
Some generated examples Top line is real handwriting for comparison See AlexGraversquos work
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Google Autoencoder
From Nando De Freitas
Convolutional NNs
CNNs are particularly popular in image processing
Often the feature maps correspond (not to macro features such as bicycles)but micro features
For example in handwritten digit recognition they correspond to smallconstituent parts of the digits
These are used then to process the image into a representation that is betterfor recognition
NNs in NLP
Bag of Words
We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index
We can also think of this as a Euclidean embedding e
aardvarkrarr eaardvark =
100
zorrorarr ezorro =
001
Word Embeddings
Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned
Objective is for example next word prediction accuracy
These are often called lsquoneural language modelsrsquo
NNs in NLP
Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used
Consider the sentence
the cat sat on the mat
and that we wish to predict the word on given the two preceding cat sat
and two succeeding words the mat
We can use a network that has inputs vcat vsat vthe vmat
The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high
The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context
Word Embeddings
Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)
Word Embeddings
There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender
vwoman minus vman asymp vaunt minus vuncle
vwoman minus vman asymp vqueen minus vking
From Mikolov (2013)
Word Embeddings Analogies
Given a relationship France-Paris we get the lsquorelationshiprsquo embedding
v = vParis minus vFrance
Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)
Word Embeddings Constrained Embeddings
We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)
Word Embeddings Constrained Embeddings
Recursive Nets and Embeddings
Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)
Recursive Nets and Embeddings
Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node
Recursive Nets and EmbeddingsTraining
We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree
The weights of this classifier are shared across all nodes
At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings
The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier
We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy
Prediction
For a new movie review the review is first parsed using a standard grammartree parser
This forms the tree which can be used to recursively form the sentiment classlabel for the review
Currently the best sentiment classifier Socher (2013)
Recursive Nets and Embeddingsotilde otilde
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
otilde
otilde
eth
middotshy
otilde
eth
plusmnsup2raquo
otilde
eth
plusmnordm
otilde
otilde
eth
notcedilraquo
otilde
otilde
eth
sup3plusmnshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
yen
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
yen
yen
eth
middotshy
yen
eth
plusmnsup2raquo
yen
eth
plusmnordm
yen
yen
eth
notcedilraquo
yen
yen
yen
acuteraquoiquestshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
otilde
eth
times
otilde
otilde
otilde
acutemiddotmicroraquofrac14
eth
eth
eth
raquoordfraquoregsect
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
times
yen
yen
eth
eth
frac14middotfrac14
eth
sup2ugravenot
eth
eth
acutemiddotmicroraquo
eth
eth
eth
iquest
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
timesnot
yen
yen
eth
eth
ugraveshy
eth
paralaquoshynot
yen
otilde
middotsup2frac12regraquofrac14middotfrac34acutesect
yen yen
frac14laquoacuteacute
eth
ograve
eth
eth
timesnot
eth
eth
eth
eth
eth
ugraveshy
otilde
yen
sup2plusmnnot
yen yen
frac14laquoacuteacute
eth
ograve
Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve
Recurrent Nets
x1 x2 x3
h1 h2 h3
y1 y2 y3
A A A
C C C
B B
RNNs are used in timeseries applications
The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt
In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram
I omitted the potential links from xtminus1 ytminus1 to ht
Handwriting Generation using a RNN
Some training examples
Handwriting Generation using a RNN
Some generated examples Top line is real handwriting for comparison See AlexGraversquos work
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Convolutional NNs
CNNs are particularly popular in image processing
Often the feature maps correspond (not to macro features such as bicycles)but micro features
For example in handwritten digit recognition they correspond to smallconstituent parts of the digits
These are used then to process the image into a representation that is betterfor recognition
NNs in NLP
Bag of Words
We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index
We can also think of this as a Euclidean embedding e
aardvarkrarr eaardvark =
100
zorrorarr ezorro =
001
Word Embeddings
Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned
Objective is for example next word prediction accuracy
These are often called lsquoneural language modelsrsquo
NNs in NLP
Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used
Consider the sentence
the cat sat on the mat
and that we wish to predict the word on given the two preceding cat sat
and two succeeding words the mat
We can use a network that has inputs vcat vsat vthe vmat
The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high
The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context
Word Embeddings
Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)
Word Embeddings
There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender
vwoman minus vman asymp vaunt minus vuncle
vwoman minus vman asymp vqueen minus vking
From Mikolov (2013)
Word Embeddings Analogies
Given a relationship France-Paris we get the lsquorelationshiprsquo embedding
v = vParis minus vFrance
Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)
Word Embeddings Constrained Embeddings
We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)
Word Embeddings Constrained Embeddings
Recursive Nets and Embeddings
Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)
Recursive Nets and Embeddings
Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node
Recursive Nets and EmbeddingsTraining
We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree
The weights of this classifier are shared across all nodes
At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings
The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier
We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy
Prediction
For a new movie review the review is first parsed using a standard grammartree parser
This forms the tree which can be used to recursively form the sentiment classlabel for the review
Currently the best sentiment classifier Socher (2013)
Recursive Nets and Embeddingsotilde otilde
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
otilde
otilde
eth
middotshy
otilde
eth
plusmnsup2raquo
otilde
eth
plusmnordm
otilde
otilde
eth
notcedilraquo
otilde
otilde
eth
sup3plusmnshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
yen
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
yen
yen
eth
middotshy
yen
eth
plusmnsup2raquo
yen
eth
plusmnordm
yen
yen
eth
notcedilraquo
yen
yen
yen
acuteraquoiquestshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
otilde
eth
times
otilde
otilde
otilde
acutemiddotmicroraquofrac14
eth
eth
eth
raquoordfraquoregsect
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
times
yen
yen
eth
eth
frac14middotfrac14
eth
sup2ugravenot
eth
eth
acutemiddotmicroraquo
eth
eth
eth
iquest
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
timesnot
yen
yen
eth
eth
ugraveshy
eth
paralaquoshynot
yen
otilde
middotsup2frac12regraquofrac14middotfrac34acutesect
yen yen
frac14laquoacuteacute
eth
ograve
eth
eth
timesnot
eth
eth
eth
eth
eth
ugraveshy
otilde
yen
sup2plusmnnot
yen yen
frac14laquoacuteacute
eth
ograve
Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve
Recurrent Nets
x1 x2 x3
h1 h2 h3
y1 y2 y3
A A A
C C C
B B
RNNs are used in timeseries applications
The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt
In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram
I omitted the potential links from xtminus1 ytminus1 to ht
Handwriting Generation using a RNN
Some training examples
Handwriting Generation using a RNN
Some generated examples Top line is real handwriting for comparison See AlexGraversquos work
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
NNs in NLP
Bag of Words
We have D words in a dictionary aardvark zorro so that we can relateeach word with its dictionary index
We can also think of this as a Euclidean embedding e
aardvarkrarr eaardvark =
100
zorrorarr ezorro =
001
Word Embeddings
Idea is to replace the Euclidean embeddings e with embeddings (vectors) vthat are learned
Objective is for example next word prediction accuracy
These are often called lsquoneural language modelsrsquo
NNs in NLP
Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used
Consider the sentence
the cat sat on the mat
and that we wish to predict the word on given the two preceding cat sat
and two succeeding words the mat
We can use a network that has inputs vcat vsat vthe vmat
The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high
The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context
Word Embeddings
Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)
Word Embeddings
There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender
vwoman minus vman asymp vaunt minus vuncle
vwoman minus vman asymp vqueen minus vking
From Mikolov (2013)
Word Embeddings Analogies
Given a relationship France-Paris we get the lsquorelationshiprsquo embedding
v = vParis minus vFrance
Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)
Word Embeddings Constrained Embeddings
We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)
Word Embeddings Constrained Embeddings
Recursive Nets and Embeddings
Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)
Recursive Nets and Embeddings
Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node
Recursive Nets and EmbeddingsTraining
We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree
The weights of this classifier are shared across all nodes
At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings
The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier
We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy
Prediction
For a new movie review the review is first parsed using a standard grammartree parser
This forms the tree which can be used to recursively form the sentiment classlabel for the review
Currently the best sentiment classifier Socher (2013)
Recursive Nets and Embeddingsotilde otilde
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
otilde
otilde
eth
middotshy
otilde
eth
plusmnsup2raquo
otilde
eth
plusmnordm
otilde
otilde
eth
notcedilraquo
otilde
otilde
eth
sup3plusmnshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
yen
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
yen
yen
eth
middotshy
yen
eth
plusmnsup2raquo
yen
eth
plusmnordm
yen
yen
eth
notcedilraquo
yen
yen
yen
acuteraquoiquestshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
otilde
eth
times
otilde
otilde
otilde
acutemiddotmicroraquofrac14
eth
eth
eth
raquoordfraquoregsect
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
times
yen
yen
eth
eth
frac14middotfrac14
eth
sup2ugravenot
eth
eth
acutemiddotmicroraquo
eth
eth
eth
iquest
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
timesnot
yen
yen
eth
eth
ugraveshy
eth
paralaquoshynot
yen
otilde
middotsup2frac12regraquofrac14middotfrac34acutesect
yen yen
frac14laquoacuteacute
eth
ograve
eth
eth
timesnot
eth
eth
eth
eth
eth
ugraveshy
otilde
yen
sup2plusmnnot
yen yen
frac14laquoacuteacute
eth
ograve
Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve
Recurrent Nets
x1 x2 x3
h1 h2 h3
y1 y2 y3
A A A
C C C
B B
RNNs are used in timeseries applications
The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt
In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram
I omitted the potential links from xtminus1 ytminus1 to ht
Handwriting Generation using a RNN
Some training examples
Handwriting Generation using a RNN
Some generated examples Top line is real handwriting for comparison See AlexGraversquos work
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
NNs in NLP
Each word w in the dictionary has an associated embedding vector vwUsually around 200 dimensional vectors are used
Consider the sentence
the cat sat on the mat
and that we wish to predict the word on given the two preceding cat sat
and two succeeding words the mat
We can use a network that has inputs vcat vsat vthe vmat
The output of the network is a probability over all words in the dictionaryp(w| vinputs)We want p(w = on|vcatvsatvthevmat) to be high
The overall objective is then to learn all the word embeddings and networkparameters subject to predicting the word correctly based on the context
Word Embeddings
Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)
Word Embeddings
There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender
vwoman minus vman asymp vaunt minus vuncle
vwoman minus vman asymp vqueen minus vking
From Mikolov (2013)
Word Embeddings Analogies
Given a relationship France-Paris we get the lsquorelationshiprsquo embedding
v = vParis minus vFrance
Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)
Word Embeddings Constrained Embeddings
We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)
Word Embeddings Constrained Embeddings
Recursive Nets and Embeddings
Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)
Recursive Nets and Embeddings
Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node
Recursive Nets and EmbeddingsTraining
We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree
The weights of this classifier are shared across all nodes
At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings
The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier
We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy
Prediction
For a new movie review the review is first parsed using a standard grammartree parser
This forms the tree which can be used to recursively form the sentiment classlabel for the review
Currently the best sentiment classifier Socher (2013)
Recursive Nets and Embeddingsotilde otilde
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
otilde
otilde
eth
middotshy
otilde
eth
plusmnsup2raquo
otilde
eth
plusmnordm
otilde
otilde
eth
notcedilraquo
otilde
otilde
eth
sup3plusmnshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
yen
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
yen
yen
eth
middotshy
yen
eth
plusmnsup2raquo
yen
eth
plusmnordm
yen
yen
eth
notcedilraquo
yen
yen
yen
acuteraquoiquestshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
otilde
eth
times
otilde
otilde
otilde
acutemiddotmicroraquofrac14
eth
eth
eth
raquoordfraquoregsect
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
times
yen
yen
eth
eth
frac14middotfrac14
eth
sup2ugravenot
eth
eth
acutemiddotmicroraquo
eth
eth
eth
iquest
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
timesnot
yen
yen
eth
eth
ugraveshy
eth
paralaquoshynot
yen
otilde
middotsup2frac12regraquofrac14middotfrac34acutesect
yen yen
frac14laquoacuteacute
eth
ograve
eth
eth
timesnot
eth
eth
eth
eth
eth
ugraveshy
otilde
yen
sup2plusmnnot
yen yen
frac14laquoacuteacute
eth
ograve
Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve
Recurrent Nets
x1 x2 x3
h1 h2 h3
y1 y2 y3
A A A
C C C
B B
RNNs are used in timeseries applications
The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt
In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram
I omitted the potential links from xtminus1 ytminus1 to ht
Handwriting Generation using a RNN
Some training examples
Handwriting Generation using a RNN
Some generated examples Top line is real handwriting for comparison See AlexGraversquos work
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Word Embeddings
Given a word (France for example) we can find which words w have embeddingvectors closest to vFrance From Ronan Collabert (2011)
Word Embeddings
There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender
vwoman minus vman asymp vaunt minus vuncle
vwoman minus vman asymp vqueen minus vking
From Mikolov (2013)
Word Embeddings Analogies
Given a relationship France-Paris we get the lsquorelationshiprsquo embedding
v = vParis minus vFrance
Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)
Word Embeddings Constrained Embeddings
We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)
Word Embeddings Constrained Embeddings
Recursive Nets and Embeddings
Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)
Recursive Nets and Embeddings
Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node
Recursive Nets and EmbeddingsTraining
We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree
The weights of this classifier are shared across all nodes
At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings
The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier
We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy
Prediction
For a new movie review the review is first parsed using a standard grammartree parser
This forms the tree which can be used to recursively form the sentiment classlabel for the review
Currently the best sentiment classifier Socher (2013)
Recursive Nets and Embeddingsotilde otilde
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
otilde
otilde
eth
middotshy
otilde
eth
plusmnsup2raquo
otilde
eth
plusmnordm
otilde
otilde
eth
notcedilraquo
otilde
otilde
eth
sup3plusmnshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
yen
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
yen
yen
eth
middotshy
yen
eth
plusmnsup2raquo
yen
eth
plusmnordm
yen
yen
eth
notcedilraquo
yen
yen
yen
acuteraquoiquestshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
otilde
eth
times
otilde
otilde
otilde
acutemiddotmicroraquofrac14
eth
eth
eth
raquoordfraquoregsect
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
times
yen
yen
eth
eth
frac14middotfrac14
eth
sup2ugravenot
eth
eth
acutemiddotmicroraquo
eth
eth
eth
iquest
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
timesnot
yen
yen
eth
eth
ugraveshy
eth
paralaquoshynot
yen
otilde
middotsup2frac12regraquofrac14middotfrac34acutesect
yen yen
frac14laquoacuteacute
eth
ograve
eth
eth
timesnot
eth
eth
eth
eth
eth
ugraveshy
otilde
yen
sup2plusmnnot
yen yen
frac14laquoacuteacute
eth
ograve
Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve
Recurrent Nets
x1 x2 x3
h1 h2 h3
y1 y2 y3
A A A
C C C
B B
RNNs are used in timeseries applications
The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt
In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram
I omitted the potential links from xtminus1 ytminus1 to ht
Handwriting Generation using a RNN
Some training examples
Handwriting Generation using a RNN
Some generated examples Top line is real handwriting for comparison See AlexGraversquos work
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Word Embeddings
There appears to be a natural lsquogeometryrsquo to the embeddings For example thereare directions that correspond to gender
vwoman minus vman asymp vaunt minus vuncle
vwoman minus vman asymp vqueen minus vking
From Mikolov (2013)
Word Embeddings Analogies
Given a relationship France-Paris we get the lsquorelationshiprsquo embedding
v = vParis minus vFrance
Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)
Word Embeddings Constrained Embeddings
We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)
Word Embeddings Constrained Embeddings
Recursive Nets and Embeddings
Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)
Recursive Nets and Embeddings
Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node
Recursive Nets and EmbeddingsTraining
We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree
The weights of this classifier are shared across all nodes
At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings
The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier
We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy
Prediction
For a new movie review the review is first parsed using a standard grammartree parser
This forms the tree which can be used to recursively form the sentiment classlabel for the review
Currently the best sentiment classifier Socher (2013)
Recursive Nets and Embeddingsotilde otilde
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
otilde
otilde
eth
middotshy
otilde
eth
plusmnsup2raquo
otilde
eth
plusmnordm
otilde
otilde
eth
notcedilraquo
otilde
otilde
eth
sup3plusmnshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
yen
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
yen
yen
eth
middotshy
yen
eth
plusmnsup2raquo
yen
eth
plusmnordm
yen
yen
eth
notcedilraquo
yen
yen
yen
acuteraquoiquestshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
otilde
eth
times
otilde
otilde
otilde
acutemiddotmicroraquofrac14
eth
eth
eth
raquoordfraquoregsect
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
times
yen
yen
eth
eth
frac14middotfrac14
eth
sup2ugravenot
eth
eth
acutemiddotmicroraquo
eth
eth
eth
iquest
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
timesnot
yen
yen
eth
eth
ugraveshy
eth
paralaquoshynot
yen
otilde
middotsup2frac12regraquofrac14middotfrac34acutesect
yen yen
frac14laquoacuteacute
eth
ograve
eth
eth
timesnot
eth
eth
eth
eth
eth
ugraveshy
otilde
yen
sup2plusmnnot
yen yen
frac14laquoacuteacute
eth
ograve
Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve
Recurrent Nets
x1 x2 x3
h1 h2 h3
y1 y2 y3
A A A
C C C
B B
RNNs are used in timeseries applications
The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt
In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram
I omitted the potential links from xtminus1 ytminus1 to ht
Handwriting Generation using a RNN
Some training examples
Handwriting Generation using a RNN
Some generated examples Top line is real handwriting for comparison See AlexGraversquos work
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Word Embeddings Analogies
Given a relationship France-Paris we get the lsquorelationshiprsquo embedding
v = vParis minus vFrance
Given Italy we can calculate vItaly + v and find the word in the dictionary whichhas closest embedding to this (it turns out to be Rome) From Mikolov (2013)
Word Embeddings Constrained Embeddings
We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)
Word Embeddings Constrained Embeddings
Recursive Nets and Embeddings
Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)
Recursive Nets and Embeddings
Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node
Recursive Nets and EmbeddingsTraining
We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree
The weights of this classifier are shared across all nodes
At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings
The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier
We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy
Prediction
For a new movie review the review is first parsed using a standard grammartree parser
This forms the tree which can be used to recursively form the sentiment classlabel for the review
Currently the best sentiment classifier Socher (2013)
Recursive Nets and Embeddingsotilde otilde
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
otilde
otilde
eth
middotshy
otilde
eth
plusmnsup2raquo
otilde
eth
plusmnordm
otilde
otilde
eth
notcedilraquo
otilde
otilde
eth
sup3plusmnshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
yen
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
yen
yen
eth
middotshy
yen
eth
plusmnsup2raquo
yen
eth
plusmnordm
yen
yen
eth
notcedilraquo
yen
yen
yen
acuteraquoiquestshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
otilde
eth
times
otilde
otilde
otilde
acutemiddotmicroraquofrac14
eth
eth
eth
raquoordfraquoregsect
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
times
yen
yen
eth
eth
frac14middotfrac14
eth
sup2ugravenot
eth
eth
acutemiddotmicroraquo
eth
eth
eth
iquest
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
timesnot
yen
yen
eth
eth
ugraveshy
eth
paralaquoshynot
yen
otilde
middotsup2frac12regraquofrac14middotfrac34acutesect
yen yen
frac14laquoacuteacute
eth
ograve
eth
eth
timesnot
eth
eth
eth
eth
eth
ugraveshy
otilde
yen
sup2plusmnnot
yen yen
frac14laquoacuteacute
eth
ograve
Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve
Recurrent Nets
x1 x2 x3
h1 h2 h3
y1 y2 y3
A A A
C C C
B B
RNNs are used in timeseries applications
The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt
In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram
I omitted the potential links from xtminus1 ytminus1 to ht
Handwriting Generation using a RNN
Some training examples
Handwriting Generation using a RNN
Some generated examples Top line is real handwriting for comparison See AlexGraversquos work
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Word Embeddings Constrained Embeddings
We can learn embeddings for English words and embeddings for ChinesewordsHowever when we know that a Chinese and English word have a similarmeaning we add a constraint that the word embeddings vChineseWord andvEnglishWord should be closeWe have only a small amount of labelled lsquosimilarrsquo Chinese-English words(these are the green border boxes in the above they are standard translationsof the corresponding Chinese character)We can visualise in 2D (using t-SNE) the embedding vectors See Socher(2013)
Word Embeddings Constrained Embeddings
Recursive Nets and Embeddings
Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)
Recursive Nets and Embeddings
Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node
Recursive Nets and EmbeddingsTraining
We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree
The weights of this classifier are shared across all nodes
At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings
The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier
We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy
Prediction
For a new movie review the review is first parsed using a standard grammartree parser
This forms the tree which can be used to recursively form the sentiment classlabel for the review
Currently the best sentiment classifier Socher (2013)
Recursive Nets and Embeddingsotilde otilde
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
otilde
otilde
eth
middotshy
otilde
eth
plusmnsup2raquo
otilde
eth
plusmnordm
otilde
otilde
eth
notcedilraquo
otilde
otilde
eth
sup3plusmnshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
yen
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
yen
yen
eth
middotshy
yen
eth
plusmnsup2raquo
yen
eth
plusmnordm
yen
yen
eth
notcedilraquo
yen
yen
yen
acuteraquoiquestshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
otilde
eth
times
otilde
otilde
otilde
acutemiddotmicroraquofrac14
eth
eth
eth
raquoordfraquoregsect
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
times
yen
yen
eth
eth
frac14middotfrac14
eth
sup2ugravenot
eth
eth
acutemiddotmicroraquo
eth
eth
eth
iquest
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
timesnot
yen
yen
eth
eth
ugraveshy
eth
paralaquoshynot
yen
otilde
middotsup2frac12regraquofrac14middotfrac34acutesect
yen yen
frac14laquoacuteacute
eth
ograve
eth
eth
timesnot
eth
eth
eth
eth
eth
ugraveshy
otilde
yen
sup2plusmnnot
yen yen
frac14laquoacuteacute
eth
ograve
Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve
Recurrent Nets
x1 x2 x3
h1 h2 h3
y1 y2 y3
A A A
C C C
B B
RNNs are used in timeseries applications
The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt
In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram
I omitted the potential links from xtminus1 ytminus1 to ht
Handwriting Generation using a RNN
Some training examples
Handwriting Generation using a RNN
Some generated examples Top line is real handwriting for comparison See AlexGraversquos work
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Word Embeddings Constrained Embeddings
Recursive Nets and Embeddings
Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)
Recursive Nets and Embeddings
Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node
Recursive Nets and EmbeddingsTraining
We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree
The weights of this classifier are shared across all nodes
At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings
The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier
We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy
Prediction
For a new movie review the review is first parsed using a standard grammartree parser
This forms the tree which can be used to recursively form the sentiment classlabel for the review
Currently the best sentiment classifier Socher (2013)
Recursive Nets and Embeddingsotilde otilde
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
otilde
otilde
eth
middotshy
otilde
eth
plusmnsup2raquo
otilde
eth
plusmnordm
otilde
otilde
eth
notcedilraquo
otilde
otilde
eth
sup3plusmnshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
yen
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
yen
yen
eth
middotshy
yen
eth
plusmnsup2raquo
yen
eth
plusmnordm
yen
yen
eth
notcedilraquo
yen
yen
yen
acuteraquoiquestshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
otilde
eth
times
otilde
otilde
otilde
acutemiddotmicroraquofrac14
eth
eth
eth
raquoordfraquoregsect
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
times
yen
yen
eth
eth
frac14middotfrac14
eth
sup2ugravenot
eth
eth
acutemiddotmicroraquo
eth
eth
eth
iquest
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
timesnot
yen
yen
eth
eth
ugraveshy
eth
paralaquoshynot
yen
otilde
middotsup2frac12regraquofrac14middotfrac34acutesect
yen yen
frac14laquoacuteacute
eth
ograve
eth
eth
timesnot
eth
eth
eth
eth
eth
ugraveshy
otilde
yen
sup2plusmnnot
yen yen
frac14laquoacuteacute
eth
ograve
Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve
Recurrent Nets
x1 x2 x3
h1 h2 h3
y1 y2 y3
A A A
C C C
B B
RNNs are used in timeseries applications
The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt
In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram
I omitted the potential links from xtminus1 ytminus1 to ht
Handwriting Generation using a RNN
Some training examples
Handwriting Generation using a RNN
Some generated examples Top line is real handwriting for comparison See AlexGraversquos work
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Recursive Nets and Embeddings
Stanford Sentiment Treebank Consists of parsed sentences with sentiment labels(minusminusminus 0+++) for each node (phrase) in the tree 215000 labelled phrases(obtained from three humans)
Recursive Nets and Embeddings
Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node
Recursive Nets and EmbeddingsTraining
We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree
The weights of this classifier are shared across all nodes
At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings
The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier
We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy
Prediction
For a new movie review the review is first parsed using a standard grammartree parser
This forms the tree which can be used to recursively form the sentiment classlabel for the review
Currently the best sentiment classifier Socher (2013)
Recursive Nets and Embeddingsotilde otilde
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
otilde
otilde
eth
middotshy
otilde
eth
plusmnsup2raquo
otilde
eth
plusmnordm
otilde
otilde
eth
notcedilraquo
otilde
otilde
eth
sup3plusmnshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
yen
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
yen
yen
eth
middotshy
yen
eth
plusmnsup2raquo
yen
eth
plusmnordm
yen
yen
eth
notcedilraquo
yen
yen
yen
acuteraquoiquestshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
otilde
eth
times
otilde
otilde
otilde
acutemiddotmicroraquofrac14
eth
eth
eth
raquoordfraquoregsect
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
times
yen
yen
eth
eth
frac14middotfrac14
eth
sup2ugravenot
eth
eth
acutemiddotmicroraquo
eth
eth
eth
iquest
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
timesnot
yen
yen
eth
eth
ugraveshy
eth
paralaquoshynot
yen
otilde
middotsup2frac12regraquofrac14middotfrac34acutesect
yen yen
frac14laquoacuteacute
eth
ograve
eth
eth
timesnot
eth
eth
eth
eth
eth
ugraveshy
otilde
yen
sup2plusmnnot
yen yen
frac14laquoacuteacute
eth
ograve
Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve
Recurrent Nets
x1 x2 x3
h1 h2 h3
y1 y2 y3
A A A
C C C
B B
RNNs are used in timeseries applications
The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt
In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram
I omitted the potential links from xtminus1 ytminus1 to ht
Handwriting Generation using a RNN
Some training examples
Handwriting Generation using a RNN
Some generated examples Top line is real handwriting for comparison See AlexGraversquos work
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Recursive Nets and Embeddings
Idea is to recursively combine embeddings such that they accurately predictthe sentiment at each node
Recursive Nets and EmbeddingsTraining
We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree
The weights of this classifier are shared across all nodes
At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings
The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier
We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy
Prediction
For a new movie review the review is first parsed using a standard grammartree parser
This forms the tree which can be used to recursively form the sentiment classlabel for the review
Currently the best sentiment classifier Socher (2013)
Recursive Nets and Embeddingsotilde otilde
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
otilde
otilde
eth
middotshy
otilde
eth
plusmnsup2raquo
otilde
eth
plusmnordm
otilde
otilde
eth
notcedilraquo
otilde
otilde
eth
sup3plusmnshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
yen
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
yen
yen
eth
middotshy
yen
eth
plusmnsup2raquo
yen
eth
plusmnordm
yen
yen
eth
notcedilraquo
yen
yen
yen
acuteraquoiquestshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
otilde
eth
times
otilde
otilde
otilde
acutemiddotmicroraquofrac14
eth
eth
eth
raquoordfraquoregsect
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
times
yen
yen
eth
eth
frac14middotfrac14
eth
sup2ugravenot
eth
eth
acutemiddotmicroraquo
eth
eth
eth
iquest
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
timesnot
yen
yen
eth
eth
ugraveshy
eth
paralaquoshynot
yen
otilde
middotsup2frac12regraquofrac14middotfrac34acutesect
yen yen
frac14laquoacuteacute
eth
ograve
eth
eth
timesnot
eth
eth
eth
eth
eth
ugraveshy
otilde
yen
sup2plusmnnot
yen yen
frac14laquoacuteacute
eth
ograve
Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve
Recurrent Nets
x1 x2 x3
h1 h2 h3
y1 y2 y3
A A A
C C C
B B
RNNs are used in timeseries applications
The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt
In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram
I omitted the potential links from xtminus1 ytminus1 to ht
Handwriting Generation using a RNN
Some training examples
Handwriting Generation using a RNN
Some generated examples Top line is real handwriting for comparison See AlexGraversquos work
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Recursive Nets and EmbeddingsTraining
We have a softmax classifier for each node in the tree to predict thesentiment of the phrase beneath this node in the tree
The weights of this classifier are shared across all nodes
At the leaf nodes at the bottom of the tree the inputs to the classifiers arethe word embeddings
The embeddings are combined by another network g with commonparameters which forms the input to the sentiment classifier
We then learn all the embeddings shared classifier parameters and sharedcombination parameters to maximise the classification accuracy
Prediction
For a new movie review the review is first parsed using a standard grammartree parser
This forms the tree which can be used to recursively form the sentiment classlabel for the review
Currently the best sentiment classifier Socher (2013)
Recursive Nets and Embeddingsotilde otilde
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
otilde
otilde
eth
middotshy
otilde
eth
plusmnsup2raquo
otilde
eth
plusmnordm
otilde
otilde
eth
notcedilraquo
otilde
otilde
eth
sup3plusmnshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
yen
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
yen
yen
eth
middotshy
yen
eth
plusmnsup2raquo
yen
eth
plusmnordm
yen
yen
eth
notcedilraquo
yen
yen
yen
acuteraquoiquestshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
otilde
eth
times
otilde
otilde
otilde
acutemiddotmicroraquofrac14
eth
eth
eth
raquoordfraquoregsect
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
times
yen
yen
eth
eth
frac14middotfrac14
eth
sup2ugravenot
eth
eth
acutemiddotmicroraquo
eth
eth
eth
iquest
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
timesnot
yen
yen
eth
eth
ugraveshy
eth
paralaquoshynot
yen
otilde
middotsup2frac12regraquofrac14middotfrac34acutesect
yen yen
frac14laquoacuteacute
eth
ograve
eth
eth
timesnot
eth
eth
eth
eth
eth
ugraveshy
otilde
yen
sup2plusmnnot
yen yen
frac14laquoacuteacute
eth
ograve
Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve
Recurrent Nets
x1 x2 x3
h1 h2 h3
y1 y2 y3
A A A
C C C
B B
RNNs are used in timeseries applications
The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt
In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram
I omitted the potential links from xtminus1 ytminus1 to ht
Handwriting Generation using a RNN
Some training examples
Handwriting Generation using a RNN
Some generated examples Top line is real handwriting for comparison See AlexGraversquos work
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Recursive Nets and Embeddingsotilde otilde
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
otilde
otilde
eth
middotshy
otilde
eth
plusmnsup2raquo
otilde
eth
plusmnordm
otilde
otilde
eth
notcedilraquo
otilde
otilde
eth
sup3plusmnshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
yen
eth
eth
Icircplusmnsup1raquoreg
eth
Uumlplusmnfrac14sup1raquoreg
yen
yen
eth
middotshy
yen
eth
plusmnsup2raquo
yen
eth
plusmnordm
yen
yen
eth
notcedilraquo
yen
yen
yen
acuteraquoiquestshynot
otilde
frac12plusmnsup3degraquoacuteacutemiddotsup2sup1
eth
ordfiquestregmiddotiquestnotmiddotplusmnsup2shy
eth
eth
plusmnsup2
eth
eth
notcedilmiddotshy
eth
notcedilraquosup3raquo
eth
ograve
otilde
eth
times
otilde
otilde
otilde
acutemiddotmicroraquofrac14
eth
eth
eth
raquoordfraquoregsect
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
times
yen
yen
eth
eth
frac14middotfrac14
eth
sup2ugravenot
eth
eth
acutemiddotmicroraquo
eth
eth
eth
iquest
eth
eth
shymiddotsup2sup1acuteraquo
eth
sup3middotsup2laquonotraquo
eth
eth
plusmnordm
eth
eth
notcedilmiddotshy
eth
eth
ograve
yen
eth
timesnot
yen
yen
eth
eth
ugraveshy
eth
paralaquoshynot
yen
otilde
middotsup2frac12regraquofrac14middotfrac34acutesect
yen yen
frac14laquoacuteacute
eth
ograve
eth
eth
timesnot
eth
eth
eth
eth
eth
ugraveshy
otilde
yen
sup2plusmnnot
yen yen
frac14laquoacuteacute
eth
ograve
Uacutemiddotsup1laquoregraquo ccedilaelig IcircOgraveIgraveOgrave degregraquofrac14middotfrac12notmiddotplusmnsup2 plusmnordm degplusmnshymiddotnotmiddotordfraquo iquestsup2frac14 sup2raquosup1iquestnotmiddotordfraquo oslashfrac34plusmnnotnotplusmnsup3 regmiddotsup1cedilnotdivide shyraquosup2notraquosup2frac12raquoshy iquestsup2frac14 notcedilraquomiddotreg sup2raquosup1iquestnotmiddotplusmnsup2ograve
Recurrent Nets
x1 x2 x3
h1 h2 h3
y1 y2 y3
A A A
C C C
B B
RNNs are used in timeseries applications
The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt
In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram
I omitted the potential links from xtminus1 ytminus1 to ht
Handwriting Generation using a RNN
Some training examples
Handwriting Generation using a RNN
Some generated examples Top line is real handwriting for comparison See AlexGraversquos work
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Recurrent Nets
x1 x2 x3
h1 h2 h3
y1 y2 y3
A A A
C C C
B B
RNNs are used in timeseries applications
The basic idea is that the hidden units at time ht (and possibly output yt)depend on the previous state of the network htminus1 xtminus1 ytminus1 for inputs xt andoutputs yt
In the above network I lsquounrolled the net through timersquo to give a standard NNdiagram
I omitted the potential links from xtminus1 ytminus1 to ht
Handwriting Generation using a RNN
Some training examples
Handwriting Generation using a RNN
Some generated examples Top line is real handwriting for comparison See AlexGraversquos work
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Handwriting Generation using a RNN
Some training examples
Handwriting Generation using a RNN
Some generated examples Top line is real handwriting for comparison See AlexGraversquos work
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Handwriting Generation using a RNN
Some generated examples Top line is real handwriting for comparison See AlexGraversquos work
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Reasons research in deep learning has exploded
Much greater compute power (GPU)
Much larger datasets
AutoDiff
What is AutoDiff
AutoDiff takes a function f(x) and returns an exact value (up to machineaccuracy) for the gradient
gi(x) equivpart
partxif
∥∥∥∥x
Note that this is not the same as a numerical approximation (such as centraldifferences) for the gradient
One can show that if done efficiently one can always calculate the gradient inless than 5 times the time it takes to compute f(x)
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Reverse DifferentiationA useful graphical representation is that the total derivative of f with respect to xis given by the sum over all path values from x to f where each path value is theproduct of the partial derivatives of the functions on the edges
df
dx=partf
partx+partf
partg
dg
dx
x
f
gpartfpartx
dgdx
partfpartg
Example
For f(x) = x2 + xgh where g =x2 and h = xg2
x
f
gh2x+ gh
2x
xh
2gx
xg
g2
f prime(x) = (2x+ gh) + (g2xg) + (2x2gxxg) + (2xxh) = 2x+ 8x7
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Reverse DifferentiationConsider
f(x1 x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST)
x1 x2
f1
f2
f3
f1(x1 x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1 x2 we first run forwards through the tree so that we canassociate each node with an actual function value
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Reverse Differentiation
x1 x2
f1
f2
f3
df3dx1
=partf3partf2
df2dx1
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx1
Similarly
df3dx2
=partf3partf2
df2df1︸ ︷︷ ︸
df3df1
df1dx2
The two derivatives share the same computation branch andwe want to exploit this
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Reverse Differentiation
x1 x2
f1
f2
f3
partf1partx1
= x2partf1partx2
= x1
partf2partf1
= cos(f1)
partf3partf2
= minus sin(f2)
1 Find the reverse ancestral (backwards) scheduleof nodes (f3 f2 f1 x1 x2)
2 Start with the first node n1 in the reverseschedule and define tn1 = 1
3 For the next node n in the reverse schedule findthe child nodes ch (n) Then define
tn =sum
cisinch(n)
partfcpartfn
tc
4 The total derivatives of f with respect to theroot nodes of the tree (here x1 and x2) are givenby the values of t at those nodes
This is a general procedure that can be used to automatically define a subroutineto efficiently compute the gradient It is efficient because information is collectedat nodes in the tree and split between parents only when required
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence
Solving chess problems is another and requires complex reasoning using someform of internal model
The world is noisy and information may be conflicting
Recognised that new approaches are required
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchensnd ndash sound
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Finding the Burglar
creak creak
bump
creak
bump bump
creak
bump bump bump
creak
bump
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended keyhit ndash hit key
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Stubby Fingers errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz
005
01
015
02
025
03
035
04
045
05
055
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Stubby Fingers language
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghijkl
mnopqrstuvwxyz 0
01
02
03
04
05
06
07
08
09
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that thiscorresponds to
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Speech Recognition raw signal
0 01 02 03 04 05 06 07 08 09minus02
minus015
minus01
minus005
0
005
01
015
02
025
03
Time
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
lsquoneuralrsquo representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho phoneme (letter)aud audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Probability
Why Probability
Probability is a logical calculus of uncertainty
Natural framework to use in models of physical systems such as the IsingModel (1920) and in AI applications such as the HMM (Baum 1966Stratonovich 1960)
The need for structure
We often want to make a probabilistic description of many objects (electronspins neurons customers etc )
Typically the representational and computational cost of probabilistic modelsgrows exponentially with the number of objects represented
Without introducing strong structural limitations about how these objects caninteract probability is a non-starter
For this reason computationally lsquosimplerrsquo alternatives (such as fuzzy logic)were introduced to try to avoid some of these difficulties ndash however these aretypically frowed on by purists
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Graphical Models
We can use graphs to represent how objects can probabilistically interact witheach other
Graphical Models and then a marriage between Graph and Probability theory
Many of the quantities that we would like to compute in a probabilitydistribution can then be related to operations on the graph
The computational complexity of operations can often be related to thestructure of the graph
Graphical Models are now used as a standard framework in EngineeringStatistics and Computer Science
Graphical Models are used to perform reasoning under uncertainty and aretherefore widely applicable
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Uses in Industry
Microsoft used to estimate the skill distribution of players in online games(the worlds largest graphical model)
Hospitals use Belief Nets to encode knowledge about diseases and symptomsto aid medical diagnosis
Google Microsoft Facebook used in many places including advertisingvideo game prediction speech recognition
Used to estimate inherent desirability of products in consumer retail
Microsoft and others Attempt to go beyond simple AB testing by usesGraphical Models to model the whole companyuser relationship
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Conditional Probability and Bayesrsquo Rule
The probability of event x conditioned on knowing event y (or more shortly theprobability of x given y) is defined as
p(x|y) equiv p(x y)
p(y)=p(y|x)p(x)
p(y)(Bayesrsquo rule)
Throwing darts
p(region 5|not region 20) =p(region 5 not region 20)
p(not region 20)
=p(region 5)
p(not region 20)=
120
1920=
1
19
Interpretationp(A = a|B = b) should not be interpreted as lsquoGiven the event B = b has occurredp(A = a|B = b) is the probability of the event A = a occurringrsquo The correctinterpretation should be lsquop(A = a|B = b) is the probability of A being in state aunder the constraint that B is in state brsquo
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Battleships
Assume there are 2 ships 1 vertical (ship 1) and 1 horizontal (ship 2) of 5pixels each
Can be placed anywhere on the 10times10 grid but cannot overlap
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query lsquohitrsquo or lsquomissrsquo responses
p(s1 s2|D) =p(D|s1 s2)p(s1 s2)
p(D)Let X be the matrix of pixel occupancy
p(X|D) =sums1s2
p(X s1 s2|D) =sums1s2
p(X|s1 s2)p(s1 s2|D)
demoBattleshipsm
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated theconditional probability of the node given its parents
The joint distribution is obtained by taking the product of the conditionalprobabilities
p(ABCDE) = p(A)p(B)p(C|AB)p(D|C)p(E|BC)
p(E|BC)
A B
C
DE
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Example ndash Part ISallyrsquos burglar Alarm is sounding Has she been Burgled or was the alarmtriggered by an Earthquake She turns the car Radio on for news of earthquakes
Choosing an orderingWithout loss of generality we can write
p(AREB) = p(A|REB)p(REB)
= p(A|REB)p(R|EB)p(EB)
= p(A|REB)p(R|EB)p(E|B)p(B)
Assumptions
The alarm is not directly influenced by any report on the radiop(A|REB) = p(A|EB)The radio broadcast is not directly influenced by the burglar variablep(R|EB) = p(R|E)Burglaries donrsquot directly lsquocausersquo earthquakes p(E|B) = p(E)
Therefore
p(AREB) = p(A|EB)p(R|E)p(E)p(B)
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Example ndash Part II Specifying the Tables
B
A
E
R
p(A|BE)
Alarm = 1 Burglar Earthquake09999 1 1
099 1 0099 0 1
00001 0 0
p(R|E)
Radio = 1 Earthquake1 10 0
The remaining tables are p(B = 1) = 001 and p(E = 1) = 0000001 The tablesand graphical structure fully specify the distribution
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Example Part III Inference
Initial Evidence The alarm is sounding
p(B = 1|A = 1) =
sumER p(B = 1 EA = 1 R)sumBER p(BEA = 1 R)
=
sumER p(A = 1|B = 1 E)p(B = 1)p(E)p(R|E)sum
BER p(A = 1|BE)p(B)p(E)p(R|E)asymp 099
Additional Evidence The radio broadcasts an earthquake warning
A similar calculation gives p(B = 1|A = 1 R = 1) asymp 001
Initially because the alarm sounds Sally thinks that shersquos been burgledHowever this probability drops dramatically when she hears that there hasbeen an earthquake
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Markov Models
For timeseries data v1 vT we need a model p(v1T ) For causal consistency itis meaningful to consider the decomposition
p(v1T ) =
Tprodt=1
p(vt|v1tminus1)
with the convention p(vt|v1tminus1) = p(v1) for t = 1
v1 v2 v3 v4
Independence assumptionsIt is often natural to assume that the influence of the immediate past is morerelevant than the remote past and in Markov models only a limited number ofprevious observations are required to predict the future
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Markov Chain
Only the recent past is relevant
p(vt|v1 vtminus1) = p(vt|vtminusL vtminus1)
where L ge 1 is the order of the Markov chain
p(v1T ) = p(v1)p(v2|v1)p(v3|v2) p(vT |vTminus1)
For a stationary Markov chain the transitions p(vt = sprime|vtminus1 = s) = f(sprime s) aretime-independent (lsquohomogeneousrsquo)
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure (a) First order Markov chain (b) Second order Markov chain
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Markov Chains
v1 v2 v3 v4
p(v1 vT ) = p(v1)︸ ︷︷ ︸initial
Tprodt=2
p(vt|vtminus1)︸ ︷︷ ︸Transition
State transition diagramNodes represent states of the variable v and arcs non-zero elements of thetransition p(vt|vtminus1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1minus 2minus 7
The most probable path from state 1 to state 7 is 1minus 8minus 9minus 7 (assuminguniform transition probabilities) The latter path is longer but more probablesince for the path 1minus 2minus 7 the probability of exiting state 2 into state 7 is15
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time
p(xt = i) =sumj
p(xt = i|xtminus1 = j)︸ ︷︷ ︸Mij
p(xtminus1 = j)
p(xt = i) is the frequency that we visit state i at time t given we startedfrom p(x1) and randomly drew samples from the transition p(xτ |xτminus1)As we repeatedly sample a new state from the chain the distribution at timet for an initial distribution p1(i) is
pt = Mtminus1p1
If for trarrinfin pinfin is independent of the initial distribution p1 then pinfin iscalled the equilibrium distribution of the chain
pinfin = Mpinfin
The equil distribution is proportional to the eigenvector with unit eigenvalueof the transition matrix
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i0 otherwise
From this we can define a Markov transition matrix with elements
Mij =Aijsumiprime Aiprimej
If we jump from website to website the equilibrium distribution componentpinfin(i) is the relative number of times we will visit website i This has anatural interpretation as the lsquoimportancersquo of website i
For each website i a list of words associated with that website is collectedAfter doing this for all websites one can make an lsquoinversersquo list of whichwebsites contain word w When a user searches for word w the list ofwebsites that contain word is then returned ranked according to theimportance of the site
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Hidden Markov Models
The HMM defines a Markov chain on hidden (or lsquolatentrsquo) variables h1T Theobserved (or lsquovisiblersquo) variables are dependent on the hidden variables through anemission p(vt|ht) This defines a joint distribution
p(h1T v1T ) = p(v1|h1)p(h1)Tprodt=2
p(vt|ht)p(ht|htminus1)
For a stationary HMM the transition p(ht|htminus1) and emission p(vt|ht) distributionsare constant through time
v1 v2 v3 v4
h1 h2 h3 h4 Figure A first order hidden Markov modelwith lsquohiddenrsquo variablesdom(ht) = 1 H t = 1 T Thelsquovisiblersquo variables vt can be either discrete orcontinuous
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
The classical inference problems
Filtering (Inferring the present) p(ht|v1t)Prediction (Inferring the future) p(ht|v1s) t gt sSmoothing (Inferring the past) p(ht|v1u) t lt uLikelihood p(v1T )Most likely path (Viterbi alignment) argmax
h1T
p(h1T |v1T )
For prediction one is also often interested in p(vt|v1s) for t gt s
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Inference in Hidden Markov Models
Belief network representation of a HMM
h1 h2 h3 h4
v1 v2 v3 v4
Filtering Smoothing and Viterbi are all computationally efficient scalinglinearly with the length of the timeseries (but quadratically with the numberof hidden states)
The algorithms are variants of lsquomessage passing on factor graphsrsquo
Algorithm guaranteed to work if the graph is singly-connected
Huge research effort in the last 15 years to apply message passing forapproximate inference in multiply-connected graphs (eg low-densityparity-check codes)
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
HMMs for speech recognition
ht is the phoneme at time t p(ht|htminus1) ndash language model p(vt|ht) ndash speechsignal model
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speechrecognition
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is somefunction of the phoneme micro(ht θ)
This function is a deep neural network trained on a large amount of data
Goldrush at the moment to find similar breakthrough applications of deepnetworks in reasoning systems
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructedon the basis of a low dimensional representation
Note that this is a Graphical Model not a Function
The latent variables h can be sampled from using p(h) and then an imagesampled from p(v|h)One cannot use an autoencoder to generate new images
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in thesemodels
Statisticians typically use sampling as an approximation
Very popular in ML to use a variational method ndash much faster for inference
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Variational InferenceConsider a distribution
p(v|θ) =inth
p(v|h θ)p(h)
and that we wish to learn θ to maximise the probability this model generatesobserved data
log p(v|θ) ge minusintq(h|v φ) log q(h|v φ) +
inth
q(h|v φ)p(v|h θ) + const
Idea is to choose a lsquovariationalrsquo distribution q(h|v φ) such that we can eithercalculate analytically the bound or sample it efficiently
We then jointly maximise the bound wrt φ and θ
We can parameterise p(v|h θ) using a deep network
Very popular approach ndash see lsquovariational autoencoderrsquo and also attentionmechanisms
Extension to semi-supervised method using p(v) =inth
sumc p(v|h c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
DRAW
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Reinforcement Learning
Can we teach computers to play Atari video games
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Deep Reinforcement Learning
Given a state of the world W and a set of possible actions A we need todecide which action to taken for any state of W that will be best for our longterm goals
Problem is that the number of pixel states is enormous
Need to learn a low dimensional representation of the screen (use a deepgenerative model)
Learn then which action to take given the low dimensional representation
Tetris
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Table of Contents
History of the AI dream
How do brains work
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio
Outlook
Machine Learning is in a boom period
Renewed interest and hope in creating AI
Combine new computational power with suitable hierarchical representations
Impressive state of the art results in Speech Recognition Image AnalysisGame Playing
Challenges
Improve understanding of optimisation for deep learning
Learn how to more efficiently exploit computational resources
Learn how to exploit massive databases
Improve interaction between reinforcement learning and representationlearning
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability
Feel free to contact me at UCL or at my AI company reinfer
httpsreinferio