1
Graphical modelsoftware for machine learning
Kevin Murphy
University of British Columbia
December, 2005
2
Outline
• Discriminative models for iid data
• Beyond iid data: conditional random fields
• Beyond supervised learning: generative models
• Beyond optimization: Bayesian models
3
Supervised learning as Bayesian inference
Y1
X1
YN
XN
Y*
X*
Yn
Xn
Y*
X*N
Training
Testing
4
Supervised learning as optimization
Y1
X1
YN
XN
Y*
X*
Yn
Xn
Y*
X*N
Training
Testing
5
Example: logistic regression
• Let yn 2 {1,…,C} be given by a softmax
• Maximize conditional log likelihood
• “Max margin” solution
6
Outline
• Discriminative models for iid data
• Beyond iid data: conditional random fields
• Beyond supervised learning: generative models
• Beyond optimization: Bayesian models
7
1D chain CRFs for sequence labeling
Yn1 YnmYn2
Xn
A 1D conditional random field (CRF) is an extension of logistic regressionto the case where the output labels are sequences, yn 2 {1,…,C}m
Local evidence Edge potential
i
ij
8
2D Lattice CRFs for pixel labeling
A conditional randomfield (CRF) is a discriminative modelof P(y|x). The edge potentialsij are image dependent.
9
2D Lattice MRFs for pixel labeling
A Markov Random Field (MRF) is an undirectedgraphical model. Here we model correlation between pixel labels using ij(yi,yj). We also have a per-pixelgenerative model of observations P(xi|yi)
Local evidence Potential functionPartition function
10
Tree-structured CRFs
• Used in parts-based object detection
• Yi is location of part i in image
eyeL nose eyeR
mouth
Fischler & Elschlager, "The representation and matching of pictorial structures”, PAMI’73Felzenszwalb & Huttenlocher, "Pictorial Structures for Object Recognition," IJCV’05
11
General CRFs
• In general, the graph may have arbitrary structure
• eg for collective web page classification,nodes=urls, edges=hyperlinks
• The potentials are in general defined on cliques, not just edges
12
Factor graphsSquare nodes = factors (potentials)Round nodes = random variablesGraph structure = bipartite
13
Potential functions
• For the local evidence, we can use a discriminative classifier (trained iid)
• For the edge compatibilities, we can use a maxent/ loglinear form, using pre-defined features
14
Restricted potential functions
• For some applications (esp in vision), we often use a Potts model of the form
•We can generalize this for ordered labels (eg discretization of continuous states)
l
15
16
Learning CRFs
• If the log likelihood is
• then the gradient is cliques
Gradient = features – expected features
Tied params
17
Learning CRFs
• Given the gradient rd, one can find the global optimum using first or second order optimization methods, such as– Conjugate gradient– Limited memory BFGS– Stochastic meta descent (SMD)?
• The bottleneck is computing the expected features needed for the gradient
18
Exact inference
• For 1D chains, one can compute P(yi,i+1|x) exactly in O(N K2) time using belief propagation (BP = forwards backwards algorithm)
• For restricted potentials (eg ij=( l)), one can do this in O(NK) time using FFT-like tricks
• This can be generalized to trees.
19
Sum-product vs max-product• We use sum-product to compute marginal
probabilities needed for learning
• We use max-product to find the most probable assignment (Viterbi decoding)
• We can also compute max-marginals
20
Complexity of exact inferenceIn general, the running time is (N Kw), where w is the treewidthof the graph; this is the size of the maximal clique of the triangulatedgraph (assuming an optimal elimination ordering).For chains and trees, w = 2.For n £ n lattices, w = O(n).
21
Approximate sum-productAlgorithm Potential (pairwise) Time N=num nodes,
K = num states,I = num iterations
BP(exact iff tree)
General O(N K2 I)
BP+FFT(exact iff tree)
Restricted O(N K I)
Generalized BP General O(N K2c I)c = cluster size
Gibbs General O(N K I)
Swendsen-Wang General O(N K I)
Mean field General O(N K I)
22
Approximate max-productAlgorithm Potential (pairwise) Time N=num nodes,
K = num states,I = num iterations
BP (exact iff tree) General O(N K2 I)
BP+DT (exact iff tree) Restricted O(N K I)
Generalized BP General O(N K2c I)c = cluster size
Graph-cuts(exact iff K=2)
Restricted O(N2 K I) [?]
ICM (iterated conditional modes)
General O(N K I)
SLS (stochastic local search)
General O(N K I)
23
Learning intractable CRFs
• We can use approximate inference and hope the gradient is “good enough”.– If we use max-product, we are doing “Viterbi
training” (cf perceptron rule)
• Or we can use other techniques, such as pseudo likelihood, which does not need inference.
24
Pseudo-likelihood
25
Software for inference and learning in 1D CRFs
• Various packages– Mallet (McCallum et al) – Java– Crf.sourceforge.net (Sarawagi, Cohen) – Java– My code – matlab (just a toy, not integrated
with BNT)– Ben Taskar says he will soon release his Max
Margin Markov net code (which uses LP for inference and QP for learning).
• Nothing standard, emphasis on NLP apps
26
Software for inference in general CRFs/ MRFs
• Max-product : C++ code for GC, BP, TRP and ICM (for Lattice2) by Rick Szeliski et al
– “A comparative study of energy minimization methods for MRFs”, Rick Szeliksi, Ramin Zabih, Daniel Scharstein, Olga Veksler, Vladimir Kolmogorov, Aseem Agarwala, Marsall Tappen, Carsten Rother
• Sum-product for Gaussian MRFs: GMRFlib, C code by Havard Rue (exact inference)
• Sum-product: various other ad hoc pieces– My matlab BP code (MRF2)– Rivasseau’s C++ code for BP, Gibbs, tree-sampling
(factor graphs)– Metlzer’s C++ code for BP, GBP, Gibbs, MF (Lattice2)
27
Software for learning general MRFs/CRFs
• Hardly any!– Parise’s matlab code (approx gradient,
pseudo likelihood, CD, etc)– My matlab code (IPF, approx gradient – just a
toy – not integrated with BNT)
28
Structure of ideal toolbox
train
trainDatalearnEngine
infEngine queries model
modelinfEngine
probDist
performance
visualizesummarize
Generator/GUI/file
infer
decide
decisionEngine
utilities
decision
testData
Nbest list
29
Structure of BNT
train
trainDatalearnEngine
infEngine queries model
modelinfEngine
probDist
visualizesummarize
Generator/GUI/file
infer
testData
decide
decisionEngine
Nbest list
BPJtreeMCMC
EMStructuralEM
Graphs+CPDs
Graphs+CPDs
LeRay Shan
Cell array
NodeIdsVarElim
N=1 (MAP) Array, Gaussian, samples
LIMID
JtreeVarElim
policy
Cell array
30
Outline
• Discriminative models for iid data
• Beyond iid data: conditional random fields
• Beyond supervised learning: generative models
• Beyond optimization: Bayesian models
31
Unsupervised learning: why?
• Labeling data is time-consuming.
• Often not clear what label to use.
• Complex objects often not describable with a single discrete label.
• Humans learn without labels.
• Want to discover novel patterns/ structure.
32
Unsupervised learning: what?
• Clusters (eg GMM)
• Low dim manifolds (eg PCA)
• Graph structure (eg biology, social networks)
• “Features” (eg maxent models of language and texture)
• “Objects” (eg sprite models in vision)
33
Unsupervised learning of objects from video
Frey and Jojic; Williams and Titsias ; et al
34
Unsupervised learning: issues
• Objective function not as obvious as in supervised learning. Usually try to maximize likelihood (measure of data compression).
• Local minima (non convex objective).
• Uses inference as subroutine (can be slow – no worse than discriminative learning)
35
Unsupervised learning: how?
• Construct a generative model (eg a Bayes net).
• Perform inference.
• May have to use approximations such as maximum likelihood and BP.
• Cannot use max likelihood for model selection…
36
A comparison of BN software
www.ai.mit.edu/~murphyk/Software/Bayes/bnsoft.html
37
Popular BN software
• BNT (matlab)
• Intel’s PNL (C++)
• Hugin (commercial)
• Netica (commercial)
• GMTk (free .exe from Jeff Bilmes)
38
Outline
• Discriminative models for iid data
• Beyond iid data: conditional random fields
• Beyond supervised learning: generative models
• Beyond optimization: Bayesian models
39
Bayesian inference: why?
• It is optimal.
• It can easily incorporate prior knowledge (esp. useful for small n, large p problems).
• It properly reports confidence in output (useful for combining estimates, and for risk-averse applications).
• It separates models from algorithms.
40
Bayesian inference: how?
• Since we want to integrate, we cannot use max-product.
• Since the unknown parameters are continuous, we cannot use sum-product.
• But we can use EP (expectation propagation), which is similar to BP.
• We can also use variational inference.
• Or MCMC (eg Gibbs sampling).
41
General purposeBayesian software
• BUGS (Gibbs sampling)
• VIBES (variational message passing)
• Minka and Winn’s toolbox (infer.net)
42
Structure of ideal Bayesian toolbox
train
trainDatalearnEngine
infEngine queries model
modelinfEngine
probDist
performance
visualizesummarize
Generator/ GUI/ file
infer
decide
decisionEngine
utilities
decision
testData