Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Bayes Nets
• Graphical models are a marriage between probability theory and graph theory.They provide a natural tool for dealing with two problems that occur throughoutapplied mathematics and engineering { uncertainty and complexity { and inparticular they are playing an increasingly important role in the design andanalysis of machine learning algorithms. Fundamental to the idea of a graphicalmodel is the notion of modularity--a complex system is built by combiningsimpler parts. Probability theory provides the glue whereby the parts arecombined, ensuring that the system as a whole is consistent, and providing waysto interface models to data. The graph theoretic side of graphical models providesboth an intuitively appealing interface by which humans can model highly-interacting sets of variables as well as a data structure that lends itself naturally tothe design of efficient general-purpose algorithms.
• Many of the classical multivariate probabalistic systems studied in fields such asstatistics, systems engineering, information theory, pattern recognition andstatistical mechanics are special cases of the general graphical model formalism--examples include mixture models, factor analysis, hidden Markov models,Kalman filters and Ising models. The graphical model framework provides a wayto view all of these systems as instances of a common underlying formalism.This view has many advantages—in particular, specialized techniques that havebeen developed in one field can be transferred between research communitiesand exploited more widely. Moreover, the graphical model formalism provides anatural framework for the design of new systems. Michael Jordan
Graphical Models
• Representation:– Efficiently represent a joint probability distribution
• Inference:– Infer hidden states of the system, given data
• Learning:– Estimate the parameters and structure of the model
from data.• Applications
Example 1
• Pearl's (1988, p. 49) example: Sherlock. Holmesis at work.
• His neighbor Mr. Watson, a practical joker, hascalled to say that his alarm at home has sounded.
• Should Sherlock rush home?• If the alarm really has sounded, it may be because
of a burglary or because of an earthquake. If hehears a radio report of an earthquake, his degree ofconfidence that there was a burglary will diminish.
Qualitative part:Directed acyclic graph
(DAG)• Nodes - random vars.• Edges - direct influence
Quantitative part:Set of conditionalprobability distributions
0.9 0.1
e
be
0.2 0.8
0.01 0.990.9 0.1
bebb
e
BE P(A | E,B)Family of Alarm
Earthquake
Radio
Burglary
Alarm
Call
Compact representation of joint probabilitydistributions via conditional independence
Together:Define a unique distributionin a factored form
†
P(B, E, A,C,R) = P(B)P(E)P(A | B,E)P(R | E)P(C | A)
What is a Bayes (belief) net?
Figure from N. Friedman
What is a Bayes net?
Earthquake
Radio
Burglary
Alarm
Call
C R,B,E | A
A node is conditionally independent of itsancestors given its parents, e.g.
Hence
From 25 – 1 = 31 parameters to 1+1+2+4+2=10
†
C
Why are Bayes nets useful?
- Graph structure supports- Modular representation of knowledge- Local, distributed algorithms for inference and learning- Intuitive (possibly causal) interpretation
- Factored representation may have exponentiallyfewer parameters than full joint P(X1,…,Xn) =>
- lower sample complexity (less data for learning)
- lower time complexity (less time for inference)
What can Bayes nets be used for?• Posterior probabilities
– Probability of any event given any evidence
• Most likely explanation– Scenario that explains evidence
• Rational decision making– Maximize expected utility– Value of Information
• Effect of intervention– Causal analysis
Earthquake
Radio
Burglary
Alarm
Call
Radio
Call
Figure from N. Friedman
Explaining away effect
Example 2
Wet example cont’d
Domain: Monitoring Intensive-Care Patients• 37 variables• 509 parameters …instead of 237
PCWP CO
HRBP
HREKG HRSAT
ERRCAUTERHRHISTORY
CATECHOL
SAO2 EXPCO2
ARTCO2
VENTALV
VENTLUNG VENITUBE
DISCONNECT
MINVOLSET
VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS
PAP SHUNT
ANAPHYLAXIS
MINOVL
PVSAT
FIO2
PRESS
INSUFFANESTHTPR
LVFAILURE
ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME
HYPOVOLEMIA
CVP
BP
A real Bayes net: Alarm
Figure from N. Friedman
More real-world BN applications• “Microsoft’s competitive advantage lies in its expertise in
Bayesian networks”-- Bill Gates, quoted in LA Times, 1996
• MS Answer Wizards, (printer) troubleshooters• Medical diagnosis• Genetic pedigree analysis• Speech recognition (HMMs)• Gene sequence/expression analysis• Turbocodes (channel coding)
Dealing with time
• In many systems, data arrives sequentially• Dynamic Bayes nets (DBNs) can be used to
model such time-series (sequence) data• Special cases of DBNs include
– State-space models– Hidden Markov models (HMMs)
State-space model (SSM)/Linear Dynamical System (LDS)
Y1 Y3
X1 X2 X3
Y2
“True” state
Noisy observations
Example: LDS for 2D tracking
Y1 Y3
X1 X2X3
Y2
X1
X1 X2
X2
X1 X2
y1
y1 y2
y2
y2y1
oo
o o
Sparse linear Gaussian systems) sparse graphs
Hidden Markov model (HMM)
Y1 Y3
X1 X2 X3
Y2
Phones/ words
acoustic signal
transitionmatrix
Gaussianobservations
Sparse transition matrix ) sparse graph
Probabilistic graphical modelsProbabilistic models
Directed Undirected
Graphical models
Alarm networkState-space modelsHMMsNaïve Bayes classifierPCA/ ICA
Markov Random FieldBoltzmann machineIsing modelMax-ent modelLog-linear models
(Bayesian belief nets) (Markov nets)
Many PatternRecognitionMethods areinstances of
graphicalmodels
Inference• Posterior probabilities
– Probability of any event given any evidence
• Most likely explanation– Scenario that explains evidence
• Rational decision making– Maximize expected utility– Value of Information
• Effect of intervention– Causal analysis
Earthquake
Radio
Burglary
Alarm
Call
Radio
Call
Figure from N. Friedman
Explaining away effect
Wet inference
• Q: Is grass wet due to sprinkler or rain?
Efficient Evaluation
Kalman filtering as a BeliefNetwork
Y1 Y3
X1 X2X3
Y2
Estimate P(Xt|y1:t) from P(Xt-1|y1:t-1) and yt
•Predict: P(Xt|y1:t-1) = sXt-1 P(Xt|Xt-1) P(Xt-1|y1:t-1)•Update: P(Xt|y1:t) / P(yt|Xt) P(Xt|y1:t-1)
Belief Propagation
rootroot
Collect
rootroot
Distribute
Figure from P. Green
Generalization of forwards-backwards algo. /RTS smoother from chains to trees - linear time, two-pass algorithm
aka Pearl’s algorithm, sum-product algorithm
Message passing view offorwards algorithm
Yt-1 Yt+1
Xt-1 XtXt+1
Yt
at|t-1
btbt+1
Forwards-backwards algorithm
Yt-1 Yt+1
Xt-1 Xt Xt+1
Yt
at|t-1bt
bt
Discrete analog of RTS smoother
BP: parallel, distributed version
X1
X2
X3 X4
X1
X2
X3 X4
Stage 1. Stage 2.
Inference in general graphs• BP is only guaranteed to be correct for trees• A general graph should be converted to a
junction tree, by clustering nodes• Computationally complexity is exponential
in size of the resulting clusters (NP-hard)
Approximate inference• Why?
– to avoid exponential complexity of exact inference indiscrete loopy graphs
– Because cannot compute messages in closed form(even for trees) in the non-linear/non-Gaussian case
• How?– Deterministic approximations: loopy BP, mean field,
structured variational, etc– Stochastic approximations: MCMC (Gibbs sampling),
likelihood weighting, particle filtering, etc
- Algorithms make different speed/accuracy tradeoffs
- Should provide the user with a choice of algorithms
Learning
• Parameter estimation• Model selection (structure learning)
Parameter learning
Figure from M. Jordan
Conditional Probability Tables (CPTs)
1?11?1
…
110111
000010
X6X5X4X3X2X1
iid data
If some prob. values are missing(latent variables), we must use some method (e.g. gradient descent or EM) to compute the (locally) maximum likelihood estimates
Structure learning (which nodesare connected)
Gene expression data
Figure from N. Friedman
Genetic pathway
Structure learning•Learning the optimal structure is NP-hard (except for trees)•Hence use heuristic search through space of DAGs or PDAGs or node orderings•Search algorithms: hill climbing, simulated annealing, GAs•Scoring function is often marginal likelihood, or an approximation like BIC/MDL or AIC
Structural complexity penalty