Bayes Nets - University of Minnesotavision.psych.umn.edu/users/schrater/schrater_lab/courses/...• Graphical models are a marriage between probability theory and graph theory. They

Bayes Nets

• Graphical models are a marriage between probability theory and graph theory.They provide a natural tool for dealing with two problems that occur throughoutapplied mathematics and engineering { uncertainty and complexity { and inparticular they are playing an increasingly important role in the design andanalysis of machine learning algorithms. Fundamental to the idea of a graphicalmodel is the notion of modularity--a complex system is built by combiningsimpler parts. Probability theory provides the glue whereby the parts arecombined, ensuring that the system as a whole is consistent, and providing waysto interface models to data. The graph theoretic side of graphical models providesboth an intuitively appealing interface by which humans can model highly-interacting sets of variables as well as a data structure that lends itself naturally tothe design of efficient general-purpose algorithms.

• Many of the classical multivariate probabalistic systems studied in fields such asstatistics, systems engineering, information theory, pattern recognition andstatistical mechanics are special cases of the general graphical model formalism--examples include mixture models, factor analysis, hidden Markov models,Kalman filters and Ising models. The graphical model framework provides a wayto view all of these systems as instances of a common underlying formalism.This view has many advantages—in particular, specialized techniques that havebeen developed in one field can be transferred between research communitiesand exploited more widely. Moreover, the graphical model formalism provides anatural framework for the design of new systems. Michael Jordan

Graphical Models

• Representation:– Efficiently represent a joint probability distribution

• Inference:– Infer hidden states of the system, given data

• Learning:– Estimate the parameters and structure of the model

from data.• Applications

Example 1

• Pearl's (1988, p. 49) example: Sherlock. Holmesis at work.

• His neighbor Mr. Watson, a practical joker, hascalled to say that his alarm at home has sounded.

• Should Sherlock rush home?• If the alarm really has sounded, it may be because

of a burglary or because of an earthquake. If hehears a radio report of an earthquake, his degree ofconfidence that there was a burglary will diminish.

Qualitative part:Directed acyclic graph

(DAG)• Nodes - random vars.• Edges - direct influence

Quantitative part:Set of conditionalprobability distributions

0.9 0.1

e

be

0.2 0.8

0.01 0.990.9 0.1

bebb

e

BE P(A | E,B)Family of Alarm

Earthquake

Radio

Burglary

Alarm

Call

Compact representation of joint probabilitydistributions via conditional independence

Together:Define a unique distributionin a factored form

†

P(B, E, A,C,R) = P(B)P(E)P(A | B,E)P(R | E)P(C | A)

What is a Bayes (belief) net?

Figure from N. Friedman

What is a Bayes net?

Earthquake

Radio

Burglary

Alarm

Call

C R,B,E | A

A node is conditionally independent of itsancestors given its parents, e.g.

Hence

From 25 – 1 = 31 parameters to 1+1+2+4+2=10

†

C

Why are Bayes nets useful?

- Graph structure supports- Modular representation of knowledge- Local, distributed algorithms for inference and learning- Intuitive (possibly causal) interpretation

- Factored representation may have exponentiallyfewer parameters than full joint P(X1,…,Xn) =>

- lower sample complexity (less data for learning)

- lower time complexity (less time for inference)

What can Bayes nets be used for?• Posterior probabilities

– Probability of any event given any evidence

• Most likely explanation– Scenario that explains evidence

• Rational decision making– Maximize expected utility– Value of Information

• Effect of intervention– Causal analysis

Earthquake

Radio

Burglary

Alarm

Call

Radio

Call


Explaining away effect

Example 2

Wet example cont’d

Domain: Monitoring Intensive-Care Patients• 37 variables• 509 parameters …instead of 237

PCWP CO

HRBP

HREKG HRSAT

ERRCAUTERHRHISTORY

CATECHOL

SAO2 EXPCO2

ARTCO2

VENTALV

VENTLUNG VENITUBE

DISCONNECT

MINVOLSET

VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS

PAP SHUNT

ANAPHYLAXIS

MINOVL

PVSAT

FIO2

PRESS

INSUFFANESTHTPR

LVFAILURE

ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME

HYPOVOLEMIA

CVP

BP

A real Bayes net: Alarm


More real-world BN applications• “Microsoft’s competitive advantage lies in its expertise in

Bayesian networks”-- Bill Gates, quoted in LA Times, 1996

• MS Answer Wizards, (printer) troubleshooters• Medical diagnosis• Genetic pedigree analysis• Speech recognition (HMMs)• Gene sequence/expression analysis• Turbocodes (channel coding)

Dealing with time

• In many systems, data arrives sequentially• Dynamic Bayes nets (DBNs) can be used to

model such time-series (sequence) data• Special cases of DBNs include

– State-space models– Hidden Markov models (HMMs)

State-space model (SSM)/Linear Dynamical System (LDS)

Y1 Y3

X1 X2 X3

Y2

“True” state

Noisy observations

Example: LDS for 2D tracking

Y1 Y3

X1 X2X3

Y2

X1

X1 X2

X2

X1 X2

y1

y1 y2

y2

y2y1

oo

o o

Sparse linear Gaussian systems) sparse graphs

Hidden Markov model (HMM)

Y1 Y3

X1 X2 X3

Y2

Phones/ words

acoustic signal

transitionmatrix

Gaussianobservations

Sparse transition matrix ) sparse graph

Probabilistic graphical modelsProbabilistic models

Directed Undirected

Graphical models

Alarm networkState-space modelsHMMsNaïve Bayes classifierPCA/ ICA

Markov Random FieldBoltzmann machineIsing modelMax-ent modelLog-linear models

(Bayesian belief nets) (Markov nets)

Many PatternRecognitionMethods areinstances of

graphicalmodels

Inference• Posterior probabilities

– Probability of any event given any evidence

• Most likely explanation– Scenario that explains evidence

• Rational decision making– Maximize expected utility– Value of Information

• Effect of intervention– Causal analysis

Earthquake

Radio

Burglary

Alarm

Call

Radio

Call


Explaining away effect

Wet inference

• Q: Is grass wet due to sprinkler or rain?

Efficient Evaluation

Kalman filtering as a BeliefNetwork

Y1 Y3

X1 X2X3

Y2

Estimate P(Xt|y1:t) from P(Xt-1|y1:t-1) and yt

•Predict: P(Xt|y1:t-1) = sXt-1 P(Xt|Xt-1) P(Xt-1|y1:t-1)•Update: P(Xt|y1:t) / P(yt|Xt) P(Xt|y1:t-1)

Belief Propagation

rootroot

Collect

rootroot

Distribute

Figure from P. Green

Generalization of forwards-backwards algo. /RTS smoother from chains to trees - linear time, two-pass algorithm

aka Pearl’s algorithm, sum-product algorithm

Message passing view offorwards algorithm

Yt-1 Yt+1

Xt-1 XtXt+1

Yt

at|t-1

btbt+1

Forwards-backwards algorithm

Yt-1 Yt+1

Xt-1 Xt Xt+1

Yt

at|t-1bt

bt

Discrete analog of RTS smoother

BP: parallel, distributed version

X1

X2

X3 X4

X1

X2

X3 X4

Stage 1. Stage 2.

Inference in general graphs• BP is only guaranteed to be correct for trees• A general graph should be converted to a

junction tree, by clustering nodes• Computationally complexity is exponential

in size of the resulting clusters (NP-hard)

Approximate inference• Why?

– to avoid exponential complexity of exact inference indiscrete loopy graphs

– Because cannot compute messages in closed form(even for trees) in the non-linear/non-Gaussian case

• How?– Deterministic approximations: loopy BP, mean field,

structured variational, etc– Stochastic approximations: MCMC (Gibbs sampling),

likelihood weighting, particle filtering, etc

- Algorithms make different speed/accuracy tradeoffs

- Should provide the user with a choice of algorithms

Learning

• Parameter estimation• Model selection (structure learning)

Parameter learning

Figure from M. Jordan

Conditional Probability Tables (CPTs)

1?11?1

…

110111

000010

X6X5X4X3X2X1

iid data

If some prob. values are missing(latent variables), we must use some method (e.g. gradient descent or EM) to compute the (locally) maximum likelihood estimates

Structure learning (which nodesare connected)

Gene expression data


Genetic pathway

Structure learning•Learning the optimal structure is NP-hard (except for trees)•Hence use heuristic search through space of DAGs or PDAGs or node orderings•Search algorithms: hill climbing, simulated annealing, GAs•Scoring function is often marginal likelihood, or an approximation like BIC/MDL or AIC

Structural complexity penalty

Documents

Bayes Nets - University of Minnesotavision.psych.umn.edu/users/schrater/schrater_lab/courses/...• Graphical models are a marriage between probability theory and graph theory. They