Upload
others
View
34
Download
11
Embed Size (px)
Citation preview
Probabilistic Graphical ModelsLecture 1: Introduction and Probability Basics
M. Jaeger
Aalborg University
Trento, May 2015 Probability 1 / 39
Introduction
Trento, May 2015 Probability 2 / 39
Introduction Example 1: SLAM
?
Simultaneous Localization and Mapping: learn a map of the environment and locate currentposition
Trento, May 2015 Probability 2 / 39
Introduction Example 1: SLAM cont.
A probabilistic graphical model for the SLAM problem:
cont0
pos0
sens0
cont1
pos1
sens1
cont2
pos2
sens2
cont3
pos3
sens3
map
cont t : control input at time t
post : position at time t
senst : sensor reading at time t
map: map of the environment
◮ Determine most probable position given map, controls, and sensor readings◮ Determine most probable map given position, controls, and sensor readings
S. Thrun, W. Burgard, and D. Fox: A probabilistic approach to concurrent mapping and localizationfor mobile robots. Autonomous Robots 5 (3-4), 253-271, 1998.
Trento, May 2015 Probability 3 / 39
Introduction Example 2: Image Segmentation
(source: http://pubs.niaaa.nih.gov/publications/arh313/243-246.htm)
◮ Divide image into small number of regions representing structurally similar areas
Trento, May 2015 Probability 4 / 39
Introduction Example 2: Image Segmentation cont.
A PGM for image segmentation:
seg5,7 seg5,8
seg6,7 seg6,8
rgb5,7 rgb5,8
rgb6,7 rgb6,8
segi,j : segment index of pixel (i , j)rgbi,j : color value of pixel (i , j)
◮ Determine most probable segmentation given the color values
Y. Zhang, M. Brady, and S. Smith: Segmentation of brain MR images through a hidden Markovrandom field model and the expectation-maximization algorithm. IEEE Transactions on MedicalImaging, 20(1), 45-57, 2001.
Trento, May 2015 Probability 5 / 39
Introduction Example 3: Statistical Semantics
Given a collection of texts:
Goal: automatically learn semantic descriptors for documents and words, that support documentclustering, text understanding, information retrieval ...
Trento, May 2015 Probability 6 / 39
Introduction Example 3: Statistical Semantics
The Probabilistic Latent Semantic Analysis (PLSA) model:
Document
Topic
Word
Word occurrences in Documents are observed. Topics are latent attributes of word occurrences indocuments.
T. Hofmann: Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning,42, 177-196, 2001.
Trento, May 2015 Probability 7 / 39
Introduction Example 4: Bioinformatics
Micro-array gene expression data:
Which genes are expressed under which conditions? Which are co-regulated, or functinallydependent?
Trento, May 2015 Probability 8 / 39
Introduction Example 4: Bioinformatics
Bayesian network showing dependencies among gene expression levels:
DBG5
LKI3
FFR3
TSW2
IUJ8
ERK3
AA7
RDE6
NQO2
JSW5
PLR9
BDO4
MNW7
KID1
N. Friedman, M. Linial, I. Nachman, and D. Pe’er: Using Bayesian networks to analyze expressiondata. Journal of computational biology, 7 (3-4), 601-620, 2000.
Trento, May 2015 Probability 9 / 39
Introduction Statistical Machine Learning
Common ground in 4 examples:
◮ Learn a probabilistic model from data (using statistical learning techniques )◮ Apply probabilistic inference algorithms to use models for prediction (classification ,
regression ), structure analysis (clustering , segmentation )
Advantages of probabilistic/statistical methods:
◮ Principled quantification of prediction uncertainties◮ Robust and principled techniques to deal with incomplete information, missing data.
Trento, May 2015 Probability 10 / 39
Introduction Probabilistic Graphical Models
Need: probabilistic models that◮ can represent models for high-dimensional state spaces◮ support efficient learning and inference techniques
Probabilistic Graphical Models◮ support a structured specification of high-dimensional distributions in terms of
low-dimensional factors◮ structured representation can be exploited for efficient learning and inference algorithms
(sometimes ...)◮ graphical representation gives human-friendly design and description possibilities
Trento, May 2015 Probability 11 / 39
Introduction This Course: Objective and Contents
Objective
◮ Provide a general introduction to the principles and techniques of probabilistic graphicalmodels
◮ Enable understanding of scientific papers that use PGMs in a specific scientific context
Contents
1: Introduction and Probability Basics2: Bayesian Networks – Syntax and Semantics3: Bayesian Networks – Inference4: Approximate Inference5: Markov Networks6: Parameter Learning7: Structure Learning8: Temporal Models9: Latent Variable Models10: Beyond Graphs
Trento, May 2015 Probability 12 / 39
Introduction Literature
D. Koller and N. Friedman:Probabilistic Graphical Mod-els. MIT press, 2009
C. M. Bishop: Pattern Recog-nition and Machine Learning.Springer, 2006
K. Murphy: Machine Learn-ing: A Probabilistic Perspec-tive. MIT Press, 2012
Trento, May 2015 Probability 13 / 39
Introduction Literature
D. Koller and N. Friedman:Probabilistic Graphical Mod-els. MIT press, 2009
C. M. Bishop: Pattern Recog-nition and Machine Learning.Springer, 2006
K. Murphy: Machine Learn-ing: A Probabilistic Perspec-tive. MIT Press, 2012
Trento, May 2015 Probability 13 / 39
Probability Basics
Trento, May 2015 Probability 14 / 39
Probability Basics Probabilities as frequencies
The probability of tossing an even number with a die is 1/2:
Frequency of even numbers: 16/30 = 0.5333. In a sequence of 100 tosses, the frequency of ’even’is (expected to be) even closer to 0.5.
Trento, May 2015 Probability 14 / 39
Probability Basics Probabilities as frequencies
The probability of tossing an even number with a die is 1/2:
Frequency of even numbers: 16/30 = 0.5333. In a sequence of 100 tosses, the frequency of ’even’is (expected to be) even closer to 0.5.
The set of possible outcomes is called the sample space:
S = {1, 2, 3, 4, 5, 6}.
A subset of S is called an event:
Event Subset of Seven {2, 4, 6}1 {1}multiple of 3 {3, 6}
Trento, May 2015 Probability 14 / 39
Probability Basics The laws of frequency probabilities
Notation
- S the sample space
- A,B, . . . events
- Pf (A) the frequency of occurrences of A in a (large) set of observed outcomes (theprobability of A).
Laws (Frequency) probabilities obey the following rules:◮ 0 ≤ Pf (A) ≤ 1◮ Pf (S) = 1◮ If A ∩ B = ∅, then Pf (A) + Pf (B) = Pf (A ∪ B)
Trento, May 2015 Probability 15 / 39
Probability Basics Probabilities as Beliefs
Measuring a subjective belief: let A be any proposition whose truth value is currently unknown, butwill later become known (e.g.: “it will rain tomorrow”, “a Democratic president will be elected in2016”, “Chievo Verona will score a goal vs. AC Fiorentina on 31/05/15”). Consider a betting ticket:
Ticket
GLOBAL GAMBLING INC. shall pay tothe owner of this ticket
$ 1if A happens.
How much are you willing to pay for this ticket? (At least $ 0!)For how much are you willing to sell this ticket? (Certainly for $ 1!)What is the price at which you would just as well buy or sell? (In between $ 0 and $ 1!)
Trento, May 2015 Probability 16 / 39
Probability Basics Ticket trading and Dutch books
Consider two agents: the elicitor (E) and the subject (S). Both E and S are in possession of ticketsfor various propositions A,B, . . .. Now:
◮ E asks S for a price for tickets for each of the propositions A,B, . . ..◮ After S has set prices for all propositions, S must be ready to either buy from E or sell to E
tickets at these prices.
The price set by S for proposition A, denoted Pb(A) is a measure of S’s belief in the happening ofA (S’s subjective probability for A).
Trento, May 2015 Probability 17 / 39
Probability Basics Ticket trading and Dutch books
Consider two agents: the elicitor (E) and the subject (S). Both E and S are in possession of ticketsfor various propositions A,B, . . .. Now:
◮ E asks S for a price for tickets for each of the propositions A,B, . . ..◮ After S has set prices for all propositions, S must be ready to either buy from E or sell to E
tickets at these prices.
The price set by S for proposition A, denoted Pb(A) is a measure of S’s belief in the happening ofA (S’s subjective probability for A).
E can make a Dutch Book against S, if S has set prices Pb(A), Pb(B), . . . for some propositions,such that E can make a combination of buying/selling deals with S, so that E will gain from thesedeals (and S will lose), under all possible combinations of outcomes for the propositions involved.
Trento, May 2015 Probability 17 / 39
Probability Basics Dutch book theorem
Proposition S’s price Pb E decides toA 0.4 buyB 0.3 buy
A ∨ B 0.8 sell
Outcome of propositions E’s gain from deals (= S’s losses)gain from buying/selling payout tickets bought - payout tickets sold total
A ∧ B 0.1 2 -1 1.1A ∧ ¬B 0.1 1 -1 0.1¬A ∧ B 0.1 1 -1 0.1¬A ∧ ¬B 0.1 0 0 0.1
Trento, May 2015 Probability 18 / 39
Probability Basics Dutch book theorem
Proposition S’s price Pb E decides toA 0.4 buyB 0.3 buy
A ∨ B 0.8 sell
Outcome of propositions E’s gain from deals (= S’s losses)gain from buying/selling payout tickets bought - payout tickets sold total
A ∧ B 0.1 2 -1 1.1A ∧ ¬B 0.1 1 -1 0.1¬A ∧ B 0.1 1 -1 0.1¬A ∧ ¬B 0.1 0 0 0.1
Dutch Book Theorem (de Finetti)
E can make a Dutch book against S, if and only if S’s prices do not obey
◮ 0 ≤ Pb(A) ≤ 1◮ Pb(S) = 1 (S the sure proposition)◮ If A ∧ B is an impossible proposition, then Pb(A) + Pb(B) = Pb(A ∨ B)
Literature: http://plato.stanford.edu/entries/dutch-book/
Trento, May 2015 Probability 18 / 39
Probability Basics Beliefs and Frequencies
◮ Frequencies and rational beliefs follow the same laws◮ We need only one probability calculus to deal with both
◮ Differences:◮ We can have beliefs about everything!◮ Frequency-probabilities need to be based on repeatable sampling/observation procedures.
From frequencies to beliefs
Pf (max. temperature on Aug.1 > 25◦C) = 0.35 Pb(it will be warmer than 25◦C on 01/08/2010) = 0.35.
This kind of reasoning pattern is called direct inference. However, it is not always possible to basesubjective probabilities on well-defined frequencies (what is the probability that global warming willraise the sea-levels more than 1m by the year 2100?).
Trento, May 2015 Probability 19 / 39
Probability Basics Probabilities: the Mathematical Model
A probability distribution on a finite sample space S is a function P() that assigns to every eventA ⊆ S a number P(A) ∈ [0, 1] (the probability of A), such that
◮ P(S) = 1◮ If A ∩ B = ∅, then P(A) + P(B) = P(A ∪ B) (finite additivity)
A
B CD
S
P(A ∪ B) = P(A) + P(B)P(C ∪ D) = P(C) + P(D)− P(C ∩ D)
Infinite Sample Spaces
◮ Probabilities are only assigned to measurable subsets A ⊆ S
◮ Countable additivity:
Ai (i ∈ N) pairwise disjoint : P(∪i∈NAi ) =∑
i∈N
P(Ai )
Trento, May 2015 Probability 20 / 39
Probability Basics Conditional probabilities
For two events A and B we define the conditional probability of A given B:
P(A|B) =P(A ∩ B)
P(B)
Examples :
P({4}|{2, 4, 6}) = P({4})P({2,4,6}) = 1/6
3/6 = 13
P(even|{4, 5, 6}) = P({4,6})P({4,5,6}) = 2
3
P(zero|roulette wheel is fair) =P(zero ∩ roulette wheel is fair)
P(roulette wheel is fair)(=
?
?) = 1/37.
Trento, May 2015 Probability 21 / 39
Probability Basics Fundamental rules
Notation: In probability expressions: A,B stands for intersection A∩ B, or conjunction “A and B”
The fundamental ruleP(A,B) = P(A|B)P(B)
The fundamental rule, conditioned
P(A,B|C) = P(A|B, C)P(B|C)
Bayes’s rule
P(B|A) =P(A|B)P(B)
P(A)
Bayes’s rule, conditioned
P(B|A, C) =P(A|B, C)P(B|C)
P(A|C)
Trento, May 2015 Probability 22 / 39
Random Variables and Independence
Trento, May 2015 Probability 23 / 39
RVs and Independence Random Variables
Sample Spaces are usually defined by random variables given by a name X and a value spaceVal(X)
X Val(X)Temperature R
CO2level 2100 R
US President 2016 democratic, republican, otherMiddleEastPeace2020 yes, no
Population N
A set of random variables X1, . . . ,Xn defines the sample space
S = Val(X1) × · · · × Val(Xn)
A (marginal) event of the form Xi = xi is the subset
Val(X1)× · · · × {xi} × · · · × Val(Xn) ⊆ S
Trento, May 2015 Probability 23 / 39
RVs and Independence Terminology and Notation
Terminology
◮ Discrete random variable: RV with finite value space (in Probability Theory also: countablyinfinite value space)
◮ Continuous random variable: RV with value space R
Notation
◮ Upper case letters X ,Y , Z for (generic) random variables, corresponding lower case lettersx, y , z for their values
◮ Boldface X = (X1, . . . ,Xk ), x = (x1, . . . , xk ) for tuples of random variables and values◮ X = x stands for component-wise assignment Xi = xi (i = 1, . . . , k )
TerminologyThe joint distribution of RVs X is a distribution P on the sample space
S = Val(X ) = Val(X1)× · · · × Val(Xn)
The marginal distribution of Xi is P restricted to events of the form Xi = xi .
Trento, May 2015 Probability 24 / 39
RVs and Independence Tables and Marginals
Tabular specification of a joint distribution for discrete RVs (contingency table):
USPres16MEP2020 dem rep oth
yes 0.12 0.1 0.08no 0.23 0.3 0.17
Trento, May 2015 Probability 25 / 39
RVs and Independence Tables and Marginals
Tabular specification of a joint distribution for discrete RVs (contingency table):
USPres16MEP2020 dem rep oth
yes 0.12 0.1 0.08no 0.23 0.3 0.17
Marginal distributions:
USPres16MEP2020 dem rep oth
yes 0.12 0.1 0.08 0.3no 0.23 0.3 0.17 0.7
0.35 0.4 0.25
Trento, May 2015 Probability 25 / 39
RVs and Independence 3-way Tables
Tabular representation of joint distribution P(A,B,C) of three variables A,Val(A) = {a1, a2},B,Val(B) = {b1, b2, b3}, C,Val(C) = {c1, c2}.
Bb1 b2 b3C C C
A c1 c2 c1 c2 c1 c2a1 0.1 0.04 0.02 0.17 0.09 0.13 0.55a2 0.07 0.11 0.04 0.09 0.12 0.02 0.45
0.32 0.32 0.360.44 0.56
or
A, B, C Pa1,b1,c1 0.1a1,b1,c2 0.04
......
a2,b3,c2 0.02
Trento, May 2015 Probability 26 / 39
RVs and Independence Conditional Distributions
Conditional distribution of A given B = b1 and C = c2:
P(A | B = b1,C = c2):A
a1 P(A = a1 | B = b1,C = c2) =P(A=a1,B=b1,C=c2)
P(B=b1,C=c2)= 0.04
0.15 = 0.2666
a2 P(A = a2 | B = b1,C = c2) =P(A=a1,B=b1,C=c2)
P(B=b1,C=c2)= 0.11
0.15 = 0.7333
Trento, May 2015 Probability 27 / 39
RVs and Independence Conditional Distributions
Conditional distribution of A given B = b1 and C = c2:
P(A | B = b1,C = c2):A
a1 P(A = a1 | B = b1,C = c2) =P(A=a1,B=b1,C=c2)
P(B=b1,C=c2)= 0.04
0.15 = 0.2666
a2 P(A = a2 | B = b1,C = c2) =P(A=a1,B=b1,C=c2)
P(B=b1,C=c2)= 0.11
0.15 = 0.7333
Conditional distribution of A given B and C:
P(A | B,C):B
b1 b2 b3C C C
A c1 c2 c1 c2 c1 c2a1 0.588 0.2666 0.333 0.654 0.428 0.866a2 0.411 0.7333 0.666 0.346 0.571 0.133
Again a function on Val(A,B,C)!
Trento, May 2015 Probability 27 / 39
RVs and Independence Basic Rules for Variables
According to fundamental rule: for all a ∈ Val(A) and all b ∈ Val(B):
P(A = a,B = b) = P(A = a | B = b) · P(B = b)
Can write this as an equation between functions P(A,B),P(A | B),P(B) on Val(A) × Val(B):
P(A,B) = P(A | B) · P(B)
Similarly, conditioned version of Fundamental rule:
P(A,B | C) = P(A | B,C) · P(B | C)
and Bayes’ rule (conditioned):
P(B | A) = P(A|B)·P(B)P(A)
P(B | A,C) = P(A|B,C)·P(B|C)P(A|C)
Trento, May 2015 Probability 28 / 39
Independence
Trento, May 2015 Probability 29 / 39
Independence Example: Football statistics
Results for Bayern München and SC Freiburg in seasons 2001/02 and 2003/04. (Not counting thematches München vs. Freiburg):
Val(München) = Val(Freiburg) = {Win, Draw, Loss}
2001/02München: LWDWWWWWWWWLDLDLDLWLDWWWDWDDWWWWFreiburg: WLLDDWLDWDWLLLDDLWDDLLDLLLLLLWLW
2003/04München: WDWWLDWWDWLWWDDWDWLWWWDDWWWLWWLLFreiburg: LDDWDWLWLLLWWLWLWLLDWLDDWDLLLWLD
Summary:
FreiburgMünchen W D L
W 12 9 15 36D 3 4 9 16L 6 4 2 12
21 17 26
Trento, May 2015 Probability 29 / 39
Independence Independence of Outcomes
Counts normalized to probabilities:
P(München,Freiburg):Freiburg
München W D L München marginal
W .1875 .1406 .2344 .5625
D .0468 .0625 .1406 .25
L .0937 .0625 .0312 .1875
Freiburg marginal .3281 .2656 .4062
Trento, May 2015 Probability 30 / 39
Independence Independence of Outcomes
Counts normalized to probabilities:
P(München,Freiburg):Freiburg
München W D L München marginal
W .1875 .1406 .2344 .5625.1845 .1494 .2284
D .0468 .0625 .1406 .25.082 .0664 .1015
L .0937 .0625 .0312 .1875.0615 .0498 .0761
Freiburg marginal .3281 .2656 .4062
Product of marginals: P(München) · P(Freiburg)
Trento, May 2015 Probability 30 / 39
Independence Independence of Outcomes
Counts normalized to probabilities:
P(München,Freiburg):Freiburg
München W D L München marginal
W .1875 .1406 .2344 .5625.1845 .1494 .2284
D .0468 .0625 .1406 .25.082 .0664 .1015
L .0937 .0625 .0312 .1875.0615 .0498 .0761
Freiburg marginal .3281 .2656 .4062
Product of marginals: P(München) · P(Freiburg)
Explanation: The outcome of Freiburg’s game is independent of the outcome of München’s game,therefore the probabilities of combinations of outcomes are the product of the probabilities for theindividual outcomes.
Trento, May 2015 Probability 30 / 39
Independence Independent Variables
Let P be a joint distribution of variables A1, . . . ,An; The variables Ai and Ak are independent(according to P), if for all ai,j ∈ Val(Ai ) and ak,h ∈ Val(Ak ) :
P(Ai = ai,j ,Ak = ak,h) = P(Ai = ai,j) · P(Ak = ak,h).
Written as equation for distributions:
P(Ai ,Ak ) = P(Ai ) · P(Ak ).
◮ Similar for 3 or more variables◮ A set of RVs is independent, if every finite subset is independent
Trento, May 2015 Probability 31 / 39
Independence Example
Pairwise independence does not imply independence:
◮ Random variables Xi (i = 1, . . . , n; n ≥ 2) and B with Val(Xi ) = Val(B) = {0, 1}◮ The Xi represent a sequence of independent coin tosses:
P(X = x) = (1
2)n for all x ∈ {0, 1}n
◮ B is a parity bit:
P(B =n∑
i=1
Xi mod 2) = 1
Then for all i , x ∈ {0, 1}:P(B = 1 | Xi = x) = P(B = 1) = 0.5
But for all x ∈ {0, 1}n:
0.5 = P(B = 1) 6= P(B = 1 | X = x) ∈ {0, 1}
Trento, May 2015 Probability 32 / 39
Independence Equivalent Formulations of Independence
The following are equivalent for two variables A,B:
P(A,B) = P(A)P(B)P(A | B) = P(A)P(B | A) = P(B)
M =W D L
P(M) : .5625 .25 .1875
P(M | F = W ) : .1875.3281 = .571 .0468
.3281 = .143 .0937.3281 = .285
P(M | F = D) : .1406.2656 = .529 .0625
.2656 = .235 .0625.2656 = .235
P(M | F = L) : .2344.4062 = .577 .1406
.4062 = .346 .0312.4062 = .077
Knowing the outcome of Freiburg’s game does not change the probabilities for München’s game (infact here not completely true – but discrepancies between e.g. P(M = L | F = L) and P(M = L)can be explained by small number of games from which frequencies were computed).
Trento, May 2015 Probability 33 / 39
Independence Compact Specifications by Independence
Independence properties can greatly simplify the specification of a distribution:
F =M = W D L M marginal
W .5625
D .25
L M and F are independent
.1875
F marginal .3281 .2656 .4062
Trento, May 2015 Probability 34 / 39
Conditional Independence
Trento, May 2015 Probability 35 / 39
Conditional Independence Example
Joint distribution for variables
Sex : Val(Sex) = {male,female}Hair length : Val(Hair length) = {long, short}Stature : Val(Stature) = {≥ 1.68,≤ 1.68}
Sexmale female
Hair length Hair lengthStature long short long short≥ 1.68 0.0441 0.3969 0.2142 0.0918≤ 1.68 0.0049 0.0441 0.1428 0.0612
P(Hair length,Stature), P(Hair length), P(Stature), P(Hair length) · P(Stature):
Hair lengthStature long short≥ 1.68 0.2583 0.4887 0.747
0.3032 0.4437≤ 1.68 0.1477 0.1053 0.253
0.1027 0.15020.406 0.594
Hair length and Stature are not independent.
Trento, May 2015 Probability 35 / 39
Conditional Independence Example Continued
P(Hair length,Stature | Sex = female)P(Hair length | Sex = female)P(Stature | Sex = female)P(Hair length | Sex = female) · P(Stature | Sex = female)
Hair lengthStature long short≥ 1.68 0.42 0.18 0.6
0.42 0.18≤ 1.68 0.28 0.12 0.4
0.28 0.120.7 0.3
Hair length and Stature are independent given Sex=female.Also: Hair length and Stature are independent given Sex=male. Hair length and Stature are independent given Sex.
Trento, May 2015 Probability 36 / 39
Conditional Independence Conditionally Independent Variables
Let P be a joint distribution of variables A = A1, . . . ,An, B = B1, . . . ,Bm, C = C1, . . . ,Ck ; Thevariables A are conditionally independent of the variables B given C, if
P(A,B | C) = P(A | C) · P(B | C)
Equivalently:P(A | B,C) = P(A | C)
Trento, May 2015 Probability 37 / 39
Conditional Independence The Chain Rule
For any joint distribution P of variables A1, . . . ,An:
P(A1, . . . ,An) = P(An | A1, . . . ,An−1) · P(An−1 | A1, . . . ,An−2) · · ·P(A2 | A1) · P(A1).
Proof by repeated application of the fundamental rule.
For i = 1, . . . , n let Pa(Ai ) be a subset of A1, . . . ,Ai−1, such that
P(Ai | A1, . . . ,Ai−1) = P(Ai | Pa(Ai )).
Then:
P(A1, . . . ,An) =n∏
i=1
P(Ai | Pa(Ai )).
This is also called the chain rule for Bayesian networks.
Trento, May 2015 Probability 38 / 39
Conditional Independence Example
Chain rule applied to Sex,Hair length, Stature:
P(Sex,Hair length,Stature) = P(Stature | Hair length,Sex) · P(Hair length | Sex) · P(Sex).
Using conditional independence:
P(Sex,Hair length,Stature) = P(Stature | Sex) · P(Hair length | Sex) · P(Sex).
Graphical representation:
Sex
Hair length Stature
Trento, May 2015 Probability 39 / 39