Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Bayesian Statistics
AI Friends Seminar
Ganguk Hwang
Department of Mathematical SciencesKAIST
AI Friends Seminar Ganguk Hwang Bayesian Statistics January 2019 1 / 37
Introduction
Introduction
AI Friends Seminar Ganguk Hwang Bayesian Statistics January 2019 2 / 37
Introduction
Bayesian Inference
Bayesian inference is a method of statistical inference in which Bayes’theorem is used to update the probability for a hypothesis as moreevidence or information becomes available.
Bayesian updating is particularly important in the dynamic analysis of asequence of data.
Bayesian inference has found application in a wide range of activities,including science, engineering, philosophy, medicine, sport, and law.
(in Wikipedia)
AI Friends Seminar Ganguk Hwang Bayesian Statistics January 2019 3 / 37
Introduction
Thomas Bayes and Bayes’ Theorem
Thomas Bayes (1701 1761) was an English statistician, philosopher andPresbyterian minister who is known for formulating a specific case of thetheorem that bears his name: Bayes’ theorem.
Bayes never published what would become his most famousaccomplishment; his notes were edited and published after his death byRichard Price.
(in Wikipedia)Bayes’ TheoremFor a partition Ei of the sample space Ω and an event F ,
PEi|F =PEi ∩ FPF
=PF |EiPEi∑Ni=1 PEiPF |Ei
AI Friends Seminar Ganguk Hwang Bayesian Statistics January 2019 4 / 37
Introduction
Distribution Function and Parameter
The distribution function F (x) of a random variable X is defined by
F (x) = PX ≤ x.
The parameter of a distribution function is the value that determines thedistribution.
For instance, when we consider a Bernoulli random variable X withPX = 1 = p, PX = 0 = 1− p, the value p determines thedistribution of X and is the parameter of the distribution of X.
AI Friends Seminar Ganguk Hwang Bayesian Statistics January 2019 5 / 37
Introduction
Likelihood Function
Likelihood FunctionLet X1, X2, · · · , Xn be independent and identically distributed (i.i.d.)random variables with parameter θ. The probability mass (or density)function of Xi is given by p(x|θ). Then, the likelihood L(x1, x2, · · · , xn|θ)is defined by
L(x1, x2, · · · , xn|θ) = p(x1|θ)p(x2|θ) · · · p(xn|θ)
Maximum Likelihood Estimator (MLE) of θ
The MLE is the value of θ that maximizes the likelihood function.
Here, we consider the likelihood function as a function of θ.
To compute the MLE, we usually use log(L(x1, x2, · · · , xn|θ)).
AI Friends Seminar Ganguk Hwang Bayesian Statistics January 2019 6 / 37
Bayesian models
Bayesian models
AI Friends Seminar Ganguk Hwang Bayesian Statistics January 2019 7 / 37
Bayesian models
Bayesian models
In Bayesian models we use the Bayes’ rule to obtain unknown probability.
In Bayesian models with two random varaibles X and Y , the following areinitially given.
The data generating distribution: X|Y ∼ p(x|y)
The prior distribution Y ∼ p(y)
We then obtain a sample value X = x and want to estimate thedistribution (the posterior distribution) of Y , given X = x, i.e., p(y|x).
p(y|x) =p(x, y)
p(x)=p(x|y)p(y)
p(x)=
p(x|y)p(y)∫p(x, y) dy
.
We frequently use p(y|x) ∝ p(x|y)p(y).
AI Friends Seminar Ganguk Hwang Bayesian Statistics January 2019 8 / 37
Bayesian models
Bayesian models
Consider a random variable X having distribution function F (x) withunknown parameter θ.
In Bayesian models, the unknown parameter θ is considered stochastic. Sowe believe that θ ∼ p(θ) where p(θ) is called a prior distribution beforesampling. After sampling, using Bayes’ rule we obtain so-called a posteriordistribution as follows:
p(θ|x) =p(x, θ)
p(x)=p(x|θ)p(θ)p(x)
=p(x|θ)p(θ)∫p(x, θ) dθ
.
AI Friends Seminar Ganguk Hwang Bayesian Statistics January 2019 9 / 37
Bayesian models
Bayesian models
P (θ) is the prior which is our belief of θ without considering the data(evidence) DP (θ|D) is the posterior which is a refined belief of θ with the evidenceDP (D|θ) is the likelihood which is the probability of obtaining the dataD as generated with parameter θ
P (D) is the evidence which is the probability of the data asdetermined by considering all possible values of θ
AI Friends Seminar Ganguk Hwang Bayesian Statistics January 2019 10 / 37
Bayesian models
Bayesian models
Figure: Concept of Bayesian models
AI Friends Seminar Ganguk Hwang Bayesian Statistics January 2019 11 / 37
Bayesian models
Maximum A Posterior Estimator
Consider X ∼ p(x|θ) with a prior distribution of p(θ). Here, θ is theparameter of the distribution of X.
The Maximum A Posterior (MAP) estimator is given by
θMAP = argmaxθ log p(θ|X).
Note that p(θ|X) is the posterior distribution of θ and
p(θ|X) =p(θ,X)
p(X).
Since p(X) is not a function of θ, we have
θMAP = argmaxθ log p(θ,X).
AI Friends Seminar Ganguk Hwang Bayesian Statistics January 2019 12 / 37
Bayesian models
Maximum A Posterior Estimator
Recall that the Maximum Likelihood Estimator (MLE) is defined by
θMLE = argmaxθ log p(X|θ)
and the MAP estimator is given by
θMAP = argmaxθ log p(θ,X) = argmaxθ log p(X|θ)p(θ).
So, the MAP estimator can use the prior information, but the MLE cannot.
AI Friends Seminar Ganguk Hwang Bayesian Statistics January 2019 13 / 37
Bayesian models
Bayesain model: an example
Let X1, X2, · · · , Xn be indepedent and identically distributed (i.i.d.)Bernoulli random variables where
PX1 = 1|θ = θ, PX1 = 0|θ = 1− θ.
Here, θ is usually called the parameter of Bernoulli distribution.First, observe that
PX1 = x1, X2 = x2, · · · , Xn = xn|θ
=
n∏i=1
PXi = xi|θ =
n∏i=1
θxi(1− θ)1−xi
= θ∑n
i=1 xi(1− θ)n−∑n
i=1 xi .
AI Friends Seminar Ganguk Hwang Bayesian Statistics January 2019 14 / 37
Bayesian models
The prior distribution of θ is given by a uniform distribution over [0, 1], i.e.,
p(θ) = 1 for 0 ≤ θ ≤ 1.
After obtaining sample values X1 = x1, X2 = x2, · · · , Xn = xn, we areinterested in the posterior distribution of θ.
p(θ|X1 = x1, X2 = x2, · · · , Xn = xn)
∝ P (X1 = x1, X2 = x2, · · · , Xn = xn|θ)p(θ)∝ θ
∑ni=1 xi(1− θ)n−
∑ni=1 xi .
AI Friends Seminar Ganguk Hwang Bayesian Statistics January 2019 15 / 37
Bayesian models
Considering the normalization constant, we obtain
p(θ|X1 = x1, X2 = x2, · · · , Xn = xn)
=Γ(n+ 2)
Γ(1 +∑n
i=1 xi)Γ(1 + n−∑n
i=1 xi)θ∑n
i=1 xi(1− θ)n−∑n
i=1 xi
which is a Beta distribution with parameters a =∑n
i=1 xi + 1 andb = n−
∑ni=1 xi + 1.
c.f.
Beta(a, b) =Γ(a+ b)
Γ(a)Γ(b)θa−1(1− θ)b−1
Note that, when a = b = 1, Beta(a, b) is in fact a uniform distribution over[0, 1]. That is, the prior and the posterior of θ are both Beta distributions.
AI Friends Seminar Ganguk Hwang Bayesian Statistics January 2019 16 / 37
Bayesian models
Example 1
Suppose we want to buy something from Amazon.com, and there are twosellers offering it for the same price. Seller 1 has 90 positive reviews and10 negative reviews. Seller 2 has 2 positive reviews and 0 negative reviews.Who should we buy from? (from Machine Learning by K.P. Murphy)
Let θ1 and θ2 be the unknown reliabilities of the two sellers. Since wedon’t know much about them, we will endow them both with uniformpriors, θi ∼ Beta(1, 1), i = 1, 2. The posteriors are
p(θ1|D1) = Beta(91, 11), p(θ2|D2) = Beta(3, 1).
Hence,
P (θ1 > θ2|D1,D2) =
∫ 1
0
∫ 1
0Iθ1>θ2Beta(θ1|91, 11)Beta(θ2|3, 1) dθ1dθ2
= 0.710.
This concludes that we are better off buying from seller 1.AI Friends Seminar Ganguk Hwang Bayesian Statistics January 2019 17 / 37
Bayesian models
Example 2
Figure: Bayesian Experiment for Bernoulli distribution with p = 0.7
AI Friends Seminar Ganguk Hwang Bayesian Statistics January 2019 18 / 37
Bayesian models
More generally, if we use Beta(a, b) as a prior distribution of θ, we have
p(θ|X1 = x1, X2 = x2, · · · , Xn = xn)
∝ P (X1 = x1, X2 = x2, · · · , Xn = xn|θ)p(θ)
∝ P (X1 = x1, X2 = x2, · · · , Xn = xn|θ)Γ(a+ b)
Γ(a)Γ(b)θa−1(1− θ)b−1
∝ θa+∑n
i=1 xi−1(1− θ)b+n−∑n
i=1 xi−1.
Hence, we see that
p(θ|X1 = x1, X2 = x2, · · · , Xn = xn) = Beta(a+
n∑i=1
xi, b+ n−n∑i=1
xi).
In this case, the Beta distribution is a conjugate prior.
From now on, we use the notation p(θ|x1, x2, · · · , xn) or p(θ|X) forsimplicity.
AI Friends Seminar Ganguk Hwang Bayesian Statistics January 2019 19 / 37
Bayesian models
More on Conjugate Distributions
In Bayesian probability theory, if the posterior distributions p(θ|x) are inthe same probability distribution family as the prior probability distributionp(θ), the prior and posterior are then called conjugate distributions, andthe prior is called a conjugate prior for the likelihood function.
As shown before, if we use a conjugate prior, we can obtain a closed-formexpression for the posterior. This also shows how the likelihood functionupdates a prior distribution.
AI Friends Seminar Ganguk Hwang Bayesian Statistics January 2019 20 / 37
Bayesian models
Conjugate Prior for Normal distribution
Assume that X is distributed according to a normal distribution withunknown mean µ and variance 1/τ (or precision τ), i.e.,
X ∼ N (µ, τ−1)
and that the prior distribution on µ and τ , (µ, τ), has a Normal-Gammadistribution
(µ, τ) ∼ NormalGamma(µ0, λ0, α0, β0),
for which the density p(µ, τ) satisfies
p(µ, τ) ∝ τα0− 12 exp[−β0τ ] exp
[−λ0τ(µ− µ0)2
2
].
AI Friends Seminar Ganguk Hwang Bayesian Statistics January 2019 21 / 37
Bayesian models
Suppose thatX1, . . . , Xn | µ, τ ∼ i.i.d. N
(µ, τ−1
),
i.e., the components of X = (X1, · · · , Xn) are conditionally independentgiven µ, τ and the conditional distribution given µ, τ is normal withexpectation µ and variance 1/τ .
The posterior distribution of µ and τ given this dataset X, is given by
P (τ, µ | X) ∝ L(X | τ, µ) p(τ, µ),
where L is the likelihood of the data given the parameters.
AI Friends Seminar Ganguk Hwang Bayesian Statistics January 2019 22 / 37
Bayesian models
Since the data are i.i.d., the likelihood of X is given by
L(X | τ, µ) ∝n∏i=1
τ1/2 exp
[−τ2
(xi − µ)2]
∝ τn/2 exp
[−τ2
n∑i=1
(xi − µ)2
]
∝ τn/2 exp
[−τ2
n∑i=1
(xi − x+ x− µ)2
]
∝ τn/2 exp
[−τ2
n∑i=1
((xi − x)2 + (x− µ)2
)]
∝ τn/2 exp
[−τ2
((n− 1)s2 + n(x− µ)2
)],
where x = 1n
∑ni=1 xi and s2 = 1
n−1∑n
i=1(xi − x)2.
AI Friends Seminar Ganguk Hwang Bayesian Statistics January 2019 23 / 37
Bayesian models
So, the posterior distribution of the parameters is proportional to the priortimes the likelihood as given below.
P (τ, µ | X)
∝ L(X | τ, µ) p(τ, µ)
∝ τn/2 exp
[−τ2
((n− 1)s2 + n(x− µ)2
)]× τα0− 1
2 exp[−β0τ ] exp
[−λ0τ(µ− µ0)2
2
]∝ τ
n2+α0− 1
2 exp
[−τ(
1
2(n− 1)s2 + β0
)]× exp
[−τ
2
(n(x− µ)2 + λ0(µ− µ0)2
)]
AI Friends Seminar Ganguk Hwang Bayesian Statistics January 2019 24 / 37
Bayesian models
The last exponential term is simplified as follows:
n(x− µ)2 + λ0(µ− µ0)2
= (n+ λ0)µ2 − 2(nx+ λ0µ0)µ+ nx2 + λ0µ
20
= (n+ λ0)(µ2 − 2
nx+ λ0µ0n+ λ0
µ) + nx2 + λ0µ20
= (n+ λ0)
(µ− nx+ λ0µ0
n+ λ0
)2
+ nx2 + λ0µ20 −
(nx+ λ0µ0)2
n+ λ0
= (n+ λ0)
(µ− nx+ λ0µ0
n+ λ0
)2
+λ0n(x− µ0)2
n+ λ0
AI Friends Seminar Ganguk Hwang Bayesian Statistics January 2019 25 / 37
Bayesian models
Combining all the results yields
P (τ, µ | X)
∝ τn2+α0− 1
2 exp
[−τ(
1
2(n− 1)s2 + β0
)]× exp
[−τ
2
((n+ λ0)
(µ− nx+ λ0µ0
n+ λ0
)2
+λ0n(x− µ0)2
n+ λ0
)]
∝ τn2+α0− 1
2 exp
[−τ(β0 +
1
2
((n− 1)s2 +
λ0n(x− µ0)2
n+ λ0
))]× exp
[−τ
2(n+ λ0)
(µ− nx+ λ0µ0
n+ λ0
)2]
AI Friends Seminar Ganguk Hwang Bayesian Statistics January 2019 26 / 37
Bayesian models
That is, the posterior is exactly in the same form as a Normal-Gammadistribution, i.e.,
P (τ, µ | X) = NormalGamma(µ1, λ1, α1, β1)
where
µ1 =nx+ λ0µ0n+ λ0
,
λ1 = n+ λ0,
α1 = α0 +n
2,
β1 = β0 +1
2
((n− 1)s2 +
λ0n(x− µ0)2
n+ λ0
)
AI Friends Seminar Ganguk Hwang Bayesian Statistics January 2019 27 / 37
Bayesian models
Wishart Distribution
Suppose X is a p× n matrix, each column of which is independentlydrawn from a p-variate normal distribution with zero mean
xi=(x1i , · · · , xpi )> ∼ Np(0,Σ), 1 ≤ i ≤ n.
Then, the Wishart distribution is the probability distribution of the p× prandom matrix
S =
n∑i=1
xix>i , S ∼Wp(Σ, n).
The positive integer n is the degrees of freedom. For n ≥ p the matrix Sis invertible with probability 1 if Σ is invertible.If p = Σ = 1, then this distribution is a chi-squared distribution with ndegrees of freedom.
AI Friends Seminar Ganguk Hwang Bayesian Statistics January 2019 28 / 37
Bayesian models
The probability density function of S ∼Wp(Σ, n) is given by
f(S) =|S|
n−p−12
|Σ|n2 2
np2 Γp
(n2
) exp
(−1
2tr(Σ−1S)
)
where Γp(x) = πp(p−1)
4∏pj=1 Γ(x+ 1−j
2 ).
AI Friends Seminar Ganguk Hwang Bayesian Statistics January 2019 29 / 37
Bayesian models
Inverse-Wishart Distribution
The Inverse-Wishart distribution is a probability distribution on positivedefinite matrices.
We say that T follows the Inverse-Wishart distribution with Ψ and m,denoted by T ∼ InvWp(Ψ,m) if its inverse T−1 follows Wp(Ψ
−1,m).
The probability density function of T ∼ InvWp(Ψ,m) is given by
f(T) =|Ψ|
m2
|T|m+p+1
2 2mp2 Γp
(m2
) exp
(−1
2tr(ΨT−1)
)
where Γp(x) = πp(p−1)
4∏pj=1 Γ(x+ 1−j
2 ).
AI Friends Seminar Ganguk Hwang Bayesian Statistics January 2019 30 / 37
Bayesian models
We now show that the Inverse-Wishart distribution is conjugate prior onthe covaraince matrix parameter of a multivariate normal distribution.Consider
X = (x1, · · · ,xn),xi ∼ Np(0,Σ),Σ ∼ InvWp(Ψ,m)
Then the posterior distribution of Σ satisfies
f(Σ|X) ∝ f(X|Σ)g(Σ|Ψ,m)
∝ (2π)−np2 |Σ|−
n2 exp
(−1
2
n∑i=1
x>i Σ−1xi
)
× |Ψ|m2
|Σ|m+p+1
2 2mp2 Γp
(m2
) exp
(−1
2tr(ΨΣ−1)
)
∝ |Σ|−n2−m+p+1
2 exp
(−1
2
n∑i=1
x>i Σ−1xi −1
2tr(ΨΣ−1)
)
AI Friends Seminar Ganguk Hwang Bayesian Statistics January 2019 31 / 37
Bayesian models
Noting that
n∑i=1
x>i Σ−1xi = tr
(n∑i=1
x>i Σ−1xi
)=
n∑i=1
tr(x>i Σ−1xi)
=
n∑i=1
tr(xix>i Σ−1) = tr(
n∑i=1
xix>i Σ−1)
= tr(SΣ−1), where S =
n∑i=1
xix>i ,
we obtain
f(Σ|X) ∝ |Σ|−n+m+p+1
2 exp
(−1
2tr(SΣ−1 + ΨΣ−1)
)∝ |Σ|−
n+m+p+12 exp
(−1
2tr((S + Ψ)Σ−1)
).
AI Friends Seminar Ganguk Hwang Bayesian Statistics January 2019 32 / 37
Bayesian models
Bayesian Decision Theory
The mode of a posterior distribuiton, e.g., the MAP estimator, is often avery poor choice as a summary because the mode is usually quite untypicalof the distribution, unlike the mean and the median.
In this case we use Bayesian decision theory where a loss function L(θ, θ)is considered. Here, L(θ, θ) is the loss we have if the truth is θ and ourestimate is θ.
With a loss function and a posterior distribution we are trying to minimize
argminθ
E[L(θ, θ)|D].
By choosing different loss functions we have different estimators otherthan the MAP estimator.
AI Friends Seminar Ganguk Hwang Bayesian Statistics January 2019 33 / 37
Bayesian models
Bayesian Decision Theory
L(θ, θ) = Iθ 6=θ: The MAP estimator
E[L(θ, θ)|D] = p(θ 6= θ|D) = 1− p(θ|D)
L(θ, θ) = (θ − θ)2: the mean
E[L(θ, θ)|D] = E[(θ − θ)2|D] = E[(θ − E[θ])2|D] + (E[θ]− θ)2
L(θ, θ) = |θ − θ|: the median
E[L(θ, θ)|D] = E[|θ − θ||D]
AI Friends Seminar Ganguk Hwang Bayesian Statistics January 2019 34 / 37
Bayesian models
References
K.P. Murphy, Machine Learning, The MIT Press, 2012.
AI Friends Seminar Ganguk Hwang Bayesian Statistics January 2019 35 / 37