Bayesian Statistics - AiFrenz · Bayesian updating is particularly important in the dynamic analysis of a sequence of data. Bayesian inference has found application in a wide range

Bayesian Statistics

AI Friends Seminar

Ganguk Hwang

Department of Mathematical SciencesKAIST

AI Friends Seminar Ganguk Hwang Bayesian Statistics January 2019 1 / 37

Introduction

Introduction


Introduction

Bayesian Inference

Bayesian inference is a method of statistical inference in which Bayes’theorem is used to update the probability for a hypothesis as moreevidence or information becomes available.

Bayesian updating is particularly important in the dynamic analysis of asequence of data.

Bayesian inference has found application in a wide range of activities,including science, engineering, philosophy, medicine, sport, and law.

(in Wikipedia)


Introduction

Thomas Bayes and Bayes’ Theorem

Thomas Bayes (1701 1761) was an English statistician, philosopher andPresbyterian minister who is known for formulating a specific case of thetheorem that bears his name: Bayes’ theorem.

Bayes never published what would become his most famousaccomplishment; his notes were edited and published after his death byRichard Price.

(in Wikipedia)Bayes’ TheoremFor a partition Ei of the sample space Ω and an event F ,

PEi|F =PEi ∩ FPF

=PF |EiPEi∑Ni=1 PEiPF |Ei


Introduction

Distribution Function and Parameter

The distribution function F (x) of a random variable X is defined by

F (x) = PX ≤ x.

The parameter of a distribution function is the value that determines thedistribution.

For instance, when we consider a Bernoulli random variable X withPX = 1 = p, PX = 0 = 1− p, the value p determines thedistribution of X and is the parameter of the distribution of X.


Introduction

Likelihood Function

Likelihood FunctionLet X1, X2, · · · , Xn be independent and identically distributed (i.i.d.)random variables with parameter θ. The probability mass (or density)function of Xi is given by p(x|θ). Then, the likelihood L(x1, x2, · · · , xn|θ)is defined by

L(x1, x2, · · · , xn|θ) = p(x1|θ)p(x2|θ) · · · p(xn|θ)

Maximum Likelihood Estimator (MLE) of θ

The MLE is the value of θ that maximizes the likelihood function.

Here, we consider the likelihood function as a function of θ.

To compute the MLE, we usually use log(L(x1, x2, · · · , xn|θ)).


Bayesian models

Bayesian models


Bayesian models

Bayesian models

In Bayesian models we use the Bayes’ rule to obtain unknown probability.

In Bayesian models with two random varaibles X and Y , the following areinitially given.

The data generating distribution: X|Y ∼ p(x|y)

The prior distribution Y ∼ p(y)

We then obtain a sample value X = x and want to estimate thedistribution (the posterior distribution) of Y , given X = x, i.e., p(y|x).

p(y|x) =p(x, y)

p(x)=p(x|y)p(y)

p(x)=

p(x|y)p(y)∫p(x, y) dy

.

We frequently use p(y|x) ∝ p(x|y)p(y).


Bayesian models

Bayesian models

Consider a random variable X having distribution function F (x) withunknown parameter θ.

In Bayesian models, the unknown parameter θ is considered stochastic. Sowe believe that θ ∼ p(θ) where p(θ) is called a prior distribution beforesampling. After sampling, using Bayes’ rule we obtain so-called a posteriordistribution as follows:

p(θ|x) =p(x, θ)

p(x)=p(x|θ)p(θ)p(x)

=p(x|θ)p(θ)∫p(x, θ) dθ

.


Bayesian models

Bayesian models

P (θ) is the prior which is our belief of θ without considering the data(evidence) DP (θ|D) is the posterior which is a refined belief of θ with the evidenceDP (D|θ) is the likelihood which is the probability of obtaining the dataD as generated with parameter θ

P (D) is the evidence which is the probability of the data asdetermined by considering all possible values of θ


Bayesian models

Bayesian models

Figure: Concept of Bayesian models


Bayesian models

Maximum A Posterior Estimator

Consider X ∼ p(x|θ) with a prior distribution of p(θ). Here, θ is theparameter of the distribution of X.

The Maximum A Posterior (MAP) estimator is given by

θMAP = argmaxθ log p(θ|X).

Note that p(θ|X) is the posterior distribution of θ and

p(θ|X) =p(θ,X)

p(X).

Since p(X) is not a function of θ, we have

θMAP = argmaxθ log p(θ,X).


Bayesian models

Maximum A Posterior Estimator

Recall that the Maximum Likelihood Estimator (MLE) is defined by

θMLE = argmaxθ log p(X|θ)

and the MAP estimator is given by

θMAP = argmaxθ log p(θ,X) = argmaxθ log p(X|θ)p(θ).

So, the MAP estimator can use the prior information, but the MLE cannot.


Bayesian models

Bayesain model: an example

Let X1, X2, · · · , Xn be indepedent and identically distributed (i.i.d.)Bernoulli random variables where

PX1 = 1|θ = θ, PX1 = 0|θ = 1− θ.

Here, θ is usually called the parameter of Bernoulli distribution.First, observe that

PX1 = x1, X2 = x2, · · · , Xn = xn|θ

=

n∏i=1

PXi = xi|θ =

n∏i=1

θxi(1− θ)1−xi

= θ∑n

i=1 xi(1− θ)n−∑n

i=1 xi .


Bayesian models

The prior distribution of θ is given by a uniform distribution over [0, 1], i.e.,

p(θ) = 1 for 0 ≤ θ ≤ 1.

After obtaining sample values X1 = x1, X2 = x2, · · · , Xn = xn, we areinterested in the posterior distribution of θ.

p(θ|X1 = x1, X2 = x2, · · · , Xn = xn)

∝ P (X1 = x1, X2 = x2, · · · , Xn = xn|θ)p(θ)∝ θ

∑ni=1 xi(1− θ)n−

∑ni=1 xi .


Bayesian models

Considering the normalization constant, we obtain

p(θ|X1 = x1, X2 = x2, · · · , Xn = xn)

=Γ(n+ 2)

Γ(1 +∑n

i=1 xi)Γ(1 + n−∑n

i=1 xi)θ∑n

i=1 xi(1− θ)n−∑n

i=1 xi

which is a Beta distribution with parameters a =∑n

i=1 xi + 1 andb = n−

∑ni=1 xi + 1.

c.f.

Beta(a, b) =Γ(a+ b)

Γ(a)Γ(b)θa−1(1− θ)b−1

Note that, when a = b = 1, Beta(a, b) is in fact a uniform distribution over[0, 1]. That is, the prior and the posterior of θ are both Beta distributions.


Bayesian models

Example 1

Suppose we want to buy something from Amazon.com, and there are twosellers offering it for the same price. Seller 1 has 90 positive reviews and10 negative reviews. Seller 2 has 2 positive reviews and 0 negative reviews.Who should we buy from? (from Machine Learning by K.P. Murphy)

Let θ1 and θ2 be the unknown reliabilities of the two sellers. Since wedon’t know much about them, we will endow them both with uniformpriors, θi ∼ Beta(1, 1), i = 1, 2. The posteriors are

p(θ1|D1) = Beta(91, 11), p(θ2|D2) = Beta(3, 1).

Hence,

P (θ1 > θ2|D1,D2) =

∫ 1

0

∫ 1

0Iθ1>θ2Beta(θ1|91, 11)Beta(θ2|3, 1) dθ1dθ2

= 0.710.

This concludes that we are better off buying from seller 1.AI Friends Seminar Ganguk Hwang Bayesian Statistics January 2019 17 / 37

Bayesian models

Example 2

Figure: Bayesian Experiment for Bernoulli distribution with p = 0.7


Bayesian models

More generally, if we use Beta(a, b) as a prior distribution of θ, we have

p(θ|X1 = x1, X2 = x2, · · · , Xn = xn)

∝ P (X1 = x1, X2 = x2, · · · , Xn = xn|θ)p(θ)

∝ P (X1 = x1, X2 = x2, · · · , Xn = xn|θ)Γ(a+ b)

Γ(a)Γ(b)θa−1(1− θ)b−1

∝ θa+∑n

i=1 xi−1(1− θ)b+n−∑n

i=1 xi−1.

Hence, we see that

p(θ|X1 = x1, X2 = x2, · · · , Xn = xn) = Beta(a+

n∑i=1

xi, b+ n−n∑i=1

xi).

In this case, the Beta distribution is a conjugate prior.

From now on, we use the notation p(θ|x1, x2, · · · , xn) or p(θ|X) forsimplicity.


Bayesian models

More on Conjugate Distributions

In Bayesian probability theory, if the posterior distributions p(θ|x) are inthe same probability distribution family as the prior probability distributionp(θ), the prior and posterior are then called conjugate distributions, andthe prior is called a conjugate prior for the likelihood function.

As shown before, if we use a conjugate prior, we can obtain a closed-formexpression for the posterior. This also shows how the likelihood functionupdates a prior distribution.


Bayesian models

Conjugate Prior for Normal distribution

Assume that X is distributed according to a normal distribution withunknown mean µ and variance 1/τ (or precision τ), i.e.,

X ∼ N (µ, τ−1)

and that the prior distribution on µ and τ , (µ, τ), has a Normal-Gammadistribution

(µ, τ) ∼ NormalGamma(µ0, λ0, α0, β0),

for which the density p(µ, τ) satisfies

p(µ, τ) ∝ τα0− 12 exp[−β0τ ] exp

[−λ0τ(µ− µ0)2

2

].


Bayesian models

Suppose thatX1, . . . , Xn | µ, τ ∼ i.i.d. N

(µ, τ−1

),

i.e., the components of X = (X1, · · · , Xn) are conditionally independentgiven µ, τ and the conditional distribution given µ, τ is normal withexpectation µ and variance 1/τ .

The posterior distribution of µ and τ given this dataset X, is given by

P (τ, µ | X) ∝ L(X | τ, µ) p(τ, µ),

where L is the likelihood of the data given the parameters.


Bayesian models

Since the data are i.i.d., the likelihood of X is given by

L(X | τ, µ) ∝n∏i=1

τ1/2 exp

[−τ2

(xi − µ)2]

∝ τn/2 exp

[−τ2

n∑i=1

(xi − µ)2

]

∝ τn/2 exp

[−τ2

n∑i=1

(xi − x+ x− µ)2

]

∝ τn/2 exp

[−τ2

n∑i=1

((xi − x)2 + (x− µ)2

)]

∝ τn/2 exp

[−τ2

((n− 1)s2 + n(x− µ)2

)],

where x = 1n

∑ni=1 xi and s2 = 1

n−1∑n

i=1(xi − x)2.


Bayesian models

So, the posterior distribution of the parameters is proportional to the priortimes the likelihood as given below.

P (τ, µ | X)

∝ L(X | τ, µ) p(τ, µ)

∝ τn/2 exp

[−τ2

((n− 1)s2 + n(x− µ)2

)]× τα0− 1

2 exp[−β0τ ] exp

[−λ0τ(µ− µ0)2

2

]∝ τ

n2+α0− 1

2 exp

[−τ(

1

2(n− 1)s2 + β0

)]× exp

[−τ

2

(n(x− µ)2 + λ0(µ− µ0)2

)]


Bayesian models

The last exponential term is simplified as follows:

n(x− µ)2 + λ0(µ− µ0)2

= (n+ λ0)µ2 − 2(nx+ λ0µ0)µ+ nx2 + λ0µ

20

= (n+ λ0)(µ2 − 2

nx+ λ0µ0n+ λ0

µ) + nx2 + λ0µ20

= (n+ λ0)

(µ− nx+ λ0µ0

n+ λ0

)2

+ nx2 + λ0µ20 −

(nx+ λ0µ0)2

n+ λ0

= (n+ λ0)

(µ− nx+ λ0µ0

n+ λ0

)2

+λ0n(x− µ0)2

n+ λ0


Bayesian models

Combining all the results yields

P (τ, µ | X)

∝ τn2+α0− 1

2 exp

[−τ(

1

2(n− 1)s2 + β0

)]× exp

[−τ

2

((n+ λ0)

(µ− nx+ λ0µ0

n+ λ0

)2

+λ0n(x− µ0)2

n+ λ0

)]

∝ τn2+α0− 1

2 exp

[−τ(β0 +

1

2

((n− 1)s2 +

λ0n(x− µ0)2

n+ λ0

))]× exp

[−τ

2(n+ λ0)

(µ− nx+ λ0µ0

n+ λ0

)2]


Bayesian models

That is, the posterior is exactly in the same form as a Normal-Gammadistribution, i.e.,

P (τ, µ | X) = NormalGamma(µ1, λ1, α1, β1)

where

µ1 =nx+ λ0µ0n+ λ0

,

λ1 = n+ λ0,

α1 = α0 +n

2,

β1 = β0 +1

2

((n− 1)s2 +

λ0n(x− µ0)2

n+ λ0

)


Bayesian models

Wishart Distribution

Suppose X is a p× n matrix, each column of which is independentlydrawn from a p-variate normal distribution with zero mean

xi=(x1i , · · · , xpi )> ∼ Np(0,Σ), 1 ≤ i ≤ n.

Then, the Wishart distribution is the probability distribution of the p× prandom matrix

S =

n∑i=1

xix>i , S ∼Wp(Σ, n).

The positive integer n is the degrees of freedom. For n ≥ p the matrix Sis invertible with probability 1 if Σ is invertible.If p = Σ = 1, then this distribution is a chi-squared distribution with ndegrees of freedom.


Bayesian models

The probability density function of S ∼Wp(Σ, n) is given by

f(S) =|S|

n−p−12

|Σ|n2 2

np2 Γp

(n2

) exp

(−1

2tr(Σ−1S)

)

where Γp(x) = πp(p−1)

4∏pj=1 Γ(x+ 1−j

2 ).


Bayesian models

Inverse-Wishart Distribution

The Inverse-Wishart distribution is a probability distribution on positivedefinite matrices.

We say that T follows the Inverse-Wishart distribution with Ψ and m,denoted by T ∼ InvWp(Ψ,m) if its inverse T−1 follows Wp(Ψ

−1,m).

The probability density function of T ∼ InvWp(Ψ,m) is given by

f(T) =|Ψ|

m2

|T|m+p+1

2 2mp2 Γp

(m2

) exp

(−1

2tr(ΨT−1)

)

where Γp(x) = πp(p−1)

4∏pj=1 Γ(x+ 1−j

2 ).


Bayesian models

We now show that the Inverse-Wishart distribution is conjugate prior onthe covaraince matrix parameter of a multivariate normal distribution.Consider

X = (x1, · · · ,xn),xi ∼ Np(0,Σ),Σ ∼ InvWp(Ψ,m)

Then the posterior distribution of Σ satisfies

f(Σ|X) ∝ f(X|Σ)g(Σ|Ψ,m)

∝ (2π)−np2 |Σ|−

n2 exp

(−1

2

n∑i=1

x>i Σ−1xi

)

× |Ψ|m2

|Σ|m+p+1

2 2mp2 Γp

(m2

) exp

(−1

2tr(ΨΣ−1)

)

∝ |Σ|−n2−m+p+1

2 exp

(−1

2

n∑i=1

x>i Σ−1xi −1

2tr(ΨΣ−1)

)


Bayesian models

Noting that

n∑i=1

x>i Σ−1xi = tr

(n∑i=1

x>i Σ−1xi

)=

n∑i=1

tr(x>i Σ−1xi)

=

n∑i=1

tr(xix>i Σ−1) = tr(

n∑i=1

xix>i Σ−1)

= tr(SΣ−1), where S =

n∑i=1

xix>i ,

we obtain

f(Σ|X) ∝ |Σ|−n+m+p+1

2 exp

(−1

2tr(SΣ−1 + ΨΣ−1)

)∝ |Σ|−

n+m+p+12 exp

(−1

2tr((S + Ψ)Σ−1)

).


Bayesian models

Bayesian Decision Theory

The mode of a posterior distribuiton, e.g., the MAP estimator, is often avery poor choice as a summary because the mode is usually quite untypicalof the distribution, unlike the mean and the median.

In this case we use Bayesian decision theory where a loss function L(θ, θ)is considered. Here, L(θ, θ) is the loss we have if the truth is θ and ourestimate is θ.

With a loss function and a posterior distribution we are trying to minimize

argminθ

E[L(θ, θ)|D].

By choosing different loss functions we have different estimators otherthan the MAP estimator.


Bayesian models

Bayesian Decision Theory

L(θ, θ) = Iθ 6=θ: The MAP estimator

E[L(θ, θ)|D] = p(θ 6= θ|D) = 1− p(θ|D)

L(θ, θ) = (θ − θ)2: the mean

E[L(θ, θ)|D] = E[(θ − θ)2|D] = E[(θ − E[θ])2|D] + (E[θ]− θ)2

L(θ, θ) = |θ − θ|: the median

E[L(θ, θ)|D] = E[|θ − θ||D]


Bayesian models

References

K.P. Murphy, Machine Learning, The MIT Press, 2012.


Documents

Bayesian Statistics - AiFrenz · Bayesian updating is particularly important in the dynamic analysis of a sequence of data. Bayesian inference has found application in a wide range