Bayesian Inference

Bayesian Inference

• Reading Assignments

R. Duda, P. Hart, and D. Stork, Pattern Classification, John-Wiley, 2nd edition, 2001(2.1, 2.4-2.6, 3.1-3.2 , hard-copy).

Rusell and Norvig, Artificial Intelligence: A Modern Approach (chapter 14, hardcopy).

S. Gong et al. Dynamic Vision: From Images to Face Recognition, Imperial CollegePres, 2001 (Chapt. 3, hard copy).

• Case Studies

H. Schneiderman and T. Kanade, "A Statistical Method for 3D Object DetectionApplied to Faces and Cars", Computer Vision and Pattern Recognition Confer-ence, pp. 45-51, 1998 (on-line).

K. Sung and T. Poggio, "Example-based learning for view-based human face detec-tion", IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 20,no. 1, pp. 39-51, 1998 (on-line).

A. Madabhushi and J. Aggarwal, "A bayesian approach to human activity recogni-tion", 2nd International Workshop on Visual Surveillance, pp. 25-30, June 1999(hard-copy).

M. Jones and J. Rehg, "Statistical color models with application to skin detection",Technical Report, Compaq Research Labs (on-line).

J. Yang and A. Waibel, "A Real-time Face Tracker", Proceedings of WACV’96, 1996(on-line).

C. Stauffer and E. Grimson, "Adaptive background mixture models for real-timetracking", IEEE Computer Vision and Pattern Recognition Conference, Vol. 2,pp. 246-252, 1998 (on-line).

Bayesian Inference

• Why bother about probabilities?

- Accounting for uncertainty is a crucial component in decision making (e.g.,classification) because of ambiguity in our measurements.

- Probability theory is the proper mechanism for accounting for uncertainty.

- Need to take into account reasonable preferences about the state of the world,for example:

"If the fish was caught in the Atlantic ocean, then it is more likely to besalmon than sea-bass"

- We will discuss techniques for building probabilistic models and for extractinginformation from a probabilistic model.

• Probabilistic Inference

- If we could define all possible values for the probability distribution, then wecould read off any probability we were interested in.

- In general, it is not practical to define all possible entries for the joint probabil-ity function.

- Probabilistic inference consists of computing probabilities that are not explicitlystored by the reasoning system (e.g., marginals, conditionals).

• Belief

- The conditional probability of an event given some evidence.

We may not know for sure what affects a particular patience, but we believe thatthere is, say, an 80% chance (i.e., a probability of 0.8) that the patient has a cav-

ity if he or she has a toothache

• Bayes rule

- Very often we want to compute the value of P(hypothesis/evidence).

- Bayes’ rule provides a way of computing a conditional probability from itsinverse conditional probability:

P(B/A) =P(A/B) P(B)

- The denominator P(A) can be considered as a normalization constant (it can beselected in a way such that the P(B/A) sum to 1)

• An example: separate sea bass from salmon

- Some definitions.

State of nature ω (random variable): ω1 for sea bass, ω2 for salmon.Probabilities P(ω1) and P(ω2): prior knowledge of how likely is to get a seabass or a salmon (priors).Probability density function p(x): how frequently we will measure a patternwith feature value x (e.g., x is a lightness measurement) (evidence)Conditional probability density function (pdf) p(x/ω j): how frequently wewill measure a pattern with feature value x given that the pattern belongs toclass ω j (likelihood)Conditional probability P(ω j /x): the probability that the fish belongs to classω j given measurement x (posterior).

- Decision rule using priors only

Decide ω1 if P(ω1) > P(ω2);otherwise decide ω2

P(error): min[P(ω1), P(ω2))]

- Classification can be improved by using additional information (i.e.,lightness measurements).

- Decision rule using conditional pdf

- The joint pdf of finding a pattern in category ω j having feature value xis:

p(x, ω j) = P(ω j /x)p(x) = p(x/ω j)P(ω j)

- The Bayes’ formula is:

P(ω j /x) =p(x/ω j)P(ω j)

likelihood x prior

evidence

where p(x) = p(x/ω1)P(ω1) + p(x/ω2)P(ω2) is essentially a scale factor.

Decide ω1 if P(ω1/x) > P(ω2/x);otherwise decide ω2

(or)Decide ω1 if p(x/ω1)P(ω1) > p(x/ω2)P(ω2); otherwise decide ω2

(assuming P(ω1) = 2/3, P(ω2) = 1/3)

• Probability error

P(error/x) =

P(ω1/x)

P(ω2/x)

if we decide ω2

if we decideω1

P(error): min[P(ω1/x), P(ω2/x))]

- Does the above decision minimize the probability error ?

P(error) =∞

−∞∫ P(error, x) dx =

−∞∫ P(error/x)p(x) dx

• Where do the probabilities come from?

- The Bayesian rule is optimal if the pmf or pdf is known.

- There are two competitive answers to the above question:

(1) Relative frequency (objective) approach.

Probabilities can only come from experiments.

(2) Bayesian (subjective) approach.

Probabilities may reflect degree of belief and can be based on opin-ion as well as experiments.

Example (objective): classify cars on UNR campus into two categories:

(1) C1: more than $50K(2) C2: less than $50K

* Suppose we use one feature x: height of car

* From Bayes rule, we can compute our belief:

P(C1/x) =P(x/C1)P(C1)

* Need to estimate P(x/C1), P(x/C2), P(C1), and P(C2)

* Determine prior probabilities

(1) ask drivers at the gate how much their car cost(2) measure the height of the car

* Suppose we end up with 1209 samples: #C1=221, #C2=988

* P(C1) = 221/12 09 = 0. 183 and P(C2) = 1 − P(C1) = 0. 817

* Determine class conditional probabilities (discretize car height intobins and use normalized histogram)

* Calculate the posterior probability for each bin using the Bayes rule:

P(C1/x = 1. 0) =P(x = 1/C1)P(C1)

P(x = 1)=

P(x = 1/C1)P(C1)

P(x = 1/C1)P(C1) + P(x = 1/C2)P(C2)= 0. 438

• Functional structure of a general statistical classifier

assign x to ω i if gi(x) > g j(x) for all j ≠ i

(discriminant functions)

• Minimum error-rate case

gi(x) = P(ω i /x)

• Is the choice of gi unique?

- Replacing gi(x) with f (gi), where f () is monotonically increasing, doesnot change the classification results.

gi(x) =p(x/ω j)P(ω j)

gi(x) = p(x/ω i)P(ω i)

gi(x) = ln p(x/ω i) + ln P(ω i)

• Decision regions/boundaries

- Decision rules divide the feature space in decision regions R1,..., Rc.

- The boundaries of the decision regions are the decision boundaries.

• Discriminant functions for the Gaussian density

- Assume the following discriminant function:

gi(x) = ln p(x/ω i) + ln P(ω i)

- If p(x/ω i) ˜ N (µ i, Σi), then

gi(x) = −1

2(x − µ i)

tΣi−1(x − µ i) −

2ln2π −

2ln |Σi| + ln P(ω i)

Case 1: Σi = σ 2 I

(1) features are uncorrelated(2) each feature has the same variance

- If we disregardd

2ln2π and

2ln |Σi| (constants):

gi(x) = −||x − µ i||

2σ 2+ ln P(ω i)

where ||x − µ i||2 = (x − µ i)

t(x − µ i) -- favors the a-priori more likely category

- Expanding the above expression:

gi(x) = −1

2σ 2[xtx − 2µ t

ix + µ ti µ i] + ln P(ω i)

- Disregarding xtx (constant), we get a linear discriminant:

gi(x) = wtix + wi0

where wi =1

σ 2µ i , and wi0 = −

2σ 2µ t

i µ i + ln P(ω i)

- Decision boundary is determined by hyperplanes; setting gi(x) = g j(x):

wt(x − x0) = 0

where w = µ i − µ j , and x0 =1

2(µ i + µ j) −

||µ i − µ j||2ln

P(ω i)

P(ω j)(µ i − µ j)

- Some comments about this hyperplane:

* It passes through x0.* It is orthogonal to the line linking the means.* What happens when P(ω i) = P(ω j) ?* If P(ω i) ≠ P(ω j), x0 shifts away from the more likely mean.* If σ is very small, the position of the boundary is insensitive to P(ω i) andP(ω j).

- Minimum distance classifier:when P(ω i) is the same for all c classes.

gi(x) = −||x − µ i||

"Case 2:" Σi = Σ

- The clusters have hyperellipsoidal shape and same size (centered at µ).

- If we disregardd

2ln2π and

2ln |Σi| (constants):

gi(x) = −1

2(x − µ i)

tΣ−1(x − µ i) + ln P(ω i)

Minimum distance classifier using Mahalanobis distance:when P(ω i) is the samefor all c classes.

gi(x) = −1

2(x − µ i)

tΣ−1(x − µ i)

- Expanding the above expression and disregarding the quadratic term:

gi(x) = wtix + wi0

(linear discriminant)

where wi = Σ−1 µ i , and wi0 = −1

iΣ−1 µ i + ln P(ω i)

- Decision boundary is determined by hyperplanes; setting gi(x) = g j(x):

wt(x − x0) = 0

where w = Σ−1(µ i − µ j) and x0 =1

2(µ i + µ j) −

ln[P(ω i)/P(ω j)]

(x − µ i)tΣ−1(x − µ i)(µ i − µ j)

- We can make a number of comments about this hyperplane:

* It passes through x0.* It is NOT orthogonal to the line linking the means.* What happens when P(ω i) = P(ω j) ?* If P(ω i) ≠ P(ω j), x0 shifts away from the more likely mean.

Case 3: Σi = arbitrary

- The clusters have different shapes and sizes (centered at µ).

- If we disregardd

2ln2π (constant):

gi(x) = xtWix + wix + wi0(quadratic discriminant)

where Wi = −1

2Σ−1

i , wi = Σ−1i µ i , and wi0 = −

iΣ−1 µ i −1

2ln|Σi| + ln P(ω i)

- Decision boundary is determined by superquadrics; setting gi(x) = g j(x)

- Decision regions can be disconnected.

• Practical difficulties

- In practice, we do not know P(ω i) or p(x/ω i)

- We are supposed to design our classifier using a set of training data.

• Possible solutions

(1) Estimate P(ω i) and p(x/ω i) using the training data.

- Usually, the estimation of P(ω i) is not very difficult.

- Estimating p(x/ω i) from training data poses serious difficulties:

insufficient number of samples

dimensionality of x is large

(2) Assume that p(x/ω i) has a parametric form (e.g., Gaussian)

- In this case, we just need to estimate some parameters (e.g., µ, Σ)

• Main methods for parameter estimation

Maximum Likelihood: It assumes that the parameters are fixed; the best estimateof their value is defined to be the one that maximizes the probability of obtainingthe samples actually observed (i.e., training data).

Bayesian Estimation: It assumes that the parameters are random variables havingsome known a priori distribution; observation of the samples (i.e., training data)converts this to a posterior density which is used to determine the true value ofparameters.

• Maximum Likelihood (ML) Estimation

Assumptions

- The training data is divided in c sets D1, D2, ...,Dc (i.e., c classes).

- The data in each set are drawn independently.

- p(x/ω j) is the joint density of class j which has a known parametric formwith parameters θ j (e.g., θ j = (µ j, Σ j)

t for Gaussian data)

Problem

- Giv en D = x1, x2. . . , xn, estimate θ

- Same procedure will be applied for each data set D j (i.e., will solve c sepa-rate problems).

ML approach

- The ML estimate is the value θ̂ that maximizes p(D/θ ) (i.e., the θ that bestsupports the training data - maximizes the probability of the observed data).

p(D/θ ) = p(x1, x2. . . , xn/θ )

(Note:) p(D/θ ) depends on θ only, it’s not a density since D is fixed)

- Since the data is drawn independently, the above probability can be writtenas follows:

p(D/θ ) =n

k=1Π p(xk /θ )

Example: Let us assume we have two coins, one of type I (fair) and one of type II(unfair). Suppose P(h/I ) = P(t/I ) = 0. 5 and P(h/II ) = 0. 8 and P(t/II ) = 0. 2. Weobserve a series of flips of a single coin, and we wish to know what type of coinwe are dealing with. Suppose we observe four heads and one tail in sequence:

P(hhhht/I ) = P(h/I )P(h/I )P(h/I )P(h/I )P(t/I ) = 0. 031 25

P(hhhht/II ) = P(h/II )P(h/II )P(h/II )P(h/II )P(t/II ) = 0. 081 9 2

Using the ML approach, the coin is of type II (we assume thatP(I ) = P(II ) = 0. 5)

• How can we find the maximum?

∇θ p(D/θ ) = 0(i.e., find the solutions, check the sign of the second derivative)

- It is easier to consider ln p(D/θ ):

∇θ ln p(D/θ ) = 0 orn

k=1Σ ∇θ ln p(xk /θ ) = 0

- The solution θ̂ maximizes p(D/θ ) or ln p(D/θ ):

θ̂ = arg maxθ ln p(D/θ )

• ML - Gaussian case: Unknown µ

- Consider ln p(x/µ) where p(x/µ) ˜ N (µ, Σ):

ln p(x/µ) = −1

2(x − µ)tΣ−1(x − µ) −

2ln2π −

2ln |Σ|

- Setting x = xk :

ln p(xk /µ) = −1

2(xk − µ)tΣ−1(xk − µ) −

2ln2π −

2ln |Σ|

- Computing the gradient, we have:

∇θ ln p(xk /µ) = Σ−1(xk − µ)

- Setting ∇θ ln p(x/µ) = 0 we hav e:

k=1Σ Σ−1(xk − µ) = 0

- The solution µ̂ is given by

µ̂ =1

k=1Σ xk

- The maximum likelihood estimate is simply the sample mean.

• ML - Gaussian case: Unknown µ and Σ

- Let us consider 1D Gaussian p(x) ˜ N (µ, σ 2) (i.e., θ = (θ1,θ2) = (µ, σ 2))

ln p(x/θ ) = −1

2ln 2πθ2 −

2θ2(xk − θ1)2

- Computing ∇θ ln p(xk /θ ) we hav e:

∂ ln p(xk /θ )

θ2(xk − θ1)

∂ ln p(xk /θ )

θ2= −

(xk − θ1)2

2θ 22

- Setting ∇θ ln p(x/θ ) = 0 we hav e:

θ2(xk − θ1) = 0

(xk − θ1)2

2θ 22

- The solutions θ̂1=µ and θ̂2=σ 2 are:

µ̂ =1

k=1Σ xk

σ̂ 2 =1

k=1Σ(xk − µ̂)2

- In the general case (multivariate Gaussian), the solutions are:

µ̂ =1

k=1Σ xk

Σ̂ =1

k=1Σ(xk − µ̂)(xk − µ̂)t

• Maximum a posteriori estimators (MAP)

- Maximize p(θ /D) (or p(D/θ ) p(θ )):

p(θ /D) =P(D/θ )p(θ )

k=1Π P(xk /θ ) p(θ )

maximize p(θ /D) orn

k=1Π P(xk /θ )p(θ )

- MAP is equivalent to maximum likelihood when p(θ ) is uniform (i.e., same pri-ors for all classes).

Example: Assuming P(I ) = 0. 75 and P(II ) = 0. 25 in the previous example, wehave the following estimate:

P(hhhht/I )P(I ) = 0. 023 4375

P(hhhht/II )P(II ) = 0. 020 48

Using the MAP approach, the coin is of type I

Example: θ = µ and p(µ) ˜ N (µ0, σ µ)

∂∂µ

k=1Σ ln P(xk /θ ) + ln p(θ )) = 0

σ 2(xk − µ) −

σ µ2

(µ − µ0) = 0 or µ̂ =µ0 +

σ µ2

k=1Σ xk

1 +σ µ

- Ifσ µ

σ 2>> 1, then µ̂ ≈

k=1Σ xk (same as ML)

- What if n → ∞?

• Maximum likelihood and model correctness

- If the model chosen for p(x/θ ) is the correct one, the maximum likelihood willgive very good results.

- If the model is wrong, maximum likelihood will give poor results.

Density estimation

- Bayesian inference is based on estimating the density function.

p(y/x) =p(x/y)p(y)

- There three types of model for density estimation:

* parametric* non-parametric* semi-parametric

• Parametric models

- This model assumes that the density function has a particular parametric form(e.g., Gaussian).

- This is appropriate if knowledge about the problem domain suggests a specificfunctional form.

- Maximum likelihood estimation is usually used to estimate the parameters ofthe model.

• Non-parametric models

- This approach makes as few assumptions about the form of the density as possi-ble.

- Non-parametric methods perform poorly unless huge data sets are available.

Parzen windows

- The density function p(x) is estimated by averaging M kernel functionseach of which is determined by one of the M data points.

- The kernel functions are usually symmetric and unimodal (e.g., Gaussianof fixed variance):

p(x) =1

(2π σ 2)N /2exp(−

||x − xm||2

2σ 2)

- A disadvantage of this approach is that the number of kernel functions andparameters grows with the size of the data.

Histogram

- A histogram quantizes the data feature space into regular bins of equal vol-ume. - The density function is approximated based on the fraction of datathat fall in each bin.

* If K is too small, the histogram will not be able to model the distribu-tion.* If K is too large, lots of data is needed to populate the histogram.

- Like Parzen windows, histograms scale poorly.

• Semi-parametric models

- Particularly useful for estimating density functions of unknown structure fromlimited data.

* The number of parameters can be varied depending upon the nature of thetrue probability density.

* The number of parameters is not determined by the size of the data set.

Mixtures

- It is defined as a weighted sum of K components where each component isa parametric density function p(x/k):

p(x) =K

k=1Σ p(x/k)π k

- The component densities p(x/k) are usually taken to be of the same para-metric form (e.g., Gaussians).

- The weights π k are the mixing parameters and they sum to unity:

k=1Σ π k = 1

- π k could be regarded as the prior probability of an observation being gener-ated by the k-th mixture component.

- Assuming Gaussian mixtures, the following parameters need to be estimated:

(1) the means of the Gaussians(2) the covariance matrices of the Gaussians(3) the mixing parameters

- Maximum-likelihood estimation is not possible using a closed analytic formlike in the case of a single Gaussian.

- There exists an iterative learning algorithm (Expectation-Maximization (EM)algorithm) which attempts to maximize likelihood.

Bayesian Inference - UNRbebis/MathMethods/BayesianInference/... · -2-Bayesian Inference •...

Documents

Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013

Inference in Bayesian Networks

Full Bayesian inference (Learning)...Learning paradigms Learning as inference Bayesian learning, full Bayesian inference, Bayesian model averaging Model identification, maximum likelihood

BAYESIAN INFERENCE Sampling techniques

Introduction to Bayesian Inference

Inference in Bayesian Nets

Bayesian Inference, Review 4/25/12 Frequentist inference Bayesian inference Review The Bayesian Heresy (pdf)pdf Professor Kari Lock Morgan Duke University

DCM Bayesian Inference

Bayesian Inference - download.e-bookshelf.de

Bayesian inference and generative models - TNU should I know about Bayesian inference? Because Bayesian principles are fundamental for • statistical inference in general • system

Bayesian inference. - UFRGS

Analysis of the Regression Model Parameters · Coherent Bayesian Inference Analysis of parameters Model generation Example Practice Coherent Bayesian Inference Coherent Bayesian Inference

Bayesian Inference (II)

Gaussian Models: Bayesian Inference

Bayesian Inference (I)

Fundamentals of Bayesian Inference

Aspects of Bayesian Inference

Bayesian inference on mixtures

Bayesian Inference for Categorical Data Analysis: A Surveypeople.stat.sc.edu/Hitchcock/bayesfinal.pdf · Bayesian Inference for Categorical Data Analysis: ... Bayesian Inference for