Bayesian Inference - UNRbebis/MathMethods/BayesianInference/... · -2-Bayesian Inference •...

Preview:

Citation preview

Bayesian Inference

• Reading Assignments

R. Duda, P. Hart, and D. Stork, Pattern Classification, John-Wiley, 2nd edition, 2001(2.1, 2.4-2.6, 3.1-3.2 , hard-copy).

Rusell and Norvig, Artificial Intelligence: A Modern Approach (chapter 14, hardcopy).

S. Gong et al. Dynamic Vision: From Images to Face Recognition, Imperial CollegePres, 2001 (Chapt. 3, hard copy).

• Case Studies

H. Schneiderman and T. Kanade, "A Statistical Method for 3D Object DetectionApplied to Faces and Cars", Computer Vision and Pattern Recognition Confer-ence, pp. 45-51, 1998 (on-line).

K. Sung and T. Poggio, "Example-based learning for view-based human face detec-tion", IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 20,no. 1, pp. 39-51, 1998 (on-line).

A. Madabhushi and J. Aggarwal, "A bayesian approach to human activity recogni-tion", 2nd International Workshop on Visual Surveillance, pp. 25-30, June 1999(hard-copy).

M. Jones and J. Rehg, "Statistical color models with application to skin detection",Technical Report, Compaq Research Labs (on-line).

J. Yang and A. Waibel, "A Real-time Face Tracker", Proceedings of WACV’96, 1996(on-line).

C. Stauffer and E. Grimson, "Adaptive background mixture models for real-timetracking", IEEE Computer Vision and Pattern Recognition Conference, Vol. 2,pp. 246-252, 1998 (on-line).

-2-

Bayesian Inference

• Why bother about probabilities?

- Accounting for uncertainty is a crucial component in decision making (e.g.,classification) because of ambiguity in our measurements.

- Probability theory is the proper mechanism for accounting for uncertainty.

- Need to take into account reasonable preferences about the state of the world,for example:

"If the fish was caught in the Atlantic ocean, then it is more likely to besalmon than sea-bass"

- We will discuss techniques for building probabilistic models and for extractinginformation from a probabilistic model.

• Probabilistic Inference

- If we could define all possible values for the probability distribution, then wecould read off any probability we were interested in.

- In general, it is not practical to define all possible entries for the joint probabil-ity function.

- Probabilistic inference consists of computing probabilities that are not explicitlystored by the reasoning system (e.g., marginals, conditionals).

• Belief

- The conditional probability of an event given some evidence.

We may not know for sure what affects a particular patience, but we believe thatthere is, say, an 80% chance (i.e., a probability of 0.8) that the patient has a cav-

ity if he or she has a toothache

-3-

• Bayes rule

- Very often we want to compute the value of P(hypothesis/evidence).

- Bayes’ rule provides a way of computing a conditional probability from itsinverse conditional probability:

P(B/A) =P(A/B) P(B)

P(A)

- The denominator P(A) can be considered as a normalization constant (it can beselected in a way such that the P(B/A) sum to 1)

• An example: separate sea bass from salmon

- Some definitions.

State of nature ω (random variable): ω1 for sea bass, ω2 for salmon.Probabilities P(ω1) and P(ω2): prior knowledge of how likely is to get a seabass or a salmon (priors).Probability density function p(x): how frequently we will measure a patternwith feature value x (e.g., x is a lightness measurement) (evidence)Conditional probability density function (pdf) p(x/ω j): how frequently wewill measure a pattern with feature value x given that the pattern belongs toclass ω j (likelihood)Conditional probability P(ω j /x): the probability that the fish belongs to classω j given measurement x (posterior).

-4-

- Decision rule using priors only

Decide ω1 if P(ω1) > P(ω2);otherwise decide ω2

P(error): min[P(ω1), P(ω2))]

- Classification can be improved by using additional information (i.e.,lightness measurements).

- Decision rule using conditional pdf

- The joint pdf of finding a pattern in category ω j having feature value xis:

p(x, ω j) = P(ω j /x)p(x) = p(x/ω j)P(ω j)

- The Bayes’ formula is:

P(ω j /x) =p(x/ω j)P(ω j)

p(x)=

likelihood x prior

evidence

where p(x) = p(x/ω1)P(ω1) + p(x/ω2)P(ω2) is essentially a scale factor.

Decide ω1 if P(ω1/x) > P(ω2/x);otherwise decide ω2

(or)Decide ω1 if p(x/ω1)P(ω1) > p(x/ω2)P(ω2); otherwise decide ω2

-5-

(assuming P(ω1) = 2/3, P(ω2) = 1/3)

• Probability error

P(error/x) =

P(ω1/x)

P(ω2/x)

if we decide ω2

if we decideω1

(or)

P(error): min[P(ω1/x), P(ω2/x))]

- Does the above decision minimize the probability error ?

P(error) =∞

−∞∫ P(error, x) dx =

−∞∫ P(error/x)p(x) dx

-6-

• Where do the probabilities come from?

- The Bayesian rule is optimal if the pmf or pdf is known.

- There are two competitive answers to the above question:

(1) Relative frequency (objective) approach.

Probabilities can only come from experiments.

(2) Bayesian (subjective) approach.

Probabilities may reflect degree of belief and can be based on opin-ion as well as experiments.

Example (objective): classify cars on UNR campus into two categories:

(1) C1: more than $50K(2) C2: less than $50K

* Suppose we use one feature x: height of car

* From Bayes rule, we can compute our belief:

P(C1/x) =P(x/C1)P(C1)

P(x)

* Need to estimate P(x/C1), P(x/C2), P(C1), and P(C2)

* Determine prior probabilities

(1) ask drivers at the gate how much their car cost(2) measure the height of the car

* Suppose we end up with 1209 samples: #C1=221, #C2=988

-7-

* P(C1) = 221/12 09 = 0. 183 and P(C2) = 1 − P(C1) = 0. 817

* Determine class conditional probabilities (discretize car height intobins and use normalized histogram)

-8-

* Calculate the posterior probability for each bin using the Bayes rule:

P(C1/x = 1. 0) =P(x = 1/C1)P(C1)

P(x = 1)=

P(x = 1/C1)P(C1)

P(x = 1/C1)P(C1) + P(x = 1/C2)P(C2)= 0. 438

• Functional structure of a general statistical classifier

assign x to ω i if gi(x) > g j(x) for all j ≠ i

(discriminant functions)

-9-

• Minimum error-rate case

gi(x) = P(ω i /x)

• Is the choice of gi unique?

- Replacing gi(x) with f (gi), where f () is monotonically increasing, doesnot change the classification results.

gi(x) =p(x/ω j)P(ω j)

p(x)

gi(x) = p(x/ω i)P(ω i)

gi(x) = ln p(x/ω i) + ln P(ω i)

• Decision regions/boundaries

- Decision rules divide the feature space in decision regions R1,..., Rc.

- The boundaries of the decision regions are the decision boundaries.

-10-

• Discriminant functions for the Gaussian density

- Assume the following discriminant function:

gi(x) = ln p(x/ω i) + ln P(ω i)

- If p(x/ω i) ˜ N (µ i, Σi), then

gi(x) = −1

2(x − µ i)

tΣi−1(x − µ i) −

d

2ln2π −

1

2ln |Σi| + ln P(ω i)

Case 1: Σi = σ 2 I

(1) features are uncorrelated(2) each feature has the same variance

- If we disregardd

2ln2π and

1

2ln |Σi| (constants):

gi(x) = −||x − µ i||

2

2σ 2+ ln P(ω i)

where ||x − µ i||2 = (x − µ i)

t(x − µ i) -- favors the a-priori more likely category

- Expanding the above expression:

gi(x) = −1

2σ 2[xtx − 2µ t

ix + µ ti µ i] + ln P(ω i)

- Disregarding xtx (constant), we get a linear discriminant:

gi(x) = wtix + wi0

where wi =1

σ 2µ i , and wi0 = −

1

2σ 2µ t

i µ i + ln P(ω i)

-11-

- Decision boundary is determined by hyperplanes; setting gi(x) = g j(x):

wt(x − x0) = 0

where w = µ i − µ j , and x0 =1

2(µ i + µ j) −

σ 2

||µ i − µ j||2ln

P(ω i)

P(ω j)(µ i − µ j)

- Some comments about this hyperplane:

* It passes through x0.* It is orthogonal to the line linking the means.* What happens when P(ω i) = P(ω j) ?* If P(ω i) ≠ P(ω j), x0 shifts away from the more likely mean.* If σ is very small, the position of the boundary is insensitive to P(ω i) andP(ω j).

-12-

- Minimum distance classifier:when P(ω i) is the same for all c classes.

gi(x) = −||x − µ i||

2

2σ 2

"Case 2:" Σi = Σ

- The clusters have hyperellipsoidal shape and same size (centered at µ).

- If we disregardd

2ln2π and

1

2ln |Σi| (constants):

gi(x) = −1

2(x − µ i)

tΣ−1(x − µ i) + ln P(ω i)

Minimum distance classifier using Mahalanobis distance:when P(ω i) is the samefor all c classes.

gi(x) = −1

2(x − µ i)

tΣ−1(x − µ i)

- Expanding the above expression and disregarding the quadratic term:

gi(x) = wtix + wi0

(linear discriminant)

where wi = Σ−1 µ i , and wi0 = −1

2µ t

iΣ−1 µ i + ln P(ω i)

-13-

- Decision boundary is determined by hyperplanes; setting gi(x) = g j(x):

wt(x − x0) = 0

where w = Σ−1(µ i − µ j) and x0 =1

2(µ i + µ j) −

ln[P(ω i)/P(ω j)]

(x − µ i)tΣ−1(x − µ i)(µ i − µ j)

- We can make a number of comments about this hyperplane:

* It passes through x0.* It is NOT orthogonal to the line linking the means.* What happens when P(ω i) = P(ω j) ?* If P(ω i) ≠ P(ω j), x0 shifts away from the more likely mean.

-14-

Case 3: Σi = arbitrary

- The clusters have different shapes and sizes (centered at µ).

- If we disregardd

2ln2π (constant):

gi(x) = xtWix + wix + wi0(quadratic discriminant)

where Wi = −1

2Σ−1

i , wi = Σ−1i µ i , and wi0 = −

1

2µ t

iΣ−1 µ i −1

2ln|Σi| + ln P(ω i)

- Decision boundary is determined by superquadrics; setting gi(x) = g j(x)

- Decision regions can be disconnected.

-15-

• Practical difficulties

- In practice, we do not know P(ω i) or p(x/ω i)

- We are supposed to design our classifier using a set of training data.

• Possible solutions

(1) Estimate P(ω i) and p(x/ω i) using the training data.

- Usually, the estimation of P(ω i) is not very difficult.

- Estimating p(x/ω i) from training data poses serious difficulties:

insufficient number of samples

dimensionality of x is large

(2) Assume that p(x/ω i) has a parametric form (e.g., Gaussian)

- In this case, we just need to estimate some parameters (e.g., µ, Σ)

• Main methods for parameter estimation

Maximum Likelihood: It assumes that the parameters are fixed; the best estimateof their value is defined to be the one that maximizes the probability of obtainingthe samples actually observed (i.e., training data).

Bayesian Estimation: It assumes that the parameters are random variables havingsome known a priori distribution; observation of the samples (i.e., training data)converts this to a posterior density which is used to determine the true value ofparameters.

-16-

• Maximum Likelihood (ML) Estimation

Assumptions

- The training data is divided in c sets D1, D2, ...,Dc (i.e., c classes).

- The data in each set are drawn independently.

- p(x/ω j) is the joint density of class j which has a known parametric formwith parameters θ j (e.g., θ j = (µ j, Σ j)

t for Gaussian data)

Problem

- Giv en D = x1, x2. . . , xn, estimate θ

- Same procedure will be applied for each data set D j (i.e., will solve c sepa-rate problems).

ML approach

- The ML estimate is the value θ̂ that maximizes p(D/θ ) (i.e., the θ that bestsupports the training data - maximizes the probability of the observed data).

p(D/θ ) = p(x1, x2. . . , xn/θ )

(Note:) p(D/θ ) depends on θ only, it’s not a density since D is fixed)

- Since the data is drawn independently, the above probability can be writtenas follows:

p(D/θ ) =n

k=1Π p(xk /θ )

-17-

Example: Let us assume we have two coins, one of type I (fair) and one of type II(unfair). Suppose P(h/I ) = P(t/I ) = 0. 5 and P(h/II ) = 0. 8 and P(t/II ) = 0. 2. Weobserve a series of flips of a single coin, and we wish to know what type of coinwe are dealing with. Suppose we observe four heads and one tail in sequence:

P(hhhht/I ) = P(h/I )P(h/I )P(h/I )P(h/I )P(t/I ) = 0. 031 25

P(hhhht/II ) = P(h/II )P(h/II )P(h/II )P(h/II )P(t/II ) = 0. 081 9 2

Using the ML approach, the coin is of type II (we assume thatP(I ) = P(II ) = 0. 5)

• How can we find the maximum?

∇θ p(D/θ ) = 0(i.e., find the solutions, check the sign of the second derivative)

- It is easier to consider ln p(D/θ ):

∇θ ln p(D/θ ) = 0 orn

k=1Σ ∇θ ln p(xk /θ ) = 0

- The solution θ̂ maximizes p(D/θ ) or ln p(D/θ ):

θ̂ = arg maxθ ln p(D/θ )

-18-

• ML - Gaussian case: Unknown µ

- Consider ln p(x/µ) where p(x/µ) ˜ N (µ, Σ):

ln p(x/µ) = −1

2(x − µ)tΣ−1(x − µ) −

d

2ln2π −

1

2ln |Σ|

- Setting x = xk :

ln p(xk /µ) = −1

2(xk − µ)tΣ−1(xk − µ) −

d

2ln2π −

1

2ln |Σ|

- Computing the gradient, we have:

∇θ ln p(xk /µ) = Σ−1(xk − µ)

- Setting ∇θ ln p(x/µ) = 0 we hav e:

n

k=1Σ Σ−1(xk − µ) = 0

- The solution µ̂ is given by

µ̂ =1

n

n

k=1Σ xk

- The maximum likelihood estimate is simply the sample mean.

-19-

• ML - Gaussian case: Unknown µ and Σ

- Let us consider 1D Gaussian p(x) ˜ N (µ, σ 2) (i.e., θ = (θ1,θ2) = (µ, σ 2))

ln p(x/θ ) = −1

2ln 2πθ2 −

1

2θ2(xk − θ1)2

- Computing ∇θ ln p(xk /θ ) we hav e:

∂ ln p(xk /θ )

θ1=

1

θ2(xk − θ1)

∂ ln p(xk /θ )

θ2= −

1

2θ2+

(xk − θ1)2

2θ 22

- Setting ∇θ ln p(x/θ ) = 0 we hav e:

n

k=1Σ

1

θ2(xk − θ1) = 0

−n

k=1Σ

1

2θ2+

n

k=1Σ

(xk − θ1)2

2θ 22

= 0

- The solutions θ̂1=µ and θ̂2=σ 2 are:

µ̂ =1

n

n

k=1Σ xk

σ̂ 2 =1

n

n

k=1Σ(xk − µ̂)2

- In the general case (multivariate Gaussian), the solutions are:

µ̂ =1

n

n

k=1Σ xk

Σ̂ =1

n

n

k=1Σ(xk − µ̂)(xk − µ̂)t

-20-

• Maximum a posteriori estimators (MAP)

- Maximize p(θ /D) (or p(D/θ ) p(θ )):

p(θ /D) =P(D/θ )p(θ )

P(D)=

n

k=1Π P(xk /θ ) p(θ )

P(D)

maximize p(θ /D) orn

k=1Π P(xk /θ )p(θ )

- MAP is equivalent to maximum likelihood when p(θ ) is uniform (i.e., same pri-ors for all classes).

Example: Assuming P(I ) = 0. 75 and P(II ) = 0. 25 in the previous example, wehave the following estimate:

P(hhhht/I )P(I ) = 0. 023 4375

P(hhhht/II )P(II ) = 0. 020 48

Using the MAP approach, the coin is of type I

Example: θ = µ and p(µ) ˜ N (µ0, σ µ)

∂∂µ

(n

k=1Σ ln P(xk /θ ) + ln p(θ )) = 0

n

k=1Σ

1

σ 2(xk − µ) −

1

σ µ2

(µ − µ0) = 0 or µ̂ =µ0 +

σ µ2

σ 2

n

k=1Σ xk

1 +σ µ

2

σ 2

- Ifσ µ

2

σ 2>> 1, then µ̂ ≈

n

k=1Σ xk (same as ML)

- What if n → ∞?

-21-

• Maximum likelihood and model correctness

- If the model chosen for p(x/θ ) is the correct one, the maximum likelihood willgive very good results.

- If the model is wrong, maximum likelihood will give poor results.

Density estimation

- Bayesian inference is based on estimating the density function.

p(y/x) =p(x/y)p(y)

p(x)

- There three types of model for density estimation:

* parametric* non-parametric* semi-parametric

• Parametric models

- This model assumes that the density function has a particular parametric form(e.g., Gaussian).

- This is appropriate if knowledge about the problem domain suggests a specificfunctional form.

- Maximum likelihood estimation is usually used to estimate the parameters ofthe model.

-22-

• Non-parametric models

- This approach makes as few assumptions about the form of the density as possi-ble.

- Non-parametric methods perform poorly unless huge data sets are available.

Parzen windows

- The density function p(x) is estimated by averaging M kernel functionseach of which is determined by one of the M data points.

- The kernel functions are usually symmetric and unimodal (e.g., Gaussianof fixed variance):

p(x) =1

M

M

m=1Σ

1

(2π σ 2)N /2exp(−

||x − xm||2

2σ 2)

- A disadvantage of this approach is that the number of kernel functions andparameters grows with the size of the data.

Histogram

- A histogram quantizes the data feature space into regular bins of equal vol-ume. - The density function is approximated based on the fraction of datathat fall in each bin.

* If K is too small, the histogram will not be able to model the distribu-tion.* If K is too large, lots of data is needed to populate the histogram.

- Like Parzen windows, histograms scale poorly.

-23-

• Semi-parametric models

- Particularly useful for estimating density functions of unknown structure fromlimited data.

* The number of parameters can be varied depending upon the nature of thetrue probability density.

* The number of parameters is not determined by the size of the data set.

Mixtures

- It is defined as a weighted sum of K components where each component isa parametric density function p(x/k):

p(x) =K

k=1Σ p(x/k)π k

- The component densities p(x/k) are usually taken to be of the same para-metric form (e.g., Gaussians).

-24-

- The weights π k are the mixing parameters and they sum to unity:

K

k=1Σ π k = 1

- π k could be regarded as the prior probability of an observation being gener-ated by the k-th mixture component.

- Assuming Gaussian mixtures, the following parameters need to be estimated:

(1) the means of the Gaussians(2) the covariance matrices of the Gaussians(3) the mixing parameters

- Maximum-likelihood estimation is not possible using a closed analytic formlike in the case of a single Gaussian.

- There exists an iterative learning algorithm (Expectation-Maximization (EM)algorithm) which attempts to maximize likelihood.

Recommended