2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes

2D1431 Machine Learning

Bayesian Learning

Outline

Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes classifier Bayes optimal classifier Bayesian belief networks Expectation maximization (EM) algorithm

Handwritten characters classification

Gray level pictures:object Gray level pictures:object classificationclassification

Gray level pictures: human Gray level pictures: human action classificationaction classification

Literature & Software T. Mitchell: chapter 6 S. Russell & P. Norvig, “Artificial Intelligence – A

Modern Approach” : chapters 14+15 R.O. Duda, P.E. Hart, D.G. Stork, “Pattern

Classification 2nd ed.” : chapters 2+3 David Heckerman: “A Tutorial on Learning with

Bayesian Belief Networks” http://ftp.research.microsoft.com/pub/tr/tr-95-06.pdf Bayes Net Toolbox for Matlab (free), Kevin Murphy http://www.cs.berkeley.edu/~murphyk/Bayes/bnt.html

Bayes Theorem

P(h|D) = P(D|h) P(h) / P(D)

P(D) : prior probability of the data D, evidence P(h) : prior probability of the hypothesis h,

prior P(h|D) : posterior probability of the hypothesis

given the data D, posterior P(D|h) : probability of the data D given the

hypothesis h , likelihood of the data

Bayes Theorem

P(h|D) = P(D|h) P(h) / P(D)

posterior = likelihood x prior / evidence

By observing the data D we can convert the prior probability P(h) to the a posteriori probability (posterior) P(h|D)

The posterior is probability that h holds after data D has been observed.

The evidence P(D) can be viewed merely as a scale factor that guarantees that the posterior probabilities sum to one.

Choosing HypothesesP(h|D) = P(D|h) P(h) / P(D)

Generally want the most probable hypothesis given the training data

Maximum a posteriori hypothesis hMAP

hMAP = argmaxhH P(h|D)

= argmaxhH P(D|h) P(h) / P(D)

= argmaxhH P(D|h) P(h) If the priors of hypothesis are equally likely

P(hi)=P(hj) then one can choose the maximum likelihood (ML) hypothesis

hML = argmaxhH P(D|h)

Bayes Theorem ExampleA patient takes a lab test and the result is positive. The test

returns a correct positive () result in 98% of the cases in which the disease is actually present, and a correct negative () result in 97% of the cases in which the disease is not present. Furthermore, 0.8% of the entire population have the disease. Hypotheses : disease, ¬disease

priors P(h) : P(disease) = 0.008, P(¬ disease)=0.992likelihoods P(D|h): P(|disease)=0.98, P( |disease)=0.02

P(|¬disease)=0.03, P(|¬disease)=0.97

Maximum posteriors argmax P(h|D): P(disease|)~ P(|disease)P(disease)=0.0078 P(¬ disease|)~ P(|¬ disease) P(¬ disease) = 0.0298P(disease|) = 0.0078/(0.0078+0.0298) = 0.21P(¬ disease|) = 0.0298/(0.0078+0.0298) = 0.79

Basic Formula for Probabilities

Product rule: P(AB) = P(A) P(B) Sum rule: P(AB) = P(A) + P(B) - P(AB) Theorem of total probability: if A1, A2, …,

An are mutually exclusive events i P(Ai) = 1, then

P(B) = i P(B|Ai) P(Ai)

Bayes Theorem Example

P(x1,x2|1,2,) = 1/(2) exp -i (xi-i)2/2

h={1,2,}

D={x1,…,xm}

Gaussian Probability Function

P(D|1,2,) = m P(xm|1,2,) Maximum likelihood hypothesis hML

hML = argmax 1,2, P(D|1,2,) Trick: maximize log-likelihood log P(D|1,2,) = m log P(xm|1,2,)

= m log (1/(2) exp -i (xmi-i)2/22

= -M log (2) - m i (xmi-i)2/22

Gaussian Probability Function

log P(D|1,2,)/ i = 0

m xmi-i

= 0 i ML = 1/M m xmi = E[xm]

log P(D|1,2,)/ = 0

MLm i (xmi-i)2 / 2M = E[(i (xm

i-i)2] / 2

Maximum likelihood hypothesis hML = {iML,ML}

Maximum Likelihood Hypothesis

ML= (0.20, -0.14) ML = 1.42

Bayes Decision Rule x = examples of class c1

o = examples of class c2

{2,2}

{1,1}

Bayes Decision Rule

Assume we have two Gaussians distributions associated to two separate classes c1, c2.

P(x|ci) = P(x|i,i)= 1/(2) exp -i (xi-i)2/2

Bayes decision rule (max posterior probability) Decide c1 if P(c1|x) > P(c2|x)

otherwise decide c2. if P(c1) = P(c2) use maximum likelihood P(x|ci) else use maximum posterior P(ci|x) = P(x|ci) P(ci)

Bayes Decision Rule

c2

c1

Two-Category Case

Discriminant functions: if g(x) > 0 then c1 else c2

g(x) = P(c1|x) – P(c2|x)

= P(x|c1) P(c1) - P(x|c1) P(c1) g(x) = log P(c1|x) – log P(c2|x)

= log P(x|c1)/P(x|c2) - log P(c1)/ P(c2) Gaussian probability functions with identical i

g(x) = (x-2)2/22 - (x-1)2/22 + log P(c1) – log P(c2)decision surface is a line/hyperplane

Learning a Real Valued Function

Consider a real-valued target function f Noisy training examples <xi,di>

di = f(xi) + ei

ei is a random variable drawn from a Gaussian distribution with zero mean.

The maximum likelihood hypothesis hML is the one that minimizes the squared sum of errors

hML = argmin hH i (di – h(xi))2

fe hML

Learning a Real Valued Function

hML = argmax hH P(D|h)

= argmax hH i P(xi|h)

= argmax hH i (2)-0.5 exp -(di-h(xi))2/22

maximizing logarithm log P(D|h)hML = argmax hH i –0.5 log(2) -(di-h(xi))2/22

= argmax hH i -(di - h(xi))2

= argmin hH i (di – h(xi))2

Learning to Predict Probabilities

Predicting survival probability of a patient Training examples <xi,di> where di is 0 or 1 Objective: train a neural network to output a

probability h(xi) = p(di=1) given xi

Maximum likelihood hypothesis:

hML = argmax hH i di ln h(xi) + (1-di) ln (1-h(xi))

maximize cross entropy between di and h(xi) Weight update rule for synapses wk to output

neuron h(xi) wk = wk + i (di-h(xi)) xk Compare to standard BP weight update rule

wk = wk + i h(xi)(1-h(xi)) (di-h(xi)) xk

Most Probable Classification

So far we sought the most probable hypothesis hMAP? What is most probable classification of a new

instance x given the data D?hMAP(x) is not the most probable classification, although

often a sufficiently good approximation of it. Consider three possible hypotheses: P(h1|D) = 0.4, P(h2|D) = 0.3, P(h3|D) = 0.3 Given a new instance x, h1(x)=+, h2(x)=-, h3(x)=-

hMAP(x) = h1(x) = + most probable classification: P(+)=P(h1|D)=0.4 P(-)=P(h2|D) + P(h3|D) = 0.6

Bayes Optimal Classifier

cmax = argmax cjC hiH P(cj|hi) P(hi|D) Example:

P(h1|D) = 0.4, P(h2|D) = 0.3, P(h3|D) = 0.3

P(+|h1)=1, P(-|h1)=0

P(+|h2)=0, P(-|h2)=1

P(+|h3)=0, P(-|h3)=1therefore

hiH P(+|hi) P(hi|D) = 0.4

hiH P(- |hi) P(hi|D) = 0.6

argmax cjC hiH P(vj|hi) P(hi|D) = -

MAP vs. Bayes Method

The maximum posterior hypothesis estimates a point hMAP in the hypothesis space H.

Bayes method instead estimates and uses a complete distribution P(h|D).

The difference appears when inference MAP or Bayes method are used for inference of unseen instances and one compares the distributions P(x|D)

MAP: P(x|D) = hMAP(x) with hML = argmax hH P(h|D) Bayes: P(x|D) = hiH P(x|hi) P(hi|D) For reasonable prior distributions P(h) MAP and

Bayes solution are equivalent in the asymptotic limit of infinite training data D.

Naïve Bayes Classifier

popular, simple learning algorithm moderate or large training set available assumption: attributes that describe instances are

conditionally independent given classification (in practice works surprisingly well even if assumption is violated)

Applications: diagnosis text classification (newsgroup articles 20

newsgroups, 1000 documents per newsgroup, classification accuracy 89%)

Naïve Bayes Classifier Assume discrete target function F: XC, where each

instance x described by attributes <a1,a2,…,an> Most probable value of f(x) is:

cMAP= argmax cjC P(cj| <a1,a2,…,an>)

= argmax cjC P(<a1,a2,…,an>|cj) P(cj) / P(<a1,a2,…,an>)

= argmax cjC P(<a1,a2,…,an>|cj) P(cj)

Naïve Bayes assumption: P(<a1,a2,…,an>|cj) = i P(ai|cj)

cNB = argmax cjC P(cj) i P(ai|cj)

Naïve Bayes Learning Algorithm

Naïve_Bayes_Learn(examples) for each target value cj estimate P(cj)

for each attribute value ai estimate of each attribute a estimate P(ai|cj)

Classify_New_Instance(x)

cNB = argmax cjC P(cj) aix P(ai|cj)

Naïve Bayes Example

Consider PlayTennis and new instance<Outlook=Sunny, Temp=cool, Humidity=high, Wind=strong>

Compute cNB = argmax cjC P(cj) aix P(ai|cj)

playtennis (9+,5-)P(yes) = 9/14, P(no) = 5/14wind=strong (3+,3-)P(strong|yes) = 3/9 , P(strong|no) 3/5…P(yes) P(sun|yes) P(cool|yes) P(high|yes) P(strong|yes)= 0.005P(no) P(sun|no) P(cool|no) P(high|no) P(strong|no)= 0.021

Estimating Probabilities

What if none (nc=0) of the training instances with target value cj have attribute ai?

P(ai|cj) = nc/n = 0 and P(cj) aix P(ai|cj) = 0

Solution: Bayesian estimate for P(ai|cj) P(ai|cj) = (nc + mp)/(n + m)

n : number of training examples for which c=cj nc : number of examples for which c=cj and a=ai p : prior estimate of P(ai|cj) m : weight given to prior (number of “virtual” examples)

Bayesian Belief Networks

naïve assumption of conditional independency too restrictive

full probability distribution intractable due to lack of data

Bayesian belief networks describe conditional independence among subsets of variables

allows combining prior knowledge about causal relationships among variables with observed data

Conditional IndependenceDefinition: X is conditionally independent of Y given Z is

the probability distribution governing X is independent of the value of Y given the value of Z, that is, if

xi,yj,zk P(X=xi|Y=yj,Z=zk) = P(X=xi|Z=zk) or more compactly P(X|Y,Z) = P(X|Z)Example: Thunder is conditionally independent of Rain

given Lightning P(Thunder |Rain, Lightning) = P(Thunder |Lightning)Notice: P(Thunder |Rain) P(Thunder)Naïve Bayes uses conditional independence to justify: P(X,Y|Z) = P(X|Y,Z) P(Y|Z) = P(X|Z) P(Y|Z)

Bayesian Belief Network

Network represents a set of conditional independence assertions: Each node is conditionally independent of its non-descendants,

given its immediate predecessors. (directed acyclic graph)

StormBusTourGroup

Lightning Campfire

ForestfireThunder

S,B S,¬B ¬S,B

S, ¬B

C 0.4 0.1 0.8 0.2

¬C

0.6 0.9 0.2 0.8

Campfire

Bayesian Belief Network

Network represents joint probability distribution over all variables P(Storm,BusGroup,Lightning,Campfire,Thunder,Forestfire) P(y1,…,yn) = i=1

n P(yi|Parents(Yi)) joint distribution is fully defined by graph plus P(yi|Parents(Yi))

StormBusTourGroup

Lightning Campfire

ForestfireThunder

S,B S,¬B ¬S,B

S, ¬B

C 0.4 0.1 0.8 0.2

¬C

0.6 0.9 0.2 0.8

Campfire

P(C|S,B)

Expectation Maximization EM

when to use data is only partially observable unsupervised clustering: target value

unobservable supervised learning: some instance attributes

unobservableapplications training Bayesian Belief Networks unsupervised clustering learning hidden Markov models

Generating Data from Mixture of Gaussians

Each instance x generated by choosing one of the k Gaussians at random Generating an instance according to that

Gaussian

EM for Estimating k Means

Given: instances from X generated by mixture of k Gaussians unknown means <1,…,k> of the k Gaussians don’t know which instance xi was generated by which

GaussianDetermine: maximum likelihood estimates of <1,…,k>

Think of full description of each instance as yi=<xi,zi1,zi2> zij is 1 if xi generated by j-th Gaussian xi observable zij unobservable

EM for Estimating k MeansEM algorithm: pick random initial h=<1,2> then iterate E step: Calculate the expected value E[zij] of each

hidden variable zij, assuming the current hypothesis h=<1,2> holds.

E[zij] = p(x=xi|=j) / n=12 p(x=xi|=j)

= exp(-(xi-j)2/22) / n=12 exp(-(xi-n)2/22)

M step: Calculate a new maximum likelihood hypothesis h’=<1’,2’> assuming the value taken on by each hidden variable zij is its expected value E[zij] calculated in the E-step. Replace h=<1,2> by h’=<1’,2’>

j = i=1m E[zij] xi / i=1

m E[zij]

EM Algorithm

Converges to local maximum likelihood and provides estimates of hidden variables zij.

In fact local maximum in E [ln (P(Y|h)] Y is complete (observable plus non-observable

variables) data Expected valued is taken over possible values

of unobserved variables in Y

General EM ProblemGiven: observed data X = {x1,…,xm} unobserved data Z = {z1,…,zm} parameterized probability distribution P(Y|h) where

Y = {y1,…,ym} is the full data yi=<xi,zi> h are the parameters

Determine: h that (locally) maximizes E[ln P(Y|h)]Applications: train Bayesian Belief Networks unsupervised clustering hidden Markov models

General EM Method

Define likelihood function Q(h’|h) which calculates Y = X Z using observed X and current

parameters h to estimate Z Q(h’|h) = E[ ln( P(Y|h’) | h, X] EM algorithm:Estimation (E) step: Calculate Q(h’|h) using the

current hypothesis h and the observed data X to estimate the probability distribution over Y.

Q(h’|h) = E[ ln( P(Y|h’) | h, X] Maximization (M) step: Replace hypothesis h by

the hypothesis h’ that maximizes this Q function.

h = argmaxh’H Q(h’|h)

Documents

2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes