Download pdf - Bayes optimal classifier Naïve Bayesguestrin/Class/15781/slides/epxing_naive... · 2007. 9. 16. · 1 ©Carlos Guestrin 2005-2007 Bayes optimal classifier Naïve Bayes Machine Learning

1

©Carlos Guestrin 2005-2007

Bayes optimal classifierNaïve Bayes

Machine Learning – 10701/15781Carlos GuestrinCarnegie Mellon University

September 17th, 2007


Classification

� Learn : h:X aaaa Y� X – features� Y – target classes

� Suppose you know P(Y|X) exactly, how should you classify?� Bayes classifier:

� Why?

2


Optimal classification

� Theorem: Bayes classifier hBayes is optimal!

� That is

� Proof :


Bayes Rule

Which is shorthand for:

3


How hard is it to learn the optimal classifier?� Data =

� How do we represent these? How many parameters?� Prior, P(Y):

� Suppose Y is composed of k classes

� Likelihood, P(X|Y):� Suppose X is composed of n binary features

� Complex model → High variance with limited data!!!


Conditional Independence

� X is conditionally independent of Y given Z, if the probability distribution governing X is independent of the value of Y, given the value of Z

� e.g.,

� Equivalent to:

4


What if features are independent?

� Predict 10701Grade

� From two conditionally Independent features� HomeworkGrade� ClassAttendance


The Naïve Bayes assumption

� Naïve Bayes assumption:� Features are independent given class:

� More generally:

� How many parameters now?� Suppose X is composed of n binary features

5


The Naïve Bayes Classifier

� Given:� Prior P(Y)� n conditionally independent features X given the class Y� For each Xi, we have likelihood P(Xi|Y)

� Decision rule:

� If assumption holds, NB is optimal classifier!


MLE for the parameters of NB

� Given dataset� Count(A=a,B=b) ← number of examples where A=a and B=b

� MLE for NB, simply:� Prior: P(Y=y) =

� Likelihood: P(Xi=xi|Yi=yi) =

6


Subtleties of NB classifier 1 –Violating the NB assumption

� Usually, features are not conditionally independent:

� Actual probabilities P(Y|X) often biased towards 0 or 1

� Nonetheless, NB is the single most used classifier out there� NB often performs well, even when assumption is violated� [Domingos & Pazzani ’96] discuss some conditions for good

performance


Subtleties of NB classifier 2 –Insufficient training data

� What if you never see a training instance where X1=a when Y=b?� e.g., Y={SpamEmail}, X1={‘Enlargement’}� P(X1=a | Y=b) = 0

� Thus, no matter what the values X2,…,Xn take:� P(Y=b | X1=a,X2,…,Xn) = 0

� What now???

7


MAP for Beta distribution

� MAP: use most likely parameter:

� Beta prior equivalent to extra thumbtack flips� As N ∞, prior is “forgotten”

� But, for small sample size, prior is important!


Bayesian learning for NB parameters – a.k.a. smoothing

� Dataset of N examples� Prior

� “distribution” Q(Xi,Y), Q(Y)� m “virtual” examples

� MAP estimate� P(Xi|Y)

� Now, even if you never observe a feature/class, posterior probability never zero

8


Text classification

� Classify e-mails� Y = {Spam,NotSpam}

� Classify news articles� Y = {what is the topic of the article?}

� Classify webpages� Y = {Student, professor, project, …}

� What about the features X?� The text!


Features X are entire document –Xi for ith word in article

9


NB for Text classification

� P(X|Y) is huge!!!� Article at least 1000 words, X={X1,…,X1000}� Xi represents ith word in document, i.e., the domain of Xi is entire

vocabulary, e.g., Webster Dictionary (or more), 10,000 words, etc.

� NB assumption helps a lot!!!� P(Xi=xi|Y=y) is just the probability of observing word xi in a

document on topic y


Bag of words model

� Typical additional assumption – Position in document doesn’t matter : P(Xi=xi|Y=y) = P(Xk=xi|Y=y) � “Bag of words” model – order of words on the page ignored� Sounds really silly, but often works very well!

When the lecture is over, remember to wake up the person sitting next to you in the lecture room.

10


Bag of words model

� Typical additional assumption – Position in document doesn’t matter : P(Xi=xi|Y=y) = P(Xk=xi|Y=y) � “Bag of words” model – order of words on the page ignored� Sounds really silly, but often works very well!

in is lecture lecture next over person remember room sitting the the the to to up wake when you


Bag of Words Approach

aardvark 0

about 2

all 2

Africa 1

apple 0

anxious 0

...

gas 1

...

oil 1

…

Zaire 0

11


NB with Bag of Words for text classification

� Learning phase:� Prior P(Y)

� Count how many documents you have from each topic (+ prior)

� P(Xi|Y) � For each topic, count how many times you saw word in

documents of this topic (+ prior)

� Test phase:� For each document

� Use naïve Bayes decision rule


Twenty News Groups results

12


Learning curve for Twenty News Groups


What if we have continuous Xi ?

Eg., character recognition: Xi is ith pixel

Gaussian Naïve Bayes (GNB):

Sometimes assume variance� is independent of Y (i.e., σi), � or independent of Xi (i.e., σk)� or both (i.e., σ)

13


Estimating Parameters: Y discrete, Xi continuous

Maximum likelihood estimates: jth training example

δ(x)=1 if x true, else 0


Example: GNB for classifying mental states

~1 mm resolution

~2 images per sec.

15,000 voxels/image

non-invasive, safe

measures Blood Oxygen Level Dependent (BOLD) response

Typical impulse response

10 sec

[Mitchell et al.]

14


Brain scans can track activation with precision and sensitivity

[Mitchell et al.]


Gaussian Naïve Bayes: Learned µvoxel,wordP(BrainActivity | WordCategory = {People,Animal})

[Mitchell et al.]

15


Learned Bayes Models – Means forP(BrainActivity | WordCategory)

Animal wordsPeople words

Pairwise classification accuracy: 85%[Mitchell et al.]


What you need to know about Naïve Bayes

� Optimal decision using Bayes Classifier� Naïve Bayes classifier

� What’s the assumption� Why we use it� How do we learn it

� Why is Bayesian estimation important

� Text classification� Bag of words model

� Gaussian NB� Features are still conditionally independent� Each feature has a Gaussian distribution given class