1
©Carlos Guestrin 2005-2007
Bayes optimal classifierNaïve Bayes
Machine Learning – 10701/15781Carlos GuestrinCarnegie Mellon University
September 17th, 2007
©Carlos Guestrin 2005-2007
Classification
� Learn : h:X aaaa Y� X – features� Y – target classes
� Suppose you know P(Y|X) exactly, how should you classify?� Bayes classifier:
� Why?
2
©Carlos Guestrin 2005-2007
Optimal classification
� Theorem: Bayes classifier hBayes is optimal!
� That is
� Proof :
©Carlos Guestrin 2005-2007
Bayes Rule
Which is shorthand for:
3
©Carlos Guestrin 2005-2007
How hard is it to learn the optimal classifier?� Data =
� How do we represent these? How many parameters?� Prior, P(Y):
� Suppose Y is composed of k classes
� Likelihood, P(X|Y):� Suppose X is composed of n binary features
� Complex model → High variance with limited data!!!
©Carlos Guestrin 2005-2007
Conditional Independence
� X is conditionally independent of Y given Z, if the probability distribution governing X is independent of the value of Y, given the value of Z
� e.g.,
� Equivalent to:
4
©Carlos Guestrin 2005-2007
What if features are independent?
� Predict 10701Grade
� From two conditionally Independent features� HomeworkGrade� ClassAttendance
©Carlos Guestrin 2005-2007
The Naïve Bayes assumption
� Naïve Bayes assumption:� Features are independent given class:
� More generally:
� How many parameters now?� Suppose X is composed of n binary features
5
©Carlos Guestrin 2005-2007
The Naïve Bayes Classifier
� Given:� Prior P(Y)� n conditionally independent features X given the class Y� For each Xi, we have likelihood P(Xi|Y)
� Decision rule:
� If assumption holds, NB is optimal classifier!
©Carlos Guestrin 2005-2007
MLE for the parameters of NB
� Given dataset� Count(A=a,B=b) ← number of examples where A=a and B=b
� MLE for NB, simply:� Prior: P(Y=y) =
� Likelihood: P(Xi=xi|Yi=yi) =
6
©Carlos Guestrin 2005-2007
Subtleties of NB classifier 1 –Violating the NB assumption
� Usually, features are not conditionally independent:
� Actual probabilities P(Y|X) often biased towards 0 or 1
� Nonetheless, NB is the single most used classifier out there� NB often performs well, even when assumption is violated� [Domingos & Pazzani ’96] discuss some conditions for good
performance
©Carlos Guestrin 2005-2007
Subtleties of NB classifier 2 –Insufficient training data
� What if you never see a training instance where X1=a when Y=b?� e.g., Y={SpamEmail}, X1={‘Enlargement’}� P(X1=a | Y=b) = 0
� Thus, no matter what the values X2,…,Xn take:� P(Y=b | X1=a,X2,…,Xn) = 0
� What now???
7
©Carlos Guestrin 2005-2007
MAP for Beta distribution
� MAP: use most likely parameter:
� Beta prior equivalent to extra thumbtack flips� As N ∞, prior is “forgotten”
� But, for small sample size, prior is important!
©Carlos Guestrin 2005-2007
Bayesian learning for NB parameters – a.k.a. smoothing
� Dataset of N examples� Prior
� “distribution” Q(Xi,Y), Q(Y)� m “virtual” examples
� MAP estimate� P(Xi|Y)
� Now, even if you never observe a feature/class, posterior probability never zero
8
©Carlos Guestrin 2005-2007
Text classification
� Classify e-mails� Y = {Spam,NotSpam}
� Classify news articles� Y = {what is the topic of the article?}
� Classify webpages� Y = {Student, professor, project, …}
� What about the features X?� The text!
©Carlos Guestrin 2005-2007
Features X are entire document –Xi for ith word in article
9
©Carlos Guestrin 2005-2007
NB for Text classification
� P(X|Y) is huge!!!� Article at least 1000 words, X={X1,…,X1000}� Xi represents ith word in document, i.e., the domain of Xi is entire
vocabulary, e.g., Webster Dictionary (or more), 10,000 words, etc.
� NB assumption helps a lot!!!� P(Xi=xi|Y=y) is just the probability of observing word xi in a
document on topic y
©Carlos Guestrin 2005-2007
Bag of words model
� Typical additional assumption – Position in document doesn’t matter : P(Xi=xi|Y=y) = P(Xk=xi|Y=y) � “Bag of words” model – order of words on the page ignored� Sounds really silly, but often works very well!
When the lecture is over, remember to wake up the person sitting next to you in the lecture room.
10
©Carlos Guestrin 2005-2007
Bag of words model
� Typical additional assumption – Position in document doesn’t matter : P(Xi=xi|Y=y) = P(Xk=xi|Y=y) � “Bag of words” model – order of words on the page ignored� Sounds really silly, but often works very well!
in is lecture lecture next over person remember room sitting the the the to to up wake when you
©Carlos Guestrin 2005-2007
Bag of Words Approach
aardvark 0
about 2
all 2
Africa 1
apple 0
anxious 0
...
gas 1
...
oil 1
…
Zaire 0
11
©Carlos Guestrin 2005-2007
NB with Bag of Words for text classification
� Learning phase:� Prior P(Y)
� Count how many documents you have from each topic (+ prior)
� P(Xi|Y) � For each topic, count how many times you saw word in
documents of this topic (+ prior)
� Test phase:� For each document
� Use naïve Bayes decision rule
©Carlos Guestrin 2005-2007
Twenty News Groups results
12
©Carlos Guestrin 2005-2007
Learning curve for Twenty News Groups
©Carlos Guestrin 2005-2007
What if we have continuous Xi ?
Eg., character recognition: Xi is ith pixel
Gaussian Naïve Bayes (GNB):
Sometimes assume variance� is independent of Y (i.e., σi), � or independent of Xi (i.e., σk)� or both (i.e., σ)
13
©Carlos Guestrin 2005-2007
Estimating Parameters: Y discrete, Xi continuous
Maximum likelihood estimates: jth training example
δ(x)=1 if x true, else 0
©Carlos Guestrin 2005-2007
Example: GNB for classifying mental states
~1 mm resolution
~2 images per sec.
15,000 voxels/image
non-invasive, safe
measures Blood Oxygen Level Dependent (BOLD) response
Typical impulse response
10 sec
[Mitchell et al.]
14
©Carlos Guestrin 2005-2007
Brain scans can track activation with precision and sensitivity
[Mitchell et al.]
©Carlos Guestrin 2005-2007
Gaussian Naïve Bayes: Learned µvoxel,wordP(BrainActivity | WordCategory = {People,Animal})
[Mitchell et al.]
15
©Carlos Guestrin 2005-2007
Learned Bayes Models – Means forP(BrainActivity | WordCategory)
Animal wordsPeople words
Pairwise classification accuracy: 85%[Mitchell et al.]
©Carlos Guestrin 2005-2007
What you need to know about Naïve Bayes
� Optimal decision using Bayes Classifier� Naïve Bayes classifier
� What’s the assumption� Why we use it� How do we learn it
� Why is Bayesian estimation important
� Text classification� Bag of words model
� Gaussian NB� Features are still conditionally independent� Each feature has a Gaussian distribution given class