56
CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

Embed Size (px)

Citation preview

Page 1: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

CSE 446: Naïve BayesWinter 2012

Dan Weld

Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

Page 2: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

2

Today

• Gaussians• Naïve Bayes• Text Classification

Page 3: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

3

Long Ago• Random variables, distributions• Marginal, joint & conditional probabilities• Sum rule, product rule, Bayes rule• Independence, conditional independence

Page 4: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

Last Time

Prior Hypothesis

Maximum Likelihood Estimate

Maximum A Posteriori Estimate

Bayesian Estimate

Uniform The most likely

Any The most likely

Any Weighted combination

Page 5: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

Bayesian Learning

Use Bayes rule:

Or equivalently:

Prior

Normalization

Data Likelihood

Posterior

Page 6: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

6

Conjugate Priors?

Page 7: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

Those Pesky Distributions

7

Discrete Continuous

Binary {0, 1} M Values

SingleEvent

Sequence(N trials)N= H+T

Bernouilli

Binomial Multinomial

Beta DirichletConjugatePrior

Page 8: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

What about continuous variables?

• Billionaire says: If I am measuring a continuous variable, what can you do for me?

• You say: Let me tell you about Gaussians…

Page 9: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

Some properties of Gaussians

• Affine transformation (multiplying by scalar and adding a constant) are Gaussian– X ~ N(,2)– Y = aX + b Y ~ N(a+b,a22)

• Sum of Gaussians is Gaussian– X ~ N(X,2

X)

– Y ~ N(Y,2Y)

– Z = X+Y Z ~ N(X+Y, 2X+2

Y)

• Can make easy to differentiate, as we will see soon!

Page 10: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

Learning a Gaussian

• Collect a bunch of data– Hopefully, i.i.d. samples– e.g., exam scores

• Learn parameters– Mean: μ– Variance: σ

Xi

i =

Exam Score

0 85

1 95

2 100

3 12

… …99 89

Page 11: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

MLE for Gaussian:

• Prob. of i.i.d. samples D={x1,…,xN}:

• Log-likelihood of data:

Page 12: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

Your second learning algorithm:MLE for mean of a Gaussian

• What’s MLE for mean?

Page 13: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

MLE for variance• Again, set derivative to zero:

Page 14: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

Learning Gaussian parameters

• MLE:

Page 15: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

Bayesian learning of Gaussian parameters

• Conjugate priors– Mean: Gaussian prior– Variance: Wishart Distribution

Page 16: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

Supervised Learning of ClassifiersFind f

• Given: Training set {(xi, yi) | i = 1 … n}• Find: A good approximation to f : X Y

Examples: what are X and Y ?• Spam Detection

– Map email to {Spam,Ham}

• Digit recognition– Map pixels to {0,1,2,3,4,5,6,7,8,9}

• Stock Prediction– Map new, historic prices, etc. to (the real numbers)

Classification

Page 17: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

20

Bayesian Categorization

• Let set of categories be {c1, c2,…cn}• Let E be description of an instance.• Determine category of E by determining for each ci

• P(E) can be ignored since is factor categories

)(

)|()()|(

EP

cEPcPEcP ii

i

)|()(~)|( iii cEPcPEcP

Page 18: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

Text classification

• Classify e-mails– Y = {Spam,NotSpam}

• Classify news articles– Y = {what is the topic of the article?}

• Classify webpages– Y = {Student, professor, project, …}

• What to use for features, X?

Page 19: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

Example: Spam Filter

• Input: email• Output: spam/ham• Setup:

– Get a large collection of example emails, each labeled “spam” or “ham”

– Note: someone has to hand label all this data!

– Want to learn to predict labels of new, future emails

• Features: The attributes used to make the ham / spam decision– Words: FREE!– Text Patterns: $dd, CAPS– For email specifically,

Semantic features: SenderInContacts

– …

Dear Sir.

First, I must solicit your confidence in this transaction, this is by virture of its nature as being utterly confidencial and top secret. …

TO BE REMOVED FROM FUTURE MAILINGS, SIMPLY REPLY TO THIS MESSAGE AND PUT "REMOVE" IN THE SUBJECT.

99 MILLION EMAIL ADDRESSES FOR ONLY $99

Ok, Iknow this is blatantly OT but I'm beginning to go insane. Had an old Dell Dimension XPS sitting in the corner and decided to put it to use, I know it was working pre being stuck in the corner, but when I plugged it in, hit the power nothing happened.

Page 20: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

Features X are word sequence in document Xi for ith word in article

Page 21: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

Features for Text Classification• X is sequence of words in document• X (and hence P(X|Y)) is huge!!!

– Article at least 1000 words, X={X1,…,X1000}

– Xi represents ith word in document, i.e., the domain of Xi is entire vocabulary, e.g., Webster Dictionary (or more), 10,000 words, etc.

• 10,0001000 = 104000

• Atoms in Universe: 1080

– We may have a problem…

Page 22: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

Bag of Words ModelTypical additional assumption –

– Position in document doesn’t matter: • P(Xi=xi|Y=y) = P(Xk=xi|Y=y)

(all positions have the same distribution)

– Ignore the order of words

When the lecture is over, remember to wake up the person sitting next to you in the lecture room.

Page 23: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

Bag of Words Model

in is lecture lecture next over person remember room sitting the the the to to up wake when you

Typical additional assumption –

– Position in document doesn’t matter: • P(Xi=xi|Y=y) = P(Xk=xi|Y=y)

(all position have the same distribution)

– Ignore the order of words – Sounds really silly, but often works very well!

– From now on:• Xi = Boolean: “wordi is in document”

• X = X1 … Xn

Page 24: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

Bag of Words Approach

aardvark 0

about 2

all 2

Africa 1

apple 0

anxious 0

...

gas 1

...

oil 1

Zaire 0

Page 25: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

28

Bayesian Categorization

• Need to know:– Priors: P(yi)

– Conditionals: P(X | yi)

• P(yi) are easily estimated from data. – If ni of the examples in D are in yi, then P(yi) = ni / |D|

• Conditionals:– X = X1 … Xn

– Estimate P(X1 … Xn | yi)

• Too many possible instances to estimate!– (exponential in n) – Even with bag of words assumption!

Problem!

P(y1 | X) ~ P(yi)P(X|yi)

Page 26: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

29

Need to Simplify Somehow

• Too many probabilities– P(x1 x2 x3 | yi)

• Can we assume some are the same?– P(x1 x2 ) = P(x1)P(x2)

P(x1 x2 x3 | spam)P(x1 x2 x3 | spam)P(x1 x2 x3 | spam) ….P( x1 x2 x3 | spam)

?

Page 27: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

Conditional Independence• X is conditionally independent of Y given Z, if

the probability distribution for X is independent of the value of Y, given the value of Z

• e.g.,

• Equivalent to:

Page 28: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

Naïve Bayes

• Naïve Bayes assumption:– Features are independent given class:

– More generally:

• How many parameters now?• Suppose X is composed of n binary features

Page 29: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

The Naïve Bayes Classifier• Given:

– Prior P(Y)

– n conditionally independent features X given the class Y

– For each Xi, we have likelihood P(Xi|

Y)

• Decision rule:

Y

X1 XnX2

Page 30: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

MLE for the parameters of NB• Given dataset, count occurrences

• MLE for discrete NB, simply:– Prior:

– Likelihood:

Page 31: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

Subtleties of NB Classifier 1 Violating the NB Assumption

• Usually, features are not conditionally independent:

• Actual probabilities P(Y|X) often biased towards 0 or 1• Nonetheless, NB is the single most used classifier out

there– NB often performs well, even when assumption is violated– [Domingos & Pazzani ’96] discuss some conditions for good

performance

Page 32: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

Subtleties of NB classifier 2: Overfitting

Page 33: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

For Binary Features: We already know the answer!

• MAP: use most likely parameter

• Beta prior equivalent to extra observations for each feature• As N → 1, prior is “forgotten”• But, for small sample size, prior is important!

Page 34: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

38

That’s Great for Bionomial

• Works for Spam / Ham• What about multiple classes

– Eg, given a wikipedia page, predicting type

Page 35: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

Multinomials: Laplace Smoothing

• Laplace’s estimate:– Pretend you saw every outcome k

extra times

– What’s Laplace with k = 0?– k is the strength of the prior– Can derive this as a MAP estimate for

multinomial with Dirichlet priors

• Laplace for conditionals:– Smooth each condition independently:

H H T

Page 36: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

40

Naïve Bayes for Text

• Modeled as generating a bag of words for a document in a given category by repeatedly sampling with replacement from a vocabulary V = {w1, w2,…wm} based on the probabilities P(wj | ci).

• Smooth probability estimates with Laplace m-estimates assuming a uniform distribution over all words (p = 1/|V|) and m = |V|– Equivalent to a virtual sample of seeing each word in

each category exactly once.

Page 37: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

41

Easy to Implement

• But…

• If you do… it probably won’t work…

Page 38: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

Probabilities: Important Detail!

Any more potential problems here?

• P(spam | X1 … Xn) = P(spam | Xi)i

We are multiplying lots of small numbers Danger of underflow!

0.557 = 7 E -18

Solution? Use logs and add! p1 * p2 = e log(p1)+log(p2)

Always keep in log form

Page 39: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

43

Naïve Bayes Posterior Probabilities

• Classification results of naïve Bayes – I.e. the class with maximum posterior probability…– Usually fairly accurate (?!?!?)

• However, due to the inadequacy of the conditional independence assumption…– Actual posterior-probability estimates not accurate.– Output probabilities generally very close to 0 or 1.

Page 40: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

NB with Bag of Words for text classification

• Learning phase:– Prior P(Y)

• Count how many documents from each topic (prior)

– P(Xi|Y) • For each topic, count how many times you saw word in

documents of this topic (+ prior); remember this dist’n is shared across all positions i

• Test phase:– For each document

• Use naïve Bayes decision rule

Page 41: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

Twenty News Groups results

Page 42: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

Learning curve for Twenty News Groups

Page 43: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

What if we have continuous Xi ?Eg., character recognition: Xi is ith pixel

Gaussian Naïve Bayes (GNB):

Sometimes assume variance• is independent of Y (i.e., i), • or independent of Xi (i.e., k)• or both (i.e., )

Page 44: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

Estimating Parameters: Y discrete, Xi continuous

Maximum likelihood estimates:• Mean:

• Variance:

jth training example

(x)=1 if x true, else 0

Page 45: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

Example: GNB for classifying mental states

~1 mm resolution

~2 images per sec.

15,000 voxels/image

non-invasive, safe

measures Blood Oxygen Level Dependent (BOLD) response

Typical impulse response

10 sec

[Mitchell et al.]

Page 46: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

Brain scans can track activation with precision and sensitivity

[Mitchell et al.]

Page 47: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

Gaussian Naïve Bayes: Learned voxel,word

P(BrainActivity | WordCategory = {People,Animal})[Mitchell et al.]

Page 48: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

Learned Bayes Models – Means forP(BrainActivity | WordCategory)

Animal wordsPeople words

Pairwise classification accuracy: 85%[Mitchell et al.]

Page 49: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

58

Page 50: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

Bayes Classifier is Optimal!• Learn: h : X Y

– X – features– Y – target classes

• Suppose: you know true P(Y|X):– Bayes classifier:

• Why?

Page 51: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

Optimal classification

• Theorem: – Bayes classifier hBayes is optimal!

– Why?

Page 52: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

What you need to know about Naïve Bayes

• Naïve Bayes classifier– What’s the assumption– Why we use it– How do we learn it– Why is Bayesian estimation important

• Text classification– Bag of words model

• Gaussian NB– Features are still conditionally independent– Each feature has a Gaussian distribution given class

• Optimal decision using Bayes Classifier

Page 53: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

62

Text Naïve Bayes Algorithm(Train)

Let V be the vocabulary of all words in the documents in DFor each category ci C

Let Di be the subset of documents in D in category ci

P(ci) = |Di| / |D|

Let Ti be the concatenation of all the documents in Di

Let ni be the total number of word occurrences in Ti

For each word wj V Let nij be the number of occurrences of wj in Ti

Let P(wi | ci) = (nij + 1) / (ni + |V|)

Page 54: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

63

Text Naïve Bayes Algorithm(Test)

Given a test document XLet n be the number of word occurrences in XReturn the category:

where ai is the word occurring the ith position in X

)|()(argmax1

n

iiii

CiccaPcP

Page 55: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

64

Naïve Bayes Time Complexity

• Training Time: O(|D|Ld + |C||V|)) where Ld is the average length of a document in D.– Assumes V and all Di , ni, and nij pre-computed in O(|D|

Ld) time during one pass through all of the data.– Generally just O(|D|Ld) since usually |C||V| < |D|Ld

• Test Time: O(|C| Lt) where Lt

is the average length of a test document.

• Very efficient overall, linearly proportional to the time needed to just read in all the data.

Page 56: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

Multi-Class Categorization

• Pick the category with max probability• Create many 1 vs other classifiers• Use a hierarchical approach (wherever hierarchy

available) Entity

Person Location

Scientist Artist City County Country

65