UVA CS 4501 : Machine Learning Lecture 17: Naïve Bayes ... · Naive Bayes is Not So Naive •...

UVACS4501:MachineLearning

Lecture17:NaïveBayesClassifierfor

TextClassification

Dr.YanjunQi

UniversityofVirginia

DepartmentofComputerScience

Wherearewe?èThreemajorsectionsforclassification

•  We can divide the large variety of classification approaches into roughly three major types

1. Discriminative - directly estimate a decision rule/boundary - e.g., SVM, NN, decision tree 2. Generative: - build a generative statistical model - e.g., naïve bayes classifier, Bayesian networks 3. Instance based classifiers - Use observation directly (no models) - e.g. K nearest neighbors

11/15/18

Dr.YanjunQi/UVACS

Today:NaïveBayesClassifierforTextü  DictionarybasedVectorspacerepresentationof

textarticleü  MultivariateBernoullivs.Multinomialü  MultivariateBernoullinaïveBayesclassifier§  Testing§  TrainingWithMaximumLikelihoodEstimation

forestimatingparametersü  MultinomialnaïveBayesclassifier§  Testing§  TrainingWithMaximumLikelihoodEstimation

forestimatingparameters§  MultinomialnaïveBayesclassifierasConditional

StochasticLanguageModels(Extra)

11/15/18 3

Dr.YanjunQi/UVACS

Textdocumentclassification,e.g.spamemailfiltering

•  Input:documentD•  Output:thepredictedclassC,cisfrom{c1,...,cL}

•  E.g.,– SpamfilteringTask:Classifyemailas‘Spam’,‘Other’.

P(C=spam|D)

Dr.YanjunQi/UVACS

From:""<takworlld@hotmail.com>Subject:realestateistheonlyway...gemoalvgkayAnyonecanbuyrealestatewithnomoneydownStoppayingrentTODAY!ChangeyourlifeNOWbytakingasimplecourse!ClickBelowtoorder:http://www.wholesaledaily.com/sales/nmd.htm

NaiveBayesisNotSoNaive

• Naive Bayes won 1st and 2nd place in KDD-CUP 97 competition out of 16 systems

Goal: Financial services industry direct mail response prediction: Predict if the recipient of mail will actually respond to the advertisement – 750,000 records.

• A good dependable baseline for text classification (but not the best)!

• Formosttextcategorizationtasks,therearemanyrelevantfeaturesandmanyirrelevantones

TextclassificationTasks•  Input:documentD•  Output:thepredictedclassC,cisfrom{c1,...,cL}

Textclassificationexamples:

•  Classifyemailas ‘Spam’,‘Other’.•  Classifywebpagesas ‘Student’,‘Faculty’,‘Other’•  Classifynewsstoriesintotopics‘Sports’,‘Politics’..•  Classifybusinessnamesby industry.•  Classifymoviereviewsas‘Favorable’,‘Unfavorable’,‘Neutral’ •  …andmanymore.

11/15/18 6

Dr.YanjunQi/UVACS

11/15/18

Dr.YanjunQi/UVACS

TextCategorization/Classification

§  Given:§  Arepresentationofatextdocumentd

§  Issue:howtorepresenttextdocuments.§ Usuallysometypeofhigh-dimensionalspace–bagofwords

§  Afixedsetofoutputclasses:C={c1,c2,…,cJ}

Sec. 13.1

Thebagofwordsrepresentation

I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun… It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet.

f( )=c

Thebagofwordsrepresentation

f( )=cgreat 2love 2

recommend 1

laugh 1happy 1

... ...

Representingtext:alistofwordsè

Dictionary

Commonrefinements:removestopwords,stemming,collapsingmultipleoccurrencesofwordsintoone….

11/15/18 11

Dr.YanjunQi/UVACS

great 2

love 2

recommend 1

laugh 1

happy 1

word frequency

‘Bagofwords’representationoftext

Bagofwordrepresentation:

Representtextasavectorofwordfrequencies.

11/15/18 12

Dr.YanjunQi/UVACS

great 2

love 2

recommend 1

laugh 1

happy 1

word frequency

Another“Bagofwords”representationoftextèEachdictionarywordasBoolean

word Boolean

Bagofwordrepresentation:

RepresenttextasavectorofBooleanrepresentingifawordExistsorNOT.

11/15/18 13

Dr.YanjunQi/UVACS

great Yes

love Yes

recommend Yes

laugh Yes

happy Yes

hate No

Bagofwords

•  Whatsimplifyingassumptionarewetaking?

Weassumedwordorderisnotimportant.

11/15/18 14

Dr.YanjunQi/UVACS

Bagofwordsrepresentationdocumenti

e.g.,X(i,j)=Frequencyofwordjindocumenti

Acollectionofdocuments

11/15/18 15

Dr.YanjunQi/UVACS

UnknownWords•  Howtohandlewordsinthetestcorpusthatdidnotoccurinthetrainingdata,i.e.outofvocabulary(OOV)words?

•  Trainamodelthatincludesanexplicitsymbolforanunknownword(<UNK>).–  Chooseavocabularyinadvanceandreplaceother(i.e.notinvocabulary)wordsinthecorpuswith<UNK>.

–  Veryoften,<UNK>alsousedtoreplacerarewords–  Replacethefirstoccurrenceofeachwordinthetrainingdatawith<UNK>.

textarticleü  MultivariateBernoullivs.Multinomialü  MultivariateBernoullinaïveBayesclassifier§  Testing§  TrainingWithMaximumLikelihoodEstimation

11/15/18 17

Dr.YanjunQi/UVACS

‘Bagofwords’èwhatprobabilitymodel?

Pr(D = d | C = ci )

?11/15/18 18

Dr.YanjunQi/UVACS

great .

love .

recommend .

laugh .

happy .

‘Bagofwords’èwhatprobabilitymodel?

Pr(D | C = c) =?

Pr(W1 = n1,W2 = n2 ,...,Wk = nk | C = c)

Pr(W1 = true,W2 = false...,Wk = true | C = c)

Dr.YanjunQi/UVACS

TwoPreviousmodels

11/15/18

NaïveProbabilisticModelsoftextdocuments

Pr(D | C = c) =

Pr(W1 = n1,W2 = n2 ,...,Wk = nk | C = c)

Dr.YanjunQi/UVACS

TwoPreviousmodels

11/15/18

MultivariateBernoulliDistribution

MultinomialDistribution

Text Classification with Naïve Bayes Classifier

•  Multinomial vs Multivariate Bernoulli?

•  Multinomial model is almost always more effective in text applications!

11/15/18 21

Dr.YanjunQi/UVACS

AdaptFromManning’textCattutorial

Experiment:MultinomialvsmultivariateBernoulli

•  M&N(1998)didsomeexperimentstoseewhichisbetter

•  Determineifauniversitywebpageis{student,faculty,other_stuff}

•  Trainon~5,000hand-labeledwebpages–  Cornell,Washington,U.Texas,Wisconsin

•  Crawlandclassifyanewsite(CMU)

11/15/18 22

Dr.YanjunQi/UVACS

Multinomialvs.multivariateBernoulli

11/15/18 23

Dr.YanjunQi/UVACS

textarticleü  MultivariateBernoullivs.Multinomialü  MultivariateBernoulli§  Testing§  TrainingWithMaximumLikelihoodEstimation

11/15/18 24

Dr.YanjunQi/UVACS

Model 1: Multivariate Bernoulli

•  Model 1: Multivariate Bernoulli – For each word in a dictionary, feature – Xw = true in document d if w appears in d

– Naive Bayes assumption: •  Given the document’s topic class label,

appearance of one word in the document tells us nothing about chances that another word appears

25 11/15/18

Dr.YanjunQi/UVACS

word Boolean

great Yes

love Yes

recommend Yes

laugh Yes

happy Yes

hate No

•  Model 1: Multivariate Bernoulli – One feature Xw for each word in dictionary – Xw = true in document d if w appears in d

– Naive Bayes assumption: •  Given the document’s class label,

appearance of one word in the document tells us nothing about chances that another word appears

26 11/15/18

Dr.YanjunQi/UVACS

Model 1: Multivariate Bernoulli Naïve Bayes Classifier

27 11/15/18

Dr.YanjunQi/UVACS

•  Conditional Independence Assumption: Features (word presence) are independent of each other given the class variable:

•  Multivariate Bernoulli model is appropriate for binary feature variables

word True/false

great Yes

love Yes

recommend Yes

laugh Yes

happy Yes

hate No

W1 W2 W5 W3 W4 W6

11/15/18

Pr(W1 = true,W2 = false,...,Wk = true |C = c)= P(W1 = true |C)•P(W2 = false |C)•!•P(Wk = true |C)

runnynose sinus fever muscle-ache running

thisisnaïve

Dr.YanjunQi/UVACS

Review:BernoulliDistributione.g.CoinFlips

•  Youflipacoin– Headwithprobabilityp– Binaryrandomvariable– Bernoullitrialwithsuccessprobabilityp

•  Youflipkcoins– Howmanyheadswouldyouexpect– NumberofheadsX:discreterandomvariable– Binomialdistributionwithparameterskandp

11/15/18

Dr.YanjunQi/UVACS

Pr(Wi = true | C = c)

11/15/18

Dr.YanjunQi/UVACS

Review:DerivingtheMaximumLikelihoodEstimateforBernoulli

−l(p)= − log(L(p))= − log px(1− p)n−x⎡⎣ ⎤⎦

= − log(px )− log((1− p)n−x )

= −x log(p)−(n− x)log(1− p)

11/15/18 31

Dr.YanjunQi/UVACS

ˆ p = xn

The proportion of positives!

i.e.Relativefrequencyofabinaryevent

Minimize the negative log-likelihood è MLE parameter estimation

Parameter estimation

fraction of documents of label cj in which word w_i appears

•  Multivariate Bernoulli model:

•  SmoothingtoAvoidOverfitting

P̂(wi = true|c j )=

11/15/18 32 AdaptFromManning’textCattutorial

Dr.YanjunQi/UVACS

TestingStage:(LookUpOperations)

11/15/18

Dr.YanjunQi/UVACS

Underflow Prevention: log space •  Multiplying lots of probabilities, which are between 0 and 1,

can result in floating-point underflow. •  Since log(xy) = log(x) + log(y), it is better to perform all

computations by summing logs of probabilities rather than multiplying probabilities.

•  Class with highest final un-normalized log probability score is still the most probable.

•  Note that model is now just max of sum of weights… cNB = argmax

cj∈ClogP(c j )+ logP(xi |c j )

i∈dictionary∑

⎧⎨⎪

⎩⎪

⎫⎬⎪

⎭⎪

34 11/15/18AdaptFromManning’textCattutorial

Dr.YanjunQi/UVACS

11/15/18 35

Dr.YanjunQi/UVACS

Model 2: Multinomial Naïve Bayes - ‘Bagofwords’representationoftext

word frequency

Canberepresentedasamultinomialdistribution.

Words=likecoloredballs,thereareKpossibletypeofthem(i.e.fromadictionaryofKwords)Document=containsNwords,eachwordoccursnitimes(likeabagofNcoloredballs)

11/15/18 36

Dr.YanjunQi/UVACS

great 2

love 2

recommend 1

laugh 1

happy 1

Pr(W1 = n1,...,Wk = nk |C = c)

word frequency

11/15/18 37

Dr.YanjunQi/UVACS

Pr(W1 = n1,...,Wk = nk |C = c)great 2

love 2

recommend 1

laugh 1

happy 1

word frequency

Words=likecoloredballs,thereareKpossibletypeofthem(i.e.fromadictionaryofKwords)ADocument=containsNwords,eachwordoccursnitimes(likeabagofNcoloredballs)

11/15/18 38

Dr.YanjunQi/UVACS

Pr(W1 = n1,...,Wk = nk |C = c)great 2

love 2

recommend 1

laugh 1

happy 1

Multinomialdistribution•  Themultinomialdistributionisageneralizationofthebinomial

distribution.

•  Thebinomialdistributioncountssuccessesofanevent(forexample,headsincointosses).

•  Theparameters:–  N(numberoftrials)–  (theprobabilityofsuccessoftheevent)

•  Themultinomialcountsthenumberofasetofevents(forexample,howmanytimeseachsideofadiecomesupinasetofrolls).–  Theparameters:–  N(numberoftrials)–  (theprobabilityofsuccessforeachcategory)

1.. kθ θ

11/15/18 39

Dr.YanjunQi/UVACS

Abinomialdistributionisthemultinomialdistributionwithk=2and

θ1 = pθ2 = 1−θ1

Multinomialdistribution•  Themultinomialdistributionisageneralizationofthebinomial

distribution.

•  Thebinomialdistributioncountssuccessesofanevent(forexample,headsincointosses).

•  Theparameters:–  N(numberoftrials)–  (theprobabilityofsuccessoftheevent)

•  Themultinomialcountsthenumberofasetofevents(forexample,howmanytimeseachsideofadiecomesupinasetofrolls).–  Theparameters:–  N(numberoftrials)–  (theprobabilityofsuccessforeachcategory)

1.. kθ θ

11/15/18 40

Dr.YanjunQi/UVACS

MultinomialDistributionforTextClassification

•  W1,W2,..Wkarevariables

P(W1 = n1,...,Wk = nk | c, N ,θ1,c ,..,θk ,c ) = N !

n1!n2 !..nk !θ1,c

n1θ2,cn2 ..θk ,c

NumberofpossibleorderingsofNballs

orderinvariantselectionsNoteeventsareindependent

Abinomialdistributionisthemultinomialdistributionwithk=2and

θ1 = pθ2 = 1−θ1

11/15/18 41

Dr.YanjunQi/UVACS

MultinomialDistributionforTextClassification

•  W1,W2,..Wkarevariables

θ i,ci=1

∑ = 1

NumberofpossibleorderingsofNballs

Labelinvariant

11/15/18 42

Dr.YanjunQi/UVACS

P(W1 = n1,...,Wk = nk | c, N ,θ1,c ,..,θk ,c ) = N !

n1!n2 !..nk !θ1,c

n1θ2,cn2 ..θk ,c

Model 2: Multinomial Naïve Bayes - ‘Bagofwords’ –TESTINGStage

word frequency

11/15/18 43

Dr.YanjunQi/UVACS

argmaxc

P(W1 = n1,...,Wk = nk ,c)

= argmaxc

{p(c)*θ1,cn1θ2,c

n2 ..θk ,cnk }

great 2

love 2

recommend 1

laugh 1

happy 1

11/15/18 44

Dr.YanjunQi/UVACS

11/15/18

Dr.YanjunQi/UVACS

DerivingtheMaximumLikelihoodEstimateformultinomialdistribution

11/15/18 46

function of θ

LIKELIHOOD:

argmaxθ

1,..,θk

P(d1,...,dT |θ1,..,θk )

= argmaxθ

1,..,θk t=1

∏P(dt |θ1,..,θk )

= argmaxθ

1,..,θk t=1

∏Nd t !

n1,dt!n2,dt

!..nk ,dt!θ1

n1,dtθ2

n2,dt ..θk

nk ,dt

= argmaxθ

1,..,θk t=1

∏θ1

n1,dtθ2

n2,dt ..θk

nk ,dt

s.t. θii=1

∑ =1

Dr.YanjunQi/UVACS

11/15/18

Dr.YanjunQi/UVACS

DerivingtheMaximumLikelihoodEstimateformultinomialdistributionargmax

θ1,..,θk

log(L(θ ))

= argmaxθ1,..,θk

log(t=1

∏θ1n1,dtθ2

n2,dt ..θknk ,dt )

= argmaxθ1,..,θk

n1,dtt=1,...T∑ log(θ1)+ n2,dt

t=1,...T∑ log(θ2 )+ ...+ nk ,dt

t=1,...T∑ log(θk )

s.t. θii=1

∑ =1

ConstrainedoptimizationMLEestimator

Constrainedoptimization

ni,dtt=1,...T∑

n1,dtt=1,...T∑ + n2,dt

t=1,...T∑ +...+ nk,dt

t=1,...T∑

ni,dtt=1,...T∑

Ndtt=1,...T∑

è i.e. We can create a mega-document by concatenating all documents d_1 to d_T èUse relative frequency of w in mega-document

11/15/18 48

argmaxθ1,..,θk

log(L(θ ))

= argmaxθ1,..,θk

log(t=1

∏θ1n1,dtθ2

n2,dt ..θknk ,dt )

= argmaxθ1,..,θk

t=1,...T∑ log(θ2 )+ ...+ nk ,dt

t=1,...T∑ log(θk )

s.t. θii=1

∑ =1

ni,dtt=1,...T∑

n1,dtt=1,...T∑ + n2,dt

t=1,...T∑ +...+ nk,dt

t=1,...T∑

ni,dtt=1,...T∑

Ndtt=1,...T∑

Dr.YanjunQi/UVACS

11/15/18 49

argmaxθ1,..,θk

log(L(θ ))

= argmaxθ1,..,θk

log(t=1

∏θ1n1,dtθ2

n2,dt ..θknk ,dt )

= argmaxθ1,..,θk

t=1,...T∑ log(θ2 )+ ...+ nk ,dt

t=1,...T∑ log(θk )

s.t. θii=1

∑ =1

Dr.YanjunQi/UVACS

ni,dtt=1,...T∑

n1,dtt=1,...T∑ + n2,dt

t=1,...T∑ +...+ nk,dt

t=1,...T∑

ni,dtt=1,...T∑

Ndtt=1,...T∑

11/15/18 50

argmaxθ1,..,θk

log(L(θ ))

= argmaxθ1,..,θk

log(t=1

∏θ1n1,dtθ2

n2,dt ..θknk ,dt )

= argmaxθ1,..,θk

t=1,...T∑ log(θ2 )+ ...+ nk ,dt

t=1,...T∑ log(θk )

s.t. θii=1

∑ =1

ni,dtt=1,...T∑

n1,dtt=1,...T∑ + n2,dt

t=1,...T∑ +...+ nk,dt

t=1,...T∑

ni,dtt=1,...T∑

Ndtt=1,...T∑

è i.e. We can create a mega-document by concatenating all documents d_1 to d_T èUse relative frequency of wi in mega-document

Dr.YanjunQi/UVACS

11/15/18 51

ni,dtt=1,...T∑

n1,dtt=1,...T∑ + n2,dt

t=1,...T∑ +...+ nk,dt

t=1,...T∑

ni,dtt=1,...T∑

Ndtt=1,...T∑

è i.e. We can create a mega-document by concatenating all documents d_1 to d_T

èUse relative frequency of w in mega-document

Dr.YanjunQi/UVACS

DerivingtheMaximumLikelihoodEstimateformultinomialBayesClassifier

11/15/18 52

function of θ

LIKELIHOOD:

argmaxθ

1,..,θk

P(d1,...,dT |θ1,..,θk )

= argmaxθ

1,..,θk t=1

∏P(dt |θ1,..,θk )

= argmaxθ

1,..,θk t=1

∏Nd t !

n1,dt!n2,dt

!..nk ,dt!θ1

n1,dtθ2

n2,dt ..θk

nk ,dt

= argmaxθ

1,..,θk t=1

∏θ1

n1,dtθ2

n2,dt ..θk

nk ,dt

s.t. θii=1

∑ =1

Dr.YanjunQi/UVACS

Parameter estimation

fraction of times in which word w appears

across all documents of class cj

== )|(ˆ ji cwXP

11/15/18 53 AdaptFromManning’textCattutorial

Dr.YanjunQi/UVACS

Multinomial model:

•  Can create a mega-document for class j by concatenating all documents on this class,

•  Use frequency of w in mega-document

n  Textj ß is length nj and is a singledocumentcontainingalldocsjn  foreachwordwk inVocabulary

n  nk,j ß numberofoccurrencesof wk inTextj;nj is length of Textj

Multinomial : Learning Algorithm for parameter estimation with MLE

•  From training corpus, extract Vocabulary •  Calculate required P(cj) and P(wk | cj) terms

–  For each cj in C do •  docsj ß subset of documents for which the target

class is cj

P(wk |c j )←

nk , j +αnj +α |Vocabulary |

|documents # total|||

docscP ←

54 11/15/18Relative frequency of word w_k appears

Dr.YanjunQi/UVACS

e.g.,α =1

n  Textj ß is length nj and is a singledocumentcontainingalldocsjn  foreachwordwk inVocabulary

n  nk,j ß numberofoccurrencesof wk inTextj;nj is length of Textj

Multinomial : Learning Algorithm for parameter estimation with MLE

•  From training corpus, extract Vocabulary •  Calculate required P(cj) and P(wk | cj) terms

–  For each cj in C do •  docsj ß subset of documents for which the target

class is cj

P(wk |c j )←

nk , j +αnj +α |Vocabulary |

|documents # total|||

docscP ←

55 11/15/18Relative frequency of word w_k appears

Dr.YanjunQi/UVACS

e.g.,α =1

Naive Bayes is Not So Naive •  Naïve Bayes: First and Second place in KDD-CUP 97 competition, among

16 (then) state of the art algorithms Goal: Financial services industry direct mail response prediction model: Predict if the recipient of mail will actually respond to the advertisement – 750,000 records.

•  Robust to Irrelevant Features Irrelevant Features cancel each other without affecting results Instead Decision Trees can heavily suffer from this.

•  Very good in domains with many equally important features Decision Trees suffer from fragmentation in such cases – especially if little data

•  A good dependable baseline for text classification (but not the best)! •  Optimal if the Independence Assumptions hold: If assumed independence is

correct, then it is the Bayes Optimal Classifier for problem •  Very Fast: Learning with one pass of counting over the data; testing linear in the

number of attributes, and document collection size •  Low Storage requirements

56 11/15/18

Dr.YanjunQi/UVACS

References

q Prof.AndrewMoore’sreviewtutorialq Prof.KeChenNBslidesq Prof.CarlosGuestrinrecitationslidesq Prof.RaymondJ.MooneyandJimmyLin’sslidesaboutlanguagemodel

q Prof.Manning’textCattutorial

11/15/18

Dr.YanjunQi/UVACS

11/15/18 58

Dr.YanjunQi/UVACS

TheBigPicture

Modeli.e.Datagenerating

process

ObservedData

Probability

Estimation/learning/Inference/Datamining

Buthowtospecifyamodel?

11/15/18

Build a generative model that approximates how data is produced.

Dr.YanjunQi/UVACS

grain(s) 3 oilseed(s) 2 total 3 wheat 1 maize 1 soybean 1 tonnes 1

... ...

word frequency

11/15/18 60

Pr(W1 = n1,...,Wk = nk |C = c)

WHYisthisnaïve???

P(W1 = n1,...,Wk = nk | c, N ,θ

1,..,θk ) = N !

n1!n2 !..nk !θ1

n1θ2n2 ..θk

multinomialcoefficient,normallycanleaveoutinpracticalcalculations.

Dr.YanjunQi/UVACS

WHYMULTINOMIALONTEXTISNAÏVEPROB.MODELING?

MainQuestion:

11/15/18

Dr.YanjunQi/UVACS

MultinomialNaïveBayesasèa generative model that approximates how a text string is produced

•  StochasticLanguageModels:– Modelprobabilityofgeneratingstrings(eachwordinturnfollowingthesequentialorderinginthestring)inthelanguage(commonlyallstringsoverdictionary∑).

–  E.g.,unigrammodel

0.2 the

0.01 boy

0.01 dog

0.03 said

0.02 likes

the boy likes the dog

0.2 0.01 0.02 0.2 0.01

Multiply all five terms

Model C_1

P(s | C_1) = 0.00000008 11/15/18 62

Dr.YanjunQi/UVACS

MultinomialNaïveBayesasèa generative model that approximates how a text string is produced

•  StochasticLanguageModels:– Modelprobabilityofgeneratingstrings(eachwordinturnfollowingthesequentialorderinginthestring)inthelanguage(commonlyallstringsoverdictionary∑).

–  E.g.,unigrammodel

0.2 the

0.01 boy

0.01 dog

0.03 said

0.02 likes

0.2 0.01 0.02 0.2 0.01

Multiply all five terms

Model C_1

P(d | C_1) = 0.00000008 11/15/18 63

Dr.YanjunQi/UVACS

MultinomialNaïveBayesasConditionalStochasticLanguageModels

•  Modelconditionalprobabilityofgeneratinganystringfromtwopossiblemodels

0.2 the

0.01 boy

0.0001 said

0.0001 likes

0.0001 black

0.0005 dog

0.01 garden

Model C1 Model C2

dog boy likes black the

0.0005 0.01 0.0001 0.0001 0.2 0.01 0.0001 0.02 0.1 0.2

P(d|C2) P(C2) > P(d|C1) P(C1)

0.2 the

0.0001 boy

0.03 said

0.02 likes

0.1 black

0.01 dog

0.0001 garden

11/15/18 64èdismorelikelytobefromclassC2

Dr.YanjunQi/UVACS

P(C2) = P(C1)

APhysicalMetaphor

•  Coloredballsarerandomlydrawnfrom(withreplacement)

A string of words model

P()=P()P()P()P()

11/15/18 65

Dr.YanjunQi/UVACS

UnigramlanguagemodelèMoregeneral:Generatinglanguagestringfromaprobabilisticmodel

•  UnigramLanguageModels

•  Alsocouldbebigram(orgenerally,n-gram)LanguageModels

=P()P(|)P(|)P(|)

P()P()P()P()

Easy. Effective!

11/15/18 66AdaptFromManning’textCattutorial

P()P(|)P(|)P(|)

Chainrule

NAÏVE:conditionalindependentoneachpositionofthestring

Dr.YanjunQi/UVACS

UnigramlanguagemodelèMoregeneral:Generatinglanguagestringfromaprobabilisticmodel

•  UnigramLanguageModels

•  Alsocouldbebigram(orgenerally,n-gram)LanguageModels

=P()P(|)P(|)P(|)

P()P()P()P()

Easy. Effective!

P()P(|)P(|)P(|)

Chainrule

NAÏVE:conditionalindependentoneachpositionofthestring

Dr.YanjunQi/UVACS

MultinomialNaïveBayes=aclassconditionalunigramlanguagemodel

•  ThinkofXiasthewordontheithpositioninthedocumentstring

•  Effectively,theprobabilityofeachclassisdoneasaclass-specificunigramlanguagemodel

X1 X2 X3 X4 X5 X6

Dr.YanjunQi/UVACS

11/15/18

Dr.YanjunQi/UVACS

0.2 0.01 0.02 0.2 0.01

Using Multinomial Naive Bayes Classifiers to Classify Text: Basic method

•  Attributes are text positions, values are words.

n  Stilltoomanypossibilitiesn  Usesameparametersforawordacrosspositionsn  Resultisbagofwordsmodel(overwordtokens)

cNB = argmaxc j∈C

P(cj ) P(xi | cj )i∏

= argmaxc j∈C

P(cj )P(x1 = "the"|cj )!P(xn = "the"|cj )

70 11/15/18

Dr.YanjunQi/UVACS

Multinomial Naïve Bayes: Classifying Step

•  Positions ß all word positions in current document which contain tokens found in Vocabulary

•  Return cNB, where

∏∈∈

=positionsi

NB cxPcPc )|()(argmaxj

11/15/18 71

Easytoimplement,no

needtoconstructbag-of-words

vectorexplicitly!!!

Pr(W1 = n1,...,Wk = nk |C = c j )

Equalto,(leavingoutofmultinomialcoefficient)

Dr.YanjunQi/UVACS

Multinomial Naïve Bayes: Classifying Step

•  Positions ß all word positions in current document which contain tokens found in Vocabulary

•  Return cNB, where

∏∈∈

=positionsi

NB cxPcPc )|()(argmaxj

11/15/18 72

Easytoimplement,no

needtoconstructbag-of-words

vectorexplicitly!!!

Pr(W1 = n1,...,Wk = nk |C = c j )

Equalto,(leavingoutofmultinomialcoefficient)

Dr.YanjunQi/UVACS

Multinomial Bayes: Time Complexity

•  Training Time: O(T*Ld + |C||V|)) where Ld is the average length of a document in D. –  Assumes V and all Di , ni, and nk,j pre-computed in O(T*Ld)

time during one pass through all of the data. –  |C||V|=Complexityofcomputingallprobabilityvalues(loopover

wordsandclasses)

–  Generally just O(T*Ld) since usually |C||V| < T*Ld •  Test Time: O(|C| Lt)

where Lt is the average length of a test document. –  Very efficient overall, linearly proportional to the time needed

to just read in all the words. –  Plus, robust in practice

73 AdaptFromManning’textCattutorial

T: num. doc |V| dict size |C| class size

11/15/18

Dr.YanjunQi/UVACS

UVA CS 4501 : Machine Learning Lecture 17: Naïve Bayes ... · Naive Bayes is Not So Naive •...

Documents

Classification: Naïve Bayes - University of Belgradeai.fon.bg.ac.rs/wp-content/uploads/2016/10/Naive-Bayes-Labs-2016.pdf · Naive Bayes classifier • Based on the Bayes rule •

Spam Filtering with Naive Bayes – Which Naive Bayes? · Spam Filtering with Naive Bayes – Which Naive Bayes? Vangelis Metsis1,2, Ion Androutsopoulos1 and Georgios Paliouras2 1Department

Naive Bayes 1, Naive Bayes and Text Classication I

Naive Bayes - Brown University

Naive Bayes Classi cation

Robust Naive Bayes

Naive Bayes for the Superbowl

Bayes Theorem & Naïve Bayes - Penn Engineeringcis521/Lectures/naive-bayes-spam.pdfUsing Naive Bayes Classifiers to Classify Text: Basic method for Multinomial Variables • As a generative

Spam Filtering with Naive Bayes – Which Naive Bayes?nlp.cs.aueb.gr/pubs/ceas2006_paper.pdf · Naive Bayes is very popular in commercial and open-source anti-spam e-mail ﬁlters

Naive Bayes - achamad.staff.ipb.ac.idachamad.staff.ipb.ac.id/wp-content/plugins/as-pdf/TOTO HARYANT… · Naive Bayes Sambil Mengingat Kembali Tentang Naive bayes. Berikut contoh

Naive Bayes and Sentiment Classification

Spam Filtering with Naive Bayes - Which Naive Bayes? · PDF fileSpam Filtering with Naive Bayes – Which Naive Bayes? ∗ Vangelis Metsis † Institute of Informatics and Telecommunications,

Naive Bayes and Map-Reducewcohen/10-605/notes/scalable-nb-notes.pdf · Naive Bayes and Map-Reduce William Cohen September 6, 2017 1 A Baseline Naive Bayes Algorithm We’ll start

Naive bayes document classification

Text Classification Using Naive Bayes

Clustered Naive Bayes

Nearest Neighbors and Naive Bayes

Naive Bayes Classifier From Wikipedia

Naive Bayes Named Entity Recognizer

Forward stagewise naive Bayes - UPM