29
Text Classification & Linear Models CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob Eisenstein

Text Classification & Linear Models · Text Classification & Linear Models CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Dan Jurafsky& James Martin, Jacob Eisenstein

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Text Classification & Linear Models · Text Classification & Linear Models CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Dan Jurafsky& James Martin, Jacob Eisenstein

TextClassification&LinearModelsCMSC723/LING723/INST725

MarineCarpuat

Slidescredit:DanJurafsky &JamesMartin,JacobEisenstein

Page 2: Text Classification & Linear Models · Text Classification & Linear Models CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Dan Jurafsky& James Martin, Jacob Eisenstein

Logistics/Reminders

• Homework1– dueThursdaySep7by12pm.

• Project1comingup

• Thursdaylecturetime:projectset-upofficehourinCSIC1121

Page 3: Text Classification & Linear Models · Text Classification & Linear Models CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Dan Jurafsky& James Martin, Jacob Eisenstein

Recap:WordMeaning

2coreissuesfromanNLPperspective

• Semanticsimilarity:giventwowords,howsimilararetheyinmeaning?• Keyconcepts:vectorsemantics,PPMIanditsvariants,cosinesimilarity

• Wordsensedisambiguation:givenawordthathasmorethanonemeaning,whichoneisusedinaspecificcontext?• Keyconcepts:wordsense,WordNetandsenseinventories,unsuperviseddisambiguation(Lesk),superviseddisambiguation

Page 4: Text Classification & Linear Models · Text Classification & Linear Models CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Dan Jurafsky& James Martin, Jacob Eisenstein

Today

• Textclassificationproblems• andtheirevaluation

• Linearclassifiers• Features&Weights• Bagofwords• NaïveBayes

Page 5: Text Classification & Linear Models · Text Classification & Linear Models CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Dan Jurafsky& James Martin, Jacob Eisenstein

Textclassification

Page 6: Text Classification & Linear Models · Text Classification & Linear Models CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Dan Jurafsky& James Martin, Jacob Eisenstein

Isthisspam?From: "Fabian Starr“ <[email protected]>Subject: Hey! Sofware for the funny prices!

Get the great discounts on popular software today for PC and Macintoshhttp://iiled.org/Cj4Lmx70-90% Discounts from retail price!!!All sofware is instantly available to download - No Need Wait!

Page 7: Text Classification & Linear Models · Text Classification & Linear Models CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Dan Jurafsky& James Martin, Jacob Eisenstein

Whatisthesubjectofthisarticle?

• Antogonists andInhibitors• BloodSupply• Chemistry• DrugTherapy• Embryology• Epidemiology• …

MeSH SubjectCategoryHierarchy

?

MEDLINE Article

Page 8: Text Classification & Linear Models · Text Classification & Linear Models CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Dan Jurafsky& James Martin, Jacob Eisenstein

TextClassification

• Assigningsubjectcategories,topics,orgenres• Spamdetection• Authorshipidentification• Age/genderidentification• LanguageIdentification• Sentimentanalysis• …

Page 9: Text Classification & Linear Models · Text Classification & Linear Models CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Dan Jurafsky& James Martin, Jacob Eisenstein

TextClassification:definition

• Input:• adocumentd• afixedsetofclassesY= {y1,y2,…,yJ}

• Output:apredictedclassy Î Y

Page 10: Text Classification & Linear Models · Text Classification & Linear Models CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Dan Jurafsky& James Martin, Jacob Eisenstein

ClassificationMethods:Hand-codedrules

• Rulesbasedoncombinationsofwordsorotherfeatures• spam:black-list-addressOR(“dollars”AND“havebeenselected”)

• Accuracycanbehigh• Ifrulescarefullyrefinedbyexpert

• Butbuildingandmaintainingtheserulesisexpensive

Page 11: Text Classification & Linear Models · Text Classification & Linear Models CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Dan Jurafsky& James Martin, Jacob Eisenstein

ClassificationMethods:SupervisedMachineLearning

• Input• adocumentd• afixedsetofclassesY= {y1,y2,…,yJ}• a trainingsetofm hand-labeleddocuments(d1,y1),....,(dm,ym)

• Output• alearnedclassifierdà y

Page 12: Text Classification & Linear Models · Text Classification & Linear Models CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Dan Jurafsky& James Martin, Jacob Eisenstein

Aside:gettingexamplesforsupervisedlearning

• Humanannotation• Byexpertsornon-experts(crowdsourcing)• Founddata

• Howdoweknowhowgoodaclassifieris?• Compareclassifierpredictionswithhumanannotation• Onheldout testexamples• Evaluationmetrics:accuracy,precision,recall

Page 13: Text Classification & Linear Models · Text Classification & Linear Models CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Dan Jurafsky& James Martin, Jacob Eisenstein

The2-by-2contingencytable

correct notcorrectselected tp fp

notselected fn tn

Page 14: Text Classification & Linear Models · Text Classification & Linear Models CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Dan Jurafsky& James Martin, Jacob Eisenstein

Precisionandrecall

• Precision:%ofselecteditemsthatarecorrectRecall:%ofcorrectitemsthatareselected

correct notcorrectselected tp fp

notselected fn tn

Page 15: Text Classification & Linear Models · Text Classification & Linear Models CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Dan Jurafsky& James Martin, Jacob Eisenstein

Acombinedmeasure:F

• AcombinedmeasurethatassessestheP/RtradeoffisFmeasure(weightedharmonicmean):

• PeopleusuallyusebalancedF1measure• i.e.,withb =1(thatis,a =½):

F =2PR/(P+R)

RPPR

RP

F+

+=

−+= 2

2 )1(1)1(1

1ββ

αα

Page 16: Text Classification & Linear Models · Text Classification & Linear Models CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Dan Jurafsky& James Martin, Jacob Eisenstein

LinearClassifiers

Page 17: Text Classification & Linear Models · Text Classification & Linear Models CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Dan Jurafsky& James Martin, Jacob Eisenstein

Bagofwords

Page 18: Text Classification & Linear Models · Text Classification & Linear Models CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Dan Jurafsky& James Martin, Jacob Eisenstein

Definingfeatures

Page 19: Text Classification & Linear Models · Text Classification & Linear Models CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Dan Jurafsky& James Martin, Jacob Eisenstein

Definingfeatures

Page 20: Text Classification & Linear Models · Text Classification & Linear Models CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Dan Jurafsky& James Martin, Jacob Eisenstein

Linearclassification

Page 21: Text Classification & Linear Models · Text Classification & Linear Models CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Dan Jurafsky& James Martin, Jacob Eisenstein

LinearModelsforClassification

Featurefunction

representation

Weights

Page 22: Text Classification & Linear Models · Text Classification & Linear Models CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Dan Jurafsky& James Martin, Jacob Eisenstein

Howcanwelearnweights?

• Byhand

• Probability• e.g.,Naïve Bayes

• Discriminativetraining• e.g.,perceptron,supportvectormachines

Page 23: Text Classification & Linear Models · Text Classification & Linear Models CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Dan Jurafsky& James Martin, Jacob Eisenstein

GenerativeStoryforMultinomialNaïveBayes

• Ahypotheticalstochasticprocessdescribinghowtrainingexamplesaregenerated

Page 24: Text Classification & Linear Models · Text Classification & Linear Models CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Dan Jurafsky& James Martin, Jacob Eisenstein

PredictionwithNaïveBayesScore(x,y)

Page 25: Text Classification & Linear Models · Text Classification & Linear Models CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Dan Jurafsky& James Martin, Jacob Eisenstein

PredictionwithNaïveBayesScore(x,y)

Page 26: Text Classification & Linear Models · Text Classification & Linear Models CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Dan Jurafsky& James Martin, Jacob Eisenstein

ParameterEstimation

• “countandnormalize”• Parametersofamultinomialdistribution

• Relativefrequencyestimator• Formally:thisisthemaximumlikelihoodestimate

• SeeCIMLforderivation

Page 27: Text Classification & Linear Models · Text Classification & Linear Models CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Dan Jurafsky& James Martin, Jacob Eisenstein

Smoothing(addalpha/Laplace)

Page 28: Text Classification & Linear Models · Text Classification & Linear Models CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Dan Jurafsky& James Martin, Jacob Eisenstein

NaïveBayesrecap

Page 29: Text Classification & Linear Models · Text Classification & Linear Models CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Dan Jurafsky& James Martin, Jacob Eisenstein

Today

• Textclassificationproblems• andtheirevaluation

• Linearclassifiers• Features&Weights• Bagofwords• NaïveBayes