IntrotoInformationExtraction-SentimentLexiconInductionasan
example
Many slides adapted from Ellen Riloff and Dan Jurafsky
WhatisInformationExtraction?• Informationextraction(IE)isanumbrellatermforNLPtasksthatinvolveextractingpiecesofinformationfromtextandassigningsomemeaningtotheinformation.
• ManyIEapplicationsaimtoturnunstructuredtextintoa“structured” representation.
• IEproblemstypicallyinvolve:– identifyingtextsnippetstoextract
– assigningsemanticmeaningtoentitiesorconcepts
– findingrelationsbetweenentitiesorconcepts
IEApplications• BiologicalProcesses(Genomics)
• ClinicalMedicine
• QuestionAnswering/WebSearch
• QueryExpansion/SemanticSets
• ExtractingEntityProfiles
• TrackingEvents(Violent,Diseases,Business,etc)
• TrackingOpinions(Political,ProductReputation,FinancialPrediction,On-lineReviews,etc.)
GeneralTechniques• SyntacticAnalysis
– PhraseIdentification
– FeatureExtraction
• SemanticAnalysis
• StatisticalMeasures
• MachineLearning
– Supervised&WeaklySupervised
• GraphAlgorithms
NamedEntityRecognition(NER)
MarsOneannouncedMonday thatithaspicked1,058aspiringspaceflyerstomoveontothenextroundinitssearchforthefirsthumanstoliveanddieontheRedPlanet.
TheWallStreetJournalreportsthatGoogle planstopartnerwithToyotatodevelopAndroidsoftwarefortheirhybridcars.
NERtypicallyinvolvesextractingandlabelingcertaintypesofentities,suchaspropernamesanddates.
Domain-specificNER
IL−2 gene expressionand NFkappaB activationthrough CD28 requiresreactiveoxygenproductionby 5lipoxygenase.
Biomedicalsystemsmustrecognizegenesandproteins:
Adrenal-Sparingsurgeryissafeandeffective,andmaybecomethetreatmentofchoiceinpatientswithhereditaryphaeochromocytoma.
Clinicalmedicalsystemsmustrecognizeproblemsandtreatments:
SemanticClassIdentification
MarsOneannouncedMondaythatithaspicked1,058aspiringspaceflyerstomoveontothenextroundinitssearchforthefirsthumanstoliveanddieontheRedPlanet.
TheWallStreetJournalreportsthatGoogleplanstopartnerwithToyotatodevelopAndroidsoftwarefortheirhybridcars.
SemanticLexiconInduction• Althoughsomegeneralsemanticdictionariesexist(e.g.,WordNet),domain-specificapplicationsoftenhavespecializedvocabulary.
• SemanticLexiconInductiontechniqueslearnlistsofwordsthatbelongtoasemanticclass.
Vehicles:car,jeep,helicopter,bike,tricycle,scooter,…
Animal:tiger,zebra,wolverine,platypus,echidna,…
Symptoms:cough,sneeze,pain,pu/pd,elevatedbp,…
Products:camera,laptop,iPad,tablet,GPSdevice,…
Domain-specificVocabulary
A 14yo m/n doxy owned by a reputable breeder is being treated for IBD with pred.
doxy
predIBDbreeder
ANIMALHUMANDISEASEDRUG
Domain-specific meanings: lab, mix, m/n = ANIMAL
SemanticTaxonomyInduction• Ideally,wewantsemanticconceptstobeorganizedinataxonomy,tosupportgeneralizationbuttodistinguishdifferentsubtypes.
AnimalMammal
FelineLion,PantheraLeoTiger,PantheraTigris,FelisTigrisCougar,MountainLion,Puma,Panther,Catamount
CanineWolf,CanisLupusCoyote,PrairieWolf,BrushWolf,AmericanJackalDog,Puppy,CanisLupusFamiliaris,Mongrel
ChallengesinTaxonomyInduction• Butthereareoftenmanywaystoorganizeaconceptualspace!
• Stricthierarchiesarerareinrealdata– graphs/networksaremorerealisticthantreestructures.
• Forexample,animalscouldbesubcategorizedbasedon:– carnivorevs.herbivore– water-dwellingvs.land-dwelling– wildvs.petsvs.agricultural– physicalcharacteristics(e.g.,baleenvs.toothedwhales)– habitat(e.g.,arcticvs.desert)
RelationExtraction
InSalzburg,littleMozart grewupinalovingmiddle-classenvironment.
Birthplace(Mozart,Salzburg)
SteveBallmer isanAmericanbusinessmanwhohasbeenservingastheCEOofMicrosoft sinceJanuary2000
Employed-By(SteveBallmer,Microsoft)CEO(SteveBallmer,Microsoft)
RelationsforWebSearch
Paraphrasing• Relationscanoftenbeexpressedwithamultitudeofdifferenceexpressions.
• Paraphrasingsystemstrytoexplicitlylearnphrasesthatrepresentthesametypeofrelation.
• Examples:– XwasborninY
– YisthebirthplaceofX
– X’sbirthplaceisY
– X’shometownisY
– XgrewupinY
15
Event ExtractionGoal: extract facts about events from unstructured documents
December 29, Pakistan - The U.S. embassy in Islamabad was damaged this morning by a car bomb. Three diplomats were injured in the explosion. Al Qaeda has claimed responsibility for the attack.
EVENTType: bombingTarget: U.S. embassyLocation: Islamabad,
PakistanDate: December 29Weapon: car bombVictim: three diplomatsPerpetrator: Al Qaeda
Example: extracting information about terrorism events in news articles:
EventExtraction
Document TextNew Jersey, February, 26. An outbreak of swine flu has been confirmed in Mercer County, NJ. Five teenage boysappear to have contracted the deadly virus from an unknown source. The CDC is investigating the cases and is taking measures to prevent the spread. . .
EventDisease: swine fluLocation: Mercer County, NJVictim: Five teenage boysDate: February 26Status: confirmed
Anotherexample:extractinginformationaboutdiseaseoutbreakevents.
Large-ScaleIEfromtheWeb
• SomeresearchershavebeendevelopingIEsystemsforlarge-scaleextractionoffactsandrelationsfromtheWeb.
• ThesesystemsexploitthemassiveamountoftextandredundancyavailableontheWebanduseweaklysupervised,iterativelearningtoharvestinformationforautomatedknowledgebaseconstruction.
• TheKnowItAllprojectatUWandNELLprojectatCMUarewell-knownresearchgroupspursuingthiswork.
OpinionExtraction
Source:<writer>Target:PowershotAspect:pictures,colorsEvaluation:beautiful,easytogrip
IjustboughtaPowershotafewdaysago.Itooksomepicturesusingthecamera.Colorsaresobeautifulevenwhenflashisused.Alsoeasytogripsincethebodyhasagriphandle.[Kobayashietal.,2007]
OpinionExtractionfromNews[Wilson&Wiebe,2009]
Source:ItaliansenatorRenzoGubertTarget:theChineseGovernmentEvaluation:praisedPOSITIVE
ItaliansenatorRenzoGubertpraisedtheChineseGovernment’sefforts.
AfricanobserversgenerallyapprovedofhisvictorywhileWesterngovernmentsdenouncedit.
Source:AfricanobserversTarget:hisvictoryEvaluation:approvedPOSITIVE
Source:WesterngovernmentsTarget:it(hisvictory)Evaluation:denouncedNEGATIVE
Summary• Informationextractionsystemsfrequentlyrelyonlow-levelNLPtoolsforbasiclanguageanalysis,ofteninapipelinearchitecture.
• ThereareawidevarietyofapplicationsforIE,includingbothbroad-coverageanddomain-specificapplications.
• SomeIEtasksarerelativelywell-understood(e.g.,namedentityrecognition),whileothersarestillquitechallenging!
• We’veonlyscratchedthesurfaceofpossibleIEtasks…nearlyendlesspossibilities.
TurneyAlgorithmtolearnaSentimentLexicon
1. Extractaphrasallexiconfromreviews
2. Learnpolarityofeachphrase
3. Rateareviewbytheaveragepolarityofitsphrases
22
Turney (2002): Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews
Extracttwo-wordphraseswithadjectives
FirstWord SecondWord ThirdWord (notextracted)
JJ NNorNNS anythingRB, RBR,RBS JJ NotNNnorNNS
JJ JJ NotNNnorNNSNNorNNS JJ NotNNnor NNS
RB,RBR,orRBS
VB,VBD,VBN,VBG
anything
23
Howtomeasurepolarityofaphrase?• Positivephrasesco-occurmorewith“excellent”
• Negativephrasesco-occurmorewith“poor”
• Buthowtomeasureco-occurrence?
24
PointwiseMutualInformation
• Pointwise mutualinformation:
– Howmuchmoredoeventsxandyco-occurthaniftheywereindependent?
PMI(X,Y ) = log2P(x,y)P(x)P(y)
PointwiseMutualInformation
• Pointwise mutualinformation:
– Howmuchmoredoeventsxandyco-occurthaniftheywereindependent?
• PMIbetweentwowords:
– Howmuchmoredotwowordsco-occurthaniftheywereindependent?PMI(word1,word2 ) = log2
P(word1,word2)P(word1)P(word2)
PMI(X,Y ) = log2P(x,y)P(x)P(y)
HowtoEstimatePointwiseMutualInformation
– Querysearchengine(Altavista)
• P(word)estimatedbyhits(word)/N
• P(word1,word2)byhits(word1 NEAR word2)/N– (MorecorrectlythebigramdenominatorshouldbekN,becausethereareatotalofNconsecutivebigrams(word1,word2),butkN bigramsthatarekwordsapart,butwejustuseNontherestofthisslideandthenext.)
PMI(word1,word2 ) = log2
1Nhits(word1 NEAR word2)
1Nhits(word1) 1
Nhits(word2)
Doesphraseappearmorewith“poor” or“excellent”?
28
Polarity(phrase) = PMI(phrase,"excellent")−PMI(phrase,"poor")
= log2hits(phrase NEAR "excellent")hits("poor")hits(phrase NEAR "poor")hits("excellent")!
"#
$
%&
= log2hits(phrase NEAR "excellent")
hits(phrase)hits("excellent")hits(phrase)hits("poor")
hits(phrase NEAR "poor")
= log2
1N hits(phrase NEAR "excellent")1N hits(phrase) 1
N hits("excellent")− log2
1N hits(phrase NEAR "poor")1N hits(phrase) 1
N hits("poor")
Phrasesfromathumbs-upreview
29
Phrase POStags
Polarity
onlineservice JJNN 2.8
onlineexperience JJNN 2.3
directdeposit JJNN 1.3
localbranch JJNN 0.42…
lowfees JJNNS 0.33
trueservice JJNN -0.73
otherbank JJNN -0.85
inconvenientlylocated
JJNN -1.5
Average 0.32
Phrasesfromathumbs-downreview
30
Phrase POStags
Polarity
directdeposits JJNNS 5.8
onlineweb JJNN 1.9
veryhandy RBJJ 1.4…
virtualmonopoly JJNN -2.0
lesserevil RBRJJ -2.3
otherproblems JJNNS -2.8
lowfunds JJNNS -6.8
unethicalpractices
JJNNS -8.5
Average -1.2
ResultsofTurneyalgorithm• 410reviewsfromEpinions
– 170(41%)negative
– 240(59%)positive
• Majorityclassbaseline:59%
• Turneyalgorithm:74%
• Phrasesratherthanwords
• Learnsdomain-specificinformation31