View
1
Download
0
Category
Preview:
Citation preview
Tagging/ParsingTaylorBerg-Kirkpatrick– CMU
Slides:DanKlein– UCBerkeley
AlgorithmsforNLP
SoHowWellDoesItWork?§ Choosethemostcommontag
§ 90.3%withabadunknownwordmodel§ 93.7%withagoodone
§ TnT (Brants,2000):§ Acarefullysmoothedtrigramtagger§ Suffixtreesforemissions§ 96.7%onWSJtext(SOTAis97+%)
§ Noiseinthedata§ Manyerrorsinthetrainingandtestcorpora
§ Probablyabout2%guaranteederrorfromnoise(onthisdata)
NN NN NNchief executive officer
JJ NN NNchief executive officer
JJ JJ NNchief executive officer
NN JJ NNchief executive officer
DT NN IN NN VBD NNS VBDThe average of interbank offered rates plummeted …
Overview:Accuracies§ Roadmapof(known/unknown)accuracies:
§ Mostfreq tag: ~90%/~50%
§ TrigramHMM: ~95%/~55%
§ TnT (HMM++): 96.2%/86.0%
§ Maxent P(t|w): 93.7%/82.6%§ MEMMtagger: 96.9%/86.9%§ State-of-the-art: 97+%/89+%§ Upperbound: ~98%
Most errors on unknown
words
CommonErrors§ Commonerrors[fromToutanova&Manning00]
NN/JJ NN
official knowledge
VBD RP/IN DT NN
made up the story
RB VBD/VBN NNS
recently sold shares
RicherFeatures
BetterFeatures§ Candosurprisinglywelljustlookingatawordbyitself:
§ Word the:the® DT§ Lowercasedword Importantly:importantly® RB§ Prefixes unfathomable:un-® JJ§ Suffixes Surprisingly:-ly® RB§ Capitalization Meridian:CAP® NNP§ Wordshapes 35-year:d-x® JJ
§ Thenbuildamaxent(orwhatever)modeltopredicttag§ MaxentP(t|w): 93.7%/82.6% s3
w3
WhyLocalContextisUseful§ Lotsofrichlocalinformation!
§ Wecouldfixthiswithafeaturethatlookedatthenextword
§ Wecouldfixthisbylinkingcapitalizedwordstotheirlowercaseversions
§ Solution:discriminativesequencemodels(MEMMs,CRFs)
§ Realitycheck:§ Taggersarealreadyprettygoodonnewswiretext…§ Whattheworldneedsistaggersthatworkonothertext!
PRP VBD IN RB IN PRP VBD .They left as soon as he arrived .
NNP NNS VBD VBN .Intrinsic flaws remained undetected .
RB
JJ
Sequence-FreeTagging?
§ Whataboutlookingatawordanditsenvironment,butnosequenceinformation?
§ Addinprevious/nextword the__§ Previous/nextwordshapes X__X§ Crudeentitydetection __…..(Inc.|Co.)§ Phrasalverbinsentence? put……__§ Conjunctionsofthesethings
§ Allfeaturesexceptsequence:96.6%/86.8%§ Useslotsoffeatures:>200K
t3
w3 w4w2
NamedEntityRecognition§ Othersequencetasksusesimilarmodels
§ Example:nameentityrecognition(NER)
Prev Cur NextState ??? LOC ???Word at Grace RoadTag IN NNP NNPSig x Xx Xx
Local Context
Tim Boon has signed a contract extension with Leicestershire which will keep him at Grace Road .
PER PER O O O O O O ORG O O O O O LOC LOC O
MEMMTaggers§ Idea:left-to-rightlocaldecisions,conditiononprevioustags
andalsoentireinput
§ TrainupP(ti|w,ti-1,ti-2)asanormalmaxent model,thenusetoscoresequences
§ ThisisreferredtoasanMEMMtagger[Ratnaparkhi 96]§ Beamsearcheffective!(Why?)§ Whataboutbeamsize1?
§ Subtleissueswithlocalnormalization(cf.Laffertyetal01)
NERFeatures
Feature Type Feature PERS LOCPrevious word at -0.73 0.94Current word Grace 0.03 0.00First char of word G 0.45 -0.04Current POS tag NNP 0.47 0.45Prev and cur tags IN NNP -0.10 0.14Previous state Other -0.70 -0.92Current signature Xx 0.80 0.46Prev state, cur sig O-Xx 0.68 0.37Prev-cur-next sig x-Xx-Xx -0.69 0.37P. state - p-cur sig O-x-Xx -0.20 0.82…Total: -0.58 2.68
Prev Cur NextState Other LOC ???Word at Grace RoadTag IN NNP NNPSig x Xx Xx
Local Context
Feature WeightsBecause of regularization term, the more common prefixes have larger weights even though entire-word features are more specific.
Decoding§ DecodingMEMMtaggers:
§ JustlikedecodingHMMs,differentlocalscores§ Viterbi,beamsearch,posteriordecoding
§ Viterbialgorithm(HMMs):
§ Viterbialgorithm(MEMMs):
§ General:
ConditionalRandomFields(andFriends)
MaximumEntropyII
§ Remember:maximumentropyobjective
§ Problem:lotsoffeaturesallowperfectfittotrainingset§ Regularization(comparetosmoothing)
DerivativeforMaximumEntropy
Big weights are bad
Total count of feature n in correct candidates
Expected count of feature n in predicted
candidates
PerceptronReview
Perceptron§ Linearmodel:
§ …thatdecomposealongthesequence
§ …allowustopredictwiththeViterbialgorithm
§ …whichmeanswecantrainwiththeperceptronalgorithm(orrelatedupdates,likeMIRA)
[Collins 01]
ConditionalRandomFields§ Makeamaxentmodeloverentiretaggings
§ MEMM
§ CRF
CRFs§ Likeanymaxent model,derivativeis:
§ Soallweneedistobeabletocomputetheexpectationofeachfeature(forexamplethenumberoftimesthelabelpairDT-NNoccurs,orthenumberoftimesNN-interestoccurs)underthemodeldistribution
§ Criticalquantity:countsofposteriormarginals:
ComputingPosteriorMarginals§ Howmany(expected)timesiswordwtaggedwiths?
§ Howtocomputethatmarginal?^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
START Fed raises interest rates END
GlobalDiscriminativeTaggers
§ Newer,higher-powereddiscriminativesequencemodels§ CRFs(alsoperceptrons,M3Ns)§ Donotdecomposetrainingintoindependentlocalregions§ Canbedeathlyslowtotrain– requirerepeatedinferenceontraining
set§ DifferencestendnottobetooimportantforPOStagging§ Differencesmoresubstantialonothersequencetasks§ However:oneissueworthknowingaboutinlocalmodels
§ “Labelbias”andotherexplainingawayeffects§ MEMMtaggers’localscorescanbenearonewithouthavingboth
good“transitions”and“emissions”§ Thismeansthatoftenevidencedoesn’tflowproperly§ Whyisn’tthisabigdealforPOStagging?§ Also:indecoding,conditiononpredicted,notgold,histories
Transformation-BasedLearning
§ [Brill95]presentsatransformation-basedtagger§ Labelthetrainingsetwithmostfrequenttags
DTMDVBDVBD.Thecanwasrusted.
§ Addtransformationruleswhichreducetrainingmistakes
§ MD® NN:DT__§ VBD® VBN:VBD__.
§ Stopwhennotransformationsdosufficientgood§ Doesthisremindanyoneofanything?
§ Probablythemostwidelyusedtagger(esp.outsideNLP)§ …butdefinitelynotthemostaccurate:96.6%/82.0%
LearnedTransformations§ Whatgetslearned?[fromBrill95]
EngCGTagger
§ Englishconstraintgrammartagger§ [TapanainenandVoutilainen94]§ Somethingelseyoushouldknowabout§ Hand-writtenandknowledgedriven§ “Don’tguessifyouknow”(generalpoint
aboutmodelingmorestructure!)§ Tagsetdoesn’tmakeallofthehard
distinctionsasthestandardtagset(e.g.JJ/NN)
§ Theygetstellaraccuracies:99%ontheirtagset
§ Linguisticrepresentationmatters…§ …butit’seasiertowinwhenyoumakeup
therules
DomainEffects§ Accuraciesdegradeoutsideofdomain
§ Uptotripleerrorrate§ Usuallymakethemosterrorsonthethingsyoucareaboutinthedomain(e.g.proteinnames)
§ Openquestions§ Howtoeffectivelyexploitunlabeleddatafromanewdomain(whatcouldwegain?)
§ Howtobestincorporatedomainlexicainaprincipledway(e.g.UMLSspecialistlexicon,ontologies)
UnsupervisedTagging
UnsupervisedTagging?§ AKApart-of-speechinduction§ Task:
§ Rawsentencesin§ Taggedsentencesout
§ Obviousthingtodo:§ Startwitha(mostly)uniformHMM§ RunEM§ Inspectresults
EMforHMMs:Process§ Alternatebetweenrecomputingdistributionsoverhiddenvariables(the
tags)andreestimatingparameters§ Crucialstep:wewanttotallyuphowmany(fractional)countsofeach
kindoftransitionandemissionwehaveundercurrentparams:
§ SamequantitiesweneededtotrainaCRF!
EMforHMMs:Quantities§ Totalpathvalues(correspondtoprobabilitieshere):
TheStateLattice/Trellis
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
START Fed raises interest rates END
EMforHMMs:Process
§ Fromthesequantities,cancomputeexpectedtransitions:
§ Andemissions:
Merialdo:Setup§ Some(discouraging)experiments[Merialdo94]
§ Setup:§ Youknowthesetofallowabletagsforeachword§ Fixktrainingexamplestotheirtruelabels
§ LearnP(w|t)ontheseexamples§ LearnP(t|t-1,t-2)ontheseexamples
§ Onnexamples,re-estimatewithEM
§ Note:weknowallowedtagsbutnotfrequencies
Merialdo:Results
DistributionalClustering
president the __ ofpresident the __ saidgovernor the __ ofgovernor the __ appointedsaid sources __ ¨said president __ thatreported sources __ ¨
presidentgovernor
saidreported
thea
¨ the president said that the downturn was over ¨
[Finch and Chater 92, Shuetze 93, many others]
DistributionalClustering§ Threemainvariantsonthesameidea:
§ Pairwisesimilaritiesandheuristicclustering§ E.g.[FinchandChater92]§ Producesdendrograms
§ Vectorspacemethods§ E.g.[Shuetze93]§ Modelsofambiguity
§ Probabilisticmethods§ Variousformulations,e.g.[LeeandPereira99]
NearestNeighbors
Dendrograms_
Dendrograms_
VectorSpaceVersion§ [Shuetze93]clusterswordsaspointsinRn
§ Vectorstoosparse,useSVDtoreduce
Mw
context counts
US V
w
context counts
Cluster these 50-200 dim vectors instead.
Õ -=i
iiii ccPcwPCSP )|()|(),( 1
Õ +-=i
iiiiii cwwPcwPcPCSP )|,()|()(),( 11
AProbabilisticVersion?
¨ the president said that the downturn was over ¨
c1 c2 c6c5 c7c3 c4 c8
¨ the president said that the downturn was over ¨
c1 c2 c6c5 c7c3 c4 c8
WhatElse?§ Variousnewerideas:
§ Contextdistributionalclustering[Clark00]§ Morphology-drivenmodels[Clark03]§ Contrastiveestimation[SmithandEisner05]§ Feature-richinduction[HaghighiandKlein06]
§ Also:§ Whataboutambiguouswords?§ Usingwidercontextsignatureshasbeenusedforlearningsynonyms(what’swrongwiththisapproach?)
§ Canextendtheseideasforgrammarinduction(later)
Computing Marginals
= sum of all paths through s at tsum of all paths
ForwardScores
Backward Scores
TotalScores
Syntax
ParseTrees
The move followed a round of similar increases by other lenders, reflecting a continuing decline in that market
PhraseStructureParsing§ Phrasestructureparsing
organizessyntaxintoconstituentsorbrackets
§ Ingeneral,thisinvolvesnestedtrees
§ Linguistscan,anddo,argueaboutdetails
§ Lotsofambiguity
§ Nottheonlykindofsyntax…
new art critics write reviews with computers
PP
NPNP
N’
NP
VP
S
ConstituencyTests
§ Howdoweknowwhatnodesgointhetree?
§ Classicconstituencytests:§ Substitutionbyproform§ Questionanswers§ Semanticgounds
§ Coherence§ Reference§ Idioms
§ Dislocation§ Conjunction
§ Cross-linguisticarguments,too
ConflictingTests§ Constituencyisn’talwaysclear
§ Unitsoftransfer:§ thinkabout~penser ৠtalkabout~hablar de
§ Phonologicalreduction:§ Iwillgo® I’llgo§ Iwanttogo® Iwanna go§ alecentre® aucentre
§ Coordination§ Hewenttoandcamefromthestore.
La vélocité des ondes sismiques
ClassicalNLP:Parsing
§ Writesymbolicorlogicalrules:
§ Usedeductionsystemstoproveparsesfromwords§ Minimalgrammaron“Fedraises”sentence:36parses§ Simple10-rulegrammar:592parses§ Real-sizegrammar:manymillionsofparses
§ Thisscaledverybadly,didn’tyieldbroad-coveragetools
Grammar (CFG) Lexicon
ROOT ® S
S ® NP VP
NP ® DT NN
NP ® NN NNS
NN ® interest
NNS ® raises
VBP ® interest
VBZ ® raises
…
NP ® NP PP
VP ® VBP NP
VP ® VBP NP PP
PP ® IN NP
Ambiguities
Ambiguities:PPAttachment
Attachments
§ Icleanedthedishesfromdinner
§ Icleanedthedisheswithdetergent
§ Icleanedthedishesinmypajamas
§ Icleanedthedishesinthesink
SyntacticAmbiguitiesI
§ Prepositionalphrases:Theycookedthebeansinthepotonthestovewithhandles.
§ Particlevs.preposition:Thepuppytoreupthestaircase.
§ ComplementstructuresThetouristsobjectedtotheguidethattheycouldn’thear.Sheknowsyoulikethebackofherhand.
§ Gerundvs.participialadjectiveVisitingrelativescanbeboring.Changingschedulesfrequentlyconfusedpassengers.
SyntacticAmbiguitiesII§ ModifierscopewithinNPs
impracticaldesignrequirementsplasticcupholder
§ MultiplegapconstructionsThechickenisreadytoeat.Thecontractorsarerichenoughtosue.
§ Coordinationscope:Smallratsandmicecansqueezeintoholesorcracksinthewall.
DarkAmbiguities
§ Darkambiguities: mostanalysesareshockinglybad(meaning,theydon’thaveaninterpretationyoucangetyourmindaround)
§ Unknownwordsandnewusages§ Solution:Weneedmechanismstofocusattentiononthebestones,probabilistictechniquesdothis
Thisanalysiscorrespondstothecorrectparseof
“Thiswillpanicbuyers!”
AmbiguitiesasTrees
PCFGs
ProbabilisticContext-FreeGrammars
§ Acontext-freegrammarisatuple<N,T,S,R>§ N :thesetofnon-terminals
§ Phrasalcategories:S,NP,VP,ADJP,etc.§ Parts-of-speech(pre-terminals):NN,JJ,DT,VB
§ T :thesetofterminals(thewords)§ S :thestartsymbol
§ OftenwrittenasROOTorTOP§ Notusuallythesentencenon-terminalS
§ R :thesetofrules§ OftheformX® Y1 Y2 …Yk,withX,Yi Î N§ Examples:S® NPVP,VP® VPCCVP§ Alsocalledrewrites,productions,orlocaltrees
§ APCFGadds:§ Atop-downproductionprobabilityperruleP(Y1 Y2 …Yk|X)
TreebankSentences
TreebankGrammars
§ NeedaPCFGforbroadcoverageparsing.§ Cantakeagrammarrightoffthetrees(doesn’tworkwell):
§ Betterresultsbyenrichingthegrammar(e.g.,lexicalization).§ Canalsogetstate-of-the-artparserswithoutlexicalization.
ROOT ® S 1
S ® NP VP . 1
NP ® PRP 1
VP ® VBD ADJP 1
…..
PLURAL NOUN
NOUNDETDET
ADJ
NOUN
NP NP
CONJ
NP PP
TreebankGrammarScale
§ Treebankgrammarscanbeenormous§ AsFSAs,therawgrammarhas~10Kstates,excludingthelexicon§ Betterparsersusuallymakethegrammarslarger,notsmaller
NP
ChomskyNormalForm
§ Chomskynormalform:§ AllrulesoftheformX® YZorX® w§ Inprinciple,thisisnolimitationonthespaceof(P)CFGs
§ N-aryrulesintroducenewnon-terminals
§ Unaries/emptiesare“promoted”§ Inpracticeit’skindofapain:
§ Reconstructingn-ariesiseasy§ Reconstructingunariesistrickier§ Thestraightforwardtransformationsdon’tpreservetreescores
§ Makesparsingalgorithmssimpler!
VP
[VP ® VBD NP •]
VBD NP PP PP
[VP ® VBD NP PP •]
VBD NP PP PP
VP
CKYParsing
ARecursiveParser
§ Willthisparserwork?§ Whyorwhynot?§ Memoryrequirements?
bestScore(X,i,j,s)if (j = i+1)
return tagScore(X,s[i])else
return max score(X->YZ) *bestScore(Y,i,k) *bestScore(Z,k,j)
AMemoizedParser§ Onesmallchange:
bestScore(X,i,j,s)if (scores[X][i][j] == null)
if (j = i+1)score = tagScore(X,s[i])
elsescore = max score(X->YZ) *
bestScore(Y,i,k) *bestScore(Z,k,j)
scores[X][i][j] = scorereturn scores[X][i][j]
§ Canalsoorganizethingsbottom-up
ABottom-UpParser(CKY)
bestScore(s)for (i : [0,n-1])for (X : tags[s[i]])score[X][i][i+1] =
tagScore(X,s[i])for (diff : [2,n])for (i : [0,n-diff])j = i + difffor (X->YZ : rule)for (k : [i+1, j-1])score[X][i][j] = max score[X][i][j],
score(X->YZ) *score[Y][i][k] *score[Z][k][j]
Y Z
X
i k j
UnaryRules§ Unaryrules?
bestScore(X,i,j,s)if (j = i+1)
return tagScore(X,s[i])else
return max max score(X->YZ) *bestScore(Y,i,k) *bestScore(Z,k,j)
max score(X->Y) *bestScore(Y,i,j)
CNF+UnaryClosure
§ Weneedunariestobenon-cyclic§ Canaddressbypre-calculatingtheunaryclosure§ Ratherthanhavingzeroormoreunaries,alwayshaveexactlyone
§ Alternateunaryandbinarylayers§ Reconstructunarychainsafterwards
NP
DT NN
VP
VBDNP
DT NN
VP
VBD NP
VP
S
SBAR
VP
SBAR
AlternatingLayers
bestScoreU(X,i,j,s)if (j = i+1)
return tagScore(X,s[i])else
return max max score(X->Y) *bestScoreB(Y,i,j)
bestScoreB(X,i,j,s)return max max score(X->YZ) *
bestScoreU(Y,i,k) *bestScoreU(Z,k,j)
Analysis
Memory§ Howmuchmemorydoesthisrequire?
§ Havetostorethescorecache§ Cachesize:|symbols|*n2 doubles§ Fortheplaintreebankgrammar:
§ X~20K,n=40,double~8bytes=~256MB§ Big,butworkable.
§ Pruning:Beams§ score[X][i][j]cangettoolarge(when?)§ Cankeepbeams(truncatedmapsscore[i][j])whichonlystorethebestfew
scoresforthespan[i,j]
§ Pruning:Coarse-to-Fine§ UseasmallergrammartoruleoutmostX[i,j]§ Muchmoreonthislater…
Time:Theory§ Howmuchtimewillittaketoparse?
§ Foreachdiff(<=n)§ Foreachi (<=n)
§ ForeachruleX® YZ§ ForeachsplitpointkDoconstantwork
§ Totaltime:|rules|*n3
§ Somethinglike5secforanunoptimized parseofa20-wordsentence
Y Z
X
i k j
Time:Practice
§ Parsingwiththevanillatreebank grammar:
§ Why’sitworseinpractice?§ Longersentences“unlock”moreofthegrammar§ Allkindsofsystemsissuesdon’tscale
~ 20K Rules
(not an optimized parser!)
Observed exponent:
3.6
Same-SpanReachability
ADJP ADVPFRAG INTJ NPPP PRN QP SSBAR UCP VP
WHNP
TOP
LST
CONJP
WHADJP
WHADVP
WHPP
NX
NAC
SBARQ
SINV
RRCSQ X
PRT
RuleStateReachability
§ Manystatesaremorelikelytomatchlargerspans!
Example: NP CC •
NP CC
0 nn-11 Alignment
Example: NP CC NP •
NP CC
0 nn-k-1n AlignmentsNP
n-k
EfficientCKY
§ LotsoftrickstomakeCKYefficient§ Someofthemarelittleengineeringdetails:
§ E.g.,firstchoosek,thenenumeratethroughtheY:[i,k]whicharenon-zero,thenloopthroughrulesbyleftchild.
§ Optimallayoutofthedynamicprogramdependsongrammar,input,evensystemdetails.
§ Anotherkindismoreimportant(andinteresting):§ ManyX[i,j]canbesuppressedonthebasisoftheinputstring§ We’llseethisnextclassasfigures-of-merit,A*heuristics,coarse-to-fine,etc
Agenda-BasedParsing
Agenda-BasedParsing§ Agenda-basedparsingislikegraphsearch(butovera
hypergraph)§ Concepts:
§ Numbering:wenumberfencepostsbetweenwords§ “Edges”oritems:spanswithlabels,e.g.PP[3,5],representthesetsof
treesoverthosewordsrootedatthatlabel(cf.searchstates)§ Achart:recordsedgeswe’veexpanded(cf.closedset)§ Anagenda:aqueuewhichholdsedges(cf.afringeoropenset)
0 1 2 3 4 5critics write reviews with computers
PP
WordItems§ Buildinganitemforthefirsttimeiscalleddiscovery.Itemsgo
intotheagendaondiscovery.§ Toinitialize,wediscoverallworditems(withscore1.0).
critics write reviews with computers
critics[0,1], write[1,2], reviews[2,3], with[3,4], computers[4,5]
0 1 2 3 4 5
AGENDA
CHART [EMPTY]
UnaryProjection§ Whenwepopaworditem,thelexicontellsusthetagitem
successors(andscores)whichgoontheagenda
critics write reviews with computers
0 1 2 3 4 5critics write reviews with computers
critics[0,1] write[1,2]NNS[0,1]
reviews[2,3] with[3,4] computers[4,5]VBP[1,2] NNS[2,3] IN[3,4] NNS[4,5]
ItemSuccessors§ Whenwepopitemsoffoftheagenda:
§ Graphsuccessors:unaryprojections(NNS® critics,NP® NNS)
§ Hypergraph successors:combinewithitemsalreadyinourchart
§ Enqueue /promoteresultingitems(ifnotinchartalready)§ Recordbacktraces asappropriate§ Stickthepoppededgeinthechart(closedset)
§ Queriesachartmustsupport:§ IsedgeX[i,j]inthechart?(Whatscore?)§ WhatedgeswithlabelYendatpositionj?§ WhatedgeswithlabelZstartatpositioni?
Y[i,j] with X ® Y forms X[i,j]
Y[i,j] and Z[j,k] with X ® Y Z form X[i,k]
Y Z
X
AnExample
0 1 2 3 4 5critics write reviews with computers
NNS VBP NNS IN NNS
NNS[0,1] VBP[1,2] NNS[2,3] IN[3,4] NNS[3,4] NP[0,1] NP[2,3] NP[4,5]
NP NP NP
VP[1,2] S[0,2]
VP
PP[3,5]
PP
VP[1,3]
VP
ROOT[0,2]
SROOT
SROOT
S[0,3] VP[1,5]
VP
NP[2,5]
NP
ROOT[0,3] S[0,5] ROOT[0,5]
S
ROOT
EmptyElements§ Sometimeswewanttopositnodesinaparsetreethatdon’t
containanypronouncedwords:
§ Theseareeasytoaddtoaagenda-basedparser!§ Foreachpositioni,addthe“word”edgee[i,i]§ AddruleslikeNP® e tothegrammar§ That’sit!
0 1 2 3 4 5I like to parse empties
e e e e e e
NP VP
I want you to parse this sentence
I want [ ] to parse this sentence
UCS/A*
§ Withweightededges,ordermatters§ Mustexpandoptimalparsefrom
bottomup(subparsesfirst)§ CKYdoesthisbyprocessingsmaller
spansbeforelargerones§ UCSpopsitemsofftheagendain
orderofdecreasingViterbiscore§ A*searchalsowelldefined
§ Youcanalsospeedupthesearchwithoutsacrificingoptimality§ Canselectwhichitemstoprocessfirst§ Candowithany“figureofmerit”
[Charniak98]§ Ifyourfigure-of-meritisavalidA*
heuristic,nolossofoptimiality[KleinandManning03]
X
n0 i j
(Speech)Lattices§ Therewasnothingmagicalaboutwordsspanningexactly
oneposition.§ Whenworkingwithspeech,wegenerallydon’tknow
howmanywordsthereare,orwheretheybreak.§ Wecanrepresentthepossibilitiesasalatticeandparse
thesejustaseasily.
Iawe
of
van
eyes
sawa
‘ve
an
Ivan
LearningPCFGs
TreebankPCFGs§ UsePCFGsforbroadcoverageparsing§ Cantakeagrammarrightoffthetrees(doesn’tworkwell):
ROOT ® S 1
S ® NP VP . 1
NP ® PRP 1
VP ® VBD ADJP 1
…..
Model F1Baseline 72.0
[Charniak 96]
ConditionalIndependence?
§ NoteveryNPexpansioncanfilleveryNPslot§ Agrammarwithsymbolslike“NP”won’tbecontext-free§ Statistically,conditionalindependencetoostrong
Non-Independence§ Independenceassumptionsareoftentoostrong.
§ Example:theexpansionofanNPishighlydependentontheparentoftheNP(i.e.,subjectsvs.objects).
§ Also:thesubjectandobjectexpansionsarecorrelated!
11%9%
6%
NP PP DT NN PRP
9% 9%
21%
NP PP DT NN PRP
7%4%
23%
NP PP DT NN PRP
All NPs NPs under S NPs under VP
GrammarRefinement
§ Example:PPattachment
GrammarRefinement
§ StructureAnnotation[Johnson’98,Klein&Manning ’03]§ Lexicalization[Collins’99,Charniak ’00]§ LatentVariables[Matsuzaki etal.05,Petrov etal.’06]
StructuralAnnotation
TheGameofDesigningaGrammar
§ Annotation refines base treebank symbols to improve statistical fit of the grammar§ Structural annotation
TypicalExperimentalSetup
§ Corpus:PennTreebank,WSJ
§ Accuracy– F1:harmonicmeanofper-nodelabeledprecisionandrecall.
§ Here:alsosize– numberofsymbolsingrammar.
Training: sections 02-21Development: section 22 (here, first 20 files)Test: section 23
VerticalMarkovization
§ VerticalMarkovorder:rewritesdependonpastkancestornodes.(cf.parentannotation)
Order 1 Order 2
72%73%74%75%76%77%78%79%
1 2v 2 3v 3
Vertical Markov Order
05000
10000150002000025000
1 2v 2 3v 3
Vertical Markov Order
Symbols
HorizontalMarkovization
70%
71%
72%
73%
74%
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Symbols
Order 1 Order ¥
UnarySplits
§ Problem:unaryrewritesusedtotransmutecategoriessoahigh-probabilityrulecanbeused.
Annotation F1 SizeBase 77.8 7.5KUNARY 78.3 8.0K
n Solution: Mark unary rewrite sites with -U
TagSplits
§ Problem:Treebanktagsaretoocoarse.
§ Example:Sentential,PP,andotherprepositionsareallmarkedIN.
§ PartialSolution:§ SubdividetheINtag. Annotation F1 Size
Previous 78.3 8.0KSPLIT-IN 80.3 8.1K
AFullyAnnotated(Unlex)Tree
SomeTestSetResults
§ Beats“firstgeneration”lexicalizedparsers.§ Lotsofroomtoimprove– morecomplexmodelsnext.
Parser LP LR F1 CB 0 CB
Magerman 95 84.9 84.6 84.7 1.26 56.6
Collins 96 86.3 85.8 86.0 1.14 59.9
Unlexicalized 86.9 85.7 86.3 1.10 60.3
Charniak 97 87.4 87.5 87.4 1.00 62.1
Collins 99 88.7 88.6 88.6 0.90 67.1
EfficientParsingforStructuralAnnotation
GrammarProjections
NP^S→DT^NPN’[…DT]^NPNP→DTN’
Coarse Grammar Fine Grammar
Note:X-BarGrammarsareprojectionswithruleslikeXP→YX’orXP→X’YorX’→X
Coarse-to-FinePruning
ForeachcoarsechartitemX[i,j],computeposteriorprobability:
… QP NP VP …coarse:
refined:
E.g.considerthespan5to12:
< threshold
Computing(Max-)Marginals
InsideandOutsideScores
PruningwithA*
§ Youcanalsospeedupthesearchwithoutsacrificingoptimality
§ Foragenda-basedparsers:§ Canselectwhichitemstoprocessfirst
§ Candowithany“figureofmerit”[Charniak98]
§ Ifyourfigure-of-meritisavalidA*heuristic,nolossofoptimiality[KleinandManning03]
X
n0 i j
A*Parsing
Lexicalization
§ Annotation refines base treebank symbols to improve statistical fit of the grammar§ Structural annotation [Johnson ’98, Klein and Manning 03]§ Head lexicalization [Collins ’99, Charniak ’00]
TheGameofDesigningaGrammar
ProblemswithPCFGs
§ Ifwedonoannotation,thesetreesdifferonlyinonerule:§ VP® VPPP§ NP® NPPP
§ Parsewillgoonewayortheother,regardlessofwords§ Weaddressedthisinonewaywithunlexicalizedgrammars(how?)§ Lexicalizationallowsustobesensitivetospecificwords
ProblemswithPCFGs
§ What’sdifferentbetweenbasicPCFGscoreshere?§ What(lexical)correlationsneedtobescored?
LexicalizedTrees
§ Add“headwords”toeachphrasalnode§ Syntacticvs.semantic
heads§ Headshipnotin(most)
treebanks§ Usuallyuseheadrules,
e.g.:§ NP:
§ TakeleftmostNP§ TakerightmostN*§ TakerightmostJJ§ Takerightchild
§ VP:§ TakeleftmostVB*§ TakeleftmostVP§ Takeleftchild
LexicalizedPCFGs?§ Problem:wenowhavetoestimateprobabilitieslike
§ Nevergoingtogettheseatomicallyoffofatreebank
§ Solution:breakupderivationintosmallersteps
LexicalDerivationSteps§ Aderivationofalocaltree[Collins99]
Choose a head tag and word
Choose a complement bag
Generate children (incl. adjuncts)
Recursively derive children
LexicalizedCKY
bestScore(X,i,j,h)if (j = i+1)return tagScore(X,s[i])
elsereturn max max score(X[h]->Y[h] Z[h’]) *
bestScore(Y,i,k,h) *bestScore(Z,k,j,h’)
max score(X[h]->Y[h’] Z[h]) *bestScore(Y,i,k,h’) *bestScore(Z,k,j,h)
Y[h] Z[h’]
X[h]
i h k h’ j
k,h’,X->YZ
(VP->VBD •)[saw] NP[her]
(VP->VBD...NP •)[saw]
k,h’,X->YZ
EfficientParsingforLexicalGrammars
QuarticParsing§ Turnsout,youcando(alittle)better[Eisner99]
§ GivesanO(n4)algorithm§ Stillprohibitiveinpracticeifnotpruned
Y[h] Z[h’]
X[h]
i h k h’ j
Y[h] Z
X[h]
i h k j
PruningwithBeams§ TheCollinsparserpruneswithper-
cellbeams[Collins99]§ Essentially,runtheO(n5) CKY§ Rememberonlyafewhypothesesfor
eachspan<i,j>.§ IfwekeepKhypothesesateachspan,
thenwedoatmostO(nK2)workperspan(why?)
§ Keepsthingsmoreorlesscubic(andinpracticeismorelikelinear!)
§ Also:certainspansareforbiddenentirelyonthebasisofpunctuation(crucialforspeed)
Y[h] Z[h’]
X[h]
i h k h’ j
PruningwithaPCFG
§ TheCharniak parserprunesusingatwo-pass,coarse-to-fineapproach[Charniak 97+]§ First,parsewiththebasegrammar§ ForeachX:[i,j]calculateP(X|i,j,s)
§ Thisisn’ttrivial,andtherearecleverspeedups§ Second,dothefullO(n5) CKY
§ SkipanyX:[i,j]whichhadlow(say,<0.0001)posterior§ Avoidsalmostallworkinthesecondphase!
§ Charniak etal06:canusemorepasses§ Petrov etal07:canusemanymorepasses
Results
§ Someresults§ Collins99– 88.6F1(generativelexical)§ CharniakandJohnson05– 89.7/91.3F1(generativelexical/reranked)
§ Petrovetal06– 90.7F1(generativeunlexical)§ McCloskyetal06– 92.1F1(gen+rerank+self-train)
§ However§ Bilexicalcountsrarelymakeadifference(why?)§ Gildea01– Removingbilexicalcountscosts<0.5F1
LatentVariablePCFGs
§ Annotationrefinesbasetreebank symbolstoimprovestatisticalfitofthegrammar§ Parentannotation[Johnson’98]
TheGameofDesigningaGrammar
§ Annotationrefinesbasetreebank symbolstoimprovestatisticalfitofthegrammar§ Parentannotation[Johnson’98]§ Headlexicalization [Collins’99,Charniak ’00]
TheGameofDesigningaGrammar
§ Annotationrefinesbasetreebank symbolstoimprovestatisticalfitofthegrammar§ Parentannotation[Johnson’98]§ Headlexicalization [Collins’99,Charniak ’00]§ Automaticclustering?
TheGameofDesigningaGrammar
LatentVariableGrammars
Parse Tree Sentence Parameters
...
Derivations
Forward
LearningLatentAnnotations
EMalgorithm:
X1
X2X7X4
X5 X6X3
He was right
.
§ Brackets are known§ Base categories are known§ Only induce subcategories
JustlikeForward-BackwardforHMMs.Backward
RefinementoftheDTtag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchicalrefinement
HierarchicalEstimationResults
74
76
78
80
82
84
86
88
90
100 300 500 700 900 1100 1300 1500 1700Total Number of grammar symbols
Par
sing
acc
urac
y (F
1)
Model F1Flat Training 87.3Hierarchical Training 88.4
Refinementofthe,tag§ Splittingallcategoriesequallyiswasteful:
Adaptive Splitting
§ Want to split complex categories more§ Idea: split everything, roll back splits which
were least useful
AdaptiveSplittingResults
Model F1Previous 88.4With 50% Merging 89.5
0
5
10
15
20
25
30
35
40
NP
VP PP
ADVP S
ADJP
SBAR Q
P
WH
NP
PRN
NX
SIN
V
PRT
WH
PP SQ
CO
NJP
FRAG
NAC UC
P
WH
ADVP INTJ
SBAR
Q
RR
C
WH
ADJP X
RO
OT
LST
NumberofPhrasalSubcategories
NumberofLexicalSubcategories
0
10
20
30
40
50
60
70
NNP JJ
NNS NN VBN RB
VBG VB VBD CD IN
VBZ
VBP DT
NNPS CC JJ
RJJ
S :PR
PPR
P$ MD
RBR
WP
POS
PDT
WRB
-LRB
- .EX
WP$
WDT
-RRB
- ''FW RB
S TO$
UH, ``
SYM RP LS #
LearnedSplits
§ Proper Nouns (NNP):
§ Personal pronouns (PRP):
NNP-14 Oct. Nov. Sept.NNP-12 John Robert JamesNNP-2 J. E. L.NNP-1 Bush Noriega Peters
NNP-15 New San WallNNP-3 York Francisco Street
PRP-0 It He IPRP-1 it he theyPRP-2 it them him
§ Relativeadverbs(RBR):
§ CardinalNumbers(CD):
RBR-0 further lower higherRBR-1 more less MoreRBR-2 earlier Earlier later
CD-7 one two ThreeCD-4 1989 1990 1988CD-11 million billion trillionCD-0 1 50 100CD-3 1 30 31CD-9 78 58 34
LearnedSplits
FinalResults(Accuracy)
≤ 40 wordsF1
all F1
ENG
Charniak&Johnson ‘05 (generative) 90.1 89.6
Split / Merge 90.6 90.1
GER
Dubey ‘05 76.3 -
Split / Merge 80.8 80.1
CH
N
Chiang et al. ‘02 80.0 76.6
Split / Merge 86.3 83.4
Still higher numbers from reranking / self-training methods
EfficientParsingforHierarchicalGrammars
Coarse-to-FineInference§ Example:PPattachment
?????????
HierarchicalPruning
… QP NP VP …coarse:
split in two: … QP1 QP2 NP1 NP2 VP1 VP2 …
… QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 …split in four:
split in eight: … … … … … … … … … … … … … … … … …
BracketPosteriors
1621min111min35min
15min(nosearcherror)
UnsupervisedTagging
UnsupervisedTagging?§ AKApart-of-speechinduction§ Task:
§ Rawsentencesin§ Taggedsentencesout
§ Obviousthingtodo:§ Startwitha(mostly)uniformHMM§ RunEM§ Inspectresults
EMforHMMs:Process§ Alternatebetweenrecomputingdistributionsoverhiddenvariables(the
tags)andreestimatingparameters§ Crucialstep:wewanttotallyuphowmany(fractional)countsofeach
kindoftransitionandemissionwehaveundercurrentparams:
§ SamequantitiesweneededtotrainaCRF!
Merialdo:Setup§ Some(discouraging)experiments[Merialdo94]
§ Setup:§ Youknowthesetofallowabletagsforeachword§ Fixktrainingexamplestotheirtruelabels
§ LearnP(w|t)ontheseexamples§ LearnP(t|t-1,t-2)ontheseexamples
§ Onnexamples,re-estimatewithEM
§ Note:weknowallowedtagsbutnotfrequencies
EMforHMMs:Quantities§ Totalpathvalues(correspondtoprobabilitieshere):
TheStateLattice/Trellis
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
START Fed raises interest rates END
EMforHMMs:Process
§ Fromthesequantities,cancomputeexpectedtransitions:
§ Andemissions:
Merialdo:Results
Projection-BasedA*
Factory payrolls fell in Sept.
NP PP
VP
S
Factory payrolls fell in Sept.
payrolls in
fell
fellFactory payrolls fell in Sept.
NP:payrolls PP:in
VP:fell
S:fellSYNTACTICp SEMANTICp
A*Speedup
§ TotaltimedominatedbycalculationofA*tablesineachprojection…O(n3)
0102030405060
0 5 10 15 20 25 30 35 40Length
Tim
e (s
ec) Combined Phase
Dependency PhasePCFG Phase
BreakingUptheSymbols
§ WecanrelaxindependenceassumptionsbyencodingdependenciesintothePCFGsymbols:
§ Whatarethemostuseful“features”toencode?
Parent annotation[Johnson 98]
Marking possessive NPs
OtherTagSplits
§ UNARY-DT:markdemonstrativesasDT^U (“theX”vs.“those”)
§ UNARY-RB:markphrasaladverbsasRB^U(“quickly”vs.“very”)
§ TAG-PA:marktagswithnon-canonicalparents(“not”isanRB^VP)
§ SPLIT-AUX:markauxiliaryverbswith–AUX[cf.Charniak97]
§ SPLIT-CC:separate“but”and“&”fromotherconjunctions
§ SPLIT-%:“%”getsitsowntag.
F1 Size
80.4 8.1K
80.5 8.1K
81.2 8.5K
81.6 9.0K
81.7 9.1K
81.8 9.3K
Recommended