Algorithms for NLP - cs.cmu.edutbergkir/11711fa17/FA17 11-711 lecture 10 -- parsing II.pdf ·...

Tagging/ParsingTaylorBerg-Kirkpatrick– CMU

Slides:DanKlein– UCBerkeley

AlgorithmsforNLP

SoHowWellDoesItWork?§ Choosethemostcommontag

§ 90.3%withabadunknownwordmodel§ 93.7%withagoodone

§ TnT (Brants,2000):§ Acarefullysmoothedtrigramtagger§ Suffixtreesforemissions§ 96.7%onWSJtext(SOTAis97+%)

§ Noiseinthedata§ Manyerrorsinthetrainingandtestcorpora

§ Probablyabout2%guaranteederrorfromnoise(onthisdata)

NN NN NNchief executive officer

JJ NN NNchief executive officer

JJ JJ NNchief executive officer

NN JJ NNchief executive officer

DT NN IN NN VBD NNS VBDThe average of interbank offered rates plummeted …

Overview:Accuracies§ Roadmapof(known/unknown)accuracies:

§ Mostfreq tag: ~90%/~50%

§ TrigramHMM: ~95%/~55%

§ TnT (HMM++): 96.2%/86.0%

§ Maxent P(t|w): 93.7%/82.6%§ MEMMtagger: 96.9%/86.9%§ State-of-the-art: 97+%/89+%§ Upperbound: ~98%

Most errors on unknown

CommonErrors§ Commonerrors[fromToutanova&Manning00]

NN/JJ NN

official knowledge

VBD RP/IN DT NN

made up the story

RB VBD/VBN NNS

recently sold shares

RicherFeatures

BetterFeatures§ Candosurprisinglywelljustlookingatawordbyitself:

§ Word the:the® DT§ Lowercasedword Importantly:importantly® RB§ Prefixes unfathomable:un-® JJ§ Suffixes Surprisingly:-ly® RB§ Capitalization Meridian:CAP® NNP§ Wordshapes 35-year:d-x® JJ

§ Thenbuildamaxent(orwhatever)modeltopredicttag§ MaxentP(t|w): 93.7%/82.6% s3

WhyLocalContextisUseful§ Lotsofrichlocalinformation!

§ Wecouldfixthiswithafeaturethatlookedatthenextword

§ Wecouldfixthisbylinkingcapitalizedwordstotheirlowercaseversions

§ Solution:discriminativesequencemodels(MEMMs,CRFs)

§ Realitycheck:§ Taggersarealreadyprettygoodonnewswiretext…§ Whattheworldneedsistaggersthatworkonothertext!

PRP VBD IN RB IN PRP VBD .They left as soon as he arrived .

NNP NNS VBD VBN .Intrinsic flaws remained undetected .

Sequence-FreeTagging?

§ Whataboutlookingatawordanditsenvironment,butnosequenceinformation?

§ Addinprevious/nextword the__§ Previous/nextwordshapes X__X§ Crudeentitydetection __…..(Inc.|Co.)§ Phrasalverbinsentence? put……__§ Conjunctionsofthesethings

§ Allfeaturesexceptsequence:96.6%/86.8%§ Useslotsoffeatures:>200K

w3 w4w2

NamedEntityRecognition§ Othersequencetasksusesimilarmodels

§ Example:nameentityrecognition(NER)

Prev Cur NextState ??? LOC ???Word at Grace RoadTag IN NNP NNPSig x Xx Xx

Local Context

Tim Boon has signed a contract extension with Leicestershire which will keep him at Grace Road .

PER PER O O O O O O ORG O O O O O LOC LOC O

MEMMTaggers§ Idea:left-to-rightlocaldecisions,conditiononprevioustags

andalsoentireinput

§ TrainupP(ti|w,ti-1,ti-2)asanormalmaxent model,thenusetoscoresequences

§ ThisisreferredtoasanMEMMtagger[Ratnaparkhi 96]§ Beamsearcheffective!(Why?)§ Whataboutbeamsize1?

§ Subtleissueswithlocalnormalization(cf.Laffertyetal01)

NERFeatures

Feature Type Feature PERS LOCPrevious word at -0.73 0.94Current word Grace 0.03 0.00First char of word G 0.45 -0.04Current POS tag NNP 0.47 0.45Prev and cur tags IN NNP -0.10 0.14Previous state Other -0.70 -0.92Current signature Xx 0.80 0.46Prev state, cur sig O-Xx 0.68 0.37Prev-cur-next sig x-Xx-Xx -0.69 0.37P. state - p-cur sig O-x-Xx -0.20 0.82…Total: -0.58 2.68

Prev Cur NextState Other LOC ???Word at Grace RoadTag IN NNP NNPSig x Xx Xx

Local Context

Feature WeightsBecause of regularization term, the more common prefixes have larger weights even though entire-word features are more specific.

Decoding§ DecodingMEMMtaggers:

§ JustlikedecodingHMMs,differentlocalscores§ Viterbi,beamsearch,posteriordecoding

§ Viterbialgorithm(HMMs):

§ Viterbialgorithm(MEMMs):

§ General:

ConditionalRandomFields(andFriends)

MaximumEntropyII

§ Remember:maximumentropyobjective

§ Problem:lotsoffeaturesallowperfectfittotrainingset§ Regularization(comparetosmoothing)

DerivativeforMaximumEntropy

Big weights are bad

Total count of feature n in correct candidates

Expected count of feature n in predicted

candidates

PerceptronReview

Perceptron§ Linearmodel:

§ …thatdecomposealongthesequence

§ …allowustopredictwiththeViterbialgorithm

§ …whichmeanswecantrainwiththeperceptronalgorithm(orrelatedupdates,likeMIRA)

[Collins 01]

ConditionalRandomFields§ Makeamaxentmodeloverentiretaggings

§ MEMM

§ CRF

CRFs§ Likeanymaxent model,derivativeis:

§ Soallweneedistobeabletocomputetheexpectationofeachfeature(forexamplethenumberoftimesthelabelpairDT-NNoccurs,orthenumberoftimesNN-interestoccurs)underthemodeldistribution

§ Criticalquantity:countsofposteriormarginals:

ComputingPosteriorMarginals§ Howmany(expected)timesiswordwtaggedwiths?

§ Howtocomputethatmarginal?^

START Fed raises interest rates END

GlobalDiscriminativeTaggers

§ Newer,higher-powereddiscriminativesequencemodels§ CRFs(alsoperceptrons,M3Ns)§ Donotdecomposetrainingintoindependentlocalregions§ Canbedeathlyslowtotrain– requirerepeatedinferenceontraining

set§ DifferencestendnottobetooimportantforPOStagging§ Differencesmoresubstantialonothersequencetasks§ However:oneissueworthknowingaboutinlocalmodels

§ “Labelbias”andotherexplainingawayeffects§ MEMMtaggers’localscorescanbenearonewithouthavingboth

good“transitions”and“emissions”§ Thismeansthatoftenevidencedoesn’tflowproperly§ Whyisn’tthisabigdealforPOStagging?§ Also:indecoding,conditiononpredicted,notgold,histories

Transformation-BasedLearning

§ [Brill95]presentsatransformation-basedtagger§ Labelthetrainingsetwithmostfrequenttags

DTMDVBDVBD.Thecanwasrusted.

§ Addtransformationruleswhichreducetrainingmistakes

§ MD® NN:DT__§ VBD® VBN:VBD__.

§ Stopwhennotransformationsdosufficientgood§ Doesthisremindanyoneofanything?

§ Probablythemostwidelyusedtagger(esp.outsideNLP)§ …butdefinitelynotthemostaccurate:96.6%/82.0%

LearnedTransformations§ Whatgetslearned?[fromBrill95]

EngCGTagger

§ Englishconstraintgrammartagger§ [TapanainenandVoutilainen94]§ Somethingelseyoushouldknowabout§ Hand-writtenandknowledgedriven§ “Don’tguessifyouknow”(generalpoint

aboutmodelingmorestructure!)§ Tagsetdoesn’tmakeallofthehard

distinctionsasthestandardtagset(e.g.JJ/NN)

§ Theygetstellaraccuracies:99%ontheirtagset

§ Linguisticrepresentationmatters…§ …butit’seasiertowinwhenyoumakeup

therules

DomainEffects§ Accuraciesdegradeoutsideofdomain

§ Uptotripleerrorrate§ Usuallymakethemosterrorsonthethingsyoucareaboutinthedomain(e.g.proteinnames)

§ Openquestions§ Howtoeffectivelyexploitunlabeleddatafromanewdomain(whatcouldwegain?)

§ Howtobestincorporatedomainlexicainaprincipledway(e.g.UMLSspecialistlexicon,ontologies)

UnsupervisedTagging

UnsupervisedTagging?§ AKApart-of-speechinduction§ Task:

§ Rawsentencesin§ Taggedsentencesout

§ Obviousthingtodo:§ Startwitha(mostly)uniformHMM§ RunEM§ Inspectresults

EMforHMMs:Process§ Alternatebetweenrecomputingdistributionsoverhiddenvariables(the

tags)andreestimatingparameters§ Crucialstep:wewanttotallyuphowmany(fractional)countsofeach

kindoftransitionandemissionwehaveundercurrentparams:

§ SamequantitiesweneededtotrainaCRF!

EMforHMMs:Quantities§ Totalpathvalues(correspondtoprobabilitieshere):

TheStateLattice/Trellis

EMforHMMs:Process

§ Fromthesequantities,cancomputeexpectedtransitions:

§ Andemissions:

Merialdo:Setup§ Some(discouraging)experiments[Merialdo94]

§ Setup:§ Youknowthesetofallowabletagsforeachword§ Fixktrainingexamplestotheirtruelabels

§ LearnP(w|t)ontheseexamples§ LearnP(t|t-1,t-2)ontheseexamples

§ Onnexamples,re-estimatewithEM

§ Note:weknowallowedtagsbutnotfrequencies

Merialdo:Results

DistributionalClustering

president the __ ofpresident the __ saidgovernor the __ ofgovernor the __ appointedsaid sources __ ¨said president __ thatreported sources __ ¨

presidentgovernor

saidreported

¨ the president said that the downturn was over ¨

[Finch and Chater 92, Shuetze 93, many others]

DistributionalClustering§ Threemainvariantsonthesameidea:

§ Pairwisesimilaritiesandheuristicclustering§ E.g.[FinchandChater92]§ Producesdendrograms

§ Vectorspacemethods§ E.g.[Shuetze93]§ Modelsofambiguity

§ Probabilisticmethods§ Variousformulations,e.g.[LeeandPereira99]

NearestNeighbors

Dendrograms_

VectorSpaceVersion§ [Shuetze93]clusterswordsaspointsinRn

§ Vectorstoosparse,useSVDtoreduce

context counts

Cluster these 50-200 dim vectors instead.

Õ -=i

iiii ccPcwPCSP )|()|(),( 1

Õ +-=i

iiiiii cwwPcwPcPCSP )|,()|()(),( 11

AProbabilisticVersion?

c1 c2 c6c5 c7c3 c4 c8

WhatElse?§ Variousnewerideas:

§ Contextdistributionalclustering[Clark00]§ Morphology-drivenmodels[Clark03]§ Contrastiveestimation[SmithandEisner05]§ Feature-richinduction[HaghighiandKlein06]

§ Also:§ Whataboutambiguouswords?§ Usingwidercontextsignatureshasbeenusedforlearningsynonyms(what’swrongwiththisapproach?)

§ Canextendtheseideasforgrammarinduction(later)

Computing Marginals

= sum of all paths through s at tsum of all paths

ForwardScores

Backward Scores

TotalScores

Syntax

ParseTrees

The move followed a round of similar increases by other lenders, reflecting a continuing decline in that market

PhraseStructureParsing§ Phrasestructureparsing

organizessyntaxintoconstituentsorbrackets

§ Ingeneral,thisinvolvesnestedtrees

§ Linguistscan,anddo,argueaboutdetails

§ Lotsofambiguity

§ Nottheonlykindofsyntax…

new art critics write reviews with computers

ConstituencyTests

§ Howdoweknowwhatnodesgointhetree?

§ Classicconstituencytests:§ Substitutionbyproform§ Questionanswers§ Semanticgounds

§ Coherence§ Reference§ Idioms

§ Dislocation§ Conjunction

§ Cross-linguisticarguments,too

ConflictingTests§ Constituencyisn’talwaysclear

§ Unitsoftransfer:§ thinkabout~penser à§ talkabout~hablar de

§ Phonologicalreduction:§ Iwillgo® I’llgo§ Iwanttogo® Iwanna go§ alecentre® aucentre

§ Coordination§ Hewenttoandcamefromthestore.

La vélocité des ondes sismiques

ClassicalNLP:Parsing

§ Writesymbolicorlogicalrules:

§ Usedeductionsystemstoproveparsesfromwords§ Minimalgrammaron“Fedraises”sentence:36parses§ Simple10-rulegrammar:592parses§ Real-sizegrammar:manymillionsofparses

§ Thisscaledverybadly,didn’tyieldbroad-coveragetools

Grammar (CFG) Lexicon

ROOT ® S

S ® NP VP

NP ® DT NN

NP ® NN NNS

NN ® interest

NNS ® raises

VBP ® interest

VBZ ® raises

NP ® NP PP

VP ® VBP NP

VP ® VBP NP PP

PP ® IN NP

Ambiguities

Ambiguities:PPAttachment

Attachments

§ Icleanedthedishesfromdinner

§ Icleanedthedisheswithdetergent

§ Icleanedthedishesinmypajamas

§ Icleanedthedishesinthesink

SyntacticAmbiguitiesI

§ Prepositionalphrases:Theycookedthebeansinthepotonthestovewithhandles.

§ Particlevs.preposition:Thepuppytoreupthestaircase.

§ ComplementstructuresThetouristsobjectedtotheguidethattheycouldn’thear.Sheknowsyoulikethebackofherhand.

§ Gerundvs.participialadjectiveVisitingrelativescanbeboring.Changingschedulesfrequentlyconfusedpassengers.

SyntacticAmbiguitiesII§ ModifierscopewithinNPs

impracticaldesignrequirementsplasticcupholder

§ MultiplegapconstructionsThechickenisreadytoeat.Thecontractorsarerichenoughtosue.

§ Coordinationscope:Smallratsandmicecansqueezeintoholesorcracksinthewall.

DarkAmbiguities

§ Darkambiguities: mostanalysesareshockinglybad(meaning,theydon’thaveaninterpretationyoucangetyourmindaround)

§ Unknownwordsandnewusages§ Solution:Weneedmechanismstofocusattentiononthebestones,probabilistictechniquesdothis

Thisanalysiscorrespondstothecorrectparseof

“Thiswillpanicbuyers!”

AmbiguitiesasTrees

ProbabilisticContext-FreeGrammars

§ Acontext-freegrammarisatuple<N,T,S,R>§ N :thesetofnon-terminals

§ Phrasalcategories:S,NP,VP,ADJP,etc.§ Parts-of-speech(pre-terminals):NN,JJ,DT,VB

§ T :thesetofterminals(thewords)§ S :thestartsymbol

§ OftenwrittenasROOTorTOP§ Notusuallythesentencenon-terminalS

§ R :thesetofrules§ OftheformX® Y1 Y2 …Yk,withX,Yi Î N§ Examples:S® NPVP,VP® VPCCVP§ Alsocalledrewrites,productions,orlocaltrees

§ APCFGadds:§ Atop-downproductionprobabilityperruleP(Y1 Y2 …Yk|X)

TreebankSentences

TreebankGrammars

§ NeedaPCFGforbroadcoverageparsing.§ Cantakeagrammarrightoffthetrees(doesn’tworkwell):

§ Betterresultsbyenrichingthegrammar(e.g.,lexicalization).§ Canalsogetstate-of-the-artparserswithoutlexicalization.

ROOT ® S 1

S ® NP VP . 1

NP ® PRP 1

VP ® VBD ADJP 1

PLURAL NOUN

NOUNDETDET

TreebankGrammarScale

§ Treebankgrammarscanbeenormous§ AsFSAs,therawgrammarhas~10Kstates,excludingthelexicon§ Betterparsersusuallymakethegrammarslarger,notsmaller

ChomskyNormalForm

§ Chomskynormalform:§ AllrulesoftheformX® YZorX® w§ Inprinciple,thisisnolimitationonthespaceof(P)CFGs

§ N-aryrulesintroducenewnon-terminals

§ Unaries/emptiesare“promoted”§ Inpracticeit’skindofapain:

§ Reconstructingn-ariesiseasy§ Reconstructingunariesistrickier§ Thestraightforwardtransformationsdon’tpreservetreescores

§ Makesparsingalgorithmssimpler!

[VP ® VBD NP •]

VBD NP PP PP

[VP ® VBD NP PP •]

VBD NP PP PP

CKYParsing

ARecursiveParser

§ Willthisparserwork?§ Whyorwhynot?§ Memoryrequirements?

bestScore(X,i,j,s)if (j = i+1)

return tagScore(X,s[i])else

return max score(X->YZ) *bestScore(Y,i,k) *bestScore(Z,k,j)

AMemoizedParser§ Onesmallchange:

bestScore(X,i,j,s)if (scores[X][i][j] == null)

if (j = i+1)score = tagScore(X,s[i])

elsescore = max score(X->YZ) *

bestScore(Y,i,k) *bestScore(Z,k,j)

scores[X][i][j] = scorereturn scores[X][i][j]

§ Canalsoorganizethingsbottom-up

ABottom-UpParser(CKY)

bestScore(s)for (i : [0,n-1])for (X : tags[s[i]])score[X][i][i+1] =

tagScore(X,s[i])for (diff : [2,n])for (i : [0,n-diff])j = i + difffor (X->YZ : rule)for (k : [i+1, j-1])score[X][i][j] = max score[X][i][j],

score(X->YZ) *score[Y][i][k] *score[Z][k][j]

UnaryRules§ Unaryrules?

bestScore(X,i,j,s)if (j = i+1)

return max max score(X->YZ) *bestScore(Y,i,k) *bestScore(Z,k,j)

max score(X->Y) *bestScore(Y,i,j)

CNF+UnaryClosure

§ Weneedunariestobenon-cyclic§ Canaddressbypre-calculatingtheunaryclosure§ Ratherthanhavingzeroormoreunaries,alwayshaveexactlyone

§ Alternateunaryandbinarylayers§ Reconstructunarychainsafterwards

VBD NP

AlternatingLayers

bestScoreU(X,i,j,s)if (j = i+1)

return max max score(X->Y) *bestScoreB(Y,i,j)

bestScoreB(X,i,j,s)return max max score(X->YZ) *

bestScoreU(Y,i,k) *bestScoreU(Z,k,j)

Analysis

Memory§ Howmuchmemorydoesthisrequire?

§ Havetostorethescorecache§ Cachesize:|symbols|*n2 doubles§ Fortheplaintreebankgrammar:

§ X~20K,n=40,double~8bytes=~256MB§ Big,butworkable.

§ Pruning:Beams§ score[X][i][j]cangettoolarge(when?)§ Cankeepbeams(truncatedmapsscore[i][j])whichonlystorethebestfew

scoresforthespan[i,j]

§ Pruning:Coarse-to-Fine§ UseasmallergrammartoruleoutmostX[i,j]§ Muchmoreonthislater…

Time:Theory§ Howmuchtimewillittaketoparse?

§ Foreachdiff(<=n)§ Foreachi (<=n)

§ ForeachruleX® YZ§ ForeachsplitpointkDoconstantwork

§ Totaltime:|rules|*n3

§ Somethinglike5secforanunoptimized parseofa20-wordsentence

Time:Practice

§ Parsingwiththevanillatreebank grammar:

§ Why’sitworseinpractice?§ Longersentences“unlock”moreofthegrammar§ Allkindsofsystemsissuesdon’tscale

~ 20K Rules

(not an optimized parser!)

Observed exponent:

Same-SpanReachability

ADJP ADVPFRAG INTJ NPPP PRN QP SSBAR UCP VP

WHADJP

WHADVP

RRCSQ X

RuleStateReachability

§ Manystatesaremorelikelytomatchlargerspans!

Example: NP CC •

0 nn-11 Alignment

Example: NP CC NP •

0 nn-k-1n AlignmentsNP

EfficientCKY

§ LotsoftrickstomakeCKYefficient§ Someofthemarelittleengineeringdetails:

§ E.g.,firstchoosek,thenenumeratethroughtheY:[i,k]whicharenon-zero,thenloopthroughrulesbyleftchild.

§ Optimallayoutofthedynamicprogramdependsongrammar,input,evensystemdetails.

§ Anotherkindismoreimportant(andinteresting):§ ManyX[i,j]canbesuppressedonthebasisoftheinputstring§ We’llseethisnextclassasfigures-of-merit,A*heuristics,coarse-to-fine,etc

Agenda-BasedParsing

Agenda-BasedParsing§ Agenda-basedparsingislikegraphsearch(butovera

hypergraph)§ Concepts:

§ Numbering:wenumberfencepostsbetweenwords§ “Edges”oritems:spanswithlabels,e.g.PP[3,5],representthesetsof

treesoverthosewordsrootedatthatlabel(cf.searchstates)§ Achart:recordsedgeswe’veexpanded(cf.closedset)§ Anagenda:aqueuewhichholdsedges(cf.afringeoropenset)

0 1 2 3 4 5critics write reviews with computers

WordItems§ Buildinganitemforthefirsttimeiscalleddiscovery.Itemsgo

intotheagendaondiscovery.§ Toinitialize,wediscoverallworditems(withscore1.0).

critics write reviews with computers

critics[0,1], write[1,2], reviews[2,3], with[3,4], computers[4,5]

0 1 2 3 4 5

AGENDA

CHART [EMPTY]

UnaryProjection§ Whenwepopaworditem,thelexicontellsusthetagitem

successors(andscores)whichgoontheagenda

critics write reviews with computers

critics[0,1] write[1,2]NNS[0,1]

reviews[2,3] with[3,4] computers[4,5]VBP[1,2] NNS[2,3] IN[3,4] NNS[4,5]

ItemSuccessors§ Whenwepopitemsoffoftheagenda:

§ Graphsuccessors:unaryprojections(NNS® critics,NP® NNS)

§ Hypergraph successors:combinewithitemsalreadyinourchart

§ Enqueue /promoteresultingitems(ifnotinchartalready)§ Recordbacktraces asappropriate§ Stickthepoppededgeinthechart(closedset)

§ Queriesachartmustsupport:§ IsedgeX[i,j]inthechart?(Whatscore?)§ WhatedgeswithlabelYendatpositionj?§ WhatedgeswithlabelZstartatpositioni?

Y[i,j] with X ® Y forms X[i,j]

Y[i,j] and Z[j,k] with X ® Y Z form X[i,k]

AnExample

NNS VBP NNS IN NNS

NNS[0,1] VBP[1,2] NNS[2,3] IN[3,4] NNS[3,4] NP[0,1] NP[2,3] NP[4,5]

NP NP NP

VP[1,2] S[0,2]

PP[3,5]

VP[1,3]

ROOT[0,2]

S[0,3] VP[1,5]

NP[2,5]

ROOT[0,3] S[0,5] ROOT[0,5]

EmptyElements§ Sometimeswewanttopositnodesinaparsetreethatdon’t

containanypronouncedwords:

§ Theseareeasytoaddtoaagenda-basedparser!§ Foreachpositioni,addthe“word”edgee[i,i]§ AddruleslikeNP® e tothegrammar§ That’sit!

0 1 2 3 4 5I like to parse empties

e e e e e e

I want you to parse this sentence

I want [ ] to parse this sentence

UCS/A*

§ Withweightededges,ordermatters§ Mustexpandoptimalparsefrom

bottomup(subparsesfirst)§ CKYdoesthisbyprocessingsmaller

spansbeforelargerones§ UCSpopsitemsofftheagendain

orderofdecreasingViterbiscore§ A*searchalsowelldefined

§ Youcanalsospeedupthesearchwithoutsacrificingoptimality§ Canselectwhichitemstoprocessfirst§ Candowithany“figureofmerit”

[Charniak98]§ Ifyourfigure-of-meritisavalidA*

heuristic,nolossofoptimiality[KleinandManning03]

n0 i j

(Speech)Lattices§ Therewasnothingmagicalaboutwordsspanningexactly

oneposition.§ Whenworkingwithspeech,wegenerallydon’tknow

howmanywordsthereare,orwheretheybreak.§ Wecanrepresentthepossibilitiesasalatticeandparse

thesejustaseasily.

LearningPCFGs

TreebankPCFGs§ UsePCFGsforbroadcoverageparsing§ Cantakeagrammarrightoffthetrees(doesn’tworkwell):

ROOT ® S 1

S ® NP VP . 1

NP ® PRP 1

VP ® VBD ADJP 1

Model F1Baseline 72.0

[Charniak 96]

ConditionalIndependence?

§ NoteveryNPexpansioncanfilleveryNPslot§ Agrammarwithsymbolslike“NP”won’tbecontext-free§ Statistically,conditionalindependencetoostrong

Non-Independence§ Independenceassumptionsareoftentoostrong.

§ Example:theexpansionofanNPishighlydependentontheparentoftheNP(i.e.,subjectsvs.objects).

§ Also:thesubjectandobjectexpansionsarecorrelated!

NP PP DT NN PRP

All NPs NPs under S NPs under VP

GrammarRefinement

§ Example:PPattachment

GrammarRefinement

§ StructureAnnotation[Johnson’98,Klein&Manning ’03]§ Lexicalization[Collins’99,Charniak ’00]§ LatentVariables[Matsuzaki etal.05,Petrov etal.’06]

StructuralAnnotation

TheGameofDesigningaGrammar

§ Annotation refines base treebank symbols to improve statistical fit of the grammar§ Structural annotation

TypicalExperimentalSetup

§ Corpus:PennTreebank,WSJ

§ Accuracy– F1:harmonicmeanofper-nodelabeledprecisionandrecall.

§ Here:alsosize– numberofsymbolsingrammar.

Training: sections 02-21Development: section 22 (here, first 20 files)Test: section 23

VerticalMarkovization

§ VerticalMarkovorder:rewritesdependonpastkancestornodes.(cf.parentannotation)

Order 1 Order 2

72%73%74%75%76%77%78%79%

1 2v 2 3v 3

Vertical Markov Order

10000150002000025000

1 2v 2 3v 3

Vertical Markov Order

Symbols

HorizontalMarkovization

0 1 2v 2 inf

Horizontal Markov Order

0 1 2v 2 inf

Horizontal Markov Order

Symbols

Order 1 Order ¥

UnarySplits

§ Problem:unaryrewritesusedtotransmutecategoriessoahigh-probabilityrulecanbeused.

Annotation F1 SizeBase 77.8 7.5KUNARY 78.3 8.0K

n Solution: Mark unary rewrite sites with -U

TagSplits

§ Problem:Treebanktagsaretoocoarse.

§ Example:Sentential,PP,andotherprepositionsareallmarkedIN.

§ PartialSolution:§ SubdividetheINtag. Annotation F1 Size

Previous 78.3 8.0KSPLIT-IN 80.3 8.1K

AFullyAnnotated(Unlex)Tree

SomeTestSetResults

§ Beats“firstgeneration”lexicalizedparsers.§ Lotsofroomtoimprove– morecomplexmodelsnext.

Parser LP LR F1 CB 0 CB

Magerman 95 84.9 84.6 84.7 1.26 56.6

Collins 96 86.3 85.8 86.0 1.14 59.9

Unlexicalized 86.9 85.7 86.3 1.10 60.3

Charniak 97 87.4 87.5 87.4 1.00 62.1

Collins 99 88.7 88.6 88.6 0.90 67.1

EfficientParsingforStructuralAnnotation

GrammarProjections

NP^S→DT^NPN’[…DT]^NPNP→DTN’

Coarse Grammar Fine Grammar

Note:X-BarGrammarsareprojectionswithruleslikeXP→YX’orXP→X’YorX’→X

Coarse-to-FinePruning

ForeachcoarsechartitemX[i,j],computeposteriorprobability:

… QP NP VP …coarse:

refined:

E.g.considerthespan5to12:

< threshold

Computing(Max-)Marginals

InsideandOutsideScores

PruningwithA*

§ Youcanalsospeedupthesearchwithoutsacrificingoptimality

§ Foragenda-basedparsers:§ Canselectwhichitemstoprocessfirst

§ Candowithany“figureofmerit”[Charniak98]

§ Ifyourfigure-of-meritisavalidA*heuristic,nolossofoptimiality[KleinandManning03]

n0 i j

A*Parsing

Lexicalization

§ Annotation refines base treebank symbols to improve statistical fit of the grammar§ Structural annotation [Johnson ’98, Klein and Manning 03]§ Head lexicalization [Collins ’99, Charniak ’00]

ProblemswithPCFGs

§ Ifwedonoannotation,thesetreesdifferonlyinonerule:§ VP® VPPP§ NP® NPPP

§ Parsewillgoonewayortheother,regardlessofwords§ Weaddressedthisinonewaywithunlexicalizedgrammars(how?)§ Lexicalizationallowsustobesensitivetospecificwords

ProblemswithPCFGs

§ What’sdifferentbetweenbasicPCFGscoreshere?§ What(lexical)correlationsneedtobescored?

LexicalizedTrees

§ Add“headwords”toeachphrasalnode§ Syntacticvs.semantic

heads§ Headshipnotin(most)

treebanks§ Usuallyuseheadrules,

e.g.:§ NP:

§ TakeleftmostNP§ TakerightmostN*§ TakerightmostJJ§ Takerightchild

§ VP:§ TakeleftmostVB*§ TakeleftmostVP§ Takeleftchild

LexicalizedPCFGs?§ Problem:wenowhavetoestimateprobabilitieslike

§ Nevergoingtogettheseatomicallyoffofatreebank

§ Solution:breakupderivationintosmallersteps

LexicalDerivationSteps§ Aderivationofalocaltree[Collins99]

Choose a head tag and word

Choose a complement bag

Generate children (incl. adjuncts)

Recursively derive children

LexicalizedCKY

bestScore(X,i,j,h)if (j = i+1)return tagScore(X,s[i])

elsereturn max max score(X[h]->Y[h] Z[h’]) *

bestScore(Y,i,k,h) *bestScore(Z,k,j,h’)

max score(X[h]->Y[h’] Z[h]) *bestScore(Y,i,k,h’) *bestScore(Z,k,j,h)

Y[h] Z[h’]

i h k h’ j

k,h’,X->YZ

(VP->VBD •)[saw] NP[her]

(VP->VBD...NP •)[saw]

k,h’,X->YZ

EfficientParsingforLexicalGrammars

QuarticParsing§ Turnsout,youcando(alittle)better[Eisner99]

§ GivesanO(n4)algorithm§ Stillprohibitiveinpracticeifnotpruned

Y[h] Z[h’]

i h k h’ j

Y[h] Z

i h k j

PruningwithBeams§ TheCollinsparserpruneswithper-

cellbeams[Collins99]§ Essentially,runtheO(n5) CKY§ Rememberonlyafewhypothesesfor

eachspan<i,j>.§ IfwekeepKhypothesesateachspan,

thenwedoatmostO(nK2)workperspan(why?)

§ Keepsthingsmoreorlesscubic(andinpracticeismorelikelinear!)

§ Also:certainspansareforbiddenentirelyonthebasisofpunctuation(crucialforspeed)

Y[h] Z[h’]

i h k h’ j

PruningwithaPCFG

§ TheCharniak parserprunesusingatwo-pass,coarse-to-fineapproach[Charniak 97+]§ First,parsewiththebasegrammar§ ForeachX:[i,j]calculateP(X|i,j,s)

§ Thisisn’ttrivial,andtherearecleverspeedups§ Second,dothefullO(n5) CKY

§ SkipanyX:[i,j]whichhadlow(say,<0.0001)posterior§ Avoidsalmostallworkinthesecondphase!

§ Charniak etal06:canusemorepasses§ Petrov etal07:canusemanymorepasses

Results

§ Someresults§ Collins99– 88.6F1(generativelexical)§ CharniakandJohnson05– 89.7/91.3F1(generativelexical/reranked)

§ Petrovetal06– 90.7F1(generativeunlexical)§ McCloskyetal06– 92.1F1(gen+rerank+self-train)

§ However§ Bilexicalcountsrarelymakeadifference(why?)§ Gildea01– Removingbilexicalcountscosts<0.5F1

LatentVariablePCFGs

§ Annotationrefinesbasetreebank symbolstoimprovestatisticalfitofthegrammar§ Parentannotation[Johnson’98]

§ Annotationrefinesbasetreebank symbolstoimprovestatisticalfitofthegrammar§ Parentannotation[Johnson’98]§ Headlexicalization [Collins’99,Charniak ’00]

§ Annotationrefinesbasetreebank symbolstoimprovestatisticalfitofthegrammar§ Parentannotation[Johnson’98]§ Headlexicalization [Collins’99,Charniak ’00]§ Automaticclustering?

LatentVariableGrammars

Parse Tree Sentence Parameters

Derivations

Forward

LearningLatentAnnotations

EMalgorithm:

X2X7X4

X5 X6X3

He was right

§ Brackets are known§ Base categories are known§ Only induce subcategories

JustlikeForward-BackwardforHMMs.Backward

RefinementoftheDTtag

DT-1 DT-2 DT-3 DT-4

Hierarchicalrefinement

HierarchicalEstimationResults

100 300 500 700 900 1100 1300 1500 1700Total Number of grammar symbols

Model F1Flat Training 87.3Hierarchical Training 88.4

Refinementofthe,tag§ Splittingallcategoriesequallyiswasteful:

Adaptive Splitting

§ Want to split complex categories more§ Idea: split everything, roll back splits which

were least useful

AdaptiveSplittingResults

Model F1Previous 88.4With 50% Merging 89.5

ADVP S

SBAR Q

NAC UC

ADVP INTJ

ADJP X

NumberofPhrasalSubcategories

NumberofLexicalSubcategories

NNP JJ

NNS NN VBN RB

VBG VB VBD CD IN

VBP DT

NNPS CC JJ

- ''FW RB

UH, ``

SYM RP LS #

LearnedSplits

§ Proper Nouns (NNP):

§ Personal pronouns (PRP):

NNP-14 Oct. Nov. Sept.NNP-12 John Robert JamesNNP-2 J. E. L.NNP-1 Bush Noriega Peters

NNP-15 New San WallNNP-3 York Francisco Street

PRP-0 It He IPRP-1 it he theyPRP-2 it them him

§ Relativeadverbs(RBR):

§ CardinalNumbers(CD):

RBR-0 further lower higherRBR-1 more less MoreRBR-2 earlier Earlier later

CD-7 one two ThreeCD-4 1989 1990 1988CD-11 million billion trillionCD-0 1 50 100CD-3 1 30 31CD-9 78 58 34

LearnedSplits

FinalResults(Accuracy)

≤ 40 wordsF1

all F1

Charniak&Johnson ‘05 (generative) 90.1 89.6

Split / Merge 90.6 90.1

Dubey ‘05 76.3 -

Chiang et al. ‘02 80.0 76.6

Still higher numbers from reranking / self-training methods

EfficientParsingforHierarchicalGrammars

Coarse-to-FineInference§ Example:PPattachment

?????????

HierarchicalPruning

… QP NP VP …coarse:

split in two: … QP1 QP2 NP1 NP2 VP1 VP2 …

… QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 …split in four:

split in eight: … … … … … … … … … … … … … … … … …

BracketPosteriors

1621min111min35min

15min(nosearcherror)

UnsupervisedTagging

UnsupervisedTagging?§ AKApart-of-speechinduction§ Task:

§ Rawsentencesin§ Taggedsentencesout

§ Obviousthingtodo:§ Startwitha(mostly)uniformHMM§ RunEM§ Inspectresults

EMforHMMs:Process§ Alternatebetweenrecomputingdistributionsoverhiddenvariables(the

tags)andreestimatingparameters§ Crucialstep:wewanttotallyuphowmany(fractional)countsofeach

kindoftransitionandemissionwehaveundercurrentparams:

§ SamequantitiesweneededtotrainaCRF!

Merialdo:Setup§ Some(discouraging)experiments[Merialdo94]

§ Setup:§ Youknowthesetofallowabletagsforeachword§ Fixktrainingexamplestotheirtruelabels

§ LearnP(w|t)ontheseexamples§ LearnP(t|t-1,t-2)ontheseexamples

§ Onnexamples,re-estimatewithEM

§ Note:weknowallowedtagsbutnotfrequencies

EMforHMMs:Quantities§ Totalpathvalues(correspondtoprobabilitieshere):

TheStateLattice/Trellis

EMforHMMs:Process

§ Fromthesequantities,cancomputeexpectedtransitions:

§ Andemissions:

Merialdo:Results

Projection-BasedA*

Factory payrolls fell in Sept.

payrolls in

fellFactory payrolls fell in Sept.

NP:payrolls PP:in

VP:fell

S:fellSYNTACTICp SEMANTICp

A*Speedup

§ TotaltimedominatedbycalculationofA*tablesineachprojection…O(n3)

0102030405060

0 5 10 15 20 25 30 35 40Length

ec) Combined Phase

Dependency PhasePCFG Phase

BreakingUptheSymbols

§ WecanrelaxindependenceassumptionsbyencodingdependenciesintothePCFGsymbols:

§ Whatarethemostuseful“features”toencode?

Parent annotation[Johnson 98]

Marking possessive NPs

OtherTagSplits

§ UNARY-DT:markdemonstrativesasDT^U (“theX”vs.“those”)

§ UNARY-RB:markphrasaladverbsasRB^U(“quickly”vs.“very”)

§ TAG-PA:marktagswithnon-canonicalparents(“not”isanRB^VP)

§ SPLIT-AUX:markauxiliaryverbswith–AUX[cf.Charniak97]

§ SPLIT-CC:separate“but”and“&”fromotherconjunctions

§ SPLIT-%:“%”getsitsowntag.

F1 Size

80.4 8.1K

80.5 8.1K

81.2 8.5K

81.6 9.0K

81.7 9.1K

81.8 9.3K

Algorithms for NLP - cs.cmu.edutbergkir/11711fa17/FA17 11-711 lecture 10 -- parsing II.pdf ·...

Documents

Learning Deep Features for Discriminative …cnnlocalization.csail.mit.edu/Zhou_Learning_Deep_Features_CVPR...Learning Deep Features for Discriminative Localization Bolei Zhou, Aditya

Algorithms for NLP - cs.cmu.edutbergkir/11711fa17/FA17 11-711 lecture 16... · proposal? En vertu de les nouvelle propositions, quel est le côut prévu de perception de le droits?

Discriminative Multiple Target Tracking

Discriminative Video Pattern Search for Efﬁcient Action Detectionjsyuan/papers/2011/Discriminative Video... · Discriminative Video Pattern Search for Efﬁcient Action Detection

Generative Models vs. Discriminative models

Discriminative Collaborative Representation for Classiﬁcationvigir.missouri.edu/~gdesouza/Research/Conference_CDs/ACCV_2014/... · Discriminative Collaborative Representation for

MaxentModels and Discriminative Estimation · PDF fileMaxentModels and Discriminative Estimation Generative vs. Discriminative models Christopher Manning. Discriminative Model Features

Unsupervised Learning of Discriminative Attributes and ...personal.ie.cuhk.edu.hk/~ccloy/files/cvpr_2016_unsupervised.pdf · Unsupervised Learning of Discriminative Attributes and

Localized discriminative Gaussian process latent variable ... · PDF fileLocalized discriminative Gaussian process latent variable model for text ... we use Discriminative Gaussian

Discriminative Non-blind Deblurring

Augmentation - imatge-upc.github.ioimatge-upc.github.io/.../slides/D2L2-augmentation.pdf · Augmentation for discriminative unsupervised feature learning Discriminative Unsupervised

Discriminative Image Segmentation: Applications to

Linear Classification with discriminative models

Discriminative Stimulus Effects of Gabapentin

Discriminative Machine Learning with Structure

Hybrid Generative-Discriminative Visual Categorization.welling/publications/papers/holub_et_al_ijcv.pdf · Hybrid Generative-Discriminative Visual Categorization. ... Learning models

Survey ICASSP 2007 Discriminative Training

Algorithms for NLP - cs.cmu.edutbergkir/11711fa17/FA17 11-711 lecture 9 -- parsing... · §If between words, this is P(word | history) §If inside words, this is P ... self-loop probabilities

Histopathological Image Classiﬁcation using Discriminative ... · Histopathological Image Classiﬁcation using Discriminative Feature-oriented ... our Discriminative Feature-oriented

Discriminative and Consistent ... - cv-foundation.org