Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Natural Language Processingwith Deep Learning
CS224N/Ling284
Christopher Manning and Richard Socher
Lecture 6: Dependency Parsing
LecturePlan
1. SyntacticStructure:ConsistencyandDependency2. DependencyGrammar3. Research highlight4. Transition-baseddependencyparsing5. Neuraldependencyparsing
Reminders/comments:Assignment1duetonightJAssignment2outtodayLFinalprojectdiscussion– comemeetwithusSorrythatwe’vebeenkindofoverloadedforofficehours
1.Twoviewsoflinguisticstructure:Constituency=phrasestructuregrammar=context-freegrammars(CFGs)
Phrasestructureorganizeswordsintonestedconstituents.
thecatadog
large inacratebarkingonthetablecuddlybythedoor
1.Twoviewsoflinguisticstructure:Constituency=phrasestructuregrammar=context-freegrammars(CFGs)
Phrasestructureorganizeswordsintonestedconstituents.
thecatadog
largeinacratebarkingonthetablecuddlybythedoor
largebarkingtalktolookfor
Twoviewsoflinguisticstructure:Dependencystructure
• Dependencystructureshowswhichwordsdependon(modifyorareargumentsof)whichotherwords.
Lookforthelargebarkingdogbythedoorinacrate
Twoviewsoflinguisticstructure:Dependencystructure
• Dependencystructureshowswhichwordsdependon(modifyorareargumentsof)whichotherwords.
Lookforthelargebarkingdogbythedoorinacrate
Ambiguity:PPattachments
Scientistsstudywhalesfrom space
Ambiguity:PPattachments
Scientistsstudywhalesfromspace
Scientists studywhalesfrom space
Attachmentambiguities
• Akeyparsingdecisionishowwe‘attach’variousconstituents• PPs,adverbialorparticipialphrases,infinitives,coordinations,etc.
Attachmentambiguities
• Akeyparsingdecisionishowwe‘attach’variousconstituents• PPs,adverbialorparticipialphrases,infinitives,coordinations,etc.
• Catalan numbers: Cn = (2n)!/[(n+1)!n!]
• An exponentially growing series, which arises in many tree-like contexts:• E.g., the number of possible triangulations of a polygon with n+2 sides
• Turns up in triangulation of probabilistic graphical models….
The rise of annotated data:Universal Dependencies treebanks
[Universal Dependencies: http://universaldependencies.org/ ;cf. Marcus et al. 1993, The Penn Treebank, Computational Linguistics]
Theriseofannotateddata
Startingoff,buildingatreebankseemsalotslowerandlessusefulthanbuildingagrammar
Butatreebankgivesusmanythings• Reusabilityofthelabor• Manyparsers,part-of-speechtaggers,etc.canbebuiltonit• Valuableresourceforlinguistics
• Broadcoverage,notjustafewintuitions• Frequenciesanddistributionalinformation• Awaytoevaluatesystems
ChristopherManning
Dependencysyntaxpostulatesthatsyntacticstructureconsistsofrelationsbetweenlexicalitems,normallybinaryasymmetricrelations(“arrows”)calleddependencies
2.DependencyGrammarandDependencyStructure
submitted
Bills were
Senatorby
immigration
Brownback
andon
ports
Republican
of
Kansas
ChristopherManning
Dependencysyntaxpostulatesthatsyntacticstructureconsistsofrelationsbetweenlexicalitems,normallybinaryasymmetricrelations(“arrows”)calleddependencies
Thearrowsarecommonlytypedwiththenameofgrammaticalrelations(subject,prepositionalobject,apposition,etc.)
DependencyGrammarandDependencyStructure
submitted
Bills were
Senatorby
nsubj:pass aux obl
case
immigration
conj
Brownback
cc
andon
case
nmod
ports flat
Republican
ofcase
nmod
Kansas
appos
ChristopherManning
Dependencysyntaxpostulatesthatsyntacticstructureconsistsofrelationsbetweenlexicalitems,normallybinaryasymmetricrelations(“arrows”)calleddependencies
Thearrowconnectsahead (governor,superior,regent)withadependent (modifier,inferior,subordinate)
Usually,dependenciesformatree(connected,acyclic,single-head)
DependencyGrammarandDependencyStructure
submitted
Bills were
Senatorby
nsubj:pass aux obl
case
immigration
conj
Brownback
cc
andon
case
nmod
ports flat
Republican
ofcase
nmod
Kansas
appos
ChristopherManning
Pāṇini’s grammar (c. 5th century BCE)
16
Gallery: http://wellcomeimages.org/indexplus/image/L0032691.htmlCC BY 4.0 File:Birch bark MS from Kashmir of the Rupavatra Wellcome L0032691.jpg
ChristopherManning
DependencyGrammar/ParsingHistory
• Theideaofdependencystructuregoesbackalongway• ToPāṇini’s grammar(c.5thcenturyBCE)• Basicapproachof1stmillenniumArabicgrammarians
• Constituency/context-freegrammarsisanew-fangledinvention• 20thcenturyinvention(R.S.Wells,1947)
• ModerndependencyworkoftenlinkedtoworkofL.Tesnière(1959)• Wasdominantapproachin“East”(Russia,China,…)• Goodforfree-er wordorderlanguages
• AmongtheearliestkindsofparsersinNLP,evenintheUS:• DavidHays,oneofthefoundersofU.S.computationallinguistics,builtearly(first?)dependencyparser(Hays1962)
ChristopherManning
ROOTDiscussionoftheoutstandingissueswascompleted.
DependencyGrammarandDependencyStructure
• Somepeopledrawthearrowsoneway;sometheotherway!• Tesnière hadthempointfromheadtodependent…
• UsuallyaddafakeROOTsoeverywordisadependentofprecisely1othernode
ChristopherManning
Whatarethesourcesofinformationfordependencyparsing?1. Bilexical affinities[discussionà issues]isplausible
2. Dependencydistancemostlywithnearbywords
3. InterveningmaterialDependenciesrarelyspaninterveningverbsorpunctuation
4. Valency ofheadsHowmanydependentsonwhichsideareusualforahead?
DependencyConditioningPreferences
ROOTDiscussionoftheoutstandingissueswascompleted.
ChristopherManning
DependencyParsing
• Asentenceisparsedbychoosingforeachwordwhatotherword(includingROOT)isitadependentof
• Usuallysomeconstraints:• OnlyonewordisadependentofROOT• Don’twantcyclesA→B,B→A
• Thismakesthedependenciesatree• Finalissueiswhetherarrowscancross(non-projective)ornot
20
I give a on bootstrappingtalk tomorrowROOT ’ll
ChristopherManning
MethodsofDependencyParsing
1. DynamicprogrammingEisner(1996)givesacleveralgorithmwithcomplexityO(n3),byproducingparseitemswithheadsattheendsratherthaninthemiddle
2. GraphalgorithmsYoucreateaMinimumSpanningTreeforasentenceMcDonaldetal.’s(2005)MSTParser scoresdependenciesindependentlyusinganMLclassifier(heusesMIRA,foronlinelearning,butitcanbesomethingelse)
3. ConstraintSatisfactionEdgesareeliminatedthatdon’tsatisfyhardconstraints.Karlsson (1990),etc.
4. “Transition-basedparsing”or“deterministicdependencyparsing”GreedychoiceofattachmentsguidedbygoodmachinelearningclassifiersMaltParser (Nivre etal.2008).Hasprovenhighlyeffective.
Improving Distributional Similarity with Lessons Learned from Word Embeddings
Omer Levy, Yoav Goldberg, Ido Dagan
Presented by: Ajay Sohmshetty
Count-based distributional models
● SVD (Singular Value Decomposition)
● PPMI (Positive Pointwise Mutual Information)
Neural network-based models
● SGNS (Skip-Gram Negative Sampling)
● GloVe
Word Representation Methods
Word Representation Methods
Count-based distributional models
● SVD (Singular Value Decomposition)
● PPMI (Positive Pointwise Mutual Information)
Neural network-based models
● SGNS (Skip-Gram Negative Sampling) / CBOW
● GloVe
Word Representation Methods
Conventional wisdom: Neural-network based models > Count-based models
Neural network-based models
● SGNS (Skip-Gram Negative Sampling) / CBOW
● GloVe
Count-based distributional models
● SVD (Singular Value Decomposition)
● PPMI (Positive Pointwise Mutual Information)
Word Representation Methods
Levy et. al.:Hyperparameters and system design choices more important,
not the embedding algorithms themselves.
Neural network-based models
● SGNS (Skip-Gram Negative Sampling) / CBOW
● GloVe
Count-based distributional models
● SVD (Singular Value Decomposition)
● PPMI (Positive Pointwise Mutual Information)
Hyperparameters in Skip-Gram
→ These can be transferred over to the count-based variants.
Unigram distribution smoothing exponent
# of negative samples
Context Distribution Smoothing
Shifted PMI
All Transferable HyperparametersHyperparameter Explored Values Applicable Methods
Window 2, 5, 10 All
Dynamic Context Window None, with All
Subsampling None, dirty, clean All
Deleting Rare Words None, with All
Shifted PMI 1, 5, 15 PPMI, SVD, SGNS
Context Distribution Smoothing 1, 0.75 PPMI, SVD, SGNS
Adding Context Vectors Only w, w+c SVD, SGNS, GloVe
Eigenvalue Weighting 0, 0.5, 1 SVD
Vector Normalization None, row, col, both All
Preprocessing
Association Metric
Postprocessing
ResultsWord Similarity Tasks Analogy Tasks
Key Takeaways- This paper challenges the conventional wisdom that neural
network-based models are superior to count-based models.
- While model design is important, hyperparameters are also KEY for achieving reasonable results. Don’t discount their importance!
- Challenge the status quo!
ChristopherManning
4.Greedytransition-basedparsing[Nivre 2003]
• Asimpleformofgreedydiscriminativedependencyparser• Theparserdoesasequenceofbottomupactions
• Roughlylike“shift”or“reduce”inashift-reduceparser,butthe“reduce”actionsarespecializedtocreatedependencieswithheadonleftorright
• Theparserhas:• astackσ,writtenwithtoptotheright• whichstartswiththeROOTsymbol
• abufferβ,writtenwithtoptotheleft• whichstartswiththeinputsentence
• asetofdependencyarcsA• whichstartsoffempty
• asetofactions
ChristopherManning
Basictransition-baseddependencyparser
Start: σ =[ROOT],β=w1,…,wn ,A=∅1. Shiftσ,wi|β,Aè σ|wi,β,A2. Left-Arcr σ|wi|wj,β,Aè σ|wj,β,A∪{r(wj,wi)}3. Right-Arcr σ|wi|wj,β,Aè σ|wi,β,A∪{r(wi,wj)}Finish:σ =[w], β=∅
ChristopherManning
Arc-standard transition-basedparser(thereareothertransitionschemes…)Analysisof“Iatefish”
ate fish[root]
Start
I
[root]
Shift
I ate fish
ate[root] fish
Shift
I
Start: σ = [ROOT], β = w1, …, wn , A = ∅1. Shift σ, wi|β, A è σ|wi, β, A2. Left-Arcr σ|wi|wj, β, A è
σ|wj, β, A∪{r(wj,wi)} 3. Right-Arcr σ|wi|wj, β, A è
σ|wi, β, A∪{r(wi,wj)}Finish: β = ∅
ChristopherManning
Arc-standard transition-basedparserAnalysisof“Iatefish”
ate[root] ate[root]
Left Arc
IA +=nsubj(ate → I)
ate fish[root] ate fish[root]
Shift
ate[root] [root]
Right ArcA +=obj(ate → fish)fish ate
ate[root] [root]
Right ArcA +=root([root] → ate)Finish
ChristopherManning
MaltParser[Nivre andHall2005]
• Wehavelefttoexplainhowwechoosethenextaction• Eachactionispredictedbyadiscriminativeclassifier(often
SVM,canbeperceptron,maxent classifier)overeachlegalmove• Maxof3untyped choices;maxof|R|× 2+1whentyped• Features:topofstackword,POS;firstinbufferword,POS;etc.
• ThereisNOsearch(inthesimplestform)• Butyoucanprofitablydoa beamsearchifyouwish(slowerbutbetter)
• ItprovidesVERY fastlineartimeparsing• Themodel’saccuracyisslightly belowthebestdependency
parsers,but• Itprovidesfast,closetostateoftheartparsingperformance
ChristopherManning
FeatureRepresentation
Feature templates: usually a combination of 1 ~ 3 elements from the configuration.
Indicator features
0 0 0 1 0 0 1 0 0 0 1 0binary, sparsedim =106 ~ 107
…
ChristopherManning
EvaluationofDependencyParsing:(labeled)dependencyaccuracy
ROOT She saw the video lecture 0 1 2 3 4 5
Gold1 2 She nsubj2 0 saw root 3 5 the det4 5 video nn5 2 lecture obj
Parsed1 2 She nsubj2 0 saw root 3 4 the det4 5 video nsubj5 2 lecture ccomp
Acc = #correctdeps#ofdeps
UAS=4/5=80%LAS=2/5=40%
ChristopherManning
Dependencypathsidentifysemanticrelations– e.g,forproteininteraction
[Erkan et al. EMNLP 07, Fundel et al. 2007, etc.]
KaiCçnsubj interactsnmod:withè SasAKaiCçnsubj interactsnmod:withè SasA conj:andè KaiAKaiCçnsubj interactsprep_withè SasA conj:andè KaiB
demonstrated
results
KaiC
interacts
rythmically
nsubj
The
markdet
ccomp
thatnsubj
KaiBKaiA
SasA
conj:and
conj:andadvmod
nmod:with
with andcc
case
ChristopherManning
• DependenciesparalleltoaCFGtreemustbeprojective• Theremustnotbeanycrossingdependencyarcswhenthewordsarelaidoutintheirlinearorder,withallarcsabovethewords.
• Butdependencytheorynormallydoesallownon-projectivestructurestoaccountfordisplacedconstituents• Youcan’teasilygetthesemanticsofcertainconstructionsrightwithoutthesenonprojective dependencies
WhodidBillbuythecoffeefromyesterday?
Projectivity
ChristopherManning
Handlingnon-projectivity
• Thearc-standardalgorithmwepresentedonlybuildsprojectivedependencytrees
• Possibledirectionstohead:1. Justdeclaredefeatonnonprojective arcs2. Useadependencyformalismwhichonlyadmitsprojective
representations(aCFGdoesn’trepresentsuchstructures…)3. Useapostprocessortoaprojectivedependencyparsingalgorithmto
identifyandresolvenonprojective links4. Addextratransitionsthatcanmodelatleastmostnon-projective
structures(e.g.,addanextraSWAPtransition,cf.bubblesort)5. Movetoaparsingmechanismthatdoesnotuseorrequireany
constraintsonprojectivity(e.g.,thegraph-basedMSTParser)
ChristopherManning
5.Whytrainaneuraldependencyparser?IndicatorFeaturesRevisited
• Problem#1:sparse• Problem#2:incomplete• Problem#3:expensivecomputation
Morethan95%ofparsingtimeisconsumedbyfeaturecomputation.
OurApproach:learnadenseandcompactfeaturerepresentation
0.1densedim = ~1000
0.9-0.2 0.3 -0.1 -0.5…
ChristopherManning
Aneuraldependencyparser[ChenandManning2014]
• EnglishparsingtoStanfordDependencies:• Unlabeledattachmentscore(UAS)=head• Labeledattachmentscore(LAS)=headandlabel
Parser UAS LAS sent./s
MaltParser 89.8 87.2 469
MSTParser 91.4 88.1 10
TurboParser 92.3* 89.6* 8
C &M2014 92.0 89.7 654
ChristopherManning
• Werepresenteachwordasad-dimensionaldensevector(i.e.,wordembedding)• Similarwordsareexpected tohaveclosevectors.
• Meanwhile,part-of-speechtags (POS)anddependencylabelsarealsorepresentedasd-dimensionalvectors.• Thesmallerdiscretesetsalsoexhibitmanysemanticalsimilarities.
DistributedRepresentations
come
go
werewas
isgood
NNS(pluralnoun)shouldbecloseto NN(singularnoun).num(numericalmodifier)shouldbecloseto amod(adjectivemodifier).
ChristopherManning
ExtractingTokensandthenvectorrepresentationsfromconfiguration
s1
s2
b1
lc(s1)rc(s1)lc(s2)rc(s2)
goodhascontrol∅∅He∅
JJVBZNN∅∅PRP∅
∅∅∅∅∅nsubj∅
+ +
word POS dep.
• Weextractasetoftokensbasedonthestack/bufferpositions:
• Weconvertthemtovectorembeddingsandconcatenatethem
ChristopherManning
ModelArchitecture
Input layer xlookup+concat
Hidden layer hh = ReLU(Wx + b1)
Output layer yy = softmax(Uh + b2)
Softmax probabilities
cross-entropy error will beback-propagated to the embeddings.
ChristopherManning
Non-linearities betweenlayers:Whythey’reneeded
• Forlogisticregression:maptoprobabilities• Here:functionapproximation,
e.g.,forregressionorclassification• Withoutnon-linearities,deepneuralnetworkscan’tdoanythingmorethanalineartransform• Extralayerscouldjustbecompileddownintoasinglelineartransform
• Peopleusevariousnon-linearities
48
ChristopherManning
Non-linearities:What’sused
logistic(“sigmoid”)tanh
tanh isjustarescaledandshiftedsigmoidtanh isoftenusedandoftenperformsbetterfordeepnets
• It’soutputissymmetricaround0
tanh(z) = 2logistic(2z)−1
49
ChristopherManning
Non-linearities:What’sused
hardtanh softsign linearrectifier(ReLU)
• hardtanh similarbutcomputationallycheaperthantanh andsaturateshard.• [Glorot andBengioAISTATS 2010,2011]discusssoftsign andrectifier• Rectifiedlinearunitisnowmegacommon– transferslinearsignalifactive
rect(z) =max(z, 0)softsign(z) = a1+ a
50
Dependencyparsingforsentencestructure
Neuralnetworkscanaccuratelydeterminethestructureofsentences,supportinginterpretation
ChenandManning(2014)wasthefirstsimple,successfulneuraldependencyparser
Thedenserepresentationsletitoutperformothergreedyparsersinbothaccuracyandspeed
Furtherdevelopmentsintransition-basedneuraldependencyparsing
Thisworkwasfurtherdevelopedandimprovedbyothers,includinginparticularatGoogle
• Bigger,deepernetworkswithbettertunedhyperparameters• Beamsearch• Global,conditionalrandomfield(CRF)-styleinferenceoverthedecisionsequence
LeadingtoSyntaxNet andtheParsey McParseFace modelhttps://research.googleblog.com/2016/05/announcing-syntaxnet-worlds-most.html
Method UAS LAS(PTB WSJSD3.3Chen&Manning2014 92.0 89.7Weiss etal.2015 93.99 92.05Andor etal.2016 94.61 92.79