Natural Language Processing with Deep Learning CS224N/Ling284 · Christopher Manning Methods of Dependency Parsing 1. Dynamic programming Eisner (1996) gives a clever algorithm with

Natural Language Processingwith Deep Learning

CS224N/Ling284

Christopher Manning and Richard Socher

Lecture 6: Dependency Parsing

LecturePlan

1. SyntacticStructure:ConsistencyandDependency2. DependencyGrammar3. Research highlight4. Transition-baseddependencyparsing5. Neuraldependencyparsing

Reminders/comments:Assignment1duetonightJAssignment2outtodayLFinalprojectdiscussion– comemeetwithusSorrythatwe’vebeenkindofoverloadedforofficehours

1.Twoviewsoflinguisticstructure:Constituency=phrasestructuregrammar=context-freegrammars(CFGs)

Phrasestructureorganizeswordsintonestedconstituents.

thecatadog

large inacratebarkingonthetablecuddlybythedoor

1.Twoviewsoflinguisticstructure:Constituency=phrasestructuregrammar=context-freegrammars(CFGs)

Phrasestructureorganizeswordsintonestedconstituents.

thecatadog

largeinacratebarkingonthetablecuddlybythedoor

largebarkingtalktolookfor

Twoviewsoflinguisticstructure:Dependencystructure

• Dependencystructureshowswhichwordsdependon(modifyorareargumentsof)whichotherwords.

Lookforthelargebarkingdogbythedoorinacrate

Twoviewsoflinguisticstructure:Dependencystructure

• Dependencystructureshowswhichwordsdependon(modifyorareargumentsof)whichotherwords.

Lookforthelargebarkingdogbythedoorinacrate

Ambiguity:PPattachments

Scientistsstudywhalesfrom space

Ambiguity:PPattachments

Scientistsstudywhalesfromspace

Scientists studywhalesfrom space

Attachmentambiguities

• Akeyparsingdecisionishowwe‘attach’variousconstituents• PPs,adverbialorparticipialphrases,infinitives,coordinations,etc.

Attachmentambiguities

• Akeyparsingdecisionishowwe‘attach’variousconstituents• PPs,adverbialorparticipialphrases,infinitives,coordinations,etc.

• Catalan numbers: Cn = (2n)!/[(n+1)!n!]

• An exponentially growing series, which arises in many tree-like contexts:• E.g., the number of possible triangulations of a polygon with n+2 sides

• Turns up in triangulation of probabilistic graphical models….

The rise of annotated data:Universal Dependencies treebanks

[Universal Dependencies: http://universaldependencies.org/ ;cf. Marcus et al. 1993, The Penn Treebank, Computational Linguistics]

Theriseofannotateddata

Startingoff,buildingatreebankseemsalotslowerandlessusefulthanbuildingagrammar

Butatreebankgivesusmanythings• Reusabilityofthelabor• Manyparsers,part-of-speechtaggers,etc.canbebuiltonit• Valuableresourceforlinguistics

• Broadcoverage,notjustafewintuitions• Frequenciesanddistributionalinformation• Awaytoevaluatesystems

ChristopherManning

Dependencysyntaxpostulatesthatsyntacticstructureconsistsofrelationsbetweenlexicalitems,normallybinaryasymmetricrelations(“arrows”)calleddependencies

2.DependencyGrammarandDependencyStructure

submitted

Bills were

Senatorby

immigration

Brownback

andon

ports

Republican

of

Kansas

ChristopherManning


Thearrowsarecommonlytypedwiththenameofgrammaticalrelations(subject,prepositionalobject,apposition,etc.)

DependencyGrammarandDependencyStructure

submitted

Bills were

Senatorby

nsubj:pass aux obl

case

immigration

conj

Brownback

cc

andon

case

nmod

ports flat

Republican

ofcase

nmod

Kansas

appos

ChristopherManning


Thearrowconnectsahead (governor,superior,regent)withadependent (modifier,inferior,subordinate)

Usually,dependenciesformatree(connected,acyclic,single-head)


submitted

Bills were

Senatorby

nsubj:pass aux obl

case

immigration

conj

Brownback

cc

andon

case

nmod

ports flat

Republican

ofcase

nmod

Kansas

appos

ChristopherManning

Pāṇini’s grammar (c. 5th century BCE)

16

Gallery: http://wellcomeimages.org/indexplus/image/L0032691.htmlCC BY 4.0 File:Birch bark MS from Kashmir of the Rupavatra Wellcome L0032691.jpg

ChristopherManning

DependencyGrammar/ParsingHistory

• Theideaofdependencystructuregoesbackalongway• ToPāṇini’s grammar(c.5thcenturyBCE)• Basicapproachof1stmillenniumArabicgrammarians

• Constituency/context-freegrammarsisanew-fangledinvention• 20thcenturyinvention(R.S.Wells,1947)

• ModerndependencyworkoftenlinkedtoworkofL.Tesnière(1959)• Wasdominantapproachin“East”(Russia,China,…)• Goodforfree-er wordorderlanguages

• AmongtheearliestkindsofparsersinNLP,evenintheUS:• DavidHays,oneofthefoundersofU.S.computationallinguistics,builtearly(first?)dependencyparser(Hays1962)

ChristopherManning

ROOTDiscussionoftheoutstandingissueswascompleted.


• Somepeopledrawthearrowsoneway;sometheotherway!• Tesnière hadthempointfromheadtodependent…

• UsuallyaddafakeROOTsoeverywordisadependentofprecisely1othernode

ChristopherManning

Whatarethesourcesofinformationfordependencyparsing?1. Bilexical affinities[discussionà issues]isplausible

2. Dependencydistancemostlywithnearbywords

3. InterveningmaterialDependenciesrarelyspaninterveningverbsorpunctuation

4. Valency ofheadsHowmanydependentsonwhichsideareusualforahead?

DependencyConditioningPreferences

ROOTDiscussionoftheoutstandingissueswascompleted.

ChristopherManning

DependencyParsing

• Asentenceisparsedbychoosingforeachwordwhatotherword(includingROOT)isitadependentof

• Usuallysomeconstraints:• OnlyonewordisadependentofROOT• Don’twantcyclesA→B,B→A

• Thismakesthedependenciesatree• Finalissueiswhetherarrowscancross(non-projective)ornot

20

I give a on bootstrappingtalk tomorrowROOT ’ll

ChristopherManning

MethodsofDependencyParsing

1. DynamicprogrammingEisner(1996)givesacleveralgorithmwithcomplexityO(n3),byproducingparseitemswithheadsattheendsratherthaninthemiddle

2. GraphalgorithmsYoucreateaMinimumSpanningTreeforasentenceMcDonaldetal.’s(2005)MSTParser scoresdependenciesindependentlyusinganMLclassifier(heusesMIRA,foronlinelearning,butitcanbesomethingelse)

3. ConstraintSatisfactionEdgesareeliminatedthatdon’tsatisfyhardconstraints.Karlsson (1990),etc.

4. “Transition-basedparsing”or“deterministicdependencyparsing”GreedychoiceofattachmentsguidedbygoodmachinelearningclassifiersMaltParser (Nivre etal.2008).Hasprovenhighlyeffective.

Improving Distributional Similarity with Lessons Learned from Word Embeddings

Omer Levy, Yoav Goldberg, Ido Dagan

Presented by: Ajay Sohmshetty

Count-based distributional models

● SVD (Singular Value Decomposition)

● PPMI (Positive Pointwise Mutual Information)

Neural network-based models

● SGNS (Skip-Gram Negative Sampling)

● GloVe

Word Representation Methods






● SGNS (Skip-Gram Negative Sampling) / CBOW

● GloVe


Conventional wisdom: Neural-network based models > Count-based models



● GloVe





Levy et. al.:Hyperparameters and system design choices more important,

not the embedding algorithms themselves.



● GloVe




Hyperparameters in Skip-Gram

→ These can be transferred over to the count-based variants.

Unigram distribution smoothing exponent

# of negative samples

Context Distribution Smoothing

Shifted PMI

All Transferable HyperparametersHyperparameter Explored Values Applicable Methods

Window 2, 5, 10 All

Dynamic Context Window None, with All

Subsampling None, dirty, clean All

Deleting Rare Words None, with All

Shifted PMI 1, 5, 15 PPMI, SVD, SGNS

Context Distribution Smoothing 1, 0.75 PPMI, SVD, SGNS

Adding Context Vectors Only w, w+c SVD, SGNS, GloVe

Eigenvalue Weighting 0, 0.5, 1 SVD

Vector Normalization None, row, col, both All

Preprocessing

Association Metric

Postprocessing

ResultsWord Similarity Tasks Analogy Tasks

Key Takeaways- This paper challenges the conventional wisdom that neural

network-based models are superior to count-based models.

- While model design is important, hyperparameters are also KEY for achieving reasonable results. Don’t discount their importance!

- Challenge the status quo!

ChristopherManning

4.Greedytransition-basedparsing[Nivre 2003]

• Asimpleformofgreedydiscriminativedependencyparser• Theparserdoesasequenceofbottomupactions

• Roughlylike“shift”or“reduce”inashift-reduceparser,butthe“reduce”actionsarespecializedtocreatedependencieswithheadonleftorright

• Theparserhas:• astackσ,writtenwithtoptotheright• whichstartswiththeROOTsymbol

• abufferβ,writtenwithtoptotheleft• whichstartswiththeinputsentence

• asetofdependencyarcsA• whichstartsoffempty

• asetofactions

ChristopherManning

Basictransition-baseddependencyparser

Start: σ =[ROOT],β=w1,…,wn ,A=∅1. Shiftσ,wi|β,Aè σ|wi,β,A2. Left-Arcr σ|wi|wj,β,Aè σ|wj,β,A∪{r(wj,wi)}3. Right-Arcr σ|wi|wj,β,Aè σ|wi,β,A∪{r(wi,wj)}Finish:σ =[w], β=∅

ChristopherManning

Arc-standard transition-basedparser(thereareothertransitionschemes…)Analysisof“Iatefish”

ate fish[root]

Start

I

[root]

Shift

I ate fish

ate[root] fish

Shift

I

Start: σ = [ROOT], β = w1, …, wn , A = ∅1. Shift σ, wi|β, A è σ|wi, β, A2. Left-Arcr σ|wi|wj, β, A è

σ|wj, β, A∪{r(wj,wi)} 3. Right-Arcr σ|wi|wj, β, A è

σ|wi, β, A∪{r(wi,wj)}Finish: β = ∅

ChristopherManning

Arc-standard transition-basedparserAnalysisof“Iatefish”

ate[root] ate[root]

Left Arc

IA +=nsubj(ate → I)

ate fish[root] ate fish[root]

Shift

ate[root] [root]

Right ArcA +=obj(ate → fish)fish ate

ate[root] [root]

Right ArcA +=root([root] → ate)Finish

ChristopherManning

MaltParser[Nivre andHall2005]

• Wehavelefttoexplainhowwechoosethenextaction• Eachactionispredictedbyadiscriminativeclassifier(often

SVM,canbeperceptron,maxent classifier)overeachlegalmove• Maxof3untyped choices;maxof|R|× 2+1whentyped• Features:topofstackword,POS;firstinbufferword,POS;etc.

• ThereisNOsearch(inthesimplestform)• Butyoucanprofitablydoa beamsearchifyouwish(slowerbutbetter)

• ItprovidesVERY fastlineartimeparsing• Themodel’saccuracyisslightly belowthebestdependency

parsers,but• Itprovidesfast,closetostateoftheartparsingperformance

ChristopherManning

FeatureRepresentation

Feature templates: usually a combination of 1 ~ 3 elements from the configuration.

Indicator features

0 0 0 1 0 0 1 0 0 0 1 0binary, sparsedim =106 ~ 107

…

ChristopherManning

EvaluationofDependencyParsing:(labeled)dependencyaccuracy

ROOT She saw the video lecture 0 1 2 3 4 5

Gold1 2 She nsubj2 0 saw root 3 5 the det4 5 video nn5 2 lecture obj

Parsed1 2 She nsubj2 0 saw root 3 4 the det4 5 video nsubj5 2 lecture ccomp

Acc = #correctdeps#ofdeps

UAS=4/5=80%LAS=2/5=40%

ChristopherManning

Dependencypathsidentifysemanticrelations– e.g,forproteininteraction

[Erkan et al. EMNLP 07, Fundel et al. 2007, etc.]

KaiCçnsubj interactsnmod:withè SasAKaiCçnsubj interactsnmod:withè SasA conj:andè KaiAKaiCçnsubj interactsprep_withè SasA conj:andè KaiB

demonstrated

results

KaiC

interacts

rythmically

nsubj

The

markdet

ccomp

thatnsubj

KaiBKaiA

SasA

conj:and

conj:andadvmod

nmod:with

with andcc

case

ChristopherManning

• DependenciesparalleltoaCFGtreemustbeprojective• Theremustnotbeanycrossingdependencyarcswhenthewordsarelaidoutintheirlinearorder,withallarcsabovethewords.

• Butdependencytheorynormallydoesallownon-projectivestructurestoaccountfordisplacedconstituents• Youcan’teasilygetthesemanticsofcertainconstructionsrightwithoutthesenonprojective dependencies

WhodidBillbuythecoffeefromyesterday?

Projectivity

ChristopherManning

Handlingnon-projectivity

• Thearc-standardalgorithmwepresentedonlybuildsprojectivedependencytrees

• Possibledirectionstohead:1. Justdeclaredefeatonnonprojective arcs2. Useadependencyformalismwhichonlyadmitsprojective

representations(aCFGdoesn’trepresentsuchstructures…)3. Useapostprocessortoaprojectivedependencyparsingalgorithmto

identifyandresolvenonprojective links4. Addextratransitionsthatcanmodelatleastmostnon-projective

structures(e.g.,addanextraSWAPtransition,cf.bubblesort)5. Movetoaparsingmechanismthatdoesnotuseorrequireany

constraintsonprojectivity(e.g.,thegraph-basedMSTParser)

ChristopherManning

5.Whytrainaneuraldependencyparser?IndicatorFeaturesRevisited

• Problem#1:sparse• Problem#2:incomplete• Problem#3:expensivecomputation

Morethan95%ofparsingtimeisconsumedbyfeaturecomputation.

OurApproach:learnadenseandcompactfeaturerepresentation

0.1densedim = ~1000

0.9-0.2 0.3 -0.1 -0.5…

ChristopherManning

Aneuraldependencyparser[ChenandManning2014]

• EnglishparsingtoStanfordDependencies:• Unlabeledattachmentscore(UAS)=head• Labeledattachmentscore(LAS)=headandlabel

Parser UAS LAS sent./s

MaltParser 89.8 87.2 469

MSTParser 91.4 88.1 10

TurboParser 92.3* 89.6* 8

C &M2014 92.0 89.7 654

ChristopherManning

• Werepresenteachwordasad-dimensionaldensevector(i.e.,wordembedding)• Similarwordsareexpected tohaveclosevectors.

• Meanwhile,part-of-speechtags (POS)anddependencylabelsarealsorepresentedasd-dimensionalvectors.• Thesmallerdiscretesetsalsoexhibitmanysemanticalsimilarities.

DistributedRepresentations

come

go

werewas

isgood

NNS(pluralnoun)shouldbecloseto NN(singularnoun).num(numericalmodifier)shouldbecloseto amod(adjectivemodifier).

ChristopherManning

ExtractingTokensandthenvectorrepresentationsfromconfiguration

s1

s2

b1

lc(s1)rc(s1)lc(s2)rc(s2)

goodhascontrol∅∅He∅

JJVBZNN∅∅PRP∅

∅∅∅∅∅nsubj∅

+ +

word POS dep.

• Weextractasetoftokensbasedonthestack/bufferpositions:

• Weconvertthemtovectorembeddingsandconcatenatethem

ChristopherManning

ModelArchitecture

Input layer xlookup+concat

Hidden layer hh = ReLU(Wx + b1)

Output layer yy = softmax(Uh + b2)

Softmax probabilities

cross-entropy error will beback-propagated to the embeddings.

ChristopherManning

Non-linearities betweenlayers:Whythey’reneeded

• Forlogisticregression:maptoprobabilities• Here:functionapproximation,

e.g.,forregressionorclassification• Withoutnon-linearities,deepneuralnetworkscan’tdoanythingmorethanalineartransform• Extralayerscouldjustbecompileddownintoasinglelineartransform

• Peopleusevariousnon-linearities

48

ChristopherManning

Non-linearities:What’sused

logistic(“sigmoid”)tanh

tanh isjustarescaledandshiftedsigmoidtanh isoftenusedandoftenperformsbetterfordeepnets

• It’soutputissymmetricaround0

tanh(z) = 2logistic(2z)−1

49

ChristopherManning

Non-linearities:What’sused

hardtanh softsign linearrectifier(ReLU)

• hardtanh similarbutcomputationallycheaperthantanh andsaturateshard.• [Glorot andBengioAISTATS 2010,2011]discusssoftsign andrectifier• Rectifiedlinearunitisnowmegacommon– transferslinearsignalifactive

rect(z) =max(z, 0)softsign(z) = a1+ a

50

Dependencyparsingforsentencestructure

Neuralnetworkscanaccuratelydeterminethestructureofsentences,supportinginterpretation

ChenandManning(2014)wasthefirstsimple,successfulneuraldependencyparser

Thedenserepresentationsletitoutperformothergreedyparsersinbothaccuracyandspeed

Furtherdevelopmentsintransition-basedneuraldependencyparsing

Thisworkwasfurtherdevelopedandimprovedbyothers,includinginparticularatGoogle

• Bigger,deepernetworkswithbettertunedhyperparameters• Beamsearch• Global,conditionalrandomfield(CRF)-styleinferenceoverthedecisionsequence

LeadingtoSyntaxNet andtheParsey McParseFace modelhttps://research.googleblog.com/2016/05/announcing-syntaxnet-worlds-most.html

Method UAS LAS(PTB WSJSD3.3Chen&Manning2014 92.0 89.7Weiss etal.2015 93.99 92.05Andor etal.2016 94.61 92.79

Documents

Natural Language Processing with Deep Learning CS224N/Ling284 · Christopher Manning Methods of Dependency Parsing 1. Dynamic programming Eisner (1996) gives a clever algorithm with