Natural Language Processing with Deep Learning CS224N/Ling284

Natural Language Processingwith Deep Learning

CS224N/Ling284

Christopher Manning and Richard Socher

Lecture 6: Dependency Parsing

LecturePlan

1. SyntacticStructure:ConsistencyandDependency2. DependencyGrammar3. Research highlight4. Transition-baseddependencyparsing5. Neuraldependencyparsing

Reminders/comments:Assignment1duetonightJAssignment2outtodayLFinalprojectdiscussion– comemeetwithusSorrythatwe’vebeenkindofoverloadedforofficehours

1.Twoviewsoflinguisticstructure:Constituency=phrasestructuregrammar=context-freegrammars(CFGs)

Phrasestructureorganizeswordsintonestedconstituents.

thecatadog

large inacratebarkingonthetablecuddlybythedoor

1.Twoviewsoflinguisticstructure:Constituency=phrasestructuregrammar=context-freegrammars(CFGs)

Phrasestructureorganizeswordsintonestedconstituents.

thecatadog

largeinacratebarkingonthetablecuddlybythedoor

largebarkingtalktolookfor

Twoviewsoflinguisticstructure:Dependencystructure

• Dependencystructureshowswhichwordsdependon(modifyorareargumentsof)whichotherwords.

Lookforthelargebarkingdogbythedoorinacrate

Twoviewsoflinguisticstructure:Dependencystructure

• Dependencystructureshowswhichwordsdependon(modifyorareargumentsof)whichotherwords.

Lookforthelargebarkingdogbythedoorinacrate

Ambiguity:PPattachments

Scientistsstudywhalesfrom space

Ambiguity:PPattachments

Scientistsstudywhalesfromspace

Scientists studywhalesfrom space

Attachmentambiguities

• Akeyparsingdecisionishowwe‘attach’variousconstituents• PPs,adverbialorparticipialphrases,infinitives,coordinations,etc.

Attachmentambiguities

• Akeyparsingdecisionishowwe‘attach’variousconstituents• PPs,adverbialorparticipialphrases,infinitives,coordinations,etc.

• Catalan numbers: Cn = (2n)!/[(n+1)!n!]

• An exponentially growing series, which arises in many tree-like contexts:• E.g., the number of possible triangulations of a polygon with n+2 sides

• Turns up in triangulation of probabilistic graphical models….

The rise of annotated data:Universal Dependencies treebanks

[Universal Dependencies: http://universaldependencies.org/ ;cf. Marcus et al. 1993, The Penn Treebank, Computational Linguistics]

Theriseofannotateddata

Startingoff,buildingatreebankseemsalotslowerandlessusefulthanbuildingagrammar

Butatreebankgivesusmanythings• Reusabilityofthelabor• Manyparsers,part-of-speechtaggers,etc.canbebuiltonit• Valuableresourceforlinguistics

• Broadcoverage,notjustafewintuitions• Frequenciesanddistributionalinformation• Awaytoevaluatesystems

ChristopherManning

Dependencysyntaxpostulatesthatsyntacticstructureconsistsofrelationsbetweenlexicalitems,normallybinaryasymmetricrelations(“arrows”)calleddependencies

2.DependencyGrammarandDependencyStructure

submitted

Bills were

Senatorby

immigration

Brownback

Republican

Kansas

ChristopherManning

Thearrowsarecommonlytypedwiththenameofgrammaticalrelations(subject,prepositionalobject,apposition,etc.)

DependencyGrammarandDependencyStructure

submitted

Bills were

Senatorby

nsubj:pass aux obl

immigration

Brownback

ports flat

Republican

ofcase

Kansas

ChristopherManning

Thearrowconnectsahead (governor,superior,regent)withadependent (modifier,inferior,subordinate)

Usually,dependenciesformatree(connected,acyclic,single-head)

submitted

Bills were

Senatorby

nsubj:pass aux obl

immigration

Brownback

ports flat

Republican

ofcase

Kansas

ChristopherManning

Pāṇini’s grammar (c. 5th century BCE)

Gallery: http://wellcomeimages.org/indexplus/image/L0032691.htmlCC BY 4.0 File:Birch bark MS from Kashmir of the Rupavatra Wellcome L0032691.jpg

ChristopherManning

DependencyGrammar/ParsingHistory

• Theideaofdependencystructuregoesbackalongway• ToPāṇini’s grammar(c.5thcenturyBCE)• Basicapproachof1stmillenniumArabicgrammarians

• Constituency/context-freegrammarsisanew-fangledinvention• 20thcenturyinvention(R.S.Wells,1947)

• ModerndependencyworkoftenlinkedtoworkofL.Tesnière(1959)• Wasdominantapproachin“East”(Russia,China,…)• Goodforfree-er wordorderlanguages

• AmongtheearliestkindsofparsersinNLP,evenintheUS:• DavidHays,oneofthefoundersofU.S.computationallinguistics,builtearly(first?)dependencyparser(Hays1962)

ChristopherManning

ROOTDiscussionoftheoutstandingissueswascompleted.

• Somepeopledrawthearrowsoneway;sometheotherway!• Tesnière hadthempointfromheadtodependent…

• UsuallyaddafakeROOTsoeverywordisadependentofprecisely1othernode

ChristopherManning

Whatarethesourcesofinformationfordependencyparsing?1. Bilexical affinities[discussionà issues]isplausible

2. Dependencydistancemostlywithnearbywords

3. InterveningmaterialDependenciesrarelyspaninterveningverbsorpunctuation

4. Valency ofheadsHowmanydependentsonwhichsideareusualforahead?

DependencyConditioningPreferences

ROOTDiscussionoftheoutstandingissueswascompleted.

ChristopherManning

DependencyParsing

• Asentenceisparsedbychoosingforeachwordwhatotherword(includingROOT)isitadependentof

• Usuallysomeconstraints:• OnlyonewordisadependentofROOT• Don’twantcyclesA→B,B→A

• Thismakesthedependenciesatree• Finalissueiswhetherarrowscancross(non-projective)ornot

I give a on bootstrappingtalk tomorrowROOT ’ll

ChristopherManning

MethodsofDependencyParsing

1. DynamicprogrammingEisner(1996)givesacleveralgorithmwithcomplexityO(n3),byproducingparseitemswithheadsattheendsratherthaninthemiddle

2. GraphalgorithmsYoucreateaMinimumSpanningTreeforasentenceMcDonaldetal.’s(2005)MSTParser scoresdependenciesindependentlyusinganMLclassifier(heusesMIRA,foronlinelearning,butitcanbesomethingelse)

3. ConstraintSatisfactionEdgesareeliminatedthatdon’tsatisfyhardconstraints.Karlsson (1990),etc.

4. “Transition-basedparsing”or“deterministicdependencyparsing”GreedychoiceofattachmentsguidedbygoodmachinelearningclassifiersMaltParser (Nivre etal.2008).Hasprovenhighlyeffective.

Improving Distributional Similarity with Lessons Learned from Word Embeddings

Omer Levy, Yoav Goldberg, Ido Dagan

Presented by: Ajay Sohmshetty

Count-based distributional models

● SVD (Singular Value Decomposition)

● PPMI (Positive Pointwise Mutual Information)

Neural network-based models

● SGNS (Skip-Gram Negative Sampling)

● GloVe

Word Representation Methods

● SGNS (Skip-Gram Negative Sampling) / CBOW

● GloVe

Conventional wisdom: Neural-network based models > Count-based models

● GloVe

Levy et. al.:Hyperparameters and system design choices more important,

not the embedding algorithms themselves.

● GloVe

Hyperparameters in Skip-Gram

→ These can be transferred over to the count-based variants.

Unigram distribution smoothing exponent

# of negative samples

Context Distribution Smoothing

Shifted PMI

All Transferable HyperparametersHyperparameter Explored Values Applicable Methods

Window 2, 5, 10 All

Dynamic Context Window None, with All

Subsampling None, dirty, clean All

Deleting Rare Words None, with All

Shifted PMI 1, 5, 15 PPMI, SVD, SGNS

Context Distribution Smoothing 1, 0.75 PPMI, SVD, SGNS

Adding Context Vectors Only w, w+c SVD, SGNS, GloVe

Eigenvalue Weighting 0, 0.5, 1 SVD

Vector Normalization None, row, col, both All

Preprocessing

Association Metric

Postprocessing

ResultsWord Similarity Tasks Analogy Tasks

Key Takeaways- This paper challenges the conventional wisdom that neural

network-based models are superior to count-based models.

- While model design is important, hyperparameters are also KEY for achieving reasonable results. Don’t discount their importance!

- Challenge the status quo!

ChristopherManning

4.Greedytransition-basedparsing[Nivre 2003]

• Asimpleformofgreedydiscriminativedependencyparser• Theparserdoesasequenceofbottomupactions

• Roughlylike“shift”or“reduce”inashift-reduceparser,butthe“reduce”actionsarespecializedtocreatedependencieswithheadonleftorright

• Theparserhas:• astackσ,writtenwithtoptotheright• whichstartswiththeROOTsymbol

• abufferβ,writtenwithtoptotheleft• whichstartswiththeinputsentence

• asetofdependencyarcsA• whichstartsoffempty

• asetofactions

ChristopherManning

Basictransition-baseddependencyparser

Start: σ =[ROOT],β=w1,…,wn ,A=∅1. Shiftσ,wi|β,Aè σ|wi,β,A2. Left-Arcr σ|wi|wj,β,Aè σ|wj,β,A∪{r(wj,wi)}3. Right-Arcr σ|wi|wj,β,Aè σ|wi,β,A∪{r(wi,wj)}Finish:σ =[w], β=∅

ChristopherManning

Arc-standard transition-basedparser(thereareothertransitionschemes…)Analysisof“Iatefish”

ate fish[root]

[root]

I ate fish

ate[root] fish

Start: σ = [ROOT], β = w1, …, wn , A = ∅1. Shift σ, wi|β, A è σ|wi, β, A2. Left-Arcr σ|wi|wj, β, A è

σ|wj, β, A∪{r(wj,wi)} 3. Right-Arcr σ|wi|wj, β, A è

σ|wi, β, A∪{r(wi,wj)}Finish: β = ∅

ChristopherManning

Arc-standard transition-basedparserAnalysisof“Iatefish”

ate[root] ate[root]

Left Arc

IA +=nsubj(ate → I)

ate fish[root] ate fish[root]

ate[root] [root]

Right ArcA +=obj(ate → fish)fish ate

ate[root] [root]

Right ArcA +=root([root] → ate)Finish

ChristopherManning

MaltParser[Nivre andHall2005]

• Wehavelefttoexplainhowwechoosethenextaction• Eachactionispredictedbyadiscriminativeclassifier(often

SVM,canbeperceptron,maxent classifier)overeachlegalmove• Maxof3untyped choices;maxof|R|× 2+1whentyped• Features:topofstackword,POS;firstinbufferword,POS;etc.

• ThereisNOsearch(inthesimplestform)• Butyoucanprofitablydoa beamsearchifyouwish(slowerbutbetter)

• ItprovidesVERY fastlineartimeparsing• Themodel’saccuracyisslightly belowthebestdependency

parsers,but• Itprovidesfast,closetostateoftheartparsingperformance

ChristopherManning

FeatureRepresentation

Feature templates: usually a combination of 1 ~ 3 elements from the configuration.

Indicator features

0 0 0 1 0 0 1 0 0 0 1 0binary, sparsedim =106 ~ 107

ChristopherManning

EvaluationofDependencyParsing:(labeled)dependencyaccuracy

ROOT She saw the video lecture 0 1 2 3 4 5

Gold1 2 She nsubj2 0 saw root 3 5 the det4 5 video nn5 2 lecture obj

Parsed1 2 She nsubj2 0 saw root 3 4 the det4 5 video nsubj5 2 lecture ccomp

Acc = #correctdeps#ofdeps

UAS=4/5=80%LAS=2/5=40%

ChristopherManning

Dependencypathsidentifysemanticrelations– e.g,forproteininteraction

[Erkan et al. EMNLP 07, Fundel et al. 2007, etc.]

KaiCçnsubj interactsnmod:withè SasAKaiCçnsubj interactsnmod:withè SasA conj:andè KaiAKaiCçnsubj interactsprep_withè SasA conj:andè KaiB

demonstrated

results

interacts

rythmically

markdet

thatnsubj

KaiBKaiA

conj:and

conj:andadvmod

nmod:with

with andcc

ChristopherManning

• DependenciesparalleltoaCFGtreemustbeprojective• Theremustnotbeanycrossingdependencyarcswhenthewordsarelaidoutintheirlinearorder,withallarcsabovethewords.

• Butdependencytheorynormallydoesallownon-projectivestructurestoaccountfordisplacedconstituents• Youcan’teasilygetthesemanticsofcertainconstructionsrightwithoutthesenonprojective dependencies

WhodidBillbuythecoffeefromyesterday?

Projectivity

ChristopherManning

Handlingnon-projectivity

• Thearc-standardalgorithmwepresentedonlybuildsprojectivedependencytrees

• Possibledirectionstohead:1. Justdeclaredefeatonnonprojective arcs2. Useadependencyformalismwhichonlyadmitsprojective

representations(aCFGdoesn’trepresentsuchstructures…)3. Useapostprocessortoaprojectivedependencyparsingalgorithmto

identifyandresolvenonprojective links4. Addextratransitionsthatcanmodelatleastmostnon-projective

structures(e.g.,addanextraSWAPtransition,cf.bubblesort)5. Movetoaparsingmechanismthatdoesnotuseorrequireany

constraintsonprojectivity(e.g.,thegraph-basedMSTParser)

ChristopherManning

5.Whytrainaneuraldependencyparser?IndicatorFeaturesRevisited

• Problem#1:sparse• Problem#2:incomplete• Problem#3:expensivecomputation

Morethan95%ofparsingtimeisconsumedbyfeaturecomputation.

OurApproach:learnadenseandcompactfeaturerepresentation

0.1densedim = ~1000

0.9-0.2 0.3 -0.1 -0.5…

ChristopherManning

Aneuraldependencyparser[ChenandManning2014]

• EnglishparsingtoStanfordDependencies:• Unlabeledattachmentscore(UAS)=head• Labeledattachmentscore(LAS)=headandlabel

Parser UAS LAS sent./s

MaltParser 89.8 87.2 469

MSTParser 91.4 88.1 10

TurboParser 92.3* 89.6* 8

C &M2014 92.0 89.7 654

ChristopherManning

• Werepresenteachwordasad-dimensionaldensevector(i.e.,wordembedding)• Similarwordsareexpected tohaveclosevectors.

• Meanwhile,part-of-speechtags (POS)anddependencylabelsarealsorepresentedasd-dimensionalvectors.• Thesmallerdiscretesetsalsoexhibitmanysemanticalsimilarities.

DistributedRepresentations

werewas

isgood

NNS(pluralnoun)shouldbecloseto NN(singularnoun).num(numericalmodifier)shouldbecloseto amod(adjectivemodifier).

ChristopherManning

ExtractingTokensandthenvectorrepresentationsfromconfiguration

lc(s1)rc(s1)lc(s2)rc(s2)

goodhascontrol∅∅He∅

JJVBZNN∅∅PRP∅

∅∅∅∅∅nsubj∅

word POS dep.

• Weextractasetoftokensbasedonthestack/bufferpositions:

• Weconvertthemtovectorembeddingsandconcatenatethem

ChristopherManning

ModelArchitecture

Input layer xlookup+concat

Hidden layer hh = ReLU(Wx + b1)

Output layer yy = softmax(Uh + b2)

Softmax probabilities

cross-entropy error will beback-propagated to the embeddings.

ChristopherManning

Non-linearities betweenlayers:Whythey’reneeded

• Forlogisticregression:maptoprobabilities• Here:functionapproximation,

e.g.,forregressionorclassification• Withoutnon-linearities,deepneuralnetworkscan’tdoanythingmorethanalineartransform• Extralayerscouldjustbecompileddownintoasinglelineartransform

• Peopleusevariousnon-linearities

ChristopherManning

Non-linearities:What’sused

logistic(“sigmoid”)tanh

tanh isjustarescaledandshiftedsigmoidtanh isoftenusedandoftenperformsbetterfordeepnets

• It’soutputissymmetricaround0

tanh(z) = 2logistic(2z)−1

ChristopherManning

Non-linearities:What’sused

hardtanh softsign linearrectifier(ReLU)

• hardtanh similarbutcomputationallycheaperthantanh andsaturateshard.• [Glorot andBengioAISTATS 2010,2011]discusssoftsign andrectifier• Rectifiedlinearunitisnowmegacommon– transferslinearsignalifactive

rect(z) =max(z, 0)softsign(z) = a1+ a

Dependencyparsingforsentencestructure

Neuralnetworkscanaccuratelydeterminethestructureofsentences,supportinginterpretation

ChenandManning(2014)wasthefirstsimple,successfulneuraldependencyparser

Thedenserepresentationsletitoutperformothergreedyparsersinbothaccuracyandspeed

Furtherdevelopmentsintransition-basedneuraldependencyparsing

Thisworkwasfurtherdevelopedandimprovedbyothers,includinginparticularatGoogle

• Bigger,deepernetworkswithbettertunedhyperparameters• Beamsearch• Global,conditionalrandomfield(CRF)-styleinferenceoverthedecisionsequence

LeadingtoSyntaxNet andtheParsey McParseFace modelhttps://research.googleblog.com/2016/05/announcing-syntaxnet-worlds-most.html

Method UAS LAS(PTB WSJSD3.3Chen&Manning2014 92.0 89.7Weiss etal.2015 93.99 92.05Andor etal.2016 94.61 92.79

Natural Language Processing with Deep Learning CS224N/Ling284 · Christopher Manning Methods of...

Documents

Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Eisner and McCloud

CS224N/Lin4 with Deep Learning tural Language Pr ocessingweb.stanford.edu/class/cs224n/slides/cs224n-2019-lecture06-rnnlm.pdf · longer, but this slide doesn’t have space! hidden

CS224N/Lin4 with Deep Learning tural Language Pr …web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture...Natural Language Processing with Deep Learning CS224N/Ling284 Lecture

Natural Language Processing with Deep Learning CS224N ...web.stanford.edu/class/cs224n/slides/cs224n-2021-lecture...PP attachment ambiguities multiply •A key parsing decision is

eisner marketing

Natural Language Processing CS224N/Ling237

cs224n-2018-lecture12-Transformers and CNNs · Transformer Networks and Convolutional Neural Networks Richard Socher Natural Language Processing with Deep Learning CS224N/Ling284

CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be ﬁxed and have no contribu(on to gradient), to minimize this

Natural Language Processing with Deep Learning CS224N ...web.stanford.edu/class/cs224n/slides/cs224n-2021-lecture...high frightened red correct similar fast evil christian 1) 57.5

CS224N/Lin4 with Deep Learning tural Language Pr ocessing › class › archive › cs › cs224n › cs224n... · 2019-02-04 · tural Language Pr ocessing with Deep Learning CS224N/Lin4

CS224N/Lin4 with Deep Learning tural Language Pr …...Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 8: Machine Translation, Sequence-to-sequence and Attention

Esquema Eisner

Natural Language Processing with Deep Learning CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020... · • Commonly in NLP deep learning: • We learn both W and word

CS224N/Lin4 with Deep Learning tural Language Pr ocessing · 2019-02-04 · Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 6: Language Models and Recurrent

Cap 3 eisner

cs224n-2018-lecture12-Transformers and CNNs

Natural Language Processing with Deep Learning CS224N ...web.stanford.edu/class/cs224n/slides/cs224n-2021-lecture...•On each timestep, each element of the gates can be open(1), closed(0),

Eisner EducationalObjectives HelpOrHindrance