Introduction to NLP Data-Driven Dependency Parsing

Introduction to NLPData-Driven Dependency Parsing

Prof. Reut TsarfatyBar Ilan University

November 24, 2020

Statistical Parsing

The Big Picture

Statistical Parsing

The Big Picture

Statistical Parsing

The Big Picture

Statistical Parsing: The Big Picture

The QuestionsI What kind of Trees?I What kind of Models?

I GenerativeI Discriminative

I Which Search Algorithm (Decoding) ?I Which Learning Algorithm (Training) ?I What kind of Evaluation?

Statistical Parsing: The Big Picture

Previously on NLP@BIU

Representation Phrase-Structure TreesModel Generative

Objective ProbabilisticSearch CKY

Train Maximum LikelihoodEvaluation Precision/Recall/F1

Today: Introduction to Dependency Parsing

Today: More Modeling Choices:

Representation Constituency Trees Dependency TreesModel Generative ?

Objective Probailistic ?Search Exhaustive ?

Train MLE ?Evaluation F1-Scores Attachement Scores

Introduction to Dependency Parsing

I The purpose of Syntactic Structures:I Encode Predicate Argument StructuresI Who Does What to Whom? (When, Where, Why...)

I Properties of Dependency Structures:I Defined as (labeled) binary relations between wordsI Reflect a long linguistic (European) traditionI Explicitly represent Argument Structure

Representation: Labeled vs. Unlabeled

Unlabeled Dependency Tree: –ROOT–

dumped

workers sacks into

binsLabeled Dependency Tree: –ROOT–

dumped

subjworkers

dobjsacks

prepinto

pobjbins

Representation: Functional vs. Lexical

Functional Dependencies: –ROOT–

dumped

subjworkers

dobjsacks

prepinto

pobjbins

Lexical Dependencies: –ROOT–

dumped

subjworkers

dobjsacks

nmodbins

caseinto

Discussions: Options and Schemes

Vertical vs. Horizontal Representationhttp://nlp.stanford.edu:8080/corenlp/

The Universal Dependencies Initiativehttps://universaldependencies.org/

http://nlp.stanford.edu:8080/corenlp/

https://universaldependencies.org/

Let’s Analyse!

The cat sat on the mat .

Let’s Analyse!

The cat is on the mat .

Let’s Analyse!

The cat , which I met , is sitting on the mat .

Let’s Analyse!

The dog and the cat sat on the big and fluffy mat

You should know how to read/analyse these!

Let’s Analyse!

The dog and the cat sat on the big and fluffy mat

You should know how to read/analyse these!

Dependency Trees: Formal Definition

I A labeled dependency tree is a labeled directed tree T :I a set V of nodes, labeled with words (including ROOT )I a set A of arcs, labeled with dependency typesI a linear precedence order < on V

I Notation:I Arc 〈v1, v2〉 connects head v1 with dep v2I Arc 〈v1, l , v2〉 connects head v1 with dep v2 with label l ∈ LI A node v0 (ROOT) serves as a unique root of the tree

Properties of Dependency Trees

A dependency T tree is:I connected:

For every node i there is a node j such that i → j or j → iI acyclic:

If i → j then not j →∗ iI single head:

If i → j then not k → j for any k 6= iI projective:

If i → j then i →∗ k for any k such that i < k < j

Non-Projective Dependency Trees

Non-Projective Dependency TreesMany parsing parsing algorithms are restricted to projectivedependency trees.

Is this a problem?Statistics from CoNLL-X Shared Task 2006I NPD = Non-projective dependenciesI NPS = Non-projective sentences

Language %NPD % NPSDutch 5.4 36.4

German 2.3 27.8Czech 1.9 23.2

Slovene 1.9 22.2Portuguese 1.3 18.9

Danish 1.0 15.6

We will (mostly) focus on projective dependencies.

Evaluation Metrics

I Unlabeled Attachement Scores (UAS)The percentage of identical arcsfrom the total number or arcs in the treeUAS =

Aintersect(i,j)n

I Labeled Attachement Scores (LAS)The percentage of identical arcs with identical labelsfrom the total number or arcs in the treeLAS =

Aintersect(i,l,j)n

I Root AccuracyThe percentage of sentences with correct root dependency

I Exact MatchThe percentage of sentences with parses identical to gold

Models for Dependency Parsing

The Parsing Objective:

y∗ = arg max{y |y∈GEN(x)}

Score(y)

The Modeling Choices:

Representation Dependency TreesModel ?

Decoder ?Trainer ?

Evaluation Attachment Scores

Modeling Methods

Our Modeling Tasks:

y∗ = arg max{y |y∈GEN(x)}

Score(y)

I GEN: How do we generate all t?I Score: How do we score any t?I argmax: How do we find the best t?

Modeling Methods

I Conversion-Based- Convert Phrase-Structure to Dependency Trees

I Grammar-Based- Generative methods based on PCFGs

I Graph-Based- Globally Optimised, Restricted features

I Transition-Based- Locally Optimal, Unrestricted features

I Neural-Based

Modeling Methods (1)

I Conversion-Based: Convert PS trees Using a Head TableI Grammar-BasedI Graph-BasedI Transition-Based

TOP

S

NP

workers

VP

VP

VP

dumped

NP

sacks

PP

P

into

NP

bins


I Conversion-Based: Convert PS trees Using a Head Table

VP → VBD VBN MD VBZ VB VBG VBP VPNP ← NN NX JJR CD JJ JJS RBADJP ← NNS QP NN ADVP JJ VBN VBGADVP → RB RBR RBS FW ADVP TO CD JJRS ← VP S SBAR ADJP UCP NPSQ ← VBZ VBD VBP VB MD PRD VP SQSBAR ← S SQ SINV SBAR FRAG IN DT

Where Do Dependency Trees Come From?

S

NP

workers

VP

VP

V

dumped

NP

sacks

PP

P

into

NP

bins

Where Do Dependency Trees Come From?

S/dumped

NP/workers

workers

VP/dumped

VP/dumped

V/dumped

dumped

NP/sacks

sacks

PP/into

P/into

into

NP/bins

bins

Where do Dependency Trees Come From?

dumped

workers

workers

dumped

dumped

dumped

dumped

sacks

sacks

into

into

into

bins

bins


dumped

workers

workers

dumped

dumped

dumped

dumped

sacks

sacks

into

into

into

bins

bins


–ROOT–

dumped

workers sacks into

bins


X Conversion-BasedI Grammar-BasedI Graph-BasedI Transition-Based

TOP

dumped

workers

workers

dumped

dumped

dumped

dumped

sacks

sacks

into

into

into

bins

bins

Grammar-Based Dependency Parsing

The Basic IdeaI Treat bi-lexical dependencies as constituentsI Decode using chart based algorithm (e.g., CKY)I Learn using standard MLE methodsI Evaluate over the set of resulting dependencies as usual

Relevant StudiesI Original version: [Hays 1964]I Link Grammar: [Sleator and Temperley 1991]I Earley-style left-corner: [Lombardo and Lesmo 1996]I Bilexical grammars: [Eisner 1996a, 1996b, Eisner 2000]

http://cs.jhu.edu/~jason/papers/eisner.coling96.pdf



The Basic IdeaI Treat bi-lexical dependencies as constituentsI Decode using chart based algorithm (e.g., CKY)I Learn using standard MLE methodsI Evaluate over the set of resulting dependencies as usual

Relevant StudiesI Original version: [Hays 1964]I Link Grammar: [Sleator and Temperley 1991]I Earley-style left-corner: [Lombardo and Lesmo 1996]I Bilexical grammars: [Eisner 1996a, 1996b, Eisner 2000]




The Objective Function:

t∗ = argmax{t |∈GEN(x)}P(t)


Representation Dependency TreesModel PCFG

Decoder Adapted CKYTrainer Smoothed MLE

Evaluation Attachment Scores


X Conversion-BasedX Grammar-BasedI Graph-BasedI Transition-Based

Graph-Based Dependency Parsing

The Basic IdeaI Define a global Arc-Factored modelI Treat the search as an MST problemI Treat the learning as a classification problemI Evaluate over the set of gold dependencies as usual


The Basic IdeaI Define a global Arc-Factored modelI Treat the search as an MST problemI Treat the learning as a classification problemI Evaluate over the set of gold dependencies as usual


Step 1: Defining the Arc Factored Model

t∗ = arg maxt∈GEN(V )

wΦ(t)

= arg maxt∈GEN(V )

∑(i→j)∈t

wφarc(i → j)


Step 2: Defining Feature Templates

Name φi (had,OBJ, effect) wiUnigram head “had” wuniheadUnigram dep “effect” wunidepUnigram head pos VB wuniheadposUnigram dep pos NN wunidepposBigram head-dep “had-effect” wbigramBigram headpos-deppos VB-NN wbigramposLabeled Bigram head-dep “had-OBJ-effect” wbigramlabelLabeled Bigram headpos-deppos VB-obj-NN wbigramposlabelIn-Between pos VB-IN-NN winbetween


Step 2: Defining Feature Templates

Name φi (had,OBJ, effect) wiUnigram head “had” wuniheadUnigram dep “effect” wunidepUnigram head pos VB wuniheadposUnigram dep pos NN wunidepposBigram head-dep “had-effect” wbigramBigram headpos-deppos VB-NN wbigramposLabeled Bigram head-dep “had-OBJ-effect” wbigramlabelLabeled Bigram headpos-deppos VB-obj-NN wbigramposlabelIn-Between pos VB-IN-NN winbetween


Step 3: LearningE.g., PerceptronI Theory:

- Find w that assigns higher scores to yi than any y ∈ Y- If seperation exists, will learn to separate the correctstructure from the incorrect structures

I Practice:- Training requires repeated inference-update- Computing feature values is time consuming- The Averaged-Perceptron variant preferred


Step 4: Finding the Max-Spanning TreeThe Chu-Liu-Edmonds Algorithm

Runtime complexity: O(n2)


Step 3: Online LearningPerceptron/MIRA (Margin Infused Relaxed Algorithm)

Step 4: Max-Spanning Tree DecodingThe Chu-Liu-Edmonds Algorithm (CLE)

http://repository.upenn.edu/cgi/viewcontent.cgi?

article=1056&context=cis_reports

http://repository.upenn.edu/cgi/viewcontent.cgi?article=1056&context=cis_reports

http://repository.upenn.edu/cgi/viewcontent.cgi?article=1056&context=cis_reports


The Objective Function:

t∗ = argmax{t |∈GEN(x)}∑

a∈arcs(t)

wT Φ(a)


Representation Dependency TreesModel Graph-Based

Arc-FactoredDecoder MST/CLE O(n2)

Trainer Perceptron/MIRAEvaluation Attachment Scores


X Conversion-BasedX Grammar-BasedX Graph-BasedI Transition-Based

Transition-Based Dependency Parsing

The Basic IdeaI Define a transition systemI Define an Oracle Algorithm for DecodingI Approximate the Oracle Algorithm via LearningI Evaluate over Dependency Arcs as Usual

http://stp.lingfil.uu.se/~nivre/docs/BeyondMaltParser.pdf

http://stp.lingfil.uu.se/~nivre/docs/BeyondMaltParser.pdf


Defining ConfigurationsA parser Configuration is a triplet c = (S,Q,A), whereI S = a stack [...,wi ]S of partially processed nodesI Q = a queue [wj , ...]Q of remaining input nodesI A = a set of labeled arcs (wi , l ,wj)

Initialization:I c0 = ([w0]S, [w1, ...,wn]Q, {})

Note: w0 = ROOTTermination:I ct = ([w0]S, []Q,A)


Defining TransitionsI Shift:

([...]S, [wi , ...]Q,A) −→ ([...,wi ]S, [...]Q,A)

I Arc-Left(l):([...,wi ,wj ]S,Q,A) −→ ([...,wj ]S,Q,A ∪ (wj , l ,wi))

I Arc-Right(l):([...,wi ,wj ]S,Q,A) −→ ([...,wi ]S,Q,A ∪ (wi , l ,wj))


Demo Deck


Deterministic ParsingGiven an oracle O that correctly predicts the next transitionO(c), parsing is deterministic:

PARSE(w1, ...,wn)1. c ← ([w0]S, [w1, ...,wn]Q, )2. while Qc 6= [] or |Sc | = 13. t ← O(c)4. c ← t(c)5. return T = (w0,w1, ...,wn,Ac)


Data-Driven ParsingWe approximate the Oracle O using a Classifier Predict(c) thatpredicts the next transition using Features of c, feats(c).

PARSE(w1, ...,wn)1. c ← ([w0]S, [w1, ...,wn]Q, )2. while Qc 6= [] or |Sc | = 13. t ← Predict(w, feats(c))4. c ← t(c)5. return T = (w0,w1, ...,wn,Ac)


Feature Engineering

name feature weightS[0] word effect w1S[0] pos NN w2S[1] word little w3S[1] pos JJ w4Q[0] word on w5Q[0] pos P w6Q[1] word financial w7Q[1] pos JJ w8Root(A) word had w9Root(A) POS VB w10s[0]-S[1] effect→little w11s[1]-S[0] little→effect w12


Feature Engineering

name feature weightS[0] word effect w1S[0] pos NN w2S[1] word little w3S[1] pos JJ w4Q[0] word on w5Q[0] pos P w6Q[1] word financial w7Q[1] pos JJ w8Root(A) word had w9Root(A) POS VB w10s[0]-S[1] effect→little w11s[1]-S[0] little→effect w12


I An Oracle O can be approximated by a (linear) classifier:

Predict(t) = arg maxt

wΦ(c, t)

I History-Based Features Φ(c, t)- Features over input words relative to S and Q- Features over the (partial) dependency tree defined by A- Features over the (partial) transition sequence so far

I Learning w from Treebank Data- Reconstruct Oracle sequence for each sentence- Construct training data set D = {(c, t)|O(c) = t}- Maximize accuracy of local predictions O(c) = t


Online LearningOnline learning algorithms

Step 4: Greedy DecodingGreedy: At each step, select the maximum scoring transition.

Recent Advances: Neural-Network Models

The Basic Claim: Both graph based and transition-basedmodels benefit from the move to Neural Networks.I Same overall approach and algorithm as before, but:→ Replace linear classifier with non-linear to MLP.→ Use pre-trained word embeddings.→ Replace feature-extractor with Bi-LSTM.

I Further explorations:→ Semi-supervised learning.→ Multi-task learning

I Remaining Challenges:→ Out-of-domain parsing (e.g. twitter)→ Parsing Morphologically-Rich Languages (e.g. Hebrew)

Summarising Dependency Parsing

I Dependency trees as labeled bi-lexical dependenciesI Data-Driven parsing trained over Dependency Treebanks

I Varied Methods:I Conversion-Based (Rules)I Grammar-Based (Probabilistic)I Graph-Based (Linear, Globally Optimized)I Transition-Based (Linear, Locally Optimized)

I Neural Network models work the same but:I Non-linear objective eg. MLPI Better word-representations eg. Word EmbeddingsI Better (automatic) feature-extraction eg. BiLSTM

I English is “solved” — What about other languages?I Stanford CoreNLP https://corenlp.runI The UD Initiative: https://universaldependencies.org/I UDPipe: http://lindat.mff.cuni.cz/services/udpipe/I ONLP: nlp.biu.ac.il/~rtsarfaty/onlp/hebrew/

https://corenlp.run


http://lindat.mff.cuni.cz/services/udpipe/

nlp.biu.ac.il/~rtsarfaty/onlp/hebrew/

Summarising Dependency Parsing

I Dependency trees as labeled bi-lexical dependenciesI Data-Driven parsing trained over Dependency Treebanks

I Varied Methods:I Conversion-Based (Rules)I Grammar-Based (Probabilistic)I Graph-Based (Linear, Globally Optimized)I Transition-Based (Linear, Locally Optimized)

I Neural Network models work the same but:I Non-linear objective eg. MLPI Better word-representations eg. Word EmbeddingsI Better (automatic) feature-extraction eg. BiLSTM

I English is “solved” — What about other languages?I Stanford CoreNLP https://corenlp.runI The UD Initiative: https://universaldependencies.org/I UDPipe: http://lindat.mff.cuni.cz/services/udpipe/I ONLP: nlp.biu.ac.il/~rtsarfaty/onlp/hebrew/

https://corenlp.run


http://lindat.mff.cuni.cz/services/udpipe/

nlp.biu.ac.il/~rtsarfaty/onlp/hebrew/

NLP@BIU: Where We’re At

So FarX Part 1: Introduction (classes 1-2)X Part 2: Words/Sequences (classes 3-4)X Part 3: Sentences/Trees (classes 5-6)→ Part 4: Meanings (Prof. Ido Dagan, starting class 7)

To Be Continued...

Documents

Introduction to NLP Data-Driven Dependency Parsing