60
Introduction to NLP Data-Driven Dependency Parsing Prof. Reut Tsarfaty Bar Ilan University November 24, 2020

Introduction to NLP Data-Driven Dependency Parsing

  • Upload
    others

  • View
    20

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to NLP Data-Driven Dependency Parsing

Introduction to NLPData-Driven Dependency Parsing

Prof. Reut TsarfatyBar Ilan University

November 24, 2020

Page 2: Introduction to NLP Data-Driven Dependency Parsing

Statistical Parsing

The Big Picture

Page 3: Introduction to NLP Data-Driven Dependency Parsing

Statistical Parsing

The Big Picture

Page 4: Introduction to NLP Data-Driven Dependency Parsing

Statistical Parsing

The Big Picture

Page 5: Introduction to NLP Data-Driven Dependency Parsing

Statistical Parsing: The Big Picture

The QuestionsI What kind of Trees?I What kind of Models?

I GenerativeI Discriminative

I Which Search Algorithm (Decoding) ?I Which Learning Algorithm (Training) ?I What kind of Evaluation?

Page 6: Introduction to NLP Data-Driven Dependency Parsing

Statistical Parsing: The Big Picture

Previously on NLP@BIU

Representation Phrase-Structure TreesModel Generative

Objective ProbabilisticSearch CKY

Train Maximum LikelihoodEvaluation Precision/Recall/F1

Page 7: Introduction to NLP Data-Driven Dependency Parsing

Today: Introduction to Dependency Parsing

Today: More Modeling Choices:

Representation Constituency Trees Dependency TreesModel Generative ?

Objective Probailistic ?Search Exhaustive ?

Train MLE ?Evaluation F1-Scores Attachement Scores

Page 8: Introduction to NLP Data-Driven Dependency Parsing

Introduction to Dependency Parsing

I The purpose of Syntactic Structures:I Encode Predicate Argument StructuresI Who Does What to Whom? (When, Where, Why...)

I Properties of Dependency Structures:I Defined as (labeled) binary relations between wordsI Reflect a long linguistic (European) traditionI Explicitly represent Argument Structure

Page 9: Introduction to NLP Data-Driven Dependency Parsing

Representation: Labeled vs. Unlabeled

Unlabeled Dependency Tree: –ROOT–

dumped

workers sacks into

binsLabeled Dependency Tree: –ROOT–

dumped

subjworkers

dobjsacks

prepinto

pobjbins

Page 10: Introduction to NLP Data-Driven Dependency Parsing

Representation: Functional vs. Lexical

Functional Dependencies: –ROOT–

dumped

subjworkers

dobjsacks

prepinto

pobjbins

Lexical Dependencies: –ROOT–

dumped

subjworkers

dobjsacks

nmodbins

caseinto

Page 11: Introduction to NLP Data-Driven Dependency Parsing

Discussions: Options and Schemes

Vertical vs. Horizontal Representationhttp://nlp.stanford.edu:8080/corenlp/

The Universal Dependencies Initiativehttps://universaldependencies.org/

Page 12: Introduction to NLP Data-Driven Dependency Parsing

Let’s Analyse!

The cat sat on the mat .

Page 13: Introduction to NLP Data-Driven Dependency Parsing

Let’s Analyse!

The cat is on the mat .

Page 14: Introduction to NLP Data-Driven Dependency Parsing

Let’s Analyse!

The cat , which I met , is sitting on the mat .

Page 15: Introduction to NLP Data-Driven Dependency Parsing

Let’s Analyse!

The dog and the cat sat on the big and fluffy mat

You should know how to read/analyse these!

Page 16: Introduction to NLP Data-Driven Dependency Parsing

Let’s Analyse!

The dog and the cat sat on the big and fluffy mat

You should know how to read/analyse these!

Page 17: Introduction to NLP Data-Driven Dependency Parsing

Dependency Trees: Formal Definition

I A labeled dependency tree is a labeled directed tree T :I a set V of nodes, labeled with words (including ROOT )I a set A of arcs, labeled with dependency typesI a linear precedence order < on V

I Notation:I Arc 〈v1, v2〉 connects head v1 with dep v2I Arc 〈v1, l , v2〉 connects head v1 with dep v2 with label l ∈ LI A node v0 (ROOT) serves as a unique root of the tree

Page 18: Introduction to NLP Data-Driven Dependency Parsing

Properties of Dependency Trees

A dependency T tree is:I connected:

For every node i there is a node j such that i → j or j → iI acyclic:

If i → j then not j →∗ iI single head:

If i → j then not k → j for any k 6= iI projective:

If i → j then i →∗ k for any k such that i < k < j

Page 19: Introduction to NLP Data-Driven Dependency Parsing

Non-Projective Dependency Trees

Page 20: Introduction to NLP Data-Driven Dependency Parsing

Non-Projective Dependency TreesMany parsing parsing algorithms are restricted to projectivedependency trees.

Is this a problem?Statistics from CoNLL-X Shared Task 2006I NPD = Non-projective dependenciesI NPS = Non-projective sentences

Language %NPD % NPSDutch 5.4 36.4

German 2.3 27.8Czech 1.9 23.2

Slovene 1.9 22.2Portuguese 1.3 18.9

Danish 1.0 15.6

We will (mostly) focus on projective dependencies.

Page 21: Introduction to NLP Data-Driven Dependency Parsing

Evaluation Metrics

I Unlabeled Attachement Scores (UAS)The percentage of identical arcsfrom the total number or arcs in the treeUAS =

Aintersect(i,j)n

I Labeled Attachement Scores (LAS)The percentage of identical arcs with identical labelsfrom the total number or arcs in the treeLAS =

Aintersect(i,l,j)n

I Root AccuracyThe percentage of sentences with correct root dependency

I Exact MatchThe percentage of sentences with parses identical to gold

Page 22: Introduction to NLP Data-Driven Dependency Parsing

Models for Dependency Parsing

The Parsing Objective:

y∗ = arg max{y |y∈GEN(x)}

Score(y)

The Modeling Choices:

Representation Dependency TreesModel ?

Decoder ?Trainer ?

Evaluation Attachment Scores

Page 23: Introduction to NLP Data-Driven Dependency Parsing

Modeling Methods

Our Modeling Tasks:

y∗ = arg max{y |y∈GEN(x)}

Score(y)

I GEN: How do we generate all t?I Score: How do we score any t?I argmax: How do we find the best t?

Page 24: Introduction to NLP Data-Driven Dependency Parsing

Modeling Methods

I Conversion-Based- Convert Phrase-Structure to Dependency Trees

I Grammar-Based- Generative methods based on PCFGs

I Graph-Based- Globally Optimised, Restricted features

I Transition-Based- Locally Optimal, Unrestricted features

I Neural-Based

Page 25: Introduction to NLP Data-Driven Dependency Parsing

Modeling Methods (1)

I Conversion-Based: Convert PS trees Using a Head TableI Grammar-BasedI Graph-BasedI Transition-Based

TOP

S

NP

workers

VP

VP

VP

dumped

NP

sacks

PP

P

into

NP

bins

Page 26: Introduction to NLP Data-Driven Dependency Parsing

Modeling Methods (1)

I Conversion-Based: Convert PS trees Using a Head Table

VP → VBD VBN MD VBZ VB VBG VBP VPNP ← NN NX JJR CD JJ JJS RBADJP ← NNS QP NN ADVP JJ VBN VBGADVP → RB RBR RBS FW ADVP TO CD JJRS ← VP S SBAR ADJP UCP NPSQ ← VBZ VBD VBP VB MD PRD VP SQSBAR ← S SQ SINV SBAR FRAG IN DT

Page 27: Introduction to NLP Data-Driven Dependency Parsing

Where Do Dependency Trees Come From?

S

NP

workers

VP

VP

V

dumped

NP

sacks

PP

P

into

NP

bins

Page 28: Introduction to NLP Data-Driven Dependency Parsing

Where Do Dependency Trees Come From?

S/dumped

NP/workers

workers

VP/dumped

VP/dumped

V/dumped

dumped

NP/sacks

sacks

PP/into

P/into

into

NP/bins

bins

Page 29: Introduction to NLP Data-Driven Dependency Parsing

Where do Dependency Trees Come From?

dumped

workers

workers

dumped

dumped

dumped

dumped

sacks

sacks

into

into

into

bins

bins

Page 30: Introduction to NLP Data-Driven Dependency Parsing

Where do Dependency Trees Come From?

dumped

workers

workers

dumped

dumped

dumped

dumped

sacks

sacks

into

into

into

bins

bins

Page 31: Introduction to NLP Data-Driven Dependency Parsing

Where do Dependency Trees Come From?

–ROOT–

dumped

workers sacks into

bins

Page 32: Introduction to NLP Data-Driven Dependency Parsing

Modeling Methods (2)

X Conversion-BasedI Grammar-BasedI Graph-BasedI Transition-Based

TOP

dumped

workers

workers

dumped

dumped

dumped

dumped

sacks

sacks

into

into

into

bins

bins

Page 33: Introduction to NLP Data-Driven Dependency Parsing

Grammar-Based Dependency Parsing

The Basic IdeaI Treat bi-lexical dependencies as constituentsI Decode using chart based algorithm (e.g., CKY)I Learn using standard MLE methodsI Evaluate over the set of resulting dependencies as usual

Relevant StudiesI Original version: [Hays 1964]I Link Grammar: [Sleator and Temperley 1991]I Earley-style left-corner: [Lombardo and Lesmo 1996]I Bilexical grammars: [Eisner 1996a, 1996b, Eisner 2000]

http://cs.jhu.edu/~jason/papers/eisner.coling96.pdf

Page 34: Introduction to NLP Data-Driven Dependency Parsing

Grammar-Based Dependency Parsing

The Basic IdeaI Treat bi-lexical dependencies as constituentsI Decode using chart based algorithm (e.g., CKY)I Learn using standard MLE methodsI Evaluate over the set of resulting dependencies as usual

Relevant StudiesI Original version: [Hays 1964]I Link Grammar: [Sleator and Temperley 1991]I Earley-style left-corner: [Lombardo and Lesmo 1996]I Bilexical grammars: [Eisner 1996a, 1996b, Eisner 2000]

http://cs.jhu.edu/~jason/papers/eisner.coling96.pdf

Page 35: Introduction to NLP Data-Driven Dependency Parsing

Grammar-Based Dependency Parsing

The Objective Function:

t∗ = argmax{t |∈GEN(x)}P(t)

The Modeling Choices:

Representation Dependency TreesModel PCFG

Decoder Adapted CKYTrainer Smoothed MLE

Evaluation Attachment Scores

Page 36: Introduction to NLP Data-Driven Dependency Parsing

Modeling Methods (3)

X Conversion-BasedX Grammar-BasedI Graph-BasedI Transition-Based

Page 37: Introduction to NLP Data-Driven Dependency Parsing

Graph-Based Dependency Parsing

The Basic IdeaI Define a global Arc-Factored modelI Treat the search as an MST problemI Treat the learning as a classification problemI Evaluate over the set of gold dependencies as usual

Page 38: Introduction to NLP Data-Driven Dependency Parsing

Graph-Based Dependency Parsing

The Basic IdeaI Define a global Arc-Factored modelI Treat the search as an MST problemI Treat the learning as a classification problemI Evaluate over the set of gold dependencies as usual

Page 39: Introduction to NLP Data-Driven Dependency Parsing

Graph-Based Dependency Parsing

Step 1: Defining the Arc Factored Model

t∗ = arg maxt∈GEN(V )

wΦ(t)

= arg maxt∈GEN(V )

∑(i→j)∈t

wφarc(i → j)

Page 40: Introduction to NLP Data-Driven Dependency Parsing

Graph-Based Dependency Parsing

Step 2: Defining Feature Templates

Name φi (had,OBJ, effect) wiUnigram head “had” wuniheadUnigram dep “effect” wunidepUnigram head pos VB wuniheadposUnigram dep pos NN wunidepposBigram head-dep “had-effect” wbigramBigram headpos-deppos VB-NN wbigramposLabeled Bigram head-dep “had-OBJ-effect” wbigramlabelLabeled Bigram headpos-deppos VB-obj-NN wbigramposlabelIn-Between pos VB-IN-NN winbetween

Page 41: Introduction to NLP Data-Driven Dependency Parsing

Graph-Based Dependency Parsing

Step 2: Defining Feature Templates

Name φi (had,OBJ, effect) wiUnigram head “had” wuniheadUnigram dep “effect” wunidepUnigram head pos VB wuniheadposUnigram dep pos NN wunidepposBigram head-dep “had-effect” wbigramBigram headpos-deppos VB-NN wbigramposLabeled Bigram head-dep “had-OBJ-effect” wbigramlabelLabeled Bigram headpos-deppos VB-obj-NN wbigramposlabelIn-Between pos VB-IN-NN winbetween

Page 42: Introduction to NLP Data-Driven Dependency Parsing

Graph-Based Dependency Parsing

Step 3: LearningE.g., PerceptronI Theory:

- Find w that assigns higher scores to yi than any y ∈ Y- If seperation exists, will learn to separate the correctstructure from the incorrect structures

I Practice:- Training requires repeated inference-update- Computing feature values is time consuming- The Averaged-Perceptron variant preferred

Page 43: Introduction to NLP Data-Driven Dependency Parsing

Graph-Based Dependency Parsing

Step 4: Finding the Max-Spanning TreeThe Chu-Liu-Edmonds Algorithm

Runtime complexity: O(n2)

Page 44: Introduction to NLP Data-Driven Dependency Parsing

Graph-Based Dependency Parsing

Step 3: Online LearningPerceptron/MIRA (Margin Infused Relaxed Algorithm)

Step 4: Max-Spanning Tree DecodingThe Chu-Liu-Edmonds Algorithm (CLE)

http://repository.upenn.edu/cgi/viewcontent.cgi?

article=1056&context=cis_reports

Page 45: Introduction to NLP Data-Driven Dependency Parsing

Graph-Based Dependency Parsing

The Objective Function:

t∗ = argmax{t |∈GEN(x)}∑

a∈arcs(t)

wT Φ(a)

The Modeling Choices:

Representation Dependency TreesModel Graph-Based

Arc-FactoredDecoder MST/CLE O(n2)

Trainer Perceptron/MIRAEvaluation Attachment Scores

Page 46: Introduction to NLP Data-Driven Dependency Parsing

Modeling Methods (4)

X Conversion-BasedX Grammar-BasedX Graph-BasedI Transition-Based

Page 47: Introduction to NLP Data-Driven Dependency Parsing

Transition-Based Dependency Parsing

The Basic IdeaI Define a transition systemI Define an Oracle Algorithm for DecodingI Approximate the Oracle Algorithm via LearningI Evaluate over Dependency Arcs as Usual

http://stp.lingfil.uu.se/~nivre/docs/BeyondMaltParser.pdf

Page 48: Introduction to NLP Data-Driven Dependency Parsing

Transition-Based Dependency Parsing

Defining ConfigurationsA parser Configuration is a triplet c = (S,Q,A), whereI S = a stack [...,wi ]S of partially processed nodesI Q = a queue [wj , ...]Q of remaining input nodesI A = a set of labeled arcs (wi , l ,wj)

Initialization:I c0 = ([w0]S, [w1, ...,wn]Q, {})

Note: w0 = ROOTTermination:I ct = ([w0]S, []Q,A)

Page 49: Introduction to NLP Data-Driven Dependency Parsing

Transition-Based Dependency Parsing

Defining TransitionsI Shift:

([...]S, [wi , ...]Q,A) −→ ([...,wi ]S, [...]Q,A)

I Arc-Left(l):([...,wi ,wj ]S,Q,A) −→ ([...,wj ]S,Q,A ∪ (wj , l ,wi))

I Arc-Right(l):([...,wi ,wj ]S,Q,A) −→ ([...,wi ]S,Q,A ∪ (wi , l ,wj))

Page 50: Introduction to NLP Data-Driven Dependency Parsing

Transition-Based Dependency Parsing

Demo Deck

Page 51: Introduction to NLP Data-Driven Dependency Parsing

Transition-Based Dependency Parsing

Deterministic ParsingGiven an oracle O that correctly predicts the next transitionO(c), parsing is deterministic:

PARSE(w1, ...,wn)1. c ← ([w0]S, [w1, ...,wn]Q, )2. while Qc 6= [] or |Sc | = 13. t ← O(c)4. c ← t(c)5. return T = (w0,w1, ...,wn,Ac)

Page 52: Introduction to NLP Data-Driven Dependency Parsing

Transition-Based Dependency Parsing

Data-Driven ParsingWe approximate the Oracle O using a Classifier Predict(c) thatpredicts the next transition using Features of c, feats(c).

PARSE(w1, ...,wn)1. c ← ([w0]S, [w1, ...,wn]Q, )2. while Qc 6= [] or |Sc | = 13. t ← Predict(w, feats(c))4. c ← t(c)5. return T = (w0,w1, ...,wn,Ac)

Page 53: Introduction to NLP Data-Driven Dependency Parsing

Transition-Based Dependency Parsing

Feature Engineering

name feature weightS[0] word effect w1S[0] pos NN w2S[1] word little w3S[1] pos JJ w4Q[0] word on w5Q[0] pos P w6Q[1] word financial w7Q[1] pos JJ w8Root(A) word had w9Root(A) POS VB w10s[0]-S[1] effect→little w11s[1]-S[0] little→effect w12

Page 54: Introduction to NLP Data-Driven Dependency Parsing

Transition-Based Dependency Parsing

Feature Engineering

name feature weightS[0] word effect w1S[0] pos NN w2S[1] word little w3S[1] pos JJ w4Q[0] word on w5Q[0] pos P w6Q[1] word financial w7Q[1] pos JJ w8Root(A) word had w9Root(A) POS VB w10s[0]-S[1] effect→little w11s[1]-S[0] little→effect w12

Page 55: Introduction to NLP Data-Driven Dependency Parsing

Transition-Based Dependency Parsing

I An Oracle O can be approximated by a (linear) classifier:

Predict(t) = arg maxt

wΦ(c, t)

I History-Based Features Φ(c, t)- Features over input words relative to S and Q- Features over the (partial) dependency tree defined by A- Features over the (partial) transition sequence so far

I Learning w from Treebank Data- Reconstruct Oracle sequence for each sentence- Construct training data set D = {(c, t)|O(c) = t}- Maximize accuracy of local predictions O(c) = t

Page 56: Introduction to NLP Data-Driven Dependency Parsing

Transition-Based Dependency Parsing

Online LearningOnline learning algorithms

Step 4: Greedy DecodingGreedy: At each step, select the maximum scoring transition.

Page 57: Introduction to NLP Data-Driven Dependency Parsing

Recent Advances: Neural-Network Models

The Basic Claim: Both graph based and transition-basedmodels benefit from the move to Neural Networks.I Same overall approach and algorithm as before, but:→ Replace linear classifier with non-linear to MLP.→ Use pre-trained word embeddings.→ Replace feature-extractor with Bi-LSTM.

I Further explorations:→ Semi-supervised learning.→ Multi-task learning

I Remaining Challenges:→ Out-of-domain parsing (e.g. twitter)→ Parsing Morphologically-Rich Languages (e.g. Hebrew)

Page 58: Introduction to NLP Data-Driven Dependency Parsing

Summarising Dependency Parsing

I Dependency trees as labeled bi-lexical dependenciesI Data-Driven parsing trained over Dependency Treebanks

I Varied Methods:I Conversion-Based (Rules)I Grammar-Based (Probabilistic)I Graph-Based (Linear, Globally Optimized)I Transition-Based (Linear, Locally Optimized)

I Neural Network models work the same but:I Non-linear objective eg. MLPI Better word-representations eg. Word EmbeddingsI Better (automatic) feature-extraction eg. BiLSTM

I English is “solved” — What about other languages?I Stanford CoreNLP https://corenlp.runI The UD Initiative: https://universaldependencies.org/I UDPipe: http://lindat.mff.cuni.cz/services/udpipe/I ONLP: nlp.biu.ac.il/~rtsarfaty/onlp/hebrew/

Page 59: Introduction to NLP Data-Driven Dependency Parsing

Summarising Dependency Parsing

I Dependency trees as labeled bi-lexical dependenciesI Data-Driven parsing trained over Dependency Treebanks

I Varied Methods:I Conversion-Based (Rules)I Grammar-Based (Probabilistic)I Graph-Based (Linear, Globally Optimized)I Transition-Based (Linear, Locally Optimized)

I Neural Network models work the same but:I Non-linear objective eg. MLPI Better word-representations eg. Word EmbeddingsI Better (automatic) feature-extraction eg. BiLSTM

I English is “solved” — What about other languages?I Stanford CoreNLP https://corenlp.runI The UD Initiative: https://universaldependencies.org/I UDPipe: http://lindat.mff.cuni.cz/services/udpipe/I ONLP: nlp.biu.ac.il/~rtsarfaty/onlp/hebrew/

Page 60: Introduction to NLP Data-Driven Dependency Parsing

NLP@BIU: Where We’re At

So FarX Part 1: Introduction (classes 1-2)X Part 2: Words/Sequences (classes 3-4)X Part 3: Sentences/Trees (classes 5-6)→ Part 4: Meanings (Prof. Ido Dagan, starting class 7)

To Be Continued...