Upload
others
View
20
Download
0
Embed Size (px)
Citation preview
Introduction to NLPData-Driven Dependency Parsing
Prof. Reut TsarfatyBar Ilan University
November 24, 2020
Statistical Parsing
The Big Picture
Statistical Parsing
The Big Picture
Statistical Parsing
The Big Picture
Statistical Parsing: The Big Picture
The QuestionsI What kind of Trees?I What kind of Models?
I GenerativeI Discriminative
I Which Search Algorithm (Decoding) ?I Which Learning Algorithm (Training) ?I What kind of Evaluation?
Statistical Parsing: The Big Picture
Previously on NLP@BIU
Representation Phrase-Structure TreesModel Generative
Objective ProbabilisticSearch CKY
Train Maximum LikelihoodEvaluation Precision/Recall/F1
Today: Introduction to Dependency Parsing
Today: More Modeling Choices:
Representation Constituency Trees Dependency TreesModel Generative ?
Objective Probailistic ?Search Exhaustive ?
Train MLE ?Evaluation F1-Scores Attachement Scores
Introduction to Dependency Parsing
I The purpose of Syntactic Structures:I Encode Predicate Argument StructuresI Who Does What to Whom? (When, Where, Why...)
I Properties of Dependency Structures:I Defined as (labeled) binary relations between wordsI Reflect a long linguistic (European) traditionI Explicitly represent Argument Structure
Representation: Labeled vs. Unlabeled
Unlabeled Dependency Tree: –ROOT–
dumped
workers sacks into
binsLabeled Dependency Tree: –ROOT–
dumped
subjworkers
dobjsacks
prepinto
pobjbins
Representation: Functional vs. Lexical
Functional Dependencies: –ROOT–
dumped
subjworkers
dobjsacks
prepinto
pobjbins
Lexical Dependencies: –ROOT–
dumped
subjworkers
dobjsacks
nmodbins
caseinto
Discussions: Options and Schemes
Vertical vs. Horizontal Representationhttp://nlp.stanford.edu:8080/corenlp/
The Universal Dependencies Initiativehttps://universaldependencies.org/
Let’s Analyse!
The cat sat on the mat .
Let’s Analyse!
The cat is on the mat .
Let’s Analyse!
The cat , which I met , is sitting on the mat .
Let’s Analyse!
The dog and the cat sat on the big and fluffy mat
You should know how to read/analyse these!
Let’s Analyse!
The dog and the cat sat on the big and fluffy mat
You should know how to read/analyse these!
Dependency Trees: Formal Definition
I A labeled dependency tree is a labeled directed tree T :I a set V of nodes, labeled with words (including ROOT )I a set A of arcs, labeled with dependency typesI a linear precedence order < on V
I Notation:I Arc 〈v1, v2〉 connects head v1 with dep v2I Arc 〈v1, l , v2〉 connects head v1 with dep v2 with label l ∈ LI A node v0 (ROOT) serves as a unique root of the tree
Properties of Dependency Trees
A dependency T tree is:I connected:
For every node i there is a node j such that i → j or j → iI acyclic:
If i → j then not j →∗ iI single head:
If i → j then not k → j for any k 6= iI projective:
If i → j then i →∗ k for any k such that i < k < j
Non-Projective Dependency Trees
Non-Projective Dependency TreesMany parsing parsing algorithms are restricted to projectivedependency trees.
Is this a problem?Statistics from CoNLL-X Shared Task 2006I NPD = Non-projective dependenciesI NPS = Non-projective sentences
Language %NPD % NPSDutch 5.4 36.4
German 2.3 27.8Czech 1.9 23.2
Slovene 1.9 22.2Portuguese 1.3 18.9
Danish 1.0 15.6
We will (mostly) focus on projective dependencies.
Evaluation Metrics
I Unlabeled Attachement Scores (UAS)The percentage of identical arcsfrom the total number or arcs in the treeUAS =
Aintersect(i,j)n
I Labeled Attachement Scores (LAS)The percentage of identical arcs with identical labelsfrom the total number or arcs in the treeLAS =
Aintersect(i,l,j)n
I Root AccuracyThe percentage of sentences with correct root dependency
I Exact MatchThe percentage of sentences with parses identical to gold
Models for Dependency Parsing
The Parsing Objective:
y∗ = arg max{y |y∈GEN(x)}
Score(y)
The Modeling Choices:
Representation Dependency TreesModel ?
Decoder ?Trainer ?
Evaluation Attachment Scores
Modeling Methods
Our Modeling Tasks:
y∗ = arg max{y |y∈GEN(x)}
Score(y)
I GEN: How do we generate all t?I Score: How do we score any t?I argmax: How do we find the best t?
Modeling Methods
I Conversion-Based- Convert Phrase-Structure to Dependency Trees
I Grammar-Based- Generative methods based on PCFGs
I Graph-Based- Globally Optimised, Restricted features
I Transition-Based- Locally Optimal, Unrestricted features
I Neural-Based
Modeling Methods (1)
I Conversion-Based: Convert PS trees Using a Head TableI Grammar-BasedI Graph-BasedI Transition-Based
TOP
S
NP
workers
VP
VP
VP
dumped
NP
sacks
PP
P
into
NP
bins
Modeling Methods (1)
I Conversion-Based: Convert PS trees Using a Head Table
VP → VBD VBN MD VBZ VB VBG VBP VPNP ← NN NX JJR CD JJ JJS RBADJP ← NNS QP NN ADVP JJ VBN VBGADVP → RB RBR RBS FW ADVP TO CD JJRS ← VP S SBAR ADJP UCP NPSQ ← VBZ VBD VBP VB MD PRD VP SQSBAR ← S SQ SINV SBAR FRAG IN DT
Where Do Dependency Trees Come From?
S
NP
workers
VP
VP
V
dumped
NP
sacks
PP
P
into
NP
bins
Where Do Dependency Trees Come From?
S/dumped
NP/workers
workers
VP/dumped
VP/dumped
V/dumped
dumped
NP/sacks
sacks
PP/into
P/into
into
NP/bins
bins
Where do Dependency Trees Come From?
dumped
workers
workers
dumped
dumped
dumped
dumped
sacks
sacks
into
into
into
bins
bins
Where do Dependency Trees Come From?
dumped
workers
workers
dumped
dumped
dumped
dumped
sacks
sacks
into
into
into
bins
bins
Where do Dependency Trees Come From?
–ROOT–
dumped
workers sacks into
bins
Modeling Methods (2)
X Conversion-BasedI Grammar-BasedI Graph-BasedI Transition-Based
TOP
dumped
workers
workers
dumped
dumped
dumped
dumped
sacks
sacks
into
into
into
bins
bins
Grammar-Based Dependency Parsing
The Basic IdeaI Treat bi-lexical dependencies as constituentsI Decode using chart based algorithm (e.g., CKY)I Learn using standard MLE methodsI Evaluate over the set of resulting dependencies as usual
Relevant StudiesI Original version: [Hays 1964]I Link Grammar: [Sleator and Temperley 1991]I Earley-style left-corner: [Lombardo and Lesmo 1996]I Bilexical grammars: [Eisner 1996a, 1996b, Eisner 2000]
http://cs.jhu.edu/~jason/papers/eisner.coling96.pdf
Grammar-Based Dependency Parsing
The Basic IdeaI Treat bi-lexical dependencies as constituentsI Decode using chart based algorithm (e.g., CKY)I Learn using standard MLE methodsI Evaluate over the set of resulting dependencies as usual
Relevant StudiesI Original version: [Hays 1964]I Link Grammar: [Sleator and Temperley 1991]I Earley-style left-corner: [Lombardo and Lesmo 1996]I Bilexical grammars: [Eisner 1996a, 1996b, Eisner 2000]
http://cs.jhu.edu/~jason/papers/eisner.coling96.pdf
Grammar-Based Dependency Parsing
The Objective Function:
t∗ = argmax{t |∈GEN(x)}P(t)
The Modeling Choices:
Representation Dependency TreesModel PCFG
Decoder Adapted CKYTrainer Smoothed MLE
Evaluation Attachment Scores
Modeling Methods (3)
X Conversion-BasedX Grammar-BasedI Graph-BasedI Transition-Based
Graph-Based Dependency Parsing
The Basic IdeaI Define a global Arc-Factored modelI Treat the search as an MST problemI Treat the learning as a classification problemI Evaluate over the set of gold dependencies as usual
Graph-Based Dependency Parsing
The Basic IdeaI Define a global Arc-Factored modelI Treat the search as an MST problemI Treat the learning as a classification problemI Evaluate over the set of gold dependencies as usual
Graph-Based Dependency Parsing
Step 1: Defining the Arc Factored Model
t∗ = arg maxt∈GEN(V )
wΦ(t)
= arg maxt∈GEN(V )
∑(i→j)∈t
wφarc(i → j)
Graph-Based Dependency Parsing
Step 2: Defining Feature Templates
Name φi (had,OBJ, effect) wiUnigram head “had” wuniheadUnigram dep “effect” wunidepUnigram head pos VB wuniheadposUnigram dep pos NN wunidepposBigram head-dep “had-effect” wbigramBigram headpos-deppos VB-NN wbigramposLabeled Bigram head-dep “had-OBJ-effect” wbigramlabelLabeled Bigram headpos-deppos VB-obj-NN wbigramposlabelIn-Between pos VB-IN-NN winbetween
Graph-Based Dependency Parsing
Step 2: Defining Feature Templates
Name φi (had,OBJ, effect) wiUnigram head “had” wuniheadUnigram dep “effect” wunidepUnigram head pos VB wuniheadposUnigram dep pos NN wunidepposBigram head-dep “had-effect” wbigramBigram headpos-deppos VB-NN wbigramposLabeled Bigram head-dep “had-OBJ-effect” wbigramlabelLabeled Bigram headpos-deppos VB-obj-NN wbigramposlabelIn-Between pos VB-IN-NN winbetween
Graph-Based Dependency Parsing
Step 3: LearningE.g., PerceptronI Theory:
- Find w that assigns higher scores to yi than any y ∈ Y- If seperation exists, will learn to separate the correctstructure from the incorrect structures
I Practice:- Training requires repeated inference-update- Computing feature values is time consuming- The Averaged-Perceptron variant preferred
Graph-Based Dependency Parsing
Step 4: Finding the Max-Spanning TreeThe Chu-Liu-Edmonds Algorithm
Runtime complexity: O(n2)
Graph-Based Dependency Parsing
Step 3: Online LearningPerceptron/MIRA (Margin Infused Relaxed Algorithm)
Step 4: Max-Spanning Tree DecodingThe Chu-Liu-Edmonds Algorithm (CLE)
http://repository.upenn.edu/cgi/viewcontent.cgi?
article=1056&context=cis_reports
Graph-Based Dependency Parsing
The Objective Function:
t∗ = argmax{t |∈GEN(x)}∑
a∈arcs(t)
wT Φ(a)
The Modeling Choices:
Representation Dependency TreesModel Graph-Based
Arc-FactoredDecoder MST/CLE O(n2)
Trainer Perceptron/MIRAEvaluation Attachment Scores
Modeling Methods (4)
X Conversion-BasedX Grammar-BasedX Graph-BasedI Transition-Based
Transition-Based Dependency Parsing
The Basic IdeaI Define a transition systemI Define an Oracle Algorithm for DecodingI Approximate the Oracle Algorithm via LearningI Evaluate over Dependency Arcs as Usual
http://stp.lingfil.uu.se/~nivre/docs/BeyondMaltParser.pdf
Transition-Based Dependency Parsing
Defining ConfigurationsA parser Configuration is a triplet c = (S,Q,A), whereI S = a stack [...,wi ]S of partially processed nodesI Q = a queue [wj , ...]Q of remaining input nodesI A = a set of labeled arcs (wi , l ,wj)
Initialization:I c0 = ([w0]S, [w1, ...,wn]Q, {})
Note: w0 = ROOTTermination:I ct = ([w0]S, []Q,A)
Transition-Based Dependency Parsing
Defining TransitionsI Shift:
([...]S, [wi , ...]Q,A) −→ ([...,wi ]S, [...]Q,A)
I Arc-Left(l):([...,wi ,wj ]S,Q,A) −→ ([...,wj ]S,Q,A ∪ (wj , l ,wi))
I Arc-Right(l):([...,wi ,wj ]S,Q,A) −→ ([...,wi ]S,Q,A ∪ (wi , l ,wj))
Transition-Based Dependency Parsing
Demo Deck
Transition-Based Dependency Parsing
Deterministic ParsingGiven an oracle O that correctly predicts the next transitionO(c), parsing is deterministic:
PARSE(w1, ...,wn)1. c ← ([w0]S, [w1, ...,wn]Q, )2. while Qc 6= [] or |Sc | = 13. t ← O(c)4. c ← t(c)5. return T = (w0,w1, ...,wn,Ac)
Transition-Based Dependency Parsing
Data-Driven ParsingWe approximate the Oracle O using a Classifier Predict(c) thatpredicts the next transition using Features of c, feats(c).
PARSE(w1, ...,wn)1. c ← ([w0]S, [w1, ...,wn]Q, )2. while Qc 6= [] or |Sc | = 13. t ← Predict(w, feats(c))4. c ← t(c)5. return T = (w0,w1, ...,wn,Ac)
Transition-Based Dependency Parsing
Feature Engineering
name feature weightS[0] word effect w1S[0] pos NN w2S[1] word little w3S[1] pos JJ w4Q[0] word on w5Q[0] pos P w6Q[1] word financial w7Q[1] pos JJ w8Root(A) word had w9Root(A) POS VB w10s[0]-S[1] effect→little w11s[1]-S[0] little→effect w12
Transition-Based Dependency Parsing
Feature Engineering
name feature weightS[0] word effect w1S[0] pos NN w2S[1] word little w3S[1] pos JJ w4Q[0] word on w5Q[0] pos P w6Q[1] word financial w7Q[1] pos JJ w8Root(A) word had w9Root(A) POS VB w10s[0]-S[1] effect→little w11s[1]-S[0] little→effect w12
Transition-Based Dependency Parsing
I An Oracle O can be approximated by a (linear) classifier:
Predict(t) = arg maxt
wΦ(c, t)
I History-Based Features Φ(c, t)- Features over input words relative to S and Q- Features over the (partial) dependency tree defined by A- Features over the (partial) transition sequence so far
I Learning w from Treebank Data- Reconstruct Oracle sequence for each sentence- Construct training data set D = {(c, t)|O(c) = t}- Maximize accuracy of local predictions O(c) = t
Transition-Based Dependency Parsing
Online LearningOnline learning algorithms
Step 4: Greedy DecodingGreedy: At each step, select the maximum scoring transition.
Recent Advances: Neural-Network Models
The Basic Claim: Both graph based and transition-basedmodels benefit from the move to Neural Networks.I Same overall approach and algorithm as before, but:→ Replace linear classifier with non-linear to MLP.→ Use pre-trained word embeddings.→ Replace feature-extractor with Bi-LSTM.
I Further explorations:→ Semi-supervised learning.→ Multi-task learning
I Remaining Challenges:→ Out-of-domain parsing (e.g. twitter)→ Parsing Morphologically-Rich Languages (e.g. Hebrew)
Summarising Dependency Parsing
I Dependency trees as labeled bi-lexical dependenciesI Data-Driven parsing trained over Dependency Treebanks
I Varied Methods:I Conversion-Based (Rules)I Grammar-Based (Probabilistic)I Graph-Based (Linear, Globally Optimized)I Transition-Based (Linear, Locally Optimized)
I Neural Network models work the same but:I Non-linear objective eg. MLPI Better word-representations eg. Word EmbeddingsI Better (automatic) feature-extraction eg. BiLSTM
I English is “solved” — What about other languages?I Stanford CoreNLP https://corenlp.runI The UD Initiative: https://universaldependencies.org/I UDPipe: http://lindat.mff.cuni.cz/services/udpipe/I ONLP: nlp.biu.ac.il/~rtsarfaty/onlp/hebrew/
Summarising Dependency Parsing
I Dependency trees as labeled bi-lexical dependenciesI Data-Driven parsing trained over Dependency Treebanks
I Varied Methods:I Conversion-Based (Rules)I Grammar-Based (Probabilistic)I Graph-Based (Linear, Globally Optimized)I Transition-Based (Linear, Locally Optimized)
I Neural Network models work the same but:I Non-linear objective eg. MLPI Better word-representations eg. Word EmbeddingsI Better (automatic) feature-extraction eg. BiLSTM
I English is “solved” — What about other languages?I Stanford CoreNLP https://corenlp.runI The UD Initiative: https://universaldependencies.org/I UDPipe: http://lindat.mff.cuni.cz/services/udpipe/I ONLP: nlp.biu.ac.il/~rtsarfaty/onlp/hebrew/
NLP@BIU: Where We’re At
So FarX Part 1: Introduction (classes 1-2)X Part 2: Words/Sequences (classes 3-4)X Part 3: Sentences/Trees (classes 5-6)→ Part 4: Meanings (Prof. Ido Dagan, starting class 7)
To Be Continued...