View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Treebanks are Not Naturally Occurring Data
Choices in Treebank Design and What They Mean for Natural Language Processing
Owen RambowColumbia University – CCLS
Goal of this Talk
• In natural language processing, we all use syntactic representations (treebanks)
• In theoretical syntax, they use syntactic representations• Problems in both communities:
– Little reflection on what these syntactic representations mean
– Representation confused with phenomenon which is the object of scientific investigation: natural language syntax
• Goal: increase meta-scientific understanding of syntactic representations for NLP
Overview
• What is a Syntactic Representation?• Dependency and Phrase Structure – Definitions– Syntactic and representational constituency and
dependency– Intermediate projections
• What does this Mean for NLP?• Takehome lessons
Acknowledgements
• Rajesh Bhatt and Fei Xia – Bhatt, Rambow and Xia: Linguistic Phenomena,
Analyses, and Representations: Understanding Conversion between Treebanks, IJCNLP 2011
• Hindi-Urdu Treebank Project– Dipti Misra Sharma (IIIT Hyderabad)– Martha Palmer (Colorado)– NSF Grant
• All errors and infelicitous choices in this presentation are my own
What is a Syntactic Representation?
1. Syntactic phenomena, e.g.:– Subject of a verb– Relative clause– Small clauseLinguists tend to agree on what phenomena exist
2. Mathematical representation type, e.g.:– Phrase structure tree– Dependency tree– Or something more complicated: LFG, TAG, …
3. Formal syntactic description:a. Mapping from phenomena to representations (in particular type)b. Chosen representation for a specific phenomenon also called
analysisc. Phenomena extracted in representation are the interpretationd. Formal description is a syntactic theory if it makes predictions
Example: Small Clauses
• Hindi– अा��ति�फ ने सी�मा� को� बेवको� फ सीमाझा� – Atif ne Seema ko bewakuuf samjhaa – Atif Erg Seema Acc stupid consider.Pfv – ‘Atif considered Seema stupid.’
• English– Atif considered Seema stupid– Atif considered her stupid
What is the Phenomenon?
• Syntactically and semantically, consider takes a clausal complement– Atif considered [clause that she is stupid]
– Atif considered [clause her stupid]
• But two problems:– No verb – her is semantically subject of stupid but has accusative
case, which is unusual (subjects are usually nominative)• So:– Atif considered [small clause her stupid]
What is the Representation Type?
• For this example, we will show dependency trees and phrase structure trees
Analysis 1a for Small Clauses:No Accusative Case Marking
• Structure represents her as subject but not accusative case marking of her
considers
Atif stupid
Subj Obj
her
Subj
Analysis 1b for Small Clauses:Exceptional Case Marking
• Structure represents her as subject and accusative case marking through node label
considers
Atif stupid
Subj Obj-ECM
her
Subj
Analysis 1a for Small Clauses:No Accusative Case Marking
• Structure represents her as subject but not accusative case marking of her
considers
Atif stupid
Subj Obj
her
Subj
S
NP
Atif
VP
considers S
her VP
AdjPstupid
Analysis 1b for Small Clauses:Exceptional Case Marking
• Structure represents her as subject but not accusative case marking of her
considers
Atif stupid
Subj Obj-ECM
her
Subj
S
NP
Atif
VP
considers SC
her VP
AdjPstupidClose to analysis adopted in Chomsky (1981)
Note on DS and PS
• These analyses are intuitively very similar• Formal notion: “consistency” (Fei Xia, see
Bhatt, Rambow & Fei 2011)– Intution: very simple and general algorithm can
transform consistent DS to PS and vice versaI
Analysis 2a for Small Clauses:General Monoclausal Analysis
• Structure represents accusative case marking of her (as object of matrix verb) but not her as semantic subject
considers
Atif stupid
Subj Obj2
her
Obj
Analysis 2b for Small Clauses:Syntactic Monoclausal Analysis
• Structure represents accusative case marking of her (as object of matrix verb) and her as semantic subject using node label
considers
Atif stupid
Subj ObjPred
her
Obj
Analysis 2b for Small Clauses:Syntactic Monoclausal Analysis
• Structure represents accusative case marking of her (as object of matrix verb) and her as semantic subject using node label
considers
Atif stupid
k1 k2s
her
k2
Neo-Paninian analysis from IIIT Hyderabad,Used for DS in Hindi-Urdu Treebank
Analysis 2b for Small Clauses:Syntactic Monoclausal Analysis
• Structure represents accusative case marking of her (as object of matrix verb) and her as semantic subject using node label
सीमाझा�
अा��ति�फ ने बेवको� फ
k1 k2s
सी�मा� को�
k2
Neo-Paninian analysis from IIIT Hyderabad,Used for DS in Hindi-Urdu Treebank
Analysis 2a for Small Clauses:General Monoclausal Analysis
• Structure represents accusative case marking of her (as object of matrix verb) but not her as semantic subject
considers
Atif stupid
Subj Obj2
her
ObjNP
Atif
VP
considers her AdjP
stupid
S
Analysis 2b for Small Clauses:Syntactic Monoclausal Analysis
• Structure represents accusative case marking of her (as object of matrix verb) and her as semantic subject using node label
considers
Atif stupid
k1 k2s
her
k2NP
Atif
VP
considers her
AdjP
stupid
S
SC
Analysis 3 for Small Clauses:Raising to Object
• Structure represents accusative case marking of her and her as semantic subject but requires empty category
considers
Atif stupid
Subj Obj-Pred
her1
Obj
e1
Subj
Analysis 3 for Small Clauses:Raising to Object
• Structure represents accusative case marking of her and her as semantic subject but requires empty category
considers
Atif stupid
Subj Obj-Pred
her1
Obj
e1
Subj
S
NP
Atif
VP
considers S
VP
AdjPstupid
her1
e1
Analysis used for PS in Hindi-Urdu Treebank
Comparison of Representations
• Less Information • Same information
considers
Atif stupid
Subj Obj
her
Subj
considers
Atif stupid
Subj Obj2
her
Obj
considers
Atif stupid
Subj Obj-Pred
her1
Obj
e1
Subj
considers
Atif stupid
Subj ObjPred
her
Obj
considers
Atif stupid
Subj Obj-ECM
her
Subj
Tree 1a
Tree 2a
Tree 1b
Tree 2b
Tree 3
Initial Summary: Syntactic Phenomena, Representation Types, Analyses
• Syntactic phenomena are the empirical data of syntax as part of the science of language– Can be very similar across languages
• There can be several possible analyses– Some have less information– But there can be different analyses that represent
the same information differently• The analyses can be similar in DS and PS• Lots of choices in treebank design!
Overview
• What is a Syntactic Representation?• Dependency and Phrase Structure – Definitions– Syntactic and representational constituency and
dependency– Intermediate projections
• What does this Mean for NLP?• Takehome lessons
Representation Types:Dependency and Phrase Structure
• Dependency Tree (DS):– One label alphabet, words (= words in a sentence)– All nodes labeled with words or empty strings
• Phrase Structure Tree (PS):– Two disjoint label alphabets, terminals (= words in
sentence) and nonterminals– All and only interior nodes are labeled with
nonterminals– Leaves are labeled with terminals or empty strings
Non-Differences Between DS and PS
• WRONG: “DS can’t have empty categories”
considers
Atif stupid
Subj Obj-Pred
her1
Obj
e1
Subj
Non-Differences Between DS and PS
• WRONG: “DS must be unordered, PS must be ordered”
that the dancer a flower the violinist gives
Non-Differences Between DS and PS
• WRONG: “PS can’t be non-projective/have crossing arcs”
Overview
• What is a Syntactic Representation?• Dependency and Phrase Structure – Definitions– Syntactic and representational constituency and
dependency– Intermediate projections
• What does this Mean for NLP?• Takehome lessons
The Double Life of Dependency and of Constituency
• There are two meanings to “dependency”:– A type of tree (no nonterminal labels)– A type of syntactic phenomenon, namely a
relation between words (e.g., “subjecthood”)• There are two meanings to “constituency”– A type of tree (terminal labels on leaf nodes; = PS
tree)– A type of syntactic phenomenon, namely a
grouping of words into phrases
Mathematical objects = tools for linguists to
represent data
Empirical facts = data for linguists
Don’t confuse data with its representation!
Constituency in PS Representation but not in Syntax
• VP premodifiers in PTB can be either sister to VP or in VP, there is no difference in meaning– (S (NP-SBJ Sandy
(ADVP-TMP often) (VP throws (NP curves)))
– (S (NP-SBJ Sandy (VP (ADVP-TMP often) throws (NP curves)))
• “The variation is in general free” (PTB Guidelines p. 137)
Constituency in Syntax but not in PS Representation
• Flat NP in PTB– Syntax of Adj-N-N sequences: • (green (baby bassinet))• ((small business) administration)
– Representation in Penn Treebank:• (green baby bassinet)• (small business administration)
Conscious decision not to represent NP-internal pre-nominal sturcture
Dependency in DS Representation but not in Syntax
• --- links in CATiB dependency treebank of Arabic– Signal that connected words are not analyzed by
annotators for dependency• Pof link in DS part of Hindi-Urdu Treebank– Used for multiword expressions which form a unit
syntactically (and thus have no structure) but may occur apart in sentence
Dependency in Syntax but not in DS Representation
• Small clauses in DS for Hindi-Urdu Treebank: dependency between embedded predicate and its semantic subject not marked as dependency link in tree
considers
Atif stupid
k1 k2s
her
k2
Aren’t DS and PS Representations Complementary? NO!
• Syntactic dependency can be encoded in DS, and typically is
• Usual convention: attachment in projection shows type of dependency
Aren’t DS and PS Representations Complementary? NO!
• Syntactic constituency is represented in DS• Usual convention: each node is the word, and
the head of the phrase containing it and all descendents
Overview
• What is a Syntactic Representation?• Dependency and Phrase Structure – Definitions– Syntactic and representational constituency and
dependency– Intermediate projections
• What does this Mean for NLP?• Takehome lessons
But What About Intermediate Projections?
• If PS:– only has preterminals (eg V) and maximal
projections (eg S);– and no attchment happens at preterminal;then we have trivial correspondence to DS (consistency in the formal sense of Xia)
• Then PS and DS must have the exact same expressive power
No Intermediate Projections: DS and PS Representationally Equivalent
S
N Adv NV
Rajesh happily eats laddus
eats
Rajesh happily laddus
What about the POS tags in the DS?
No Intermediate Projections: DS and PS Representationally Equivalent
S
N Adv NV
Rajesh happily eats laddus
eatsV
RajeshN happilyAdv laddusN
What about grammatical functions?
No Intermediate Projections: DS and PS Representationally Equivalent
S
NSubj AdvAdj NObjV
Rajesh happily eats laddus
eatsV
RajeshN happilyAdv laddusN
Subj ObjAdj
Intermediate Projections
• If PS only has preterminals (eg V) and maximal projections (eg S), we have trivial correspondence to DS
• No direct equivalent of intermediate projections (eg VP) in DS
Intermediate Projections: DS and PS Different!
S
NP
Adv NPVRajesh
happily eats laddus
eatsV
RajeshN happilyAdv laddusN
Subj ObjAdj
VP
VP
AdvPN
Note: X– X’—XP = maximal projectionEXCEPT: V—VP—S:
VP = intermediate projection
Intermediate Projections
• If PS only has preterminals (eg V) and maximal projections (eg S), we have trivial correspondence to DS
• No direct equivalent of intermediate projections (eg VP) in DS – we have a problem?
• What is important: how intermediate projections are used in the formal syntactic description
• A node is not information about the syntax, only a node and its interpretation
Why Have Intermediate Projections?
• Evidence for IntProj as a constituent– VP fronting in English
• Use of a VP to represent semantic facts– Interpretation of adverbs in English
• Use of IntProj to represent syntactic facts– Subject and object of verbs in English– Genitive versus adjectival modification in Arabic– Also: • Arguments of verbs in Hindi• Verb second in German
Intermediate Projection as Constituent:VP Fronting in English
• VP can be moved as a constituent– Rajesh said he would [VP eat laddus]
and [VP eat laddus] Rajesh did
VP Fronting in English
S
NP
didRajesh
VP
AuxN
NPV
eat laddus
VP
Sbar
What can Dependency Do?
ε1
laddus
Rajesh did
Aux
Obj
Subj
eat1
Top
did
laddus
Rajesh
Obj
Subj
eat
Top
did
Rajesh
Subj
laddus
Obj
eat
Comp
eat
did laddus
ObjAux
Rajesh
Subj
If auxiliary did is head(basically, we have a VP!)
If eat is head
Issue: two possible analysesof auxiliary-verb structures in DS
• Possible claim:– VP adverbs characterize manner of event
• John quickly/happily played chess
– S adverbs characterize speaker attitude towards event:• Happily/fortunately (enough), John lost against me at chess
• Problem: interpretation independent of word order– Quickly/happily John played chess– John happily/oddly (enough) lost against me at chess
• Conflict between using VP to identify subject and using VP to describe semantics of adverbs– Would need traces to show scope
• In fact, Penn Treebank for English does not use VPs for scope
Intermediate Projections and Semantic Scope
Use of VP to Indicate Subject
considers
Atif stupid
Subj Obj-Pred
her1
Obj
e1
Subj
S
NP
Atif
VP
considers S
VP
AdjPstupid
her1
e1
Projections:Noun Modification in Arabic Treebank
• Idafa (genitive) construction: – (NP (N بيت/byt/house)
(NP (N المصري/AlmSry/the-Egyptian)))the house of the Egyptian
• Modification construction:– (NP (N الوزير/Alwzyr/the-minister)
(Adj المصري/AlmSry/the-Egyptian)) the Egyptian minister
• Syntactic difference signaled by projection to X or XP• Can’t do in DS – need labels
Conclusion from Intermediate Projections
• Intermediate projections (eg VP, N’) are specific to PS
• They can be used to encode syntactic or semantic information
• But: DS can express the same information!• Conclusion: the existence of intermediate
projections in PS is not a reason to say that PS is more expressive than DS
Overview
• What is a Syntactic Representation?• Dependency and Phrase Structure – Definitions– Syntactic and representational constituency and
dependency– Intermediate projections
• What does this Mean for NLP?• Takehome lessons
What Does This Mean for NLP?
• Treebanks are not naturally occurring data • The guidelines are painstakingly produced by linguists and
represent a formal description of the language• Annotators understand a sentence, determine what syntactic
phenomena exist, and use the guidelines to choose an analysis for the sentence (a structure)
• Users of the treebank can use the guidelines to interpret the structures and get back the syntactic phenomena present
• These phenomena, and not their representation in the treebank, can be used for NLP in whatever representation chosen by the researcher!
• There is already lots of linguistics in our resources, we just need to make use of that linguistic information!
Conclusion
• 4 takehome lessons
Lesson 1: There are All Sortsof Trees
Fallacy: DS trees are unordered,have no empty categories,… and PS trees are ordered and …
Truth: DS and PS trees are graphs that can
have all sorts of properties
Lesson 2:Trees Need an Interpretation
Fallacy: Trees have intrinsic syntactic meaning
Truth: Trees only gain syntactic meaning through interpretation;in treebanks: guidelines
Lesson 3: Conversion between Formats Depends on Interpretation
Fallacy: It is easier/harder to convertDS to PS than vice versa
Truth: conversion means preserving interpretation; thus it depends on the two representations between which conversion happens
Lesson 4: There are two Meanings to “Constituency” and “Dependency”
Fallacy: DS trees only show dependency and PS trees only show constituency
Truth: there are distinct mathematical (data structure) notions and linguistic notions of “dependency” and
“phrase structure”; each tree type can show each
Thank you!