Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
The Prague Dependency TreebanksMorphology, Syntax, Semantics
Jan HajičInstitute of Formal and Applied Linguistics
School of Computer ScienceFaculty of Mathematics and Physics
Charles University, PragueCzech Republic
Dec. 15, 2010 CLARA / META-NET training course 2
The Prague Dependency Treebank
The ideaApply the “old” Prague theory to real-word textsProvide enough data for ML experiments
?“Old” Prague theoryPrague structuralism (1930s)Stratificational approachCentered on “deep syntax”
Separated from “surface form”Dependency based (how else ☺)
Dec. 15, 2010 CLARA / META-NET training course 3
PDT:The Methodology
Manual annotation is PRIMARYSome help from existing tools possible
“No information loss, no redundancy”Much formalization, but…… original form always retrievable
Dictionaries In theory: “secondary”, side effect of annotationIn reality: help consistencyLinks: data → dictionary(-ies)
Extensive support for Machine LearningErgonomy of annotation
Graphical (“linguistic”) presentation & editing
Dec. 15, 2010 CLARA / META-NET training course 4
The Prague Dependency Treebank Project: Czech Treebank
1995 (Dublin) 1996-2006-2010-…1998 PDT v. 0.5 released (JHU workshop)
400k words manually annotated, unchecked2001 PDT 1.0 released (LDC):
1.3MW annotated, morphology & surface syntax2006 PDT 2.0 release
0.8MW annotated (50k sentences) + PDT 1.0 correctedthe “tectogrammatical layer”
underlying (deep) syntax
Dec. 15, 2010 CLARA / META-NET training course 5
Related Projects (Treebanks)
Prague Czech-English Dependency TreebankWSJ portion of PTB, translated to Czech (1.2 mil. words)automatically analyzed
English side (PTB), tooManual annotation started
Prague Arabic Dependency Treebankapply same representation to annotation of Arabicsurface syntax so far
Both published (partial version) in 2004 (LDC)PCEDT version 2.0 being prepared (2011)
Dec. 15, 2010 CLARA / META-NET training course 6
PDT Annotation LayersL0 (w) Words (tokens)
automatic segmentation and markup onlyL1 (m) Morphology
Tag (full morphology, 13 categories), lemmaL2 (a) Analytical layer (surface syntax)
Dependency, analytical dependency functionL3 (t) Tectogrammatical layer (“deep” syntax)
Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon
PD
T 1.
0 (2
001)
PD
T 2.
0 (2
006)
Dec. 15, 2010 CLARA / META-NET training course 7
PDT Annotation LayersL0 (w) Words (tokens)
automatic segmentation and markup onlyL1 (m) Morphology
Tag (full morphology, 13 categories), lemmaL2 (a) Analytical layer (surface syntax)
Dependency, analytical dependency functionL3 (t) Tectogrammatical layer (“deep” syntax)
Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon
Dec. 15, 2010 CLARA / META-NET training course 8
Morphological Attributes
Tag: 13 categoriesExample: AAFP3----3N----
Adjective no poss. Gender negatedRegular no poss. Number no voiceFeminine no person reserve1Plural no tense reserve2Dative superlative base var.
Lemma: POS-unique identifierBooks/verb -> book-1, went -> go, to/prep. -> to-1
Ex.: nejnezajímavějším“(to) the most uninteresting”
Dec. 15, 2010 CLARA / META-NET training course 9
Morphological Disambiguation
Full morphological disambiguationmore complex than (e.g. English) POS tagging
Several full morphological taggers:(Pure) HMMFeature-based (MaxEnt-like)
used in the PDT distributionAveraged Perceptron (M. Collins, EMNLP’02)
All: ~ 94-96% accuracy (perceptron is best) “COMPOST” (available for several languages)
EACL 2009 paper, http://ufal.mff.cuni.cz/compost
Dec. 15, 2010 CLARA / META-NET training course 10
The Segmentation Problem:Arabic
Tokenization / segmentation not always trivialArabic, German, Chinese, Japanese
Dec. 15, 2010 CLARA / META-NET training course 11
Layer 2 (a-layer): Analytical Syntax
Dependency + Analytical Function
dependent
governor
The influence of the Mexicancrisis on Central and EasternEurope has apparently been underestimated.
Dec. 15, 2010 CLARA / META-NET training course 12
Analytical Syntax: Functions
Main (for [main] semantic lexemes):Pred, Sb, Obj, Adv, Atr, Atv(V), AuxV, Pnom“Double” dependency: AtrAdv, AtrObj, AtrAtr
Special (function words, punctuation,...):Reflefives, particles: AuxT, AuxR, AuxO, AuxZ, AuxYPrepositions/Conjunctions: AuxP, AuxCPunctuation, Graphics: AuxX, AuxS, AuxG, AuxK
StructuralElipsis: ExD, Coordination etc.: Coord, Apos
Dec. 15, 2010 CLARA / META-NET training course 13
PDT-styleArabic Surface Syntax
Only several differences(Sometimes) Separate nodes for individual segments (cf. tagging/segmentation) Copula treatment (Czech: rare treated as ellispsis; Arabic: systematic solution), Pred(Added) analytic functions:
AuxM (did-not)
Ante (what)
Work by Faculty of Arts (Arabic language) students
Dec. 15, 2010 CLARA / META-NET training course 14
Arabic Surface SyntaxExample
In the section on literature, the magazine presented the issue of the Arabic language and the dangers that threaten it.
Dec. 15, 2010 CLARA / META-NET training course 15
English Analytic LayerBy conversion from PTB
Extended analytic functions
Head rules Jason Eisner’s, added more for full conversion
Coordination, traces, etc.
Coordination handlingSame as in Czech/Arabic PDT
Dec. 15, 2010 CLARA / META-NET training course 16
Penn TreebankUniversity of Pennsylvania, 1993
Linguistic Data ConsortiumWall Street Journal texts, ca. 50,000 sentences
1989-1991Financial (most), news, arts, sports2499 (2312) documents in 25 sections
AnnotationPOS (Part-of-speech tags)Syntactic “bracketing” + bracket (syntactic) labels(Syntactic) Function tags, traces, co-indexing
Dec. 15, 2010 CLARA / META-NET training course 17
Penn Treebank Example( (S
(NP-SBJ (NP (NNP Pierre) (NNP Vinken) )(, ,) (ADJP
(NP (CD 61) (NNS years) )(JJ old) )
(, ,) )(VP (MD will)
(VP (VB join) (NP (DT the) (NN board) )(PP-CLR (IN as)
(NP (DT a) (JJ nonexecutive) (NN director) ))(NP-TMP (NNP Nov.) (CD 29) )))
(. .) ))
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
POS tag (NNS)(noun, plural)
Phrase label (NP)
Noun Phrase
“Preterminal”
Dec. 15, 2010 CLARA / META-NET training course 18
Penn Treebank Example:Sentence Tree
Phrase-based tree representation:
Dec. 15, 2010 CLARA / META-NET training course 19
Parallel Czech-English Annotation
English text -> Czech text (human translation)Czech side (goal): all layers manual annotationEnglish side (goal):
Morphology and surface syntax: technical conversionPenn Treebank style -> PDT Analytic layer
Tectogrammatical annotation: manual annotation(Slightly) different rules needed for English
AlignmentNatural, sentence level only (now)
Dec. 15, 2010 CLARA / META-NET training course 20
Human Translation ofWSJ Texts
Hired translators / FCE levelSpecific rules for translation
Sentence per sentence only …to get simple 1:1 alignment
Fluent Czech at the target sideIf a choice, prefer “literal” translation
The numbers:English tokens: 1,173,766Translated to Czech:
Revised/PCEDT 1.0: 487,929Now finished (all 2312 documents)
Dec. 15, 2010 CLARA / META-NET training course 21
English Annotation POS and Syntax
Automatic conversion from Penn TreebankPDT morphological layer
From POS tagsPDT analytic layer
From:Penn Treebank Syntactic StructureNon-terminal labelsFunction tags (non-terminal “suffixes”)
2-step processHead determination rulesConversion to dependency + analytic function
Dec. 15, 2010 CLARA / META-NET training course 22
Head Determination Rules
Exhaustive set of rules By J. Eisner + M. Cmejrek/J. Curin4000 rules (non-terminal based)
Ex.: (S (NP-SBJ VP .)) → VPAdditional rules
Coordination, AppositionPunctuation (end-of-sentence, internal)
Original idea (possibility of conversion)J. Robinson (1960s)
Dec. 15, 2010 CLARA / META-NET training course 23
Example: Head Determination Rules (J.E.)
(board)
(board)(the)
(join)
(will) (join)
(join)
(join)
(NP (DT NN)) → NN
(VP (VB NP)) → VB
(VP (MD VP)) → VP
(S (… VP …)) → VP
Rules:
Dec. 15, 2010 CLARA / META-NET training course 24
Example: Analytical Structure, Functions
(board)
(board)(the)
(join)
(will) (join)
(join)
(join)
→→
Penn Treebank structure(with heads added) PDT-like Analytic
Representation
Dec. 15, 2010 CLARA / META-NET training course 25
PDT Annotation LayersL0 (w) Words (tokens)
automatic segmentation and markup onlyL1 (m) Morphology
Tag (full morphology, 13 categories), lemmaL2 (a) Analytical layer (surface syntax)
Dependency, analytical dependency functionL3 (t) Tectogrammatical layer (“deep” syntax)
Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon
Dec. 15, 2010 CLARA / META-NET training course 26
Layer 3 (t-layer): Tectogrammatical
Underlying (deep) syntax4 sublayers (integrated):
dependency structure, (detailed) functorsvalency annotation
topic/focus and deep word ordercoreference (mostly grammatical only)all the rest (grammatemes):
detailed functorsunderlying gender, number, ...
Total39 attributes (vs. 5 at m-layer, 2 at a-layer)
Dec. 15, 2010 CLARA / META-NET training course 27
Analytical vs. Tectogrammatical
Underlying verb + tense
Deep function
Elided Actor in
Prepositions out
Another ellipsis...
(TR: sublayer 1 only shown)
Dec. 15, 2010 CLARA / META-NET training course 28
Layer 3: Tectogrammatical
Underlying (deep) syntax4 sublayers:
dependency structure, (detailed) functorstopic/focus and deep word ordercoreferenceall the rest (grammatemes):
detailed functorsunderlying gender, number, ...
Dec. 15, 2010 CLARA / META-NET training course 29
Tectogrammatical Functors
“Actants”: ACT, PAT, EFF, ADDR, ORIG
modify: verbs, nouns, adjectivescannot repeat in a clause, usually obligatory
Free modifications (~ 50), semantically definedcan repeat; optional, sometimes obligatoryEx.: LOC, DIR1, ...; TWHEN, TTILL,...; RSTR; BEN, ATT, ACMP, INTT, MANN; MAT, APP; ID, DPHR, ...
SpecialCoordination, Rhematizers, Foreign phrases,...
syntactic semantic
Dec. 15, 2010 CLARA / META-NET training course 30
Tectogrammatical Example
Analytical verb form:(he) allowed would-be to-be enrolled
směl by být zapsán
Additional attributes (grammatemes):conditional + “allow”
Collapsed
Dec. 15, 2010 CLARA / META-NET training course 31
Tectogrammatical Example
Passive construction (action)(The) book has-been translated [by Mr. X]Kniha byla přeložena
Disappeared Added
Dec. 15, 2010 CLARA / META-NET training course 32
Tectogrammatical Example
Object(he) gave him a-book
dal mu knihu
Obj goes into ACT, PAT, ADDR, EFF or ORIG based on governor’s valency frame
Dec. 15, 2010 CLARA / META-NET training course 33
Tectogrammatical Example
Incomplete phrasesPeter works well , but Paul badlyPetr pracuje dobře, ale Pavel špatně
Added
Dec. 15, 2010 CLARA / META-NET training course 34
Layer 3: Tectogrammatical
Underlying (deep) syntax4 sublayers:
dependency structure, (detailed) functorstopic/focus and deep word ordercoreferenceall the rest (grammatemes):
detailed functorsunderlying gender, number, ...
Dec. 15, 2010 CLARA / META-NET training course 35
Deep Word OrderTopic/Focus
Example:
Baker bakes rolls. vs. BakerIC bakes rolls.
Analyticaldep. tree:
Dec. 15, 2010 CLARA / META-NET training course 36
Layer 3: TectogrammaticalUnderlying (deep) syntax4 sublayers:
dependency structure, (detailed) functorstopic/focus and deep word ordercoreferenceall the rest (grammatemes):
detailed functorsunderlying gender, number, ...
Dec. 15, 2010 CLARA / META-NET training course 37
Coreference
Grammatical (easy)relative clauses
which, whoPeter and Paul, who ...
controlinfinitival constructions
John promised to go ...
reflexive pronouns{him,her,thme}self(-ves)
Mary saw herself in ...
Johngo
he home
promisePRED
ACTPAT
ACT DIR3
Dec. 15, 2010 CLARA / META-NET training course 38
Coreference
TextualEx.: Peter moved to Iowa after he finished his PhD.
Peter Iowafinish
he PhD
movePRED
ACT DIR1TWHEN
ACT PAT
heAPP
Dec. 15, 2010 CLARA / META-NET training course 39
Layer 3: TectogrammaticalUnderlying (deep) syntax4 sublayers:
dependency structure, (detailed) functorstopic/focus and deep word ordercoreferenceall the rest (grammatemes):
detailed functorsunderlying gender, number, ...
Dec. 15, 2010 CLARA / META-NET training course 40
GrammatemesDetailed functors (subfunctors)
only for some functors: TWHEN: before/afterLOC: next-to, behind, in-front-of, ...also: ACMP, BEN, CPR, DIR1, DIR2, DIR3, EXT
Lexical (underlying)number (SG/PL), tense, modality, degree of comparison, ...strictly only where necessary (agreement!)
Dec. 15, 2010 CLARA / META-NET training course 41
Fully Annotated Sentence
The boundaries of some problems seem to be clearer after they were revived by Havel’s speech.
Dec. 15, 2010 CLARA / META-NET training course 42
Arabic Example:Tectogrammatics
In the section on literature, the magazine presented the issue of the Arabic language and the dangers that threaten it.
Dec. 15, 2010 CLARA / META-NET training course 43
English PDT-style Annotation
Morphology and SyntaxBy conversion
Tectogrammatical annotationManual (English TR: by S. Cinková)Pre-annotation
Transformation from Penn Treebank & Propbank (Palmer, Kingsbury) by Z. Žabokrtský et al.
ValencyFrom Propbank Frame Files (Cinková, Šindlerová, Nedolužko, Semecký)
The annotation is finished now (Nov. 2010; 1 mil. words)
Dec. 15, 2010 CLARA / META-NET training course 44
Valency in the PDTValency: specific ability of a word to combine itself withother units of meaning
dát (give)
matka (mother)ADDR
EvaACT
pršet (rain)
zítra (tomorrow)TWHEN
plakat (cry)
Adam noc (night)ACT TWHEN
Specific behavior
dar (gift)PAT
neděle (Sunday)TWHEN
---
Modifies anything
Dec. 15, 2010 CLARA / META-NET training course 45
Valency - Basic Principles
inner participants vs. free modifications(arguments vs. adjuncts)
obligatory vs. optional modifications(the dialogue test)
Dec. 15, 2010 CLARA / META-NET training course 46
Inner Participant …… Free Modification
ACT(or), PAT(ient) ADDR(essee), EFF(ect), ORIG(in) (5)
each occurs just with particular verbs
each modifies the verb only once (in a clause)
Location (LOC, DIR1,…) Time (TWHEN, TTILL, …), Manner, Intention,… (70)
can modify in principleany verb
can be repeated (within the same clause)
Dec. 15, 2010 CLARA / META-NET training course 47
Obligatory … Optional
A: John left.B: From where?A: *I don't know.
A: John left.B: To where?A: I don't know.
„from where“obligatory modification
„to where“optional modification
The Dialogue Test
Answering a question about a semantically obligatory modification, the speaker cannot say: I don't know.
Dec. 15, 2010 CLARA / META-NET training course 48
Valency frame
obligatory optionalargumentadjunct
Structure:
one meaning of the word one valency frame
Contents:
functorobligatorinesssurface form
word: leavemeaning 1: sb left sthmeaning 2: sb left from somewhere
frame1: ACT PATframe2: ACT DIR1
Dec. 15, 2010 CLARA / META-NET training course 49
Valency lexicon:PDT-VALLEX
8500 verb senses / valency frames9000 noun sense / valency framessome adjectives and adverbs
PDT-VALLEX Entryverb: dosáhnout meaning 1: to reach sthmeaning 2: to get sb to do sthmeaning 3: …meaning 4: …
Dec. 15, 2010 CLARA / META-NET training course 50
The PDT-VALLEX editor
‘lay down’
resign
win
ask
senses:
Dec. 15, 2010 CLARA / META-NET training course 52
Corpus <-> Valency Lexicon
Corpus – occurrences of „uzavřít“ (to close) :
ENTRY: uzavřítvf1: ACT(.1) CPHR({smlouva}.4)
ex: u. dohodu (close a contract)vf2: ACT(.1) PAT(.4)
ex.: u. pokoj (close a room, house)
Lexicon:
Sentence 2035: Sentence 15345: Sentence 51042:
Dec. 15, 2010 CLARA / META-NET training course 53
Valency and Text Generation
Using valency for......getting the correct (lemma, tag) of verb arguments
Example:
starat_sePRED
MartinACT
tygrPAT
Martin....1..........
staratV..............
o...............
tygr....4..........
VALLEX entry: starat (se) ACT(.1) PAT(o.[.4])
se...............
Martin se stará o tygry.
“Martin takes care of tigers.”
“to take care of”
“tiger”
Dec. 15, 2010 CLARA / META-NET training course 54
The Annotation Process4 sublayers
work on structure first, rest in parallelStructure
automatic preprocessing - programmed conversion from analytical layer annotation
Grammatemesmostly automatically (based on lower layers’annotation), manual checking, corrections
Cross-sublayer/cross-layer checkingpartly automatic, then manual
Dec. 15, 2010 CLARA / META-NET training course 55
The Annotation SchemeXML + principles of linear- and tree-based standoff annotation
⇒ PML(Prague Markup Language)
Layer schemes (Relax NG)PDT/PADT: t(ecto), a(nalytic), m(orphology), …English: + phrase-based (p-layer)
Dec. 15, 2010 CLARA / META-NET training course 56
PML/XML AnnotationLayers
Strictly top-down linksw+m+a can be easily “knitted”API for cross-layer access (programming)PML Schema / Relax NG[z and audio layers:used for spoken data (audio as layer “-1”)]
LFGanalogy:
f-struct
Φ
c-structz-
laye
rau
dio
BYL BYS ČELO LESA …
Dec. 15, 2010 CLARA / META-NET training course 58
Tectogrammatical Layer in Machine Translation
The Translation (“Vauquois”) triangle
transfer
source target
Tectogrammatical Representation
Surface Syntax
MorphologyGeneration
Cz En
Dec. 15, 2010 CLARA / META-NET training course 59
Dependency trees in MT
According to his opinion UAL's executives were misinformed about the financing of the original transaction.
Transfer:
Podle jeho názoru bylo vedení UAL o financovánípůvodní transakce nesprávně informováno.
- structure (~0)- lexical- functions- grammatical
Dec. 15, 2010 CLARA / META-NET training course 60
Valency and Translation
leave-1 nechat-3ACT() PAT() LOC() ACT(.1) PAT(.4) LOC()
leave-2 odjet-1ACT() DIR1(from.) ACT(.1) DIR1(z.[.2])
Dec. 15, 2010 CLARA / META-NET training course 61
To summarize…
PDT is/has (a)…Dependency-based treebanking project
Czech (other languages in the works – Eng, Ar)~ 1mil. words
sufficient size for ML experiments4 layers of annotation
token, morphology, syntax, deep syntax/semantics++)independent and full information at all levels, but...interlinked (for the development of parsers/generators)
Valency dictionary integrated (links from data)
Dec. 15, 2010 CLARA / META-NET training course 62
Some pointersCurrent version of PDT: v2.0, LDC2006T01
all three levels, 1.9/1.5/0.8 Mwordshttp://ufal.mff.cuni.cz/pdt2.0
http://ufal.mff.cuni.czResearch -> Corpora (Treebank(s))
http://ufal.mff.cuni.cz/pedtDeep syntax (TR) of Penn Treebank texts
http://www.ldc.upenn.eduLDC2001T10 (PDT v1.0), LDC2004T23 (PADT 1.0), LDC2004T25 (PCEDT 1.0), LDC2006T01 (PDT 2.0)
http://www.clsp.jhu.edu: Workshop 2002Using TL for MT Generation