62
The Prague Dependency Treebanks Morphology, Syntax, Semantics Jan Hajič Institute of Formal and Applied Linguistics School of Computer Science Faculty of Mathematics and Physics Charles University, Prague Czech Republic

Slides - META-NET

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

The Prague Dependency TreebanksMorphology, Syntax, Semantics

Jan HajičInstitute of Formal and Applied Linguistics

School of Computer ScienceFaculty of Mathematics and Physics

Charles University, PragueCzech Republic

Dec. 15, 2010 CLARA / META-NET training course 2

The Prague Dependency Treebank

The ideaApply the “old” Prague theory to real-word textsProvide enough data for ML experiments

?“Old” Prague theoryPrague structuralism (1930s)Stratificational approachCentered on “deep syntax”

Separated from “surface form”Dependency based (how else ☺)

Dec. 15, 2010 CLARA / META-NET training course 3

PDT:The Methodology

Manual annotation is PRIMARYSome help from existing tools possible

“No information loss, no redundancy”Much formalization, but…… original form always retrievable

Dictionaries In theory: “secondary”, side effect of annotationIn reality: help consistencyLinks: data → dictionary(-ies)

Extensive support for Machine LearningErgonomy of annotation

Graphical (“linguistic”) presentation & editing

Dec. 15, 2010 CLARA / META-NET training course 4

The Prague Dependency Treebank Project: Czech Treebank

1995 (Dublin) 1996-2006-2010-…1998 PDT v. 0.5 released (JHU workshop)

400k words manually annotated, unchecked2001 PDT 1.0 released (LDC):

1.3MW annotated, morphology & surface syntax2006 PDT 2.0 release

0.8MW annotated (50k sentences) + PDT 1.0 correctedthe “tectogrammatical layer”

underlying (deep) syntax

Dec. 15, 2010 CLARA / META-NET training course 5

Related Projects (Treebanks)

Prague Czech-English Dependency TreebankWSJ portion of PTB, translated to Czech (1.2 mil. words)automatically analyzed

English side (PTB), tooManual annotation started

Prague Arabic Dependency Treebankapply same representation to annotation of Arabicsurface syntax so far

Both published (partial version) in 2004 (LDC)PCEDT version 2.0 being prepared (2011)

Dec. 15, 2010 CLARA / META-NET training course 6

PDT Annotation LayersL0 (w) Words (tokens)

automatic segmentation and markup onlyL1 (m) Morphology

Tag (full morphology, 13 categories), lemmaL2 (a) Analytical layer (surface syntax)

Dependency, analytical dependency functionL3 (t) Tectogrammatical layer (“deep” syntax)

Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon

PD

T 1.

0 (2

001)

PD

T 2.

0 (2

006)

Dec. 15, 2010 CLARA / META-NET training course 7

PDT Annotation LayersL0 (w) Words (tokens)

automatic segmentation and markup onlyL1 (m) Morphology

Tag (full morphology, 13 categories), lemmaL2 (a) Analytical layer (surface syntax)

Dependency, analytical dependency functionL3 (t) Tectogrammatical layer (“deep” syntax)

Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon

Dec. 15, 2010 CLARA / META-NET training course 8

Morphological Attributes

Tag: 13 categoriesExample: AAFP3----3N----

Adjective no poss. Gender negatedRegular no poss. Number no voiceFeminine no person reserve1Plural no tense reserve2Dative superlative base var.

Lemma: POS-unique identifierBooks/verb -> book-1, went -> go, to/prep. -> to-1

Ex.: nejnezajímavějším“(to) the most uninteresting”

Dec. 15, 2010 CLARA / META-NET training course 9

Morphological Disambiguation

Full morphological disambiguationmore complex than (e.g. English) POS tagging

Several full morphological taggers:(Pure) HMMFeature-based (MaxEnt-like)

used in the PDT distributionAveraged Perceptron (M. Collins, EMNLP’02)

All: ~ 94-96% accuracy (perceptron is best) “COMPOST” (available for several languages)

EACL 2009 paper, http://ufal.mff.cuni.cz/compost

Dec. 15, 2010 CLARA / META-NET training course 10

The Segmentation Problem:Arabic

Tokenization / segmentation not always trivialArabic, German, Chinese, Japanese

Dec. 15, 2010 CLARA / META-NET training course 11

Layer 2 (a-layer): Analytical Syntax

Dependency + Analytical Function

dependent

governor

The influence of the Mexicancrisis on Central and EasternEurope has apparently been underestimated.

Dec. 15, 2010 CLARA / META-NET training course 12

Analytical Syntax: Functions

Main (for [main] semantic lexemes):Pred, Sb, Obj, Adv, Atr, Atv(V), AuxV, Pnom“Double” dependency: AtrAdv, AtrObj, AtrAtr

Special (function words, punctuation,...):Reflefives, particles: AuxT, AuxR, AuxO, AuxZ, AuxYPrepositions/Conjunctions: AuxP, AuxCPunctuation, Graphics: AuxX, AuxS, AuxG, AuxK

StructuralElipsis: ExD, Coordination etc.: Coord, Apos

Dec. 15, 2010 CLARA / META-NET training course 13

PDT-styleArabic Surface Syntax

Only several differences(Sometimes) Separate nodes for individual segments (cf. tagging/segmentation) Copula treatment (Czech: rare treated as ellispsis; Arabic: systematic solution), Pred(Added) analytic functions:

AuxM (did-not)

Ante (what)

Work by Faculty of Arts (Arabic language) students

Dec. 15, 2010 CLARA / META-NET training course 14

Arabic Surface SyntaxExample

In the section on literature, the magazine presented the issue of the Arabic language and the dangers that threaten it.

Dec. 15, 2010 CLARA / META-NET training course 15

English Analytic LayerBy conversion from PTB

Extended analytic functions

Head rules Jason Eisner’s, added more for full conversion

Coordination, traces, etc.

Coordination handlingSame as in Czech/Arabic PDT

Dec. 15, 2010 CLARA / META-NET training course 16

Penn TreebankUniversity of Pennsylvania, 1993

Linguistic Data ConsortiumWall Street Journal texts, ca. 50,000 sentences

1989-1991Financial (most), news, arts, sports2499 (2312) documents in 25 sections

AnnotationPOS (Part-of-speech tags)Syntactic “bracketing” + bracket (syntactic) labels(Syntactic) Function tags, traces, co-indexing

Dec. 15, 2010 CLARA / META-NET training course 17

Penn Treebank Example( (S

(NP-SBJ (NP (NNP Pierre) (NNP Vinken) )(, ,) (ADJP

(NP (CD 61) (NNS years) )(JJ old) )

(, ,) )(VP (MD will)

(VP (VB join) (NP (DT the) (NN board) )(PP-CLR (IN as)

(NP (DT a) (JJ nonexecutive) (NN director) ))(NP-TMP (NNP Nov.) (CD 29) )))

(. .) ))

Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.

POS tag (NNS)(noun, plural)

Phrase label (NP)

Noun Phrase

“Preterminal”

Dec. 15, 2010 CLARA / META-NET training course 18

Penn Treebank Example:Sentence Tree

Phrase-based tree representation:

Dec. 15, 2010 CLARA / META-NET training course 19

Parallel Czech-English Annotation

English text -> Czech text (human translation)Czech side (goal): all layers manual annotationEnglish side (goal):

Morphology and surface syntax: technical conversionPenn Treebank style -> PDT Analytic layer

Tectogrammatical annotation: manual annotation(Slightly) different rules needed for English

AlignmentNatural, sentence level only (now)

Dec. 15, 2010 CLARA / META-NET training course 20

Human Translation ofWSJ Texts

Hired translators / FCE levelSpecific rules for translation

Sentence per sentence only …to get simple 1:1 alignment

Fluent Czech at the target sideIf a choice, prefer “literal” translation

The numbers:English tokens: 1,173,766Translated to Czech:

Revised/PCEDT 1.0: 487,929Now finished (all 2312 documents)

Dec. 15, 2010 CLARA / META-NET training course 21

English Annotation POS and Syntax

Automatic conversion from Penn TreebankPDT morphological layer

From POS tagsPDT analytic layer

From:Penn Treebank Syntactic StructureNon-terminal labelsFunction tags (non-terminal “suffixes”)

2-step processHead determination rulesConversion to dependency + analytic function

Dec. 15, 2010 CLARA / META-NET training course 22

Head Determination Rules

Exhaustive set of rules By J. Eisner + M. Cmejrek/J. Curin4000 rules (non-terminal based)

Ex.: (S (NP-SBJ VP .)) → VPAdditional rules

Coordination, AppositionPunctuation (end-of-sentence, internal)

Original idea (possibility of conversion)J. Robinson (1960s)

Dec. 15, 2010 CLARA / META-NET training course 23

Example: Head Determination Rules (J.E.)

(board)

(board)(the)

(join)

(will) (join)

(join)

(join)

(NP (DT NN)) → NN

(VP (VB NP)) → VB

(VP (MD VP)) → VP

(S (… VP …)) → VP

Rules:

Dec. 15, 2010 CLARA / META-NET training course 24

Example: Analytical Structure, Functions

(board)

(board)(the)

(join)

(will) (join)

(join)

(join)

→→

Penn Treebank structure(with heads added) PDT-like Analytic

Representation

Dec. 15, 2010 CLARA / META-NET training course 25

PDT Annotation LayersL0 (w) Words (tokens)

automatic segmentation and markup onlyL1 (m) Morphology

Tag (full morphology, 13 categories), lemmaL2 (a) Analytical layer (surface syntax)

Dependency, analytical dependency functionL3 (t) Tectogrammatical layer (“deep” syntax)

Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon

Dec. 15, 2010 CLARA / META-NET training course 26

Layer 3 (t-layer): Tectogrammatical

Underlying (deep) syntax4 sublayers (integrated):

dependency structure, (detailed) functorsvalency annotation

topic/focus and deep word ordercoreference (mostly grammatical only)all the rest (grammatemes):

detailed functorsunderlying gender, number, ...

Total39 attributes (vs. 5 at m-layer, 2 at a-layer)

Dec. 15, 2010 CLARA / META-NET training course 27

Analytical vs. Tectogrammatical

Underlying verb + tense

Deep function

Elided Actor in

Prepositions out

Another ellipsis...

(TR: sublayer 1 only shown)

Dec. 15, 2010 CLARA / META-NET training course 28

Layer 3: Tectogrammatical

Underlying (deep) syntax4 sublayers:

dependency structure, (detailed) functorstopic/focus and deep word ordercoreferenceall the rest (grammatemes):

detailed functorsunderlying gender, number, ...

Dec. 15, 2010 CLARA / META-NET training course 29

Tectogrammatical Functors

“Actants”: ACT, PAT, EFF, ADDR, ORIG

modify: verbs, nouns, adjectivescannot repeat in a clause, usually obligatory

Free modifications (~ 50), semantically definedcan repeat; optional, sometimes obligatoryEx.: LOC, DIR1, ...; TWHEN, TTILL,...; RSTR; BEN, ATT, ACMP, INTT, MANN; MAT, APP; ID, DPHR, ...

SpecialCoordination, Rhematizers, Foreign phrases,...

syntactic semantic

Dec. 15, 2010 CLARA / META-NET training course 30

Tectogrammatical Example

Analytical verb form:(he) allowed would-be to-be enrolled

směl by být zapsán

Additional attributes (grammatemes):conditional + “allow”

Collapsed

Dec. 15, 2010 CLARA / META-NET training course 31

Tectogrammatical Example

Passive construction (action)(The) book has-been translated [by Mr. X]Kniha byla přeložena

Disappeared Added

Dec. 15, 2010 CLARA / META-NET training course 32

Tectogrammatical Example

Object(he) gave him a-book

dal mu knihu

Obj goes into ACT, PAT, ADDR, EFF or ORIG based on governor’s valency frame

Dec. 15, 2010 CLARA / META-NET training course 33

Tectogrammatical Example

Incomplete phrasesPeter works well , but Paul badlyPetr pracuje dobře, ale Pavel špatně

Added

Dec. 15, 2010 CLARA / META-NET training course 34

Layer 3: Tectogrammatical

Underlying (deep) syntax4 sublayers:

dependency structure, (detailed) functorstopic/focus and deep word ordercoreferenceall the rest (grammatemes):

detailed functorsunderlying gender, number, ...

Dec. 15, 2010 CLARA / META-NET training course 35

Deep Word OrderTopic/Focus

Example:

Baker bakes rolls. vs. BakerIC bakes rolls.

Analyticaldep. tree:

Dec. 15, 2010 CLARA / META-NET training course 36

Layer 3: TectogrammaticalUnderlying (deep) syntax4 sublayers:

dependency structure, (detailed) functorstopic/focus and deep word ordercoreferenceall the rest (grammatemes):

detailed functorsunderlying gender, number, ...

Dec. 15, 2010 CLARA / META-NET training course 37

Coreference

Grammatical (easy)relative clauses

which, whoPeter and Paul, who ...

controlinfinitival constructions

John promised to go ...

reflexive pronouns{him,her,thme}self(-ves)

Mary saw herself in ...

Johngo

he home

promisePRED

ACTPAT

ACT DIR3

Dec. 15, 2010 CLARA / META-NET training course 38

Coreference

TextualEx.: Peter moved to Iowa after he finished his PhD.

Peter Iowafinish

he PhD

movePRED

ACT DIR1TWHEN

ACT PAT

heAPP

Dec. 15, 2010 CLARA / META-NET training course 39

Layer 3: TectogrammaticalUnderlying (deep) syntax4 sublayers:

dependency structure, (detailed) functorstopic/focus and deep word ordercoreferenceall the rest (grammatemes):

detailed functorsunderlying gender, number, ...

Dec. 15, 2010 CLARA / META-NET training course 40

GrammatemesDetailed functors (subfunctors)

only for some functors: TWHEN: before/afterLOC: next-to, behind, in-front-of, ...also: ACMP, BEN, CPR, DIR1, DIR2, DIR3, EXT

Lexical (underlying)number (SG/PL), tense, modality, degree of comparison, ...strictly only where necessary (agreement!)

Dec. 15, 2010 CLARA / META-NET training course 41

Fully Annotated Sentence

The boundaries of some problems seem to be clearer after they were revived by Havel’s speech.

Dec. 15, 2010 CLARA / META-NET training course 42

Arabic Example:Tectogrammatics

In the section on literature, the magazine presented the issue of the Arabic language and the dangers that threaten it.

Dec. 15, 2010 CLARA / META-NET training course 43

English PDT-style Annotation

Morphology and SyntaxBy conversion

Tectogrammatical annotationManual (English TR: by S. Cinková)Pre-annotation

Transformation from Penn Treebank & Propbank (Palmer, Kingsbury) by Z. Žabokrtský et al.

ValencyFrom Propbank Frame Files (Cinková, Šindlerová, Nedolužko, Semecký)

The annotation is finished now (Nov. 2010; 1 mil. words)

Dec. 15, 2010 CLARA / META-NET training course 44

Valency in the PDTValency: specific ability of a word to combine itself withother units of meaning

dát (give)

matka (mother)ADDR

EvaACT

pršet (rain)

zítra (tomorrow)TWHEN

plakat (cry)

Adam noc (night)ACT TWHEN

Specific behavior

dar (gift)PAT

neděle (Sunday)TWHEN

---

Modifies anything

Dec. 15, 2010 CLARA / META-NET training course 45

Valency - Basic Principles

inner participants vs. free modifications(arguments vs. adjuncts)

obligatory vs. optional modifications(the dialogue test)

Dec. 15, 2010 CLARA / META-NET training course 46

Inner Participant …… Free Modification

ACT(or), PAT(ient) ADDR(essee), EFF(ect), ORIG(in) (5)

each occurs just with particular verbs

each modifies the verb only once (in a clause)

Location (LOC, DIR1,…) Time (TWHEN, TTILL, …), Manner, Intention,… (70)

can modify in principleany verb

can be repeated (within the same clause)

Dec. 15, 2010 CLARA / META-NET training course 47

Obligatory … Optional

A: John left.B: From where?A: *I don't know.

A: John left.B: To where?A: I don't know.

„from where“obligatory modification

„to where“optional modification

The Dialogue Test

Answering a question about a semantically obligatory modification, the speaker cannot say: I don't know.

Dec. 15, 2010 CLARA / META-NET training course 48

Valency frame

obligatory optionalargumentadjunct

Structure:

one meaning of the word one valency frame

Contents:

functorobligatorinesssurface form

word: leavemeaning 1: sb left sthmeaning 2: sb left from somewhere

frame1: ACT PATframe2: ACT DIR1

Dec. 15, 2010 CLARA / META-NET training course 49

Valency lexicon:PDT-VALLEX

8500 verb senses / valency frames9000 noun sense / valency framessome adjectives and adverbs

PDT-VALLEX Entryverb: dosáhnout meaning 1: to reach sthmeaning 2: to get sb to do sthmeaning 3: …meaning 4: …

Dec. 15, 2010 CLARA / META-NET training course 50

The PDT-VALLEX editor

‘lay down’

resign

win

ask

senses:

Dec. 15, 2010 CLARA / META-NET training course 51

Valency Lexicon and TrEd

to write sth (about sth)

Dec. 15, 2010 CLARA / META-NET training course 52

Corpus <-> Valency Lexicon

Corpus – occurrences of „uzavřít“ (to close) :

ENTRY: uzavřítvf1: ACT(.1) CPHR({smlouva}.4)

ex: u. dohodu (close a contract)vf2: ACT(.1) PAT(.4)

ex.: u. pokoj (close a room, house)

Lexicon:

Sentence 2035: Sentence 15345: Sentence 51042:

Dec. 15, 2010 CLARA / META-NET training course 53

Valency and Text Generation

Using valency for......getting the correct (lemma, tag) of verb arguments

Example:

starat_sePRED

MartinACT

tygrPAT

Martin....1..........

staratV..............

o...............

tygr....4..........

VALLEX entry: starat (se) ACT(.1) PAT(o.[.4])

se...............

Martin se stará o tygry.

“Martin takes care of tigers.”

“to take care of”

“tiger”

Dec. 15, 2010 CLARA / META-NET training course 54

The Annotation Process4 sublayers

work on structure first, rest in parallelStructure

automatic preprocessing - programmed conversion from analytical layer annotation

Grammatemesmostly automatically (based on lower layers’annotation), manual checking, corrections

Cross-sublayer/cross-layer checkingpartly automatic, then manual

Dec. 15, 2010 CLARA / META-NET training course 55

The Annotation SchemeXML + principles of linear- and tree-based standoff annotation

⇒ PML(Prague Markup Language)

Layer schemes (Relax NG)PDT/PADT: t(ecto), a(nalytic), m(orphology), …English: + phrase-based (p-layer)

Dec. 15, 2010 CLARA / META-NET training course 56

PML/XML AnnotationLayers

Strictly top-down linksw+m+a can be easily “knitted”API for cross-layer access (programming)PML Schema / Relax NG[z and audio layers:used for spoken data (audio as layer “-1”)]

LFGanalogy:

f-struct

Φ

c-structz-

laye

rau

dio

BYL BYS ČELO LESA …

Dec. 15, 2010 CLARA / META-NET training course 57

PDT 2.0: The DataData sizes

Dec. 15, 2010 CLARA / META-NET training course 58

Tectogrammatical Layer in Machine Translation

The Translation (“Vauquois”) triangle

transfer

source target

Tectogrammatical Representation

Surface Syntax

MorphologyGeneration

Cz En

Dec. 15, 2010 CLARA / META-NET training course 59

Dependency trees in MT

According to his opinion UAL's executives were misinformed about the financing of the original transaction.

Transfer:

Podle jeho názoru bylo vedení UAL o financovánípůvodní transakce nesprávně informováno.

- structure (~0)- lexical- functions- grammatical

Dec. 15, 2010 CLARA / META-NET training course 60

Valency and Translation

leave-1 nechat-3ACT() PAT() LOC() ACT(.1) PAT(.4) LOC()

leave-2 odjet-1ACT() DIR1(from.) ACT(.1) DIR1(z.[.2])

Dec. 15, 2010 CLARA / META-NET training course 61

To summarize…

PDT is/has (a)…Dependency-based treebanking project

Czech (other languages in the works – Eng, Ar)~ 1mil. words

sufficient size for ML experiments4 layers of annotation

token, morphology, syntax, deep syntax/semantics++)independent and full information at all levels, but...interlinked (for the development of parsers/generators)

Valency dictionary integrated (links from data)

Dec. 15, 2010 CLARA / META-NET training course 62

Some pointersCurrent version of PDT: v2.0, LDC2006T01

all three levels, 1.9/1.5/0.8 Mwordshttp://ufal.mff.cuni.cz/pdt2.0

http://ufal.mff.cuni.czResearch -> Corpora (Treebank(s))

http://ufal.mff.cuni.cz/pedtDeep syntax (TR) of Penn Treebank texts

http://www.ldc.upenn.eduLDC2001T10 (PDT v1.0), LDC2004T23 (PADT 1.0), LDC2004T25 (PCEDT 1.0), LDC2006T01 (PDT 2.0)

http://www.clsp.jhu.edu: Workshop 2002Using TL for MT Generation