PDT 2.0 Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics...

Preview:

Citation preview

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Prague Dependency Treebank 2.0

Zdeněk ŽabokrtskýDept. of Formal and Applied Linguistics

Charles University, Praguezabokrtsky@ufal.mff.cuni.cz

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Outline of the talk

Introduction

Layers of annotation

Data

Software tools

Documentation

Tour through the CD-ROM

Final remarks

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Introduction

treebank syntactically annotated corpus (“bank” of syntactic trees)

Prague Dependency Treebank collection of linguistically annotated Czech texts (2MW), software tools and documentation morphological and surface- and deep-syntactic dependency-oriented sentence analyses

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

About Czech

western group of Slavic languages

rich inflectional morphology

(relatively) free word order language

Latin alphabet extended with accents

(příliš žluťoučký kůň)

spoken in the Czech republic

10+ million speakers

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Historical backgroundand development of PDT

1920’s – Prague Linguistic Circle founded

1930-50’s – influential dependency-oriented works of Lucien

Tesniere and Vladimír Šmilauer

mid 1960’s – Petr Sgall’s Functional Generative Description

1992 – Penn Treebank

1994 – Czech National Corpus

1995 – PDT started

1998 – PDT 0.5 pre-release

2001 – PDT 1.0 released by LDC

2006 – PDT 2.0 to be released by LDC

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Outline of the talk

Introduction

Layers of annotation

Data

Software tools

Documentation

Tour through the CD-ROM

Final remarks

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Layered annotation scheme

tectogrammatical layerdeep-syntactic dependency tree

analytical layersurface-syntactic dependency tree

morphological layermorphological lemma and tag associated with each token

word layeroriginal text, segmented on word boundaries

He would have gone intoforest.

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

M-layer

sentence represented as a sequence of tokens each token lemmatized and tagged (attributes lemma and tag)15-character long positional morphological tag

1. (main) POS 2. detailed POS 3. gender 4. number 5. case ...

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

A-layer (1)- nodes and edges

sentence represented as a rooted ordered tree with labeled nodes and edges

edges labeled with analytical functions:

dependency relations (Sb, Obj, Adv, Atr)non-dep. relations (Coord)auxiliary (functional) nodes (AuxP for prepositions, AuxC for subordinating conjunctions...)

special treatment of coordination constructions

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

A-layer (2)- coordination

intricate interplay between dependency and coordination relations

PDT solution: both conjuncts (members of coordination) and shared modifiers attached below the coordination conjunction (but distinguished from each other by a special attribute is_member)

direct parent vs. effective parent:

M M

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

T-layer (1) - nodes

t-nodescomplex typed feature structuresnodes represent autosemantic wordsfunctional words do not have nodes of their ownartificially added nodes (e.g. for pro-drops)

node attributestectogrammatical lemmadependency relation – functor and subfunctorgrammateme attributes (representing morphological meanings)attributes for topic-focus articulationattributes for coreference relations

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

T-layer (2) - dependency relations

according to FGD, two types of functorsactants (arguments)

ACT – actorPAT – patientADDR – addresseeEFF – effectORIG - origin

free modifiers (adjuncts) various types of temporal modifiers - TWHEN, TTIL, TSIN...spatial and directional modifiers – LOC, DIR1, DIR2, DIR3MEANS, BENeficiary, CAUSe, REGard, EXTent, MATerial, CONDition...

additional functors for representing non-dependency relations coordinations – CONJ, DISJ, ADVS ... appositions – APPS parenthetical constructions - PAR expressions in foreign language - FPHR

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

T-layer (3) - valency

all occurrences of all verbs in t-trees interlinked with the valency lexicon PDT-VALLEXindividual valency frames roughly corresponds to individual senses of the given verbvalency frame ~ a sequence of frame slots, for each of which its functor, obligatority and its possible surface realizations are specified

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

T-layer (3) - coreference

two types of coreference according to FGD grammatical (verbs of control, relative clauses, reflexive pronouns...) textual (personal pronouns, incl. elided ones)

coreference in PDT binary relation between t-nodes depicted as a “non-tree” arc (arrow)

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

T-layer (4) - grammatemes

grammatemes t-node attributes representing morphological meanings

motivation

number for nouns, tense for verbs, degree for adjectives, deontic/verb/sentence modality ...

Peter met her youngest brother. Peter will meet her young brothers.

PeterACT

meetPREDtense=ant brother

PATnumber=sg

#PersPronAPP

youngRSTRdegree=sup

PeterACT

meetPREDtense=post brother

PATnumber=pl

#PersPronAPP

youngRSTRdegree=pos

Peter met her youngest brother. Peter will meet her young brothers.

PeterACT

meetPREDtense=ant brother

PATnumber=sg

#PersPronAPP

youngRSTRdegree=sup

PeterACT

meetPREDtense=post brother

PATnumber=pl

#PersPronAPP

youngRSTRdegree=pos

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

T-layer (5) - node typing

presence/absence of a given attribute? the need for node typing

two-level hierarchy of t-layer node types used in PDT 2.0:

tectogrammatical node

com plex atom qcom plexlistcoap dphrfphrroot

semanticadjectives

semanticadverbs

semanticverbs

semantic nouns

denotativen.denot

(number,gender)

pronominal

indefiniten.pron.indef

(number,gender,person,indeftype)

definiten.quant.def

(number,gender,numertype)

quantificative

definitenegationn.denot.neg

(number,gender,negation)

demonstrativen.pron.def.demon

(number,gender)

personaln.pron.def.pers

(number,gender,person,politeness)

tectogrammatical node

com plex atom qcom plexlistcoap dphrfphrroot

semanticadjectives

semanticadverbs

semanticverbs

semantic nouns

denotativen.denot

(number,gender)

pronominal

indefiniten.pron.indef

(number,gender,person,indeftype)

definiten.quant.def

(number,gender,numertype)

quantificative

definitenegationn.denot.neg

(number,gender,negation)

demonstrativen.pron.def.demon

(number,gender)

personaln.pron.def.pers

(number,gender,person,politeness)

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Interlinking the layers

any unit at any layer has a PDT unique ID

neighboring layers connected by top-down pointers

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Outline of the talk

Introduction

Layers of annotation

Data

Software tools

Documentation

Tour through the CD-ROM

Final remarks

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Sources of text

texts provided by the Czech National Corpus

7000 articles (or article fragments) from Czech newspapers and journals:

Lidové noviny (daily newspapers) Mladá fronta Dnes (daily newspapers) Českomoravský profit (business weekly) Vesmír (scientific journal)

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Amount of annotated data

m-layer data1.96 MW in 116 kS

a-layer data (75 % of m-layer)1.5 MW in 88 kS

t-layer data (59 % of a-layer)0.8 MW in 49 kS

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Division into files

1 XML file per document and annotation layer

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Train/test data

train : devtest : evaltest = 8 : 1 : 1

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Full vs. sample data

sample data 500 sentences a freely available subset of the full data converted also to HTML (can be viewed in any WWW browser, no tree editor needed)

the whole PDT 2.0 except for the full data (but including sample data, all tools, docs, and sample data) is available on the web

the full data will be available only to the licensed users who obtain the CD from the Linguistic Data Consortium

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Outline of the talk

Introduction

Layers of annotation

Data

Software tools

Documentation

Tour through the CD-ROM

Final remarks

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Tree editor TrEd

general customizable tree editor implemented in Perl the main editing and browsing tool in the PDT project

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Batch processing of the data

btred – batch processing version of tred

ntred – networked (parallelized) version of btred

$ btred -TNe 'print "$this->{t_lemma}\n" if $this->parent==$root and grep{$_->{functor}=~/^DIR/} $this->children()‘ data/sample/*.t.gz -q

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Netgraph

client-server application for on-line PDT search implemented in Java

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Tools for post-annotation consistency checking

hundreds of btred scripts of various types:

technical tests e.g. each sentence contains at least one token all identifiers are unique, all referred identifiers exist...

m-layer tests locative (6th case) cannot occur without a preposition improbable word forms (e.g. imperatives haš, tel)

a-layer testsnot more than one subject in a clauseattributes (afun Atr) should not appear directly below verbs

t-layer testssurface forms of verb arguments match the specifications in the valency lexiconrelative pronouns in relative clauses should be in agreement with their antecedent (in the sense of grammatical coreference)

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Tools for automatic annotation

chain of tools for automatic text processing (from a raw text to a-layer trees):

1. sentence segmentation and tokenization

2. morphological analysis

3. morphological disambiguation

4. dependency parsing (adapted Collins)

5. analytical function assignment

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Tools for format conversions

conversion not only between PDT data formats, but also from other treebanks’ formats constituency trees from Negra in TrEd:

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Outline of the talk

Introduction

Layers of annotation

Data

Software tools

Documentation

Tour through the CD-ROM

Final remarks

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

PDT 2.0 Documentation

PDT Guide overview of all parts of PDT 2.0 mirrors the directory structure of the PDT 2.0 CD-ROM

Annotation guidelines m-layer (~100 pages) a-layer (~ 250 pages) t-layer (~ 800 pages)

Publications conference and journal papers, technical reports, theses ...

Technical documentation (software tools and data formats)

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Outline of the talk

Introduction

Layers of annotation

Data

Software tools

Documentation

Tour through the CD-ROM

Final remarks

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Outline of the talk

Introduction

Layers of annotation

Data

Software tools

Documentation

Tour through the CD-ROM

Final remarks

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Want to experiment with...

tagging ? dependency parsing ? semantic-role labeling ? frame semantics ? word-sense disambiguation ? anaphora resolution ? information structure ? ...

Use PDT 2.0,it’s all there !!!

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Annotation scheme not limited to Czech

T-layer in English T-layer in German A-layer in German

A-layer in Arabic A-layer in Slovene A-layer in Romanian

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Those involved (some of)

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Thank you!

BTW, anyone interestedin beta-testing?

Recommended