14
Jan Hajič Otakar Smrž Petr Zemánek Jan Šnaidauf Emanuel Beška Faculty of Mathematics and Physics Faculty of Philosophy and Arts Charles University in Prague Development in Data and Tools Prague Arabic Dependency Treebank

Jan Hajič Otakar Smrž Petr Zemánek Jan Šnaidauf Emanuel Beška Faculty of Mathematics and Physics Faculty of Philosophy and Arts Charles University in Prague

Embed Size (px)

Citation preview

Page 1: Jan Hajič Otakar Smrž Petr Zemánek Jan Šnaidauf Emanuel Beška Faculty of Mathematics and Physics Faculty of Philosophy and Arts Charles University in Prague

Jan Hajič Otakar Smrž

Petr ZemánekJan Šnaidauf

Emanuel Beška

Faculty of Mathematics and PhysicsFaculty of Philosophy and ArtsCharles University in Prague

Development in Data and Tools

Prague Arabic DependencyTreebank

Page 2: Jan Hajič Otakar Smrž Petr Zemánek Jan Šnaidauf Emanuel Beška Faculty of Mathematics and Physics Faculty of Philosophy and Arts Charles University in Prague

September 23, 2004 Prague Arabic Dependency Treebank: Development in Data and Tools

2

Project Release – PADT 1.0 December 2004, Linguistic Data

Consortium 148 000 Morpho, 113 500 Syntax

AFP 13 000 N/A France Presse Penn ATB 1

UMH 38 500 N/A Ummah Press Penn ATB 2

XIN 13 500 N/A Xinhua News A Gigaword

ALH 10 000 73 500 Al-Hayat News A Gigaword

ANN 12 500 25 500 An-Nahar News A Gigaword

XIA 26 500 49 500 Xinhua News A Gigaword

Page 3: Jan Hajič Otakar Smrž Petr Zemánek Jan Šnaidauf Emanuel Beška Faculty of Mathematics and Physics Faculty of Philosophy and Arts Charles University in Prague

September 23, 2004 Prague Arabic Dependency Treebank: Development in Data and Tools

3

Open-Source Tools TrEd Tree Editor

Multi-purpose annotation environment Suite of programming utilities

Netgraph Search Engine Server/Client system architecture Easy-to-learn query language

Encode::Arabic Perl Module Extension for processing of Arabic script ArabTeX, Buckwalter, Unicode, …

Page 4: Jan Hajič Otakar Smrž Petr Zemánek Jan Šnaidauf Emanuel Beška Faculty of Mathematics and Physics Faculty of Philosophy and Arts Charles University in Prague

September 23, 2004 Prague Arabic Dependency Treebank: Development in Data and Tools

4

PADT Functional Views Functional Generative Description

Theory of linguistic meaning and its expression Prague Dependency Treebank for Czech

Independence of representation levels Tectogrammatical – linguistic meaning Analytical – surface dependency syntax Morphological – categories and lexical units

Abstraction of the relations across levels Strict distinction between form and function Different units of description on each level

Page 5: Jan Hajič Otakar Smrž Petr Zemánek Jan Šnaidauf Emanuel Beška Faculty of Mathematics and Physics Faculty of Philosophy and Arts Charles University in Prague

September 23, 2004 Prague Arabic Dependency Treebank: Development in Data and Tools

5

Functional Morphology Provides syntax levels with their abstract

language, not just giving letters in tokens Revives multiple senses of categories Completeness of generation Strict modeling of grammatical control MorphoTrees – ‘human tagging’ Successful prototype feature-based tagger

Page 6: Jan Hajič Otakar Smrž Petr Zemánek Jan Šnaidauf Emanuel Beška Faculty of Mathematics and Physics Faculty of Philosophy and Arts Charles University in Prague

September 23, 2004 Prague Arabic Dependency Treebank: Development in Data and Tools

6

Syntactic Levels of Description

Analytical level Pragmatically motivated, close to surface syntax Every single token resulting from

morphological level forms one node Tree-like dependency structure for every sentence

Tectogrammatical level Linguistic (literal) meaning, deep relations, TFA Initial structures transformed from AL Nodes for autosemantic words only Decisive role of valency frames

Page 7: Jan Hajič Otakar Smrž Petr Zemánek Jan Šnaidauf Emanuel Beška Faculty of Mathematics and Physics Faculty of Philosophy and Arts Charles University in Prague

September 23, 2004 Prague Arabic Dependency Treebank: Development in Data and Tools

7

Logic of Analytical Trees Concepts of dependency and valency Reduction: sentence must retain

grammatical correctness if leaves(terminal nodes) are chopped off

Trees: clause components clauses sentences paragraphs etc.Subtrees of clauses exchangeable for non-clauses

Nodes: words, tokenized parts of words, punctuation marks – marked by functions

Edges: syntactic relations –governing node dependent node/subtree

Page 8: Jan Hajič Otakar Smrž Petr Zemánek Jan Šnaidauf Emanuel Beška Faculty of Mathematics and Physics Faculty of Philosophy and Arts Charles University in Prague

September 23, 2004 Prague Arabic Dependency Treebank: Development in Data and Tools

8

Some Syntax Issues of Arabic

Non-verbal predication of several types Subordinate non-verbal clauses / modification Verb-like behavior of many nominal forms Mostly VSO in verbal sentences, but…

vice-versa in non-verbal clauses different, depending on context boundness

Compound verbs, fixed composite prepositions Grammatical co-reference, accusative of

inner object, complex referencing, etc.

Page 9: Jan Hajič Otakar Smrž Petr Zemánek Jan Šnaidauf Emanuel Beška Faculty of Mathematics and Physics Faculty of Philosophy and Arts Charles University in Prague

September 23, 2004 Prague Arabic Dependency Treebank: Development in Data and Tools

9

Problem I: Predication Head node of tree: PREDICATE

Why? Steady role in sentence, cannot be omitted Verbal predicate: I-go to school Non-verbal predicate

Nominal: The-house a-big (=the house is big) Existential: There a-city (=there is a city) Prepositional

Possessive: For him a-house (=he has a house) Adverbial: The-mosque in the-city (=…is…)

Conjunctional: The-problem that (=…is that)

Page 10: Jan Hajič Otakar Smrž Petr Zemánek Jan Šnaidauf Emanuel Beška Faculty of Mathematics and Physics Faculty of Philosophy and Arts Charles University in Prague

September 23, 2004 Prague Arabic Dependency Treebank: Development in Data and Tools

10

la- [PredP]for

-hu [Obj]him

baytun [Sb]a-house [nom.]

Predication Types in TreesdAma [Pred]lasted

iqtirAHu [Sb]proposal

‑hu [Atr]his

al-EamalIyata [Obj]the-operation [acc.]

EalA [AuxP]on

zumalA’i [Obj]colleagues

‑hi [Atr]his

sAEatayni [Adv]two-hours [acc.]al-baytu [Sb]

the-house [nom.]

kabIrun [Pnom]a-big [nom.]

vam~ata [PredE]there-is

fI [PredP]in

al-madInati [Adv]the-city [gen.]

al-jAmiEu [Sb]the-mosque [nom.]

madInatun [Sb]a-city [nom.]

Nominal

Prepositional(possessive)

Existential

Prepositional(adverbial, locative)

Verbal

Verb-like behavior (object of noun?)

Page 11: Jan Hajič Otakar Smrž Petr Zemánek Jan Šnaidauf Emanuel Beška Faculty of Mathematics and Physics Faculty of Philosophy and Arts Charles University in Prague

September 23, 2004 Prague Arabic Dependency Treebank: Development in Data and Tools

11

Problem II: Clauses & Co-reference

Recursiveness: subordinate clause is con-tained as subtree in place of simple element Head-node of clause gets the same function Problem: non-verbal structures – clauses or not? Compound verbs (mA zAla etc.) treated equally

Grammatical co-reference: Personal pro- noun formally required by another element Pronoun must be marked to be treated as such Target of reference is unambiguously identifiable Often in subordinate clauses, mostly attributive

Ex.: He-wrote a-book number its-pages hundred

Page 12: Jan Hajič Otakar Smrž Petr Zemánek Jan Šnaidauf Emanuel Beška Faculty of Mathematics and Physics Faculty of Philosophy and Arts Charles University in Prague

September 23, 2004 Prague Arabic Dependency Treebank: Development in Data and Tools

12

naHwu [Sb]grammar [nom.]

jumalan [Sb]sentences [acc.]

fI [Atr_PredP]in

Clauses & Co-reference in Trees

kataba [Pred]he-wrote

SafHatin [Atr]pages [gen.]

kitAban [Obj]a-book

mi’atu [Sb]hundred [nom.]

zAlat [Pred]she-stopped

tuHis~u [Atv]she-feels

anna [AuxC]that

‑hA [Atr_Ref] their

-hA [Obj]her

wADiHun [Atr_Pnom]clear [nom.]

tuEjibu [Obj_Pred]they-impress

al-rajulu [Sb]the-man [nom.]

Attributive clause, prepositional

predicate (adverbial)

Objective clause, verbal predicate

Compound verb, formed as main verb and its complement

zaybabu [Sb]Zaynab

mA [AuxM]not

-hi [Adv_Ref]it

Referencing pronoun, as

attribute in clause

Attributive clause, nominal predicate

Referencing pronoun, as

adverbial in clause

Page 13: Jan Hajič Otakar Smrž Petr Zemánek Jan Šnaidauf Emanuel Beška Faculty of Mathematics and Physics Faculty of Philosophy and Arts Charles University in Prague

September 23, 2004 Prague Arabic Dependency Treebank: Development in Data and Tools

13

Future Prospects Implementation of Functional

Morphology Tectogrammatical annotation Lexicons of valency frames Re-training the feature-based tagger

on MorphoTrees Machine-learning on the treebank

data for various purposes

Page 14: Jan Hajič Otakar Smrž Petr Zemánek Jan Šnaidauf Emanuel Beška Faculty of Mathematics and Physics Faculty of Philosophy and Arts Charles University in Prague

September 23, 2004 Prague Arabic Dependency Treebank: Development in Data and Tools

14

Thank you

Questions welcome!

http://ckl.mff.cuni.cz/padt/