80
The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de Workshop „Processing Pipelines“ Darmstadt 2008/07/10

The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Embed Size (px)

Citation preview

Page 1: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

The PAULA framework:Automatic and Manual Annotation of

Linguistic Data

Christian Chiarcos and Manfred Stede Universität Potsdam

{chiarcos|stede}@ling.uni-potsdam.de

Workshop „Processing Pipelines“Darmstadt 2008/07/10

Page 2: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Overview

• Motivation • Multi-level annotation for discourse structure

research• Multi-level annotation for information structure

research• The ANNIS linguistic information system

multi-level querying and visualization• Example pipelines

• Corpus annotation and exploitation • PAULA for text summarization

• The PAULA format• Current state, future plans

Page 3: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Multi-level annotation (1):Discourse structure(s)

• Thesis:

Coherence of a text is not adequately characterized by „the“ discourse structure (a single tree or graph) but by the interplay of different levels of description, each reflecting a separate dimension of textuality.

(In Textlinguistik, this idea is not new (e.g., Motsch 96) but the programme has not been carried through yet.)

Page 4: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Impfpflicht gegen Kinderkrankheiten?[1] Kein Kind weiß heute noch, was Pocken sind. [2] So ein Glück. [3] Als die Pockenimpfung 1854 eingeführt wurde, [4] glaubten manche Menschen, [5] dass sich ihr Kopf in einen Kuhkopf verwandelt, [6] wenn sie sich impfen lassen. [7] Denn der Impfstoff wurde damals aus der Haut von Rindern hergestellt. [8] Heute ist diese furchtbare Krankheit ausgerottet. [9] Dank einer entschlossenen, weltweiten Impfkampagne. [10] Aber es gibt noch: Masern, Kinderlähmung, Diphtherie, Mumps, Röteln, Hepatitis B, Tuberkulose, Keuchhusten. [11] Daransterben, vor allem in den Entwicklungsländern, jährlich immer noch Millionen Kinder. [12] In Deutschland werden diese Krankheiten von vielen Eltern offenbar nicht ernst genommen. [13] Weil sie sie gar nicht mehr kennen! [14] Denn mit Impfstoffen wurde erreicht, [15] dass diese Infektionen nur noch sporadisch auftreten. [16] Doch wer aus eigenem Erleben weiß, [17] wie schrecklich Kinder leiden, [18] wenn sie ‚nur‘ Masern oder Keuchhusten haben, [19] sollte ihnen dies ersparen. [20] Und auch die gesundheitlichen Folgewirkungen. [21] Nur wer impfen lässt, hilft mit, dass Impfungen eines Tages überflüssig werden. [22] Stattdessen wird über Nebenwirkungen von Impfstoffen schwadroniert, [23] die höchst selten auftreten und die man erst Recht nur aus Büchern kennt. [24] Dann gibt es noch das schöne Argument: Das ist mein Kind, das darf der Staat nicht pieken. [25] Gegen solche Eltern hilft auch keine Impfung.

Page 5: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Mandatory vaccination against children‘s diseases?

[1] Today, children don‘t know anymore what pox are. [2] What a joy. [3] When pox vaccination was introduced in 1854, [4] quite a few people believed [5] that their head would turn into a cow‘s head [6] if they got themselves vaccinated. [7] For the vaccine was made from cattle‘s skin at the times. [8] Nowadays this dreadful disease is exterminated. [9] Thanks to a determined, world-wide vaccination campaign. [10] But there still are other diseases: Measles, polio, diphteria, mumps, rubella, hepatitis B, tuberculosis, pertussis. [11] Millions of children die of these, especially in less developed countries. [12] In Germany, many parents apparently don‘t take these diseases seriously. [13] Because they don‘t know them anymore! [14] For it has been achieved with vaccines [15] that these infections hit only rarely today. [16] But those who have experienced [17] how terribly children suffer [18] when they come down with ‚just‘ measles or pertussis, [19] should spare them the agony. [20] As well as the long-term consequences. [21] Only those who have their children vaccinated will contribute to vaccines‘ becoming superfluous some day. [22] Instead, people rant about side effects [23] that occur very rarely and are known merely from books. [24] Then there is the great argument: This is my child, the governement must not prick her. [25] No vaccine can help against such parents.

Page 6: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Mandatory vaccination against children‘s diseases?

[1] Today, children don‘t know anymore what pox are. [2] What a joy. [3] When pox vaccination was introduced in 1854, [4] quite a few people believed [5] that their head would turn into a cow‘s head [6] if they got themselves vaccinated. [7] For the vaccine was made from cattle‘s skin at the times. [8] Nowadays this dreadful disease is exterminated. [9] Thanks to a determined, world-wide vaccination campaign. [10] But there still are other diseases: Measles, polio, diphteria, mumps, rubella, hepatitis B, tuberculosis, pertussis. [11] Millions of children die of these, especially in less developed countries. [12] In Germany, many parents apparently don‘t take these diseases seriously. [13] Because they don‘t know them anymore! [14] For it has been achieved with vaccines [15] that these infections hit only rarely today. [16] But those who have experienced [17] how terribly children suffer [18] when they come down with ‚just‘ measles or pertussis, [19] should spare them the agony. [20] As well as the long-term consequences. [21] Only those who have their children vaccinated will contribute to vaccines‘ becoming superfluous some day. [22] Instead, people rant about side effects [23] that occur very rarely and are known merely from books. [24] Then there is the great argument: This is my child, the governement must not prick her. [25] No vaccine can help against such parents.

Page 7: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Referential Structure

Page 8: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Mandatory vaccination against children‘s diseases?

[1] Today, children don‘t know anymore what pox are. [2] What a joy. [3] When pox vaccination was introduced in 1854, [4] quite a few people believed [5] that their head would turn into a cow‘s head [6] if they got themselves vaccinated. [7] For the vaccine was made from cattle‘s skin at the times. [8] Nowadays this dreadful disease is exterminated. [9] Thanks to a determined, world-wide vaccination campaign. [10] But there still are other diseases: Measles, polio, diphteria, mumps, rubella, hepatitis B, tuberculosis, pertussis. [11] Millions of children die of these, especially in less developed countries. [12] In Germany, many parents apparently don‘t take these diseases seriously. [13] Because they don‘t know them anymore! [14] For it has been achieved with vaccines [15] that these infections hit only rarely today. [16] But those who have experienced [17] how terribly children suffer [18] when they come down with ‚just‘ measles or pertussis, [19] should spare them the agony. [20] As well as the long-term consequences. [21] Only those who have their children vaccinated will contribute to vaccines‘ becoming superfluous some day. [22] Instead, people rant about side effects [23] that occur very rarely and are known merely from books. [24] Then there is the great argument: This is my child, the governement must not prick her. [25] No vaccine can help against such parents.

Page 9: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Thematic Structure

Page 10: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Mandatory vaccination against children‘s diseases?

[1] Today, children don‘t know anymore what pox are. [2] What a joy. [3] When pox vaccination was introduced in 1854, [4] quite a few people believed [5] that their head would turn into a cow‘s head [6] if they got themselves vaccinated. [7] For the vaccine was made from cattle‘s skin at the times. [8] Nowadays this dreadful disease is exterminated. [9] Thanks to a determined, world-wide vaccination campaign. [10] But there still are other diseases: Measles, polio, diphteria, mumps, rubella, hepatitis B, tuberculosis, pertussis. [11] Millions of children die of these, especially in less developed countries. [12] In Germany, many parents apparently don‘t take these diseases seriously. [13] Because they don‘t know them anymore! [14] For it has been achieved with vaccines [15] that these infections hit only rarely today. [16] But those who have experienced [17] how terribly children suffer [18] when they come down with ‚just‘ measles or pertussis, [19] should spare them the agony. [20] As well as the long-term consequences. [21] Only those who have their children vaccinated will contribute to vaccines‘ becoming superfluous some day. [22] Instead, people rant about side effects [23] that occur very rarely and are known merely from books. [24] Then there is the great argument: This is my child, the governement must not prick her. [25] No vaccine can help against such parents.

Page 11: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Conjunctive Relations

• temporal– simultaneous, succession

• consequential– manner, consequence, condition, purpose, concession

• comparative– similarity, contrast, reformulation

• additive– addition, alternation

• Relations can be directed but not weighted - there is no nuclearity

(Martin 1992)

Page 12: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Conjunctive Relations

(Martin 1992)

Page 13: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Intentional structure

• Illocutions (inspired by Schmitt 00, Searle 76)

– Reportivum: writer describes a state of affairs

– Identifikativum: writer characterizes own state of mind, health, etc.

– Estimativum: writer presents proposition as probably true

– Evaluativum: writer presents a personal opinion

– Appellativum: writer orders or suggests an action

• Support Relations (subset of RST)– Ease-understanding

(Background)– Encourage-acting

(Motivation)– Ease-acting

(Enablement)– Encourage-believing

(Evidence)– Encourage appreciating

(Antithesis, Concession)

• Compare „types of argument“ (e.g., Eggs 00):– deontic– epistemic– ethic/aesthetic

Page 14: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Mandatory vaccination against children‘s diseases?

[1] Today, children don‘t know anymore what pox are. [2] What a joy. [3] When pox vaccination was introduced in 1854, [4] quite a few people believed [5] that their head would turn into a cow‘s head [6] if they got themselves vaccinated. [7] For the vaccine was made from cattle‘s skin at the times. [8] Nowadays this dreadful disease is exterminated. [9] Thanks to a determined, world-wide vaccination campaign. [10] But there still are other diseases: Measles, polio, diphteria, mumps, rubella, hepatitis B, tuberculosis, pertussis. [11] Millions of children die of these, especially in less developed countries. [12] In Germany, many parents apparently don‘t take these diseases seriously. [13] Because they don‘t know them anymore! [14] For it has been achieved with vaccines [15] that these infections hit only rarely today. [16] But those who have experienced [17] how terribly children suffer [18] when they come down with ‚just‘ measles or pertussis, [19] should spare them the agony. [20] As well as the long-term consequences. [21] Only those who have their children vaccinated will contribute to vaccines‘ becoming superfluous some day. [22] Instead, people rant about side effects [23] that occur very rarely and are known merely from books. [24] Then there is the great argument: This is my child, the governement must not prick her. [25] No vaccine can help against such parents.

Page 15: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Argument structure

(inspired by Freeman 1993)

Page 16: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Text understanding: Relating levels of analysis

Page 17: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Text understanding: Relations to sentence syntax

Impfpflicht gegen Kinderkrankheiten?[1] Kein Kind weiß heute noch, was Pocken sind. [2] So ein Glück. [3] Als die Pockenimpfung 1854 eingeführt wurde, [4] glaubten manche Menschen, [5] dass sich ihr Kopf in einen Kuhkopf verwandelt, [6] wenn sie sich impfen lassen. [7] Denn der Impfstoff wurde damals aus der Haut von Rindern hergestellt. [8] Heute ist diese furchtbare Krankheit ausgerottet. [9] Dank einer entschlossenen, weltweiten Impfkampagne. [10] Aber es gibt noch: Masern, Kinderlähmung, Diphtherie, Mumps, Röteln, Hepatitis B, Tuberkulose, Keuchhusten. [11] Daransterben, vor allem in den Entwicklungsländern, jährlich immer noch Millionen Kinder. [12] In Deutschland werden diese Krankheiten von vielen Eltern offenbar nicht ernst genommen. [13] Weil sie sie gar nicht mehr kennen! [14] Denn mit

Page 18: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Multi-level annotation: syntax tree

Annotate, Synpathy

NK

NP

NK NK

Die einstige Fußball-Weltmacht

ART ADJA NN

Page 19: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Multi-level annotation: coreference

MMAX

Page 20: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Multi-level annotation: text tree

RST Tool

Page 21: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Multi-level annotation: layers

Exmaralda

Page 22: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de
Page 23: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de
Page 24: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Multi-level annotation (2):Information structure - SFB632

• B1 (Gur/Kwa languages)• B2 (Tchadic languages)• B4 (Diachronic Germanic / Latin translation)• B6 (Spoken „Kiezdeutsch“)• C1 (Newspaper German) - see below• C6 (Hindi)• D1 (Newspaper German) - see above• D2 (Questionnaire - 13 different languages)

Page 25: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Multi-level annotation (2):Information structure - SFB632

• B1 (Gur/Kwa languages) - Shoe/Toolbox, Exm.• B2 (Tchadic languages) - Shoe/Toolbox, Exm.• B4 (Diachronic Germanic / Latin) - Exmaralda• B6 (Spoken „Kiezdeutsch“) - Exmaralda• C1 (Newspaper German) - Synpathy, MMax• C6 (Hindi) - XML• D1 (Newspaper German) - Syn, MMax, RST, Exm • D2 (Questionnaire) - Exmaralda

Page 26: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Multi-level annotation (2):Information structure - SFB632

• B4 (Diachronic Germanic / Latin) - Exmaralda1800 sentences: info structure, syntax, coherence relations

• C1 (Newspaper German) - Synpathy, MMaxLarge text collection with only selected sentences being annotated - see below

• D1 (Newspaper German) - Syn, MMax, RST, Exm200 texts/2500 sentences, in part with coherence relations, coreference, syntax, info structure

• D2 (Questionnaire) - ExmaraldaGB of audio data / 50K transcribed tokens, in part with phrase structure, info structure

Page 27: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

SFB632: From annotation tool to database

• Database reads PAULA• Conversion scripts map from tool output to PAULA

– Add metadata to documents– Fix some inconsistent tokenization

• Challenges– Enforce common tokenization across layers (and thus across tools)– Enforce syntactically correct annotation (Exmaralda)

• Manual work– Check for typos and other errors (wrong type of annotation layer,

etc.)– Repair some inconsistent tokenization

Page 28: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

ANNIS Database

• ANNIS V1: Data resides in main memory– In use since 2005

• ANNIS V2: System with relational DB backend (PostgreSQL)– To be launched this summer

Page 29: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

ANNIS query language

• Issue queries across annotation layers– ...to combine different realms of informationgivenness=giv & syncat=pp & rhetrel=contrast

– ...to check for conflicting annotations within the same realmann1::givenness=new & ann2::givenness=giv & #1 _=_ #2

– ...to check for completeness of annotationsaboutness=ref & !givenness=* & #1 _=_ #2

Page 30: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

ANNIS V1Text view and annotation layers

Page 31: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

ANNIS V2 Search for multiple constitutents in the Vorfeld

Page 32: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

ANNIS V2Hit list

Page 33: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

ANNIS V2Tree view

Page 34: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

ANNIS V2Coreference view

Page 35: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Availability

• ANNIS database V1• ANNIS database V2: later this year• PAULA documentation• Conversion scripts

– AnnotTools to PAULAExmaralda, MMAX2, TigerXML, RSTTool, URML, Palinka, generic inline XML

• Ontology & Tools*– extensions for ontology-based corpus querying– HTML-export for ontologies

* Developed in the DFG project „Nachhaltigkeit linguistischer Daten“ (SFB 441, SFB 538, SFB 632)

Page 36: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Example pipelines

• Motivation • Multi-level annotation for discourse structure

research• Multi-level annotation for information structure

research• The ANNIS linguistic information system

multi-level querying and visualization• Example pipelines

• Corpus annotation and exploitation • PAULA for text summarization

• The PAULA format• Current state, future plans

Page 37: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Corpus annotation pipeline

• information structure and word order in German*– What contextual conditions are licensing pre-field

occupation of non-subject constituents ?• annotations

– grammatical annotation• syntax, morphology

– pragmatic annotation• anaphora, bridging, information status

• efficient, goal-specific annotation– partial annotation

• selected examples + immediate context– semiautomatic annotation

* Chiarcos, C., J. Ritz, M. Stede (2008), Investigating non-canonical constructions in context: efficient corpus annotation and retrieval. to be presented at KONVENS 2008, Berlin, October, 2008

Page 38: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Corpus annotation pipeline

sample selection• collect a number of texts• mark target sentences

automatedpre-processing

• tokenization• parsing

• annotate anaphora and verify syntax• use standard annotation tools for both tasks• synchronization

anaphora

grammar

integration • conversion to PAULA

manual annotation

Page 39: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Corpus annotation pipeline

sample selection

• collect a number of texts• mark target sentences• convert to plain text with markup

tokenization• use standard tokenizer• mark sentence boundaries• preserve markup

parsing

• BitPar (German version of TracePar)*POS, morph, TIGER-style syntax

• in case of failure, use TreeTagger/Chunker**POS, NP/PP-chunks

conversion to TIGER XML • conversion from bracket format

automated pre-processing

* http://www.ims.uni-stuttgart.de/tcl/SOFTWARE/BitPar.html ** http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger

Page 40: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

pre-processing

Corpus annotation pipeline

• synchronizationidentify relevant context sentences

• produce TIGER XML

anaphoricannotation

• MMAX*converted from TIGER XMLpreserve TIGER ids as annotations

* http://mmax2.sourceforge.net

** http://www.mpi.nl/tools/synpathy.html

grammaticalannotation

• Synpathy**correct selected sentences

• synchronizationverify MMAX references to TIGER XML

integration

manual annotationmanual annotation

Page 41: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

integration

Corpus annotation pipeline

• MMAX formatTIGER ids as annotation valuesanaphora

grammar

• loss-less conversion to PAULAisomorphic to source format

annotation

• TIGER XML

MMAX@PAULA

TIGER@PAULA

merged PAULA project• merging

references to the same token file

integrated PAULA project• integration

replace TIGER ids from MMAX@PAULA withpointing relations to TIGER@PAULA elements

Page 42: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Corpus exploitation pipeline

• What is the relation between different levels of description ?– information status vs. morphosyntax– discourse structure vs. anaphora

• Qualitative analysis– Query the corpus for corresponding annotations and analyse

these examples manually.cf. ANNIS slides

• Quantitative analysis– Assess statistic correlations between different annotations.

Page 43: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Corpus exploitation pipelineQuantitative analysis

corpus of PAULA projects

TIGER XMLExmaralda

RST ToolMMAX

• POS, morph, syntax• information structure• discourse structure• coreferenceconversion

to PAULA

• integration of multiple annotationsof the same set of documents

conversion to ARFF

WEKA • WEKA* workbench for statistic analysesstatistic, neuronal, symbolic classifiers

* http://sourceforge.net/projects/weka/

• extraction of feature vectorsso far, no generic ARFF exporter has been developed. ANNIS 2.0 will be augmented witha number of example converters

Page 44: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Corpus exploitation pipelineQuantitative analysis with WEKA

Preprocessingselecting relevant features from anARFF feature list

Page 45: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Corpus exploitation pipelineQuantitative analysis with WEKA

exampleanalysis

(decision tree)

information status and referring expressions in German (Potsdam Commentary Corpus)

Page 46: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

NLP pipeline

• Summarization project*– high-quality summarization– syntax, coreference, text structure, causal

markers– PAULA as exchange format between different

NLP modules– output of different modules is to be combined

• these may also run in parallel

– specific requirements for the exchange format

* Stede, M., H. Bieler, S. Dipper, and A. Suriyawongkul (2006). SUMMaR: Combining Linguistics and Statistics for Text Summarization. In Proceedings of the 17th European Conference on Artificial Intelligence (ECAI-06)

Page 47: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Layout Structureand Metadata

Extraction

 

Text StructureExtraction

Tokenization and Sentence Boundary

Detection

Syntactical Analysis(Connexor) Structure Weight

CalculationDiscourse Marker

Annotation

Term Weight Calculation

Treetagger

Topic Segmentation Number and Time Annotation

Coreference Analysis(Rosana)

Preprocessing Modules

Flexible Modules

Summarization, architecture

flexible modules can be arranged in anyorder in the pipeline or be processed non-sequentially PAULA as common interchange format

Merging

Summary Calculation

Graphical Representation

Final Modules

Page 48: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Preprocessing Modules

Flexible Modules(selection)

Final Modules

Layout Structureand Metadata

Extraction

 

Text StructureExtraction

Tokenization and Sentence Boundary

Detection

Syntactic Analysis(Connexor)

Term Weight Calculation

Coreference Analysis(Rosana)

Merging

Summary Calculation

Graphical Representation

Topic Segmentation

Robust Morphosyntactic

Analysis(TreeTagger)

Summarization pipeline

Page 49: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Summarization pipelineA fragment

Layout Structureand Metadata

Extraction

 

Text StructureExtraction

Tokenization and Sentence Boundary

Detection

Term Weight Calculation

???

Merging

Summary Calculation

Graphical Representation

Preprocessing Modules

Flexible Modules

Final Modules

Topic Segmentation

Robust Morphosyntactic

Analysis(TreeTagger)

PAULA

* Rosana requires Connexor as input format, hence, the mapping to PAULA is skipped at this point

*

to be processed by other components in the summarizationpipeline

coming from a preprocessing module

Syntactic Analysis(Connexor)

Coreference Analysis(Rosana)

Transforming Rosanaoutput to PAULA

PAULA

components in the pipeline are „wrapped“ to become consumers and generators of PAULA

Transforming relevant PAULAannotations to Connexor input format

Page 50: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Summarization pipelineA fragment

Layout Structureand Metadata

Extraction

 

Text StructureExtraction

Tokenization and Sentence Boundary

Detection

Term Weight Calculation

???

Merging

Summary Calculation

Graphical Representation

Preprocessing Modules

Flexible Modules

Final Modules

Topic Segmentation

Robust Morphosyntactic

Analysis(TreeTagger)

PAULA

Syntactic Analysis(Connexor)

Coreference Analysis(Rosana)

Transforming Rosanaoutput to PAULA

PAULATransforming relevant PAULA

annotations to Connexor input format

Merging multiple annotationlayers in one PAULA project

one single PAULA projectcomprising annotations fromdifferent modules

Page 51: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Requirements for an interchange format for summarization

• advantages– scalability– modularization

• requirements – supporting merge and split operations for

annotations of the same document– clear conceptual separation of annotations

Page 52: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

PAULA

• Motivation • Multi-level annotation for discourse structure

research• Multi-level annotation for information structure

research• The ANNIS linguistic information system

multi-level querying and visualization• Example pipelines

• Corpus annotation and exploitation • PAULA for text summarization

• The PAULA format• Current state, future plans

Page 53: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

PAULA formatdesiderata I

• PAULAPotsdamer Austauschformat Linguistischer Annotationen

• designed with the following premises– very general, annotation-specific format

supporting• multi-layer annotations for information structural

(and other) phenomena• conflicting hierarchies (RST vs. syntax)• pointing references (e.g., anaphora)

Page 54: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

PAULA formatdesiderata II

• PAULAPotsdamer Austauschformat Linguistischer Annotationen

• designed with the following premises– high coverage

loss-less representation of information from a multitude of input formats and tools

– TIGER XML, Exmaralda, MMAX, RSTTool– Connexor, Rosana, Brill Tagger

Page 55: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

PAULA formatdesiderata III

• PAULAPotsdamer Austauschformat Linguistischer Annotationen

• designed with the following premises– merging and splitting operations

• self-contained annotation layers• extraction/addition of new annotation layers with

minimal effects to other annotation layers

– XML

Page 56: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

PAULA formatAn “interlingua” for tools

• Radical standoff– each annotation layer stored in a separate file

• systematic application of xlinks– for non-tree fragments

• crossing branches for discontinuous constituents, anaphoric annotation

• Make as few structural commitments as possiblea wide variety of data formats can be represented

• as opposed to earlier, task-specific formats

• design inspired by early drafts for LAF– conceptually related to GrAF (Ide & Suderman (2007))

Page 57: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

PAULA formatBasic elements

<mark> (markable)span of text which is subject to annotation, e.g.

a token

<struct> (structure)node in a hierarchical (tree or tree-like) structure

<rel> (relation)relation between struct or mark elements

<feat> (feature)annotation attached to a mark, struct, or rel element

Page 58: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

PAULA formatBasic elements of syntax annotation

Annotate, TIGERSearch, Synpathy

NK

NP

NK NK

Die einstige Fußball-Weltmacht

ART ADJA NN

Page 59: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

PAULA formatBasic elements of syntax annotation

PAULA representation of structure elements (struct, mark)

mark elements(token)

struct elements

rel elements NK

NP

NK NK

Die einstige Fußball-Weltmacht

ART ADJA NN

<struct>

<rel> <rel> <rel>

<mark> <mark> <mark>

(type „tok“) (type „tok“) (type „tok“)

Die einstige Fußball-Weltmachtprimary data

Page 60: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

PAULA formatBasic elements of syntax annotation

PAULA representation of annotation elements (feat)

NK

NP

NK NK

Die einstige Fußball-Weltmacht

ART ADJA NN

<struct>

<rel> <rel> <rel>

<mark> <mark> <mark>

(type „tok“) (type „tok“) (type „tok“)

Die einstige Fußball-Weltmacht

mark elements(token)

struct elements

rel elements

primary data

cat=NP

func=NK func=NK func=NK

POS=ART POS=ADJA POS=NN

Page 61: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

PAULA formatPhysical representation

NK

NP

NK NK

Die einstige Fußball-Weltmacht

ART ADJA NN

<struct>

<rel> <rel> <rel>

<mark> <mark> <mark>

(type „tok“) (type „tok“) (type „tok“)

Die einstige Fußball-Weltmacht

mark elements(token)

struct elements

rel elements

primary data

cat=NP

func=NK func=NK func=NK

POS=ART POS=ADJA POS=NN

text.xml

tok.xml

syntax.xml

Every type of structure (primary data, mark, struct) represented in an individual filestruct and rel together encode hierarchical structures

Page 62: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

PAULA formatPhysical representation

NK

NP

NK NK

Die einstige Fußball-Weltmacht

ART ADJA NN

<rel> <rel> <rel>

<mark> <mark> <mark>

(type „tok“) (type „tok“) (type „tok“)

Die einstige Fußball-Weltmacht

mark elements(token)

struct elements

rel elements

primary data

cat=NP

func=NK func=NK func=NK

POS=ART POS=ADJA POS=NN

text.xml

tok.xml

syntax.xml

Dominance relations represented by XML hierarchy between struct and rel

<struct> inline XMLfragment

Page 63: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

PAULA formatPhysical representation

NK

NP

NK NK

Die einstige Fußball-Weltmacht

ART ADJA NN

<rel> <rel> <rel>

<mark> <mark> <mark>

(type „tok“) (type „tok“) (type „tok“)

Die einstige Fußball-Weltmacht

mark elements(token)

struct elements

rel elements

primary data

cat=NP

func=NK func=NK func=NK

POS=ART POS=ADJA POS=NN

text.xml

tok.xml

syntax.xml

Dominance relations represented by XML hierarchy between struct and reland xlinks/xpointer between rel and dominated struct/mark

<struct> inline XMLfragment

xlink/xpointer

Page 64: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

PAULA formatPhysical representation

NK

NP

NK NK

Die einstige Fußball-Weltmacht

ART ADJA NN

<struct>

<rel> <rel> <rel>

<mark> <mark> <mark>

(type „tok“) (type „tok“) (type „tok“)

Die einstige Fußball-Weltmacht

mark elements(token)

struct elements

rel elements

primary data

cat=NP

func=NK func=NK func=NK

POS=ART POS=ADJA POS=NN

text.xml

tok.xml

syntax.xml

Every type of structure (primary data, mark, struct) represented in an individual filemarks refer to token sequences

Page 65: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

PAULA formatPhysical representation

NK

NP

NK NK

Die einstige Fußball-Weltmacht

ART ADJA NN

<struct>

<rel> <rel> <rel>

<mark> <mark> <mark>

(type „tok“) (type „tok“) (type „tok“)

Die einstige Fußball-Weltmacht

mark elements(token)

struct elements

rel elements

primary data

cat=NP

func=NK func=NK func=NK

POS=ART POS=ADJA POS=NN

text.xml

tok.xml

syntax.xml

Every type of structure (primary data, mark, struct) represented in an individual filemarks of type ‚tok‘ refer to spans of primary data

Page 66: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

PAULA formatPhysical representation

NK

NP

NK NK

Die einstige Fußball-Weltmacht

ART ADJA NN

<struct>

<rel> <rel> <rel>

<mark> <mark> <mark>

(type „tok“) (type „tok“) (type „tok“)

Die einstige Fußball-Weltmacht

mark elements(token)

struct elements

rel elements

primary data

cat=NP

func=NK func=NK func=NK

POS=ART POS=ADJA POS=NN

text.xml

tok.xml

syntax.xml

Every type of structure (primary data, mark, struct) represented in an individual filemarks of type ‚tok‘ refer to spans of primary data

xlink/xpointer

Page 67: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

PAULA formatPhysical representation

NK

NP

NK NK

Die einstige Fußball-Weltmacht

ART ADJA NN

<struct>

<rel> <rel> <rel>

<mark> <mark> <mark>

(type „tok“) (type „tok“) (type „tok“)

Die einstige Fußball-Weltmacht

mark elements(token)

struct elements

rel elements

primary data

cat=NP

func=NK func=NK func=NK

POS=ART POS=ADJA POS=NN

text.xml

tok.xml

syntax.xml

For every annotation layer, every type of feat is also represented in a separate file

cat_func.xml

pos.xml

Page 68: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

PAULA formatPhysical representation

NK

NP

NK NK

Die einstige Fußball-Weltmacht

ART ADJA NN

<struct>

<rel> <rel> <rel>

<mark> <mark> <mark>

(type „tok“) (type „tok“) (type „tok“)

Die einstige Fußball-Weltmacht

mark elements(token)

struct elements

rel elements

primary data

cat=NP

func=NK func=NK func=NK

POS=ART POS=ADJA POS=NN

text.xml

tok.xml

syntax.xml

Feats are attached to mark/struct elements by means of xlink/xpointer expressions

cat_func.xml

pos.xml

Page 69: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

PAULA formatAchievements

• Generic format– capable to represent hierarchical structures

struct elements correspond to nodes

struct/rel elements correspond to dominance relations

– capable to represent flat, layer-based annotations*mark elements correspond to spans of texts without hierarchical structure

– capable to represent pointing relations*rel elements without a dominating struct element represent non-dominance relations

– capable to represent any annotation assigned to thesefeat elements may point to any struct, mark, rel element

* not shown here

Page 70: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

PAULA formatAchievements

• Hierarchies are modelled by means of xlinks– may represent any kind of dominance relation using the same

mechanism, including discontinuous segments and crossing edges

• Represents every annotation layer on its own– structures from different annotation layers do not interfere with

each other• e.g. conflicting hierarchies

– addition or removal of another annotation layer does not affect the representation of the remaining layers

Page 71: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

PAULA formatAchievements

• Addition of annotation layers and merging annotation projects is easy– if two annotation projects exist for one piece of primary data:*

• redirect all references to the token layer to the common token layer

• register the new annotation layer

• Removal of annotation layers is trivial– if an annotation layer is to be removed

• remove the registration of the annotation layer in the current annotation project

* Merging of two annotation projects requires identical tokenization, more in a minute.

Page 72: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

PAULA formatSome minor disadvantages

• Overhead– for a project with n annotation layers with

different annotations, at least 2n+2 files are created

• Only partially human readable– information distributed across multiple files

Page 73: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

PAULA formatMore serious problems

• Hard to process using script languages– validity of xlink-references must be verified

• Maintenance– there is a number of (quite elaborate) converters from

and to PAULA– any extension of the original format requires all these

converters to be updated

• Merging annotation projects with different tokenization– Regularly, correction of tokenization is required, e.g.,

in the output of tools that are insensitive to tokenization (RSTTool) or re-tokenize (Connexor)

Page 74: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

PAULARecent developments

• Currently, the PAULA JAVA API is under development, including– an implementation of the PAULA Object

Model– a parser for PAULA

• downward-compatible

– serialization facilities• downward-compatible

– routines for standard operations• aligning divergent tokenizations

Page 75: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

PAULAForthcoming

• Intended extensions of PAULA concern– sub-token annotations

• morphemes, tones, etc.

– parallel corpora• multiple streams of primary data

– integration of media files

Page 76: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

What we‘ve shown

• need for MLA for annotation and processing of pragmatic (and other linguistic) phenomena

• ANNIS, a tool for the querying and visualization of MLA

• example pipelines involving MLA– typical problems

• synchronization• adding/removal operations• expressivity of existing formats

Page 77: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

What we‘ve shown

• typical problems when processing MLA– synchronization– adding/removal operations– expressivity of existing formats

premises for the development of PAULA– generic format specifically designed for

linguistic annotations– `radical‘ standoff

Page 78: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

What we‘ve shown

• Problems of radical standoff formats– excessive use of xlinks– hard to read– validation vacilities

• .... and a solution to these– PAULA API

Page 79: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Thank you

Page 80: The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

Thank you... and thanks to the team:

• Anke Lüdeling (HUB), Ulf Leser (HUB)

• Heike Bieler (UP), Michael Götze (UP), Julia Ritz (UP), Amir Zeldes (HUB), Uwe Küssner (ext), {Stefanie Dipper, Tillmann Wegst}

• Karsten Hütter (HUB), Christian Lemke (UP), Viktor Rosenfeld (HUB), Florian Zipser (UP)