The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de

The PAULA framework:Automatic and Manual Annotation of

Linguistic Data

Christian Chiarcos and Manfred Stede Universität Potsdam

{chiarcos|stede}@ling.uni-potsdam.de

Workshop „Processing Pipelines“Darmstadt 2008/07/10

Overview

• Motivation • Multi-level annotation for discourse structure

research• Multi-level annotation for information structure

research• The ANNIS linguistic information system

multi-level querying and visualization• Example pipelines

• Corpus annotation and exploitation • PAULA for text summarization

• The PAULA format• Current state, future plans

Multi-level annotation (1):Discourse structure(s)

• Thesis:

Coherence of a text is not adequately characterized by „the“ discourse structure (a single tree or graph) but by the interplay of different levels of description, each reflecting a separate dimension of textuality.

(In Textlinguistik, this idea is not new (e.g., Motsch 96) but the programme has not been carried through yet.)

Impfpflicht gegen Kinderkrankheiten?[1] Kein Kind weiß heute noch, was Pocken sind. [2] So ein Glück. [3] Als die Pockenimpfung 1854 eingeführt wurde, [4] glaubten manche Menschen, [5] dass sich ihr Kopf in einen Kuhkopf verwandelt, [6] wenn sie sich impfen lassen. [7] Denn der Impfstoff wurde damals aus der Haut von Rindern hergestellt. [8] Heute ist diese furchtbare Krankheit ausgerottet. [9] Dank einer entschlossenen, weltweiten Impfkampagne. [10] Aber es gibt noch: Masern, Kinderlähmung, Diphtherie, Mumps, Röteln, Hepatitis B, Tuberkulose, Keuchhusten. [11] Daransterben, vor allem in den Entwicklungsländern, jährlich immer noch Millionen Kinder. [12] In Deutschland werden diese Krankheiten von vielen Eltern offenbar nicht ernst genommen. [13] Weil sie sie gar nicht mehr kennen! [14] Denn mit Impfstoffen wurde erreicht, [15] dass diese Infektionen nur noch sporadisch auftreten. [16] Doch wer aus eigenem Erleben weiß, [17] wie schrecklich Kinder leiden, [18] wenn sie ‚nur‘ Masern oder Keuchhusten haben, [19] sollte ihnen dies ersparen. [20] Und auch die gesundheitlichen Folgewirkungen. [21] Nur wer impfen lässt, hilft mit, dass Impfungen eines Tages überflüssig werden. [22] Stattdessen wird über Nebenwirkungen von Impfstoffen schwadroniert, [23] die höchst selten auftreten und die man erst Recht nur aus Büchern kennt. [24] Dann gibt es noch das schöne Argument: Das ist mein Kind, das darf der Staat nicht pieken. [25] Gegen solche Eltern hilft auch keine Impfung.

Mandatory vaccination against children‘s diseases?

[1] Today, children don‘t know anymore what pox are. [2] What a joy. [3] When pox vaccination was introduced in 1854, [4] quite a few people believed [5] that their head would turn into a cow‘s head [6] if they got themselves vaccinated. [7] For the vaccine was made from cattle‘s skin at the times. [8] Nowadays this dreadful disease is exterminated. [9] Thanks to a determined, world-wide vaccination campaign. [10] But there still are other diseases: Measles, polio, diphteria, mumps, rubella, hepatitis B, tuberculosis, pertussis. [11] Millions of children die of these, especially in less developed countries. [12] In Germany, many parents apparently don‘t take these diseases seriously. [13] Because they don‘t know them anymore! [14] For it has been achieved with vaccines [15] that these infections hit only rarely today. [16] But those who have experienced [17] how terribly children suffer [18] when they come down with ‚just‘ measles or pertussis, [19] should spare them the agony. [20] As well as the long-term consequences. [21] Only those who have their children vaccinated will contribute to vaccines‘ becoming superfluous some day. [22] Instead, people rant about side effects [23] that occur very rarely and are known merely from books. [24] Then there is the great argument: This is my child, the governement must not prick her. [25] No vaccine can help against such parents.



Referential Structure



Thematic Structure



Conjunctive Relations

• temporal– simultaneous, succession

• consequential– manner, consequence, condition, purpose, concession

• comparative– similarity, contrast, reformulation

• additive– addition, alternation

• Relations can be directed but not weighted - there is no nuclearity

(Martin 1992)

Conjunctive Relations

(Martin 1992)

Intentional structure

• Illocutions (inspired by Schmitt 00, Searle 76)

– Reportivum: writer describes a state of affairs

– Identifikativum: writer characterizes own state of mind, health, etc.

– Estimativum: writer presents proposition as probably true

– Evaluativum: writer presents a personal opinion

– Appellativum: writer orders or suggests an action

• Support Relations (subset of RST)– Ease-understanding

(Background)– Encourage-acting

(Motivation)– Ease-acting

(Enablement)– Encourage-believing

(Evidence)– Encourage appreciating

(Antithesis, Concession)

• Compare „types of argument“ (e.g., Eggs 00):– deontic– epistemic– ethic/aesthetic



Argument structure

(inspired by Freeman 1993)

Text understanding: Relating levels of analysis

Text understanding: Relations to sentence syntax

Impfpflicht gegen Kinderkrankheiten?[1] Kein Kind weiß heute noch, was Pocken sind. [2] So ein Glück. [3] Als die Pockenimpfung 1854 eingeführt wurde, [4] glaubten manche Menschen, [5] dass sich ihr Kopf in einen Kuhkopf verwandelt, [6] wenn sie sich impfen lassen. [7] Denn der Impfstoff wurde damals aus der Haut von Rindern hergestellt. [8] Heute ist diese furchtbare Krankheit ausgerottet. [9] Dank einer entschlossenen, weltweiten Impfkampagne. [10] Aber es gibt noch: Masern, Kinderlähmung, Diphtherie, Mumps, Röteln, Hepatitis B, Tuberkulose, Keuchhusten. [11] Daransterben, vor allem in den Entwicklungsländern, jährlich immer noch Millionen Kinder. [12] In Deutschland werden diese Krankheiten von vielen Eltern offenbar nicht ernst genommen. [13] Weil sie sie gar nicht mehr kennen! [14] Denn mit

Multi-level annotation: syntax tree

Annotate, Synpathy

NK

NP

NK NK

Die einstige Fußball-Weltmacht

ART ADJA NN

Multi-level annotation: coreference

MMAX

Multi-level annotation: text tree

RST Tool

Multi-level annotation: layers

Exmaralda

Multi-level annotation (2):Information structure - SFB632

• B1 (Gur/Kwa languages)• B2 (Tchadic languages)• B4 (Diachronic Germanic / Latin translation)• B6 (Spoken „Kiezdeutsch“)• C1 (Newspaper German) - see below• C6 (Hindi)• D1 (Newspaper German) - see above• D2 (Questionnaire - 13 different languages)


• B1 (Gur/Kwa languages) - Shoe/Toolbox, Exm.• B2 (Tchadic languages) - Shoe/Toolbox, Exm.• B4 (Diachronic Germanic / Latin) - Exmaralda• B6 (Spoken „Kiezdeutsch“) - Exmaralda• C1 (Newspaper German) - Synpathy, MMax• C6 (Hindi) - XML• D1 (Newspaper German) - Syn, MMax, RST, Exm • D2 (Questionnaire) - Exmaralda


• B4 (Diachronic Germanic / Latin) - Exmaralda1800 sentences: info structure, syntax, coherence relations

• C1 (Newspaper German) - Synpathy, MMaxLarge text collection with only selected sentences being annotated - see below

• D1 (Newspaper German) - Syn, MMax, RST, Exm200 texts/2500 sentences, in part with coherence relations, coreference, syntax, info structure

• D2 (Questionnaire) - ExmaraldaGB of audio data / 50K transcribed tokens, in part with phrase structure, info structure

SFB632: From annotation tool to database

• Database reads PAULA• Conversion scripts map from tool output to PAULA

– Add metadata to documents– Fix some inconsistent tokenization

• Challenges– Enforce common tokenization across layers (and thus across tools)– Enforce syntactically correct annotation (Exmaralda)

• Manual work– Check for typos and other errors (wrong type of annotation layer,

etc.)– Repair some inconsistent tokenization

ANNIS Database

• ANNIS V1: Data resides in main memory– In use since 2005

• ANNIS V2: System with relational DB backend (PostgreSQL)– To be launched this summer

ANNIS query language

• Issue queries across annotation layers– ...to combine different realms of informationgivenness=giv & syncat=pp & rhetrel=contrast

– ...to check for conflicting annotations within the same realmann1::givenness=new & ann2::givenness=giv & #1 _=_ #2

– ...to check for completeness of annotationsaboutness=ref & !givenness=* & #1 _=_ #2

ANNIS V1Text view and annotation layers

ANNIS V2 Search for multiple constitutents in the Vorfeld

ANNIS V2Hit list

ANNIS V2Tree view

ANNIS V2Coreference view

Availability

• ANNIS database V1• ANNIS database V2: later this year• PAULA documentation• Conversion scripts

– AnnotTools to PAULAExmaralda, MMAX2, TigerXML, RSTTool, URML, Palinka, generic inline XML

• Ontology & Tools*– extensions for ontology-based corpus querying– HTML-export for ontologies

* Developed in the DFG project „Nachhaltigkeit linguistischer Daten“ (SFB 441, SFB 538, SFB 632)

Example pipelines







Corpus annotation pipeline

• information structure and word order in German*– What contextual conditions are licensing pre-field

occupation of non-subject constituents ?• annotations

– grammatical annotation• syntax, morphology

– pragmatic annotation• anaphora, bridging, information status

• efficient, goal-specific annotation– partial annotation

• selected examples + immediate context– semiautomatic annotation

* Chiarcos, C., J. Ritz, M. Stede (2008), Investigating non-canonical constructions in context: efficient corpus annotation and retrieval. to be presented at KONVENS 2008, Berlin, October, 2008


sample selection• collect a number of texts• mark target sentences

automatedpre-processing

• tokenization• parsing

• annotate anaphora and verify syntax• use standard annotation tools for both tasks• synchronization

anaphora

grammar

integration • conversion to PAULA

manual annotation


sample selection

• collect a number of texts• mark target sentences• convert to plain text with markup

tokenization• use standard tokenizer• mark sentence boundaries• preserve markup

parsing

• BitPar (German version of TracePar)*POS, morph, TIGER-style syntax

• in case of failure, use TreeTagger/Chunker**POS, NP/PP-chunks

conversion to TIGER XML • conversion from bracket format

automated pre-processing

* http://www.ims.uni-stuttgart.de/tcl/SOFTWARE/BitPar.html ** http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger

pre-processing


• synchronizationidentify relevant context sentences

• produce TIGER XML

anaphoricannotation

• MMAX*converted from TIGER XMLpreserve TIGER ids as annotations

* http://mmax2.sourceforge.net

** http://www.mpi.nl/tools/synpathy.html

grammaticalannotation

• Synpathy**correct selected sentences

• synchronizationverify MMAX references to TIGER XML

integration

manual annotationmanual annotation

integration


• MMAX formatTIGER ids as annotation valuesanaphora

grammar

• loss-less conversion to PAULAisomorphic to source format

annotation

• TIGER XML

MMAX@PAULA

TIGER@PAULA

merged PAULA project• merging

references to the same token file

integrated PAULA project• integration

replace TIGER ids from MMAX@PAULA withpointing relations to TIGER@PAULA elements

Corpus exploitation pipeline

• What is the relation between different levels of description ?– information status vs. morphosyntax– discourse structure vs. anaphora

• Qualitative analysis– Query the corpus for corresponding annotations and analyse

these examples manually.cf. ANNIS slides

• Quantitative analysis– Assess statistic correlations between different annotations.

Corpus exploitation pipelineQuantitative analysis

corpus of PAULA projects

TIGER XMLExmaralda

RST ToolMMAX

• POS, morph, syntax• information structure• discourse structure• coreferenceconversion

to PAULA

• integration of multiple annotationsof the same set of documents

conversion to ARFF

WEKA • WEKA* workbench for statistic analysesstatistic, neuronal, symbolic classifiers

* http://sourceforge.net/projects/weka/

• extraction of feature vectorsso far, no generic ARFF exporter has been developed. ANNIS 2.0 will be augmented witha number of example converters

Corpus exploitation pipelineQuantitative analysis with WEKA

Preprocessingselecting relevant features from anARFF feature list

Corpus exploitation pipelineQuantitative analysis with WEKA

exampleanalysis

(decision tree)

information status and referring expressions in German (Potsdam Commentary Corpus)

NLP pipeline

• Summarization project*– high-quality summarization– syntax, coreference, text structure, causal

markers– PAULA as exchange format between different

NLP modules– output of different modules is to be combined

• these may also run in parallel

– specific requirements for the exchange format

* Stede, M., H. Bieler, S. Dipper, and A. Suriyawongkul (2006). SUMMaR: Combining Linguistics and Statistics for Text Summarization. In Proceedings of the 17th European Conference on Artificial Intelligence (ECAI-06)

Layout Structureand Metadata

Extraction

Text StructureExtraction

Tokenization and Sentence Boundary

Detection

Syntactical Analysis(Connexor) Structure Weight

CalculationDiscourse Marker

Annotation

Term Weight Calculation

Treetagger

Topic Segmentation Number and Time Annotation

Coreference Analysis(Rosana)

Preprocessing Modules

Flexible Modules

Summarization, architecture

flexible modules can be arranged in anyorder in the pipeline or be processed non-sequentially PAULA as common interchange format

Merging

Summary Calculation

Graphical Representation

Final Modules


Flexible Modules(selection)

Final Modules


Extraction



Detection

Syntactic Analysis(Connexor)



Merging

Summary Calculation


Topic Segmentation

Robust Morphosyntactic

Analysis(TreeTagger)

Summarization pipeline

Summarization pipelineA fragment


Extraction



Detection


???

Merging

Summary Calculation



Flexible Modules

Final Modules

Topic Segmentation



PAULA

* Rosana requires Connexor as input format, hence, the mapping to PAULA is skipped at this point

*

to be processed by other components in the summarizationpipeline

coming from a preprocessing module



Transforming Rosanaoutput to PAULA

PAULA

components in the pipeline are „wrapped“ to become consumers and generators of PAULA

Transforming relevant PAULAannotations to Connexor input format

Summarization pipelineA fragment


Extraction



Detection


???

Merging

Summary Calculation



Flexible Modules

Final Modules

Topic Segmentation



PAULA



Transforming Rosanaoutput to PAULA

PAULATransforming relevant PAULA

annotations to Connexor input format

Merging multiple annotationlayers in one PAULA project

one single PAULA projectcomprising annotations fromdifferent modules

Requirements for an interchange format for summarization

• advantages– scalability– modularization

• requirements – supporting merge and split operations for

annotations of the same document– clear conceptual separation of annotations

PAULA







PAULA formatdesiderata I

• PAULAPotsdamer Austauschformat Linguistischer Annotationen

• designed with the following premises– very general, annotation-specific format

supporting• multi-layer annotations for information structural

(and other) phenomena• conflicting hierarchies (RST vs. syntax)• pointing references (e.g., anaphora)

PAULA formatdesiderata II


• designed with the following premises– high coverage

loss-less representation of information from a multitude of input formats and tools

– TIGER XML, Exmaralda, MMAX, RSTTool– Connexor, Rosana, Brill Tagger

PAULA formatdesiderata III


• designed with the following premises– merging and splitting operations

• self-contained annotation layers• extraction/addition of new annotation layers with

minimal effects to other annotation layers

– XML

PAULA formatAn “interlingua” for tools

• Radical standoff– each annotation layer stored in a separate file

• systematic application of xlinks– for non-tree fragments

• crossing branches for discontinuous constituents, anaphoric annotation

• Make as few structural commitments as possiblea wide variety of data formats can be represented

• as opposed to earlier, task-specific formats

• design inspired by early drafts for LAF– conceptually related to GrAF (Ide & Suderman (2007))

PAULA formatBasic elements

<mark> (markable)span of text which is subject to annotation, e.g.

a token

<struct> (structure)node in a hierarchical (tree or tree-like) structure

<rel> (relation)relation between struct or mark elements

<feat> (feature)annotation attached to a mark, struct, or rel element

PAULA formatBasic elements of syntax annotation

Annotate, TIGERSearch, Synpathy

NK

NP

NK NK


ART ADJA NN


PAULA representation of structure elements (struct, mark)

mark elements(token)

struct elements

rel elements NK

NP

NK NK


ART ADJA NN

<struct>

<rel> <rel> <rel>

<mark> <mark> <mark>

(type „tok“) (type „tok“) (type „tok“)

Die einstige Fußball-Weltmachtprimary data


PAULA representation of annotation elements (feat)

NK

NP

NK NK


ART ADJA NN

<struct>

<rel> <rel> <rel>





struct elements

rel elements

primary data

cat=NP

func=NK func=NK func=NK

POS=ART POS=ADJA POS=NN

PAULA formatPhysical representation

NK

NP

NK NK


ART ADJA NN

<struct>

<rel> <rel> <rel>





struct elements

rel elements

primary data

cat=NP



text.xml

tok.xml

syntax.xml

Every type of structure (primary data, mark, struct) represented in an individual filestruct and rel together encode hierarchical structures


NK

NP

NK NK


ART ADJA NN

<rel> <rel> <rel>





struct elements

rel elements

primary data

cat=NP



text.xml

tok.xml

syntax.xml

Dominance relations represented by XML hierarchy between struct and rel

<struct> inline XMLfragment


NK

NP

NK NK


ART ADJA NN

<rel> <rel> <rel>





struct elements

rel elements

primary data

cat=NP



text.xml

tok.xml

syntax.xml

Dominance relations represented by XML hierarchy between struct and reland xlinks/xpointer between rel and dominated struct/mark

<struct> inline XMLfragment

xlink/xpointer


NK

NP

NK NK


ART ADJA NN

<struct>

<rel> <rel> <rel>





struct elements

rel elements

primary data

cat=NP



text.xml

tok.xml

syntax.xml

Every type of structure (primary data, mark, struct) represented in an individual filemarks refer to token sequences


NK

NP

NK NK


ART ADJA NN

<struct>

<rel> <rel> <rel>





struct elements

rel elements

primary data

cat=NP



text.xml

tok.xml

syntax.xml

Every type of structure (primary data, mark, struct) represented in an individual filemarks of type ‚tok‘ refer to spans of primary data


NK

NP

NK NK


ART ADJA NN

<struct>

<rel> <rel> <rel>





struct elements

rel elements

primary data

cat=NP



text.xml

tok.xml

syntax.xml

Every type of structure (primary data, mark, struct) represented in an individual filemarks of type ‚tok‘ refer to spans of primary data

xlink/xpointer


NK

NP

NK NK


ART ADJA NN

<struct>

<rel> <rel> <rel>





struct elements

rel elements

primary data

cat=NP



text.xml

tok.xml

syntax.xml

For every annotation layer, every type of feat is also represented in a separate file

cat_func.xml

pos.xml


NK

NP

NK NK


ART ADJA NN

<struct>

<rel> <rel> <rel>





struct elements

rel elements

primary data

cat=NP



text.xml

tok.xml

syntax.xml

Feats are attached to mark/struct elements by means of xlink/xpointer expressions

cat_func.xml

pos.xml

PAULA formatAchievements

• Generic format– capable to represent hierarchical structures

struct elements correspond to nodes

struct/rel elements correspond to dominance relations

– capable to represent flat, layer-based annotations*mark elements correspond to spans of texts without hierarchical structure

– capable to represent pointing relations*rel elements without a dominating struct element represent non-dominance relations

– capable to represent any annotation assigned to thesefeat elements may point to any struct, mark, rel element

* not shown here


• Hierarchies are modelled by means of xlinks– may represent any kind of dominance relation using the same

mechanism, including discontinuous segments and crossing edges

• Represents every annotation layer on its own– structures from different annotation layers do not interfere with

each other• e.g. conflicting hierarchies

– addition or removal of another annotation layer does not affect the representation of the remaining layers


• Addition of annotation layers and merging annotation projects is easy– if two annotation projects exist for one piece of primary data:*

• redirect all references to the token layer to the common token layer

• register the new annotation layer

• Removal of annotation layers is trivial– if an annotation layer is to be removed

• remove the registration of the annotation layer in the current annotation project

* Merging of two annotation projects requires identical tokenization, more in a minute.

PAULA formatSome minor disadvantages

• Overhead– for a project with n annotation layers with

different annotations, at least 2n+2 files are created

• Only partially human readable– information distributed across multiple files

PAULA formatMore serious problems

• Hard to process using script languages– validity of xlink-references must be verified

• Maintenance– there is a number of (quite elaborate) converters from

and to PAULA– any extension of the original format requires all these

converters to be updated

• Merging annotation projects with different tokenization– Regularly, correction of tokenization is required, e.g.,

in the output of tools that are insensitive to tokenization (RSTTool) or re-tokenize (Connexor)

PAULARecent developments

• Currently, the PAULA JAVA API is under development, including– an implementation of the PAULA Object

Model– a parser for PAULA

• downward-compatible

– serialization facilities• downward-compatible

– routines for standard operations• aligning divergent tokenizations

PAULAForthcoming

• Intended extensions of PAULA concern– sub-token annotations

• morphemes, tones, etc.

– parallel corpora• multiple streams of primary data

– integration of media files

What we‘ve shown

• need for MLA for annotation and processing of pragmatic (and other linguistic) phenomena

• ANNIS, a tool for the querying and visualization of MLA

• example pipelines involving MLA– typical problems

• synchronization• adding/removal operations• expressivity of existing formats

What we‘ve shown

• typical problems when processing MLA– synchronization– adding/removal operations– expressivity of existing formats

premises for the development of PAULA– generic format specifically designed for

linguistic annotations– `radical‘ standoff

What we‘ve shown

• Problems of radical standoff formats– excessive use of xlinks– hard to read– validation vacilities

• .... and a solution to these– PAULA API

Thank you

Thank you... and thanks to the team:

• Anke Lüdeling (HUB), Ulf Leser (HUB)

• Heike Bieler (UP), Michael Götze (UP), Julia Ritz (UP), Amir Zeldes (HUB), Uwe Küssner (ext), {Stefanie Dipper, Tillmann Wegst}

• Karsten Hütter (HUB), Christian Lemke (UP), Viktor Rosenfeld (HUB), Florian Zipser (UP)

Documents

The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam {chiarcos|stede}@ling.uni-potsdam.de