Upload
lillian-perkins
View
218
Download
1
Tags:
Embed Size (px)
Citation preview
The PAULA framework:Automatic and Manual Annotation of
Linguistic Data
Christian Chiarcos and Manfred Stede Universität Potsdam
{chiarcos|stede}@ling.uni-potsdam.de
Workshop „Processing Pipelines“Darmstadt 2008/07/10
Overview
• Motivation • Multi-level annotation for discourse structure
research• Multi-level annotation for information structure
research• The ANNIS linguistic information system
multi-level querying and visualization• Example pipelines
• Corpus annotation and exploitation • PAULA for text summarization
• The PAULA format• Current state, future plans
Multi-level annotation (1):Discourse structure(s)
• Thesis:
Coherence of a text is not adequately characterized by „the“ discourse structure (a single tree or graph) but by the interplay of different levels of description, each reflecting a separate dimension of textuality.
(In Textlinguistik, this idea is not new (e.g., Motsch 96) but the programme has not been carried through yet.)
Impfpflicht gegen Kinderkrankheiten?[1] Kein Kind weiß heute noch, was Pocken sind. [2] So ein Glück. [3] Als die Pockenimpfung 1854 eingeführt wurde, [4] glaubten manche Menschen, [5] dass sich ihr Kopf in einen Kuhkopf verwandelt, [6] wenn sie sich impfen lassen. [7] Denn der Impfstoff wurde damals aus der Haut von Rindern hergestellt. [8] Heute ist diese furchtbare Krankheit ausgerottet. [9] Dank einer entschlossenen, weltweiten Impfkampagne. [10] Aber es gibt noch: Masern, Kinderlähmung, Diphtherie, Mumps, Röteln, Hepatitis B, Tuberkulose, Keuchhusten. [11] Daransterben, vor allem in den Entwicklungsländern, jährlich immer noch Millionen Kinder. [12] In Deutschland werden diese Krankheiten von vielen Eltern offenbar nicht ernst genommen. [13] Weil sie sie gar nicht mehr kennen! [14] Denn mit Impfstoffen wurde erreicht, [15] dass diese Infektionen nur noch sporadisch auftreten. [16] Doch wer aus eigenem Erleben weiß, [17] wie schrecklich Kinder leiden, [18] wenn sie ‚nur‘ Masern oder Keuchhusten haben, [19] sollte ihnen dies ersparen. [20] Und auch die gesundheitlichen Folgewirkungen. [21] Nur wer impfen lässt, hilft mit, dass Impfungen eines Tages überflüssig werden. [22] Stattdessen wird über Nebenwirkungen von Impfstoffen schwadroniert, [23] die höchst selten auftreten und die man erst Recht nur aus Büchern kennt. [24] Dann gibt es noch das schöne Argument: Das ist mein Kind, das darf der Staat nicht pieken. [25] Gegen solche Eltern hilft auch keine Impfung.
Mandatory vaccination against children‘s diseases?
[1] Today, children don‘t know anymore what pox are. [2] What a joy. [3] When pox vaccination was introduced in 1854, [4] quite a few people believed [5] that their head would turn into a cow‘s head [6] if they got themselves vaccinated. [7] For the vaccine was made from cattle‘s skin at the times. [8] Nowadays this dreadful disease is exterminated. [9] Thanks to a determined, world-wide vaccination campaign. [10] But there still are other diseases: Measles, polio, diphteria, mumps, rubella, hepatitis B, tuberculosis, pertussis. [11] Millions of children die of these, especially in less developed countries. [12] In Germany, many parents apparently don‘t take these diseases seriously. [13] Because they don‘t know them anymore! [14] For it has been achieved with vaccines [15] that these infections hit only rarely today. [16] But those who have experienced [17] how terribly children suffer [18] when they come down with ‚just‘ measles or pertussis, [19] should spare them the agony. [20] As well as the long-term consequences. [21] Only those who have their children vaccinated will contribute to vaccines‘ becoming superfluous some day. [22] Instead, people rant about side effects [23] that occur very rarely and are known merely from books. [24] Then there is the great argument: This is my child, the governement must not prick her. [25] No vaccine can help against such parents.
Mandatory vaccination against children‘s diseases?
[1] Today, children don‘t know anymore what pox are. [2] What a joy. [3] When pox vaccination was introduced in 1854, [4] quite a few people believed [5] that their head would turn into a cow‘s head [6] if they got themselves vaccinated. [7] For the vaccine was made from cattle‘s skin at the times. [8] Nowadays this dreadful disease is exterminated. [9] Thanks to a determined, world-wide vaccination campaign. [10] But there still are other diseases: Measles, polio, diphteria, mumps, rubella, hepatitis B, tuberculosis, pertussis. [11] Millions of children die of these, especially in less developed countries. [12] In Germany, many parents apparently don‘t take these diseases seriously. [13] Because they don‘t know them anymore! [14] For it has been achieved with vaccines [15] that these infections hit only rarely today. [16] But those who have experienced [17] how terribly children suffer [18] when they come down with ‚just‘ measles or pertussis, [19] should spare them the agony. [20] As well as the long-term consequences. [21] Only those who have their children vaccinated will contribute to vaccines‘ becoming superfluous some day. [22] Instead, people rant about side effects [23] that occur very rarely and are known merely from books. [24] Then there is the great argument: This is my child, the governement must not prick her. [25] No vaccine can help against such parents.
Referential Structure
Mandatory vaccination against children‘s diseases?
[1] Today, children don‘t know anymore what pox are. [2] What a joy. [3] When pox vaccination was introduced in 1854, [4] quite a few people believed [5] that their head would turn into a cow‘s head [6] if they got themselves vaccinated. [7] For the vaccine was made from cattle‘s skin at the times. [8] Nowadays this dreadful disease is exterminated. [9] Thanks to a determined, world-wide vaccination campaign. [10] But there still are other diseases: Measles, polio, diphteria, mumps, rubella, hepatitis B, tuberculosis, pertussis. [11] Millions of children die of these, especially in less developed countries. [12] In Germany, many parents apparently don‘t take these diseases seriously. [13] Because they don‘t know them anymore! [14] For it has been achieved with vaccines [15] that these infections hit only rarely today. [16] But those who have experienced [17] how terribly children suffer [18] when they come down with ‚just‘ measles or pertussis, [19] should spare them the agony. [20] As well as the long-term consequences. [21] Only those who have their children vaccinated will contribute to vaccines‘ becoming superfluous some day. [22] Instead, people rant about side effects [23] that occur very rarely and are known merely from books. [24] Then there is the great argument: This is my child, the governement must not prick her. [25] No vaccine can help against such parents.
Thematic Structure
Mandatory vaccination against children‘s diseases?
[1] Today, children don‘t know anymore what pox are. [2] What a joy. [3] When pox vaccination was introduced in 1854, [4] quite a few people believed [5] that their head would turn into a cow‘s head [6] if they got themselves vaccinated. [7] For the vaccine was made from cattle‘s skin at the times. [8] Nowadays this dreadful disease is exterminated. [9] Thanks to a determined, world-wide vaccination campaign. [10] But there still are other diseases: Measles, polio, diphteria, mumps, rubella, hepatitis B, tuberculosis, pertussis. [11] Millions of children die of these, especially in less developed countries. [12] In Germany, many parents apparently don‘t take these diseases seriously. [13] Because they don‘t know them anymore! [14] For it has been achieved with vaccines [15] that these infections hit only rarely today. [16] But those who have experienced [17] how terribly children suffer [18] when they come down with ‚just‘ measles or pertussis, [19] should spare them the agony. [20] As well as the long-term consequences. [21] Only those who have their children vaccinated will contribute to vaccines‘ becoming superfluous some day. [22] Instead, people rant about side effects [23] that occur very rarely and are known merely from books. [24] Then there is the great argument: This is my child, the governement must not prick her. [25] No vaccine can help against such parents.
Conjunctive Relations
• temporal– simultaneous, succession
• consequential– manner, consequence, condition, purpose, concession
• comparative– similarity, contrast, reformulation
• additive– addition, alternation
• Relations can be directed but not weighted - there is no nuclearity
(Martin 1992)
Conjunctive Relations
(Martin 1992)
Intentional structure
• Illocutions (inspired by Schmitt 00, Searle 76)
– Reportivum: writer describes a state of affairs
– Identifikativum: writer characterizes own state of mind, health, etc.
– Estimativum: writer presents proposition as probably true
– Evaluativum: writer presents a personal opinion
– Appellativum: writer orders or suggests an action
• Support Relations (subset of RST)– Ease-understanding
(Background)– Encourage-acting
(Motivation)– Ease-acting
(Enablement)– Encourage-believing
(Evidence)– Encourage appreciating
(Antithesis, Concession)
• Compare „types of argument“ (e.g., Eggs 00):– deontic– epistemic– ethic/aesthetic
Mandatory vaccination against children‘s diseases?
[1] Today, children don‘t know anymore what pox are. [2] What a joy. [3] When pox vaccination was introduced in 1854, [4] quite a few people believed [5] that their head would turn into a cow‘s head [6] if they got themselves vaccinated. [7] For the vaccine was made from cattle‘s skin at the times. [8] Nowadays this dreadful disease is exterminated. [9] Thanks to a determined, world-wide vaccination campaign. [10] But there still are other diseases: Measles, polio, diphteria, mumps, rubella, hepatitis B, tuberculosis, pertussis. [11] Millions of children die of these, especially in less developed countries. [12] In Germany, many parents apparently don‘t take these diseases seriously. [13] Because they don‘t know them anymore! [14] For it has been achieved with vaccines [15] that these infections hit only rarely today. [16] But those who have experienced [17] how terribly children suffer [18] when they come down with ‚just‘ measles or pertussis, [19] should spare them the agony. [20] As well as the long-term consequences. [21] Only those who have their children vaccinated will contribute to vaccines‘ becoming superfluous some day. [22] Instead, people rant about side effects [23] that occur very rarely and are known merely from books. [24] Then there is the great argument: This is my child, the governement must not prick her. [25] No vaccine can help against such parents.
Argument structure
(inspired by Freeman 1993)
Text understanding: Relating levels of analysis
Text understanding: Relations to sentence syntax
Impfpflicht gegen Kinderkrankheiten?[1] Kein Kind weiß heute noch, was Pocken sind. [2] So ein Glück. [3] Als die Pockenimpfung 1854 eingeführt wurde, [4] glaubten manche Menschen, [5] dass sich ihr Kopf in einen Kuhkopf verwandelt, [6] wenn sie sich impfen lassen. [7] Denn der Impfstoff wurde damals aus der Haut von Rindern hergestellt. [8] Heute ist diese furchtbare Krankheit ausgerottet. [9] Dank einer entschlossenen, weltweiten Impfkampagne. [10] Aber es gibt noch: Masern, Kinderlähmung, Diphtherie, Mumps, Röteln, Hepatitis B, Tuberkulose, Keuchhusten. [11] Daransterben, vor allem in den Entwicklungsländern, jährlich immer noch Millionen Kinder. [12] In Deutschland werden diese Krankheiten von vielen Eltern offenbar nicht ernst genommen. [13] Weil sie sie gar nicht mehr kennen! [14] Denn mit
Multi-level annotation: syntax tree
Annotate, Synpathy
NK
NP
NK NK
Die einstige Fußball-Weltmacht
ART ADJA NN
Multi-level annotation: coreference
MMAX
Multi-level annotation: text tree
RST Tool
Multi-level annotation: layers
Exmaralda
Multi-level annotation (2):Information structure - SFB632
• B1 (Gur/Kwa languages)• B2 (Tchadic languages)• B4 (Diachronic Germanic / Latin translation)• B6 (Spoken „Kiezdeutsch“)• C1 (Newspaper German) - see below• C6 (Hindi)• D1 (Newspaper German) - see above• D2 (Questionnaire - 13 different languages)
Multi-level annotation (2):Information structure - SFB632
• B1 (Gur/Kwa languages) - Shoe/Toolbox, Exm.• B2 (Tchadic languages) - Shoe/Toolbox, Exm.• B4 (Diachronic Germanic / Latin) - Exmaralda• B6 (Spoken „Kiezdeutsch“) - Exmaralda• C1 (Newspaper German) - Synpathy, MMax• C6 (Hindi) - XML• D1 (Newspaper German) - Syn, MMax, RST, Exm • D2 (Questionnaire) - Exmaralda
Multi-level annotation (2):Information structure - SFB632
• B4 (Diachronic Germanic / Latin) - Exmaralda1800 sentences: info structure, syntax, coherence relations
• C1 (Newspaper German) - Synpathy, MMaxLarge text collection with only selected sentences being annotated - see below
• D1 (Newspaper German) - Syn, MMax, RST, Exm200 texts/2500 sentences, in part with coherence relations, coreference, syntax, info structure
• D2 (Questionnaire) - ExmaraldaGB of audio data / 50K transcribed tokens, in part with phrase structure, info structure
SFB632: From annotation tool to database
• Database reads PAULA• Conversion scripts map from tool output to PAULA
– Add metadata to documents– Fix some inconsistent tokenization
• Challenges– Enforce common tokenization across layers (and thus across tools)– Enforce syntactically correct annotation (Exmaralda)
• Manual work– Check for typos and other errors (wrong type of annotation layer,
etc.)– Repair some inconsistent tokenization
ANNIS Database
• ANNIS V1: Data resides in main memory– In use since 2005
• ANNIS V2: System with relational DB backend (PostgreSQL)– To be launched this summer
ANNIS query language
• Issue queries across annotation layers– ...to combine different realms of informationgivenness=giv & syncat=pp & rhetrel=contrast
– ...to check for conflicting annotations within the same realmann1::givenness=new & ann2::givenness=giv & #1 _=_ #2
– ...to check for completeness of annotationsaboutness=ref & !givenness=* & #1 _=_ #2
ANNIS V1Text view and annotation layers
ANNIS V2 Search for multiple constitutents in the Vorfeld
ANNIS V2Hit list
ANNIS V2Tree view
ANNIS V2Coreference view
Availability
• ANNIS database V1• ANNIS database V2: later this year• PAULA documentation• Conversion scripts
– AnnotTools to PAULAExmaralda, MMAX2, TigerXML, RSTTool, URML, Palinka, generic inline XML
• Ontology & Tools*– extensions for ontology-based corpus querying– HTML-export for ontologies
* Developed in the DFG project „Nachhaltigkeit linguistischer Daten“ (SFB 441, SFB 538, SFB 632)
Example pipelines
• Motivation • Multi-level annotation for discourse structure
research• Multi-level annotation for information structure
research• The ANNIS linguistic information system
multi-level querying and visualization• Example pipelines
• Corpus annotation and exploitation • PAULA for text summarization
• The PAULA format• Current state, future plans
Corpus annotation pipeline
• information structure and word order in German*– What contextual conditions are licensing pre-field
occupation of non-subject constituents ?• annotations
– grammatical annotation• syntax, morphology
– pragmatic annotation• anaphora, bridging, information status
• efficient, goal-specific annotation– partial annotation
• selected examples + immediate context– semiautomatic annotation
* Chiarcos, C., J. Ritz, M. Stede (2008), Investigating non-canonical constructions in context: efficient corpus annotation and retrieval. to be presented at KONVENS 2008, Berlin, October, 2008
Corpus annotation pipeline
sample selection• collect a number of texts• mark target sentences
automatedpre-processing
• tokenization• parsing
• annotate anaphora and verify syntax• use standard annotation tools for both tasks• synchronization
anaphora
grammar
integration • conversion to PAULA
manual annotation
Corpus annotation pipeline
sample selection
• collect a number of texts• mark target sentences• convert to plain text with markup
tokenization• use standard tokenizer• mark sentence boundaries• preserve markup
parsing
• BitPar (German version of TracePar)*POS, morph, TIGER-style syntax
• in case of failure, use TreeTagger/Chunker**POS, NP/PP-chunks
conversion to TIGER XML • conversion from bracket format
automated pre-processing
* http://www.ims.uni-stuttgart.de/tcl/SOFTWARE/BitPar.html ** http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger
pre-processing
Corpus annotation pipeline
• synchronizationidentify relevant context sentences
• produce TIGER XML
anaphoricannotation
• MMAX*converted from TIGER XMLpreserve TIGER ids as annotations
* http://mmax2.sourceforge.net
** http://www.mpi.nl/tools/synpathy.html
grammaticalannotation
• Synpathy**correct selected sentences
• synchronizationverify MMAX references to TIGER XML
integration
manual annotationmanual annotation
integration
Corpus annotation pipeline
• MMAX formatTIGER ids as annotation valuesanaphora
grammar
• loss-less conversion to PAULAisomorphic to source format
annotation
• TIGER XML
MMAX@PAULA
TIGER@PAULA
merged PAULA project• merging
references to the same token file
integrated PAULA project• integration
replace TIGER ids from MMAX@PAULA withpointing relations to TIGER@PAULA elements
Corpus exploitation pipeline
• What is the relation between different levels of description ?– information status vs. morphosyntax– discourse structure vs. anaphora
• Qualitative analysis– Query the corpus for corresponding annotations and analyse
these examples manually.cf. ANNIS slides
• Quantitative analysis– Assess statistic correlations between different annotations.
Corpus exploitation pipelineQuantitative analysis
corpus of PAULA projects
TIGER XMLExmaralda
RST ToolMMAX
• POS, morph, syntax• information structure• discourse structure• coreferenceconversion
to PAULA
• integration of multiple annotationsof the same set of documents
conversion to ARFF
WEKA • WEKA* workbench for statistic analysesstatistic, neuronal, symbolic classifiers
* http://sourceforge.net/projects/weka/
• extraction of feature vectorsso far, no generic ARFF exporter has been developed. ANNIS 2.0 will be augmented witha number of example converters
Corpus exploitation pipelineQuantitative analysis with WEKA
Preprocessingselecting relevant features from anARFF feature list
Corpus exploitation pipelineQuantitative analysis with WEKA
exampleanalysis
(decision tree)
information status and referring expressions in German (Potsdam Commentary Corpus)
NLP pipeline
• Summarization project*– high-quality summarization– syntax, coreference, text structure, causal
markers– PAULA as exchange format between different
NLP modules– output of different modules is to be combined
• these may also run in parallel
– specific requirements for the exchange format
* Stede, M., H. Bieler, S. Dipper, and A. Suriyawongkul (2006). SUMMaR: Combining Linguistics and Statistics for Text Summarization. In Proceedings of the 17th European Conference on Artificial Intelligence (ECAI-06)
Layout Structureand Metadata
Extraction
Text StructureExtraction
Tokenization and Sentence Boundary
Detection
Syntactical Analysis(Connexor) Structure Weight
CalculationDiscourse Marker
Annotation
Term Weight Calculation
Treetagger
Topic Segmentation Number and Time Annotation
Coreference Analysis(Rosana)
Preprocessing Modules
Flexible Modules
Summarization, architecture
flexible modules can be arranged in anyorder in the pipeline or be processed non-sequentially PAULA as common interchange format
Merging
Summary Calculation
Graphical Representation
Final Modules
Preprocessing Modules
Flexible Modules(selection)
Final Modules
Layout Structureand Metadata
Extraction
Text StructureExtraction
Tokenization and Sentence Boundary
Detection
Syntactic Analysis(Connexor)
Term Weight Calculation
Coreference Analysis(Rosana)
Merging
Summary Calculation
Graphical Representation
Topic Segmentation
Robust Morphosyntactic
Analysis(TreeTagger)
Summarization pipeline
Summarization pipelineA fragment
Layout Structureand Metadata
Extraction
Text StructureExtraction
Tokenization and Sentence Boundary
Detection
Term Weight Calculation
???
Merging
Summary Calculation
Graphical Representation
Preprocessing Modules
Flexible Modules
Final Modules
Topic Segmentation
Robust Morphosyntactic
Analysis(TreeTagger)
PAULA
* Rosana requires Connexor as input format, hence, the mapping to PAULA is skipped at this point
*
to be processed by other components in the summarizationpipeline
coming from a preprocessing module
Syntactic Analysis(Connexor)
Coreference Analysis(Rosana)
Transforming Rosanaoutput to PAULA
PAULA
components in the pipeline are „wrapped“ to become consumers and generators of PAULA
Transforming relevant PAULAannotations to Connexor input format
Summarization pipelineA fragment
Layout Structureand Metadata
Extraction
Text StructureExtraction
Tokenization and Sentence Boundary
Detection
Term Weight Calculation
???
Merging
Summary Calculation
Graphical Representation
Preprocessing Modules
Flexible Modules
Final Modules
Topic Segmentation
Robust Morphosyntactic
Analysis(TreeTagger)
PAULA
Syntactic Analysis(Connexor)
Coreference Analysis(Rosana)
Transforming Rosanaoutput to PAULA
PAULATransforming relevant PAULA
annotations to Connexor input format
Merging multiple annotationlayers in one PAULA project
one single PAULA projectcomprising annotations fromdifferent modules
Requirements for an interchange format for summarization
• advantages– scalability– modularization
• requirements – supporting merge and split operations for
annotations of the same document– clear conceptual separation of annotations
PAULA
• Motivation • Multi-level annotation for discourse structure
research• Multi-level annotation for information structure
research• The ANNIS linguistic information system
multi-level querying and visualization• Example pipelines
• Corpus annotation and exploitation • PAULA for text summarization
• The PAULA format• Current state, future plans
PAULA formatdesiderata I
• PAULAPotsdamer Austauschformat Linguistischer Annotationen
• designed with the following premises– very general, annotation-specific format
supporting• multi-layer annotations for information structural
(and other) phenomena• conflicting hierarchies (RST vs. syntax)• pointing references (e.g., anaphora)
PAULA formatdesiderata II
• PAULAPotsdamer Austauschformat Linguistischer Annotationen
• designed with the following premises– high coverage
loss-less representation of information from a multitude of input formats and tools
– TIGER XML, Exmaralda, MMAX, RSTTool– Connexor, Rosana, Brill Tagger
PAULA formatdesiderata III
• PAULAPotsdamer Austauschformat Linguistischer Annotationen
• designed with the following premises– merging and splitting operations
• self-contained annotation layers• extraction/addition of new annotation layers with
minimal effects to other annotation layers
– XML
PAULA formatAn “interlingua” for tools
• Radical standoff– each annotation layer stored in a separate file
• systematic application of xlinks– for non-tree fragments
• crossing branches for discontinuous constituents, anaphoric annotation
• Make as few structural commitments as possiblea wide variety of data formats can be represented
• as opposed to earlier, task-specific formats
• design inspired by early drafts for LAF– conceptually related to GrAF (Ide & Suderman (2007))
PAULA formatBasic elements
<mark> (markable)span of text which is subject to annotation, e.g.
a token
<struct> (structure)node in a hierarchical (tree or tree-like) structure
<rel> (relation)relation between struct or mark elements
<feat> (feature)annotation attached to a mark, struct, or rel element
PAULA formatBasic elements of syntax annotation
Annotate, TIGERSearch, Synpathy
NK
NP
NK NK
Die einstige Fußball-Weltmacht
ART ADJA NN
PAULA formatBasic elements of syntax annotation
PAULA representation of structure elements (struct, mark)
mark elements(token)
struct elements
rel elements NK
NP
NK NK
Die einstige Fußball-Weltmacht
ART ADJA NN
<struct>
<rel> <rel> <rel>
<mark> <mark> <mark>
(type „tok“) (type „tok“) (type „tok“)
Die einstige Fußball-Weltmachtprimary data
PAULA formatBasic elements of syntax annotation
PAULA representation of annotation elements (feat)
NK
NP
NK NK
Die einstige Fußball-Weltmacht
ART ADJA NN
<struct>
<rel> <rel> <rel>
<mark> <mark> <mark>
(type „tok“) (type „tok“) (type „tok“)
Die einstige Fußball-Weltmacht
mark elements(token)
struct elements
rel elements
primary data
cat=NP
func=NK func=NK func=NK
POS=ART POS=ADJA POS=NN
PAULA formatPhysical representation
NK
NP
NK NK
Die einstige Fußball-Weltmacht
ART ADJA NN
<struct>
<rel> <rel> <rel>
<mark> <mark> <mark>
(type „tok“) (type „tok“) (type „tok“)
Die einstige Fußball-Weltmacht
mark elements(token)
struct elements
rel elements
primary data
cat=NP
func=NK func=NK func=NK
POS=ART POS=ADJA POS=NN
text.xml
tok.xml
syntax.xml
Every type of structure (primary data, mark, struct) represented in an individual filestruct and rel together encode hierarchical structures
PAULA formatPhysical representation
NK
NP
NK NK
Die einstige Fußball-Weltmacht
ART ADJA NN
<rel> <rel> <rel>
<mark> <mark> <mark>
(type „tok“) (type „tok“) (type „tok“)
Die einstige Fußball-Weltmacht
mark elements(token)
struct elements
rel elements
primary data
cat=NP
func=NK func=NK func=NK
POS=ART POS=ADJA POS=NN
text.xml
tok.xml
syntax.xml
Dominance relations represented by XML hierarchy between struct and rel
<struct> inline XMLfragment
PAULA formatPhysical representation
NK
NP
NK NK
Die einstige Fußball-Weltmacht
ART ADJA NN
<rel> <rel> <rel>
<mark> <mark> <mark>
(type „tok“) (type „tok“) (type „tok“)
Die einstige Fußball-Weltmacht
mark elements(token)
struct elements
rel elements
primary data
cat=NP
func=NK func=NK func=NK
POS=ART POS=ADJA POS=NN
text.xml
tok.xml
syntax.xml
Dominance relations represented by XML hierarchy between struct and reland xlinks/xpointer between rel and dominated struct/mark
<struct> inline XMLfragment
xlink/xpointer
PAULA formatPhysical representation
NK
NP
NK NK
Die einstige Fußball-Weltmacht
ART ADJA NN
<struct>
<rel> <rel> <rel>
<mark> <mark> <mark>
(type „tok“) (type „tok“) (type „tok“)
Die einstige Fußball-Weltmacht
mark elements(token)
struct elements
rel elements
primary data
cat=NP
func=NK func=NK func=NK
POS=ART POS=ADJA POS=NN
text.xml
tok.xml
syntax.xml
Every type of structure (primary data, mark, struct) represented in an individual filemarks refer to token sequences
PAULA formatPhysical representation
NK
NP
NK NK
Die einstige Fußball-Weltmacht
ART ADJA NN
<struct>
<rel> <rel> <rel>
<mark> <mark> <mark>
(type „tok“) (type „tok“) (type „tok“)
Die einstige Fußball-Weltmacht
mark elements(token)
struct elements
rel elements
primary data
cat=NP
func=NK func=NK func=NK
POS=ART POS=ADJA POS=NN
text.xml
tok.xml
syntax.xml
Every type of structure (primary data, mark, struct) represented in an individual filemarks of type ‚tok‘ refer to spans of primary data
PAULA formatPhysical representation
NK
NP
NK NK
Die einstige Fußball-Weltmacht
ART ADJA NN
<struct>
<rel> <rel> <rel>
<mark> <mark> <mark>
(type „tok“) (type „tok“) (type „tok“)
Die einstige Fußball-Weltmacht
mark elements(token)
struct elements
rel elements
primary data
cat=NP
func=NK func=NK func=NK
POS=ART POS=ADJA POS=NN
text.xml
tok.xml
syntax.xml
Every type of structure (primary data, mark, struct) represented in an individual filemarks of type ‚tok‘ refer to spans of primary data
xlink/xpointer
PAULA formatPhysical representation
NK
NP
NK NK
Die einstige Fußball-Weltmacht
ART ADJA NN
<struct>
<rel> <rel> <rel>
<mark> <mark> <mark>
(type „tok“) (type „tok“) (type „tok“)
Die einstige Fußball-Weltmacht
mark elements(token)
struct elements
rel elements
primary data
cat=NP
func=NK func=NK func=NK
POS=ART POS=ADJA POS=NN
text.xml
tok.xml
syntax.xml
For every annotation layer, every type of feat is also represented in a separate file
cat_func.xml
pos.xml
PAULA formatPhysical representation
NK
NP
NK NK
Die einstige Fußball-Weltmacht
ART ADJA NN
<struct>
<rel> <rel> <rel>
<mark> <mark> <mark>
(type „tok“) (type „tok“) (type „tok“)
Die einstige Fußball-Weltmacht
mark elements(token)
struct elements
rel elements
primary data
cat=NP
func=NK func=NK func=NK
POS=ART POS=ADJA POS=NN
text.xml
tok.xml
syntax.xml
Feats are attached to mark/struct elements by means of xlink/xpointer expressions
cat_func.xml
pos.xml
PAULA formatAchievements
• Generic format– capable to represent hierarchical structures
struct elements correspond to nodes
struct/rel elements correspond to dominance relations
– capable to represent flat, layer-based annotations*mark elements correspond to spans of texts without hierarchical structure
– capable to represent pointing relations*rel elements without a dominating struct element represent non-dominance relations
– capable to represent any annotation assigned to thesefeat elements may point to any struct, mark, rel element
* not shown here
PAULA formatAchievements
• Hierarchies are modelled by means of xlinks– may represent any kind of dominance relation using the same
mechanism, including discontinuous segments and crossing edges
• Represents every annotation layer on its own– structures from different annotation layers do not interfere with
each other• e.g. conflicting hierarchies
– addition or removal of another annotation layer does not affect the representation of the remaining layers
PAULA formatAchievements
• Addition of annotation layers and merging annotation projects is easy– if two annotation projects exist for one piece of primary data:*
• redirect all references to the token layer to the common token layer
• register the new annotation layer
• Removal of annotation layers is trivial– if an annotation layer is to be removed
• remove the registration of the annotation layer in the current annotation project
* Merging of two annotation projects requires identical tokenization, more in a minute.
PAULA formatSome minor disadvantages
• Overhead– for a project with n annotation layers with
different annotations, at least 2n+2 files are created
• Only partially human readable– information distributed across multiple files
PAULA formatMore serious problems
• Hard to process using script languages– validity of xlink-references must be verified
• Maintenance– there is a number of (quite elaborate) converters from
and to PAULA– any extension of the original format requires all these
converters to be updated
• Merging annotation projects with different tokenization– Regularly, correction of tokenization is required, e.g.,
in the output of tools that are insensitive to tokenization (RSTTool) or re-tokenize (Connexor)
PAULARecent developments
• Currently, the PAULA JAVA API is under development, including– an implementation of the PAULA Object
Model– a parser for PAULA
• downward-compatible
– serialization facilities• downward-compatible
– routines for standard operations• aligning divergent tokenizations
PAULAForthcoming
• Intended extensions of PAULA concern– sub-token annotations
• morphemes, tones, etc.
– parallel corpora• multiple streams of primary data
– integration of media files
What we‘ve shown
• need for MLA for annotation and processing of pragmatic (and other linguistic) phenomena
• ANNIS, a tool for the querying and visualization of MLA
• example pipelines involving MLA– typical problems
• synchronization• adding/removal operations• expressivity of existing formats
What we‘ve shown
• typical problems when processing MLA– synchronization– adding/removal operations– expressivity of existing formats
premises for the development of PAULA– generic format specifically designed for
linguistic annotations– `radical‘ standoff
What we‘ve shown
• Problems of radical standoff formats– excessive use of xlinks– hard to read– validation vacilities
• .... and a solution to these– PAULA API
Thank you
Thank you... and thanks to the team:
• Anke Lüdeling (HUB), Ulf Leser (HUB)
• Heike Bieler (UP), Michael Götze (UP), Julia Ritz (UP), Amir Zeldes (HUB), Uwe Küssner (ext), {Stefanie Dipper, Tillmann Wegst}
• Karsten Hütter (HUB), Christian Lemke (UP), Viktor Rosenfeld (HUB), Florian Zipser (UP)