1
Hierarchical XML Layers Representation for
Heavily Annotated Corpora
Dan Cristea Cristina Butnariu [email protected] [email protected]
“Al. I. Cuza” University of Iaşi
Faculty of Computer Science
and
Romanian Academy – the Iaşi Branch
Institute for Theoretical Computer Science
LREC 2004 – Workshop on Richly Annotated Corpora 2/48
XML in LR annotation• A de facto framework to support language annotation• Used to:
– record experts views on linguistic phenomena on corpora– store intermediate results in pipe-line NLP applications– post NLP results
• BUT: – annotation schemes: a chaos and not reusable – many annotations do share parts in common– not all layers are useful for the task at hand
LREC 2004 – Workshop on Richly Annotated Corpora 3/48
Presentation
• Motivation for a structural view on annotation schemes
• Proposal for a hierarchical representation – circular references– classification within the hierarchy– operations within the hierarchy
• Conclusions
LREC 2004 – Workshop on Richly Annotated Corpora 4/48
An annotation session
• a source XML annotated document
• a database image of the annotation
or both
DTD file
Annotation session
LREC 2004 – Workshop on Richly Annotated Corpora 5/48
A sequence of annotation sessions
DTD1 DTD2
Annotation session
Annotation session
LREC 2004 – Workshop on Richly Annotated Corpora 6/48
DTD1 DTD2
Mixing human with automatic annotation
Manual annotation
Automatic annotation
LREC 2004 – Workshop on Richly Annotated Corpora 7/48
Multiple parentage of a scheme
+
LREC 2004 – Workshop on Richly Annotated Corpora 8/48
Multiple parentage
LREC 2004 – Workshop on Richly Annotated Corpora 9/48
< … >< … >
Multiple parentage
LREC 2004 – Workshop on Richly Annotated Corpora 10/48
Multiple parentage
LREC 2004 – Workshop on Richly Annotated Corpora 11/48
< … > < … >
Multiple parentage
LREC 2004 – Workshop on Richly Annotated Corpora 12/48
Multiple parentage
LREC 2004 – Workshop on Richly Annotated Corpora 13/48
Multiple parentage
< … >
< … >
< … >
< … >
LREC 2004 – Workshop on Richly Annotated Corpora 14/48
The hierarchy – a DAG representation
ST-NP
ST-SEG-NP-VP
ST-ROOT
ST-TOKST-SEG ST-PAR
ST-POS
ST-VP
ST-COREF
ST-PAR-SEG-NP-VPST-COREF-IN-SEG
LREC 2004 – Workshop on Richly Annotated Corpora 15/48
The hierarchy – a DAG representation
ST-NP
ST-SEG-NP-VP
ST-ROOT
ST-TOKST-SEG ST-PAR
ST-POS
ST-VP
ST-COREF
ST-PAR-SEG-NP-VPST-COREF-IN-SEG
LREC 2004 – Workshop on Richly Annotated Corpora 16/48
Definition of a scheme
<scheme name=”scheme-name” parents=”list-of-parents”>
<tag name="tag-name" attributes="list-of-attributes"/>
… <ref source-tag="tag-name" source-attribute="attribute-name" target-tag="tag-name" target-attribute=”attribute-name”>
…</scheme>
LREC 2004 – Workshop on Richly Annotated Corpora 17/48
The subsumption relationA node A subsumes a node B in the hierarchy (B is
a descendent of A) iff:– any tag-name of A is also in B;– any attribute in the list of attributes of a tag-name in A is
also in the list of attributes of the same tag-name of B;– any semantic relation which holds in A also holds in B;– either B has at least one tag-name which is not in A,
and/or there is at least one tag-name in B such that at least one attribute in its list of attributes is not in the list of attributes of the homonymous tag-name in A, and/or there is at least one semantic relation which holds in B and which doesn’t hold in A.
A
B
LREC 2004 – Workshop on Richly Annotated Corpora 18/48
Example<?xml version="1.0" encoding="ISO-8859-1" ?> <ROOT><SEG id="0"> <NP head-id="2" id="0"> <TOK id="2" pos="N" lemma="Winston">Winston</TOK> </NP> <TOK id="3" pos="V" lemma="be">was</TOK> <TOK id="4" pos="ING" lemma="dream">dreaming</TOK> <TOK id="5" pos="PREP" lemma="of">of</TOK> <NP head-id="7" id="2"> <NP head-id="6" id="1" coref="0"> <TOK id="6" pos="PRON" lemma="he">his</TOK> </NP> <TOK id="7" pos="N" lemma="mother">mother</TOK> </NP> <TOK id="8" pos="PUNCT">.</TOK> </SEG><SEG id="1"> <NP head-id="9" id="3" coref="0"> <TOK id="9" pos="PRON" lemma="he">He</TOK> </NP> <TOK id="10" pos="V" lemma="must">must</TOK> <TOK id="11" pos="PUNCT">,</TOK> </SEG><SEG id="2"> <NP head-id="12" id="4" coref="0"> <TOK id="12" pos="PRON" lemma="he">he</TOK> </NP> <TOK id="13" pos="V" lemma="think">thought</TOK> <TOK id="14" pos="PUNCT">,</TOK> </SEG></ROOT>
LREC 2004 – Workshop on Richly Annotated Corpora 19/48
How can circular references be notated?
<SEG id=“seg0" head-id=“vp0">
Winston
<VP id=“vp0“ in-seg=“seg0">was dreaming</VP>
of his mother
</SEG>
LREC 2004 – Workshop on Richly Annotated Corpora 20/48
Representing circular references
ST-ROOT
ST-SEG
<SEG id=“seg0">
Winston
was dreaming
of his mother
</SEG>
SEG annotation
LREC 2004 – Workshop on Richly Annotated Corpora 21/48
Representing circular references
ST-ROOT
Winston
<VP id=“vp0“>
was dreaming
</VP>
of his mother
ST-VP
VP annotation
LREC 2004 – Workshop on Richly Annotated Corpora 22/48
Representing circular references
ST-ROOT
ST-VPST-SEG
ST-SEG-TO-VP
<SEG id=“seg0" head-id=“vp0">
Winston
<VP id=“vp0“>
was dreaming
</VP>
of his mother
</SEG>
SEG refers into VP
LREC 2004 – Workshop on Richly Annotated Corpora 23/48
Representing circular references
ST-ROOT
ST-VPST-SEG
ST-VP-TO-SEG
<SEG id=“seg0">
Winston
<VP id=“vp0“ in-seg=“seg0">
was dreaming
</VP>
of his mother
</SEG>
VP refers into SEG
LREC 2004 – Workshop on Richly Annotated Corpora 24/48
Representing circular references
<SEG id=“seg0“ head-id=“vp0”>
Winston
<VP id=“vp0“ in-seg=“seg0">
was dreaming
</VP>
of his mother
</SEG>
Keeping all referencesST-ROOT
ST-VPST-SEG
ST-SEG-TO-VP ST-VP-TO-SEG
ST-SEG-VP
LREC 2004 – Workshop on Richly Annotated Corpora 25/48
Representing circular references
ST-ROOT
ST-VPST-SEG
ST-SEG-TO-VP ST-VP-TO-SEG
ST-SEG-VP
ST-ROOT
ST-VPST-SEG
ST-SEG-VP
Delete unnecessary layers
LREC 2004 – Workshop on Richly Annotated Corpora 26/48
In what conditions can a document interact with
a hierarchy?
• Compatibility of names
• Matching of semantic relations
LREC 2004 – Workshop on Richly Annotated Corpora 27/48
In what conditions can a document interact with
a hierarchy?
• Compatibility of names = tag and attribute names– simple translation – expanding/shrinking values
msd=”Ncmso”
expands into a set of elementary features
pos=”noun” type=”common” gender=”masculine” number=”singular” case=”obligue”
LREC 2004 – Workshop on Richly Annotated Corpora 28/48
In what conditions can a document interact with
a hierarchy?
• Matching of semantic relations– only by explicit declaration– automatic detection (intersection of attribute value
ranges) is prone to errors
LREC 2004 – Workshop on Richly Annotated Corpora 29/48
Operations on the lattice: classification
• Automatic classification of a document on the lattice proceeds in two steps:– the witness-collection is formed:
• the document is parsed tag declarations•semantic-relations declaration in the
header ref declarations
– the witness-collection is “classified” down the hierarchy
LREC 2004 – Workshop on Richly Annotated Corpora 30/48
Operations on the lattice: classification
• The “programming by classification” paradigm of Mellish&Reiter (1993)– the witness collection satisfies the
restrictions of a node collection (is classified under it) if the features of the node collection represent of subset of the features of the witness collection
LREC 2004 – Workshop on Richly Annotated Corpora 31/48
Operations on the lattice: classification
• Automatic classification of a document on the lattice
ST-NP
ST-SEG-NP-VP
ST-ROOT
ST-TOK ST-SEG ST-PAR
ST-POS
ST-VP
ST-COREF
ST-PAR-SEG-NP-VPST-COREF-IN-SEG
LREC 2004 – Workshop on Richly Annotated Corpora 32/48
• Automatic classification of a document on the lattice
ST-NP
ST-SEG-NP-VP
ST-ROOT
ST-TOK ST-SEG ST-PAR
ST-POS
ST-VP
ST-COREF
ST-PAR-SEG-NP-VPST-COREF-IN-SEG
Operations on the lattice: classification
LREC 2004 – Workshop on Richly Annotated Corpora 33/48
• Automatic classification of a document on the lattice
ST-NP
ST-SEG-NP-VP
ST-ROOT
ST-TOK ST-SEG ST-PAR
ST-POS
ST-VP
ST-COREF
ST-PAR-SEG-NP-VPST-COREF-IN-SEG
Operations on the lattice: classification
LREC 2004 – Workshop on Richly Annotated Corpora 34/48
• Automatic classification of a document on the lattice
ST-NP
ST-SEG-NP-VP
ST-ROOT
ST-TOK ST-SEG ST-PAR
ST-POS
ST-VP
ST-COREF
ST-PAR-SEG-NP-VPST-COREF-IN-SEG
Operations on the lattice: classification
LREC 2004 – Workshop on Richly Annotated Corpora 35/48
• Automatic classification of a document on the lattice
ST-NP
ST-SEG-NP-VP
ST-ROOT
ST-TOK ST-SEG ST-PAR
ST-POS
ST-VP
ST-COREF
ST-PAR-SEG-NP-VPST-COREF-IN-SEG
Operations on the lattice: classification
superior borderline
LREC 2004 – Workshop on Richly Annotated Corpora 36/48
• Automatic classification of a document on the lattice
ST-NP
ST-SEG-NP-VP
ST-ROOT
ST-TOK ST-SEG ST-PAR
ST-POS
ST-VP
ST-COREF
ST-PAR-SEG-NP-VPST-COREF-IN-SEG
Operations on the lattice: classification
superior borderline
inferior borderline
LREC 2004 – Workshop on Richly Annotated Corpora 37/48
• Automatic classification of a document on the lattice
ST-NP
ST-ROOT
ST-TOK ST-SEG ST-PAR
ST-POS
ST-VP
ST-COREF
ST-PAR-SEG-NP-VPST-COREF-IN-SEG
ST-SEG-NP-VP-1
ST-SEG-NP-VP
Operations on the lattice: classification
LREC 2004 – Workshop on Richly Annotated Corpora 38/48
• Automatic classification of a document on the lattice
ST-NP
ST-SEG-NP-VP
ST-ROOT
ST-TOK ST-SEG ST-PAR
ST-POS
ST-VP
ST-COREF
ST-PAR-SEG-NP-VPST-COREF-IN-SEG
Operations on the lattice: classification
LREC 2004 – Workshop on Richly Annotated Corpora 39/48
• Automatic classification of a document on the lattice
ST-NP
ST-SEG-NP-VP
ST-ROOT
ST-TOK ST-SEG ST-PAR
ST-POS
ST-VP
ST-COREF
ST-PAR-SEG-NP-VPST-COREF-IN-SEG
Operations on the lattice: classification
LREC 2004 – Workshop on Richly Annotated Corpora 40/48
• Automatic classification of a document on the lattice
ST-NP
ST-SEG-NP-VP
ST-ROOT
ST-TOK ST-SEG ST-PAR
ST-POS
ST-VP
ST-COREF
ST-PAR-SEG-NP-VPST-COREF-IN-SEG
Operations on the lattice: classification
LREC 2004 – Workshop on Richly Annotated Corpora 41/48
• Automatic classification of a document on the lattice
ST-NP
ST-SEG-NP-VP
ST-ROOT
ST-TOK ST-SEG ST-PAR
ST-POS
ST-VP
ST-COREF
ST-PAR-SEG-NP-VPST-COREF-IN-SEG
Operations on the lattice: classification
superior borderline
LREC 2004 – Workshop on Richly Annotated Corpora 42/48
• Automatic classification of a document on the lattice
ST-NP
ST-SEG-NP-VP
ST-ROOT
ST-TOK ST-SEG ST-PAR
ST-POS
ST-VP
ST-COREF
ST-PAR-SEG-NP-VPST-COREF-IN-SEG
ST-NP-PP
Operations on the lattice: classification
LREC 2004 – Workshop on Richly Annotated Corpora 43/48
ST-SEG-NP-VP
ST-ROOT
ST-TOK
ST-NP
ST-PAR
ST-POS
ST-VP
ST-COREF
ST-PAR-SEG-NP-VPST-COREF-IN-SEG
ST-NP-SEG
Operations on the lattice: merge
ST-SEG
LREC 2004 – Workshop on Richly Annotated Corpora 44/48
Operations on the lattice: extract
ST-NP
ST-SEG-NP-VP
ST-ROOT
ST-TOK ST-SEG ST-PAR
ST-POS
ST-VP
ST-COREF
ST-PAR-SEG-NP-VPST-COREF-IN-SEG
LREC 2004 – Workshop on Richly Annotated Corpora 45/48
Operations on the lattice: extract
ST-NP
ST-SEG-NP-VP
ST-ROOT
ST-TOK ST-SEG ST-PAR
ST-POS
ST-VP
ST-COREF
ST-PAR-SEG-NP-VPST-COREF-IN-SEG
LREC 2004 – Workshop on Richly Annotated Corpora 46/48
Conclusions• Propose a data structure facilitating:
– Definition and exploitation of annotation schemes– Visualization of the hierarchy– Representation of circular references– Concurrent annotations– Automatic classification– Operations
• initialize-hierarchy• classify• merge• extract
• System developed in Java, freely available on request
LREC 2004 – Workshop on Richly Annotated Corpora 47/48
Acknowledgements
The research presented in this paper has been partly supported by the EC IST-2000-29388 Balkanet project funded by the EC and the Balkanet-MEC project funded by the Romanian Ministry of Education and Research
LREC 2004 – Workshop on Richly Annotated Corpora 48/48
Thank you…