View
213
Download
0
Tags:
Embed Size (px)
Citation preview
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
The XML FrameworkIts Implications for Corpus Access and Use
Nancy Ide
Department of Computer Science
Vassar College
Data Architectures and Software Support for Large CorporaTowards an American National Corpus
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
XML
• emerging standard for data representation and exchange on the World Wide Web
• powerful tool for data representation and access
• obvious standard for interchange of language resources– supports text, speech, video, audio– ...and linkage among them!
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
XML provides more than SGML
• better linkage mechanisms
• XSLT for document access and transformation
• XML schemas
• provision for accessing all or part of multiple DTDs
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
XML Links• "stand-off" annotation is the accepted norm for
annotated resources
• maintain all or most annotations in separate documents– each references appropriate locations in the
original data – yields a finely linked hypertext format where the
links specify a semantic role rather than navigational options
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Requirements of the stand-off architecture
• address XML elements
• address characters and chains of characters within those elements
• address elements and characters both within the same document and in other XML documents
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
XML Path Language (XPath)
• concise notation for element localization in the document tree– /div/p[2]/s[3] - third sentence of second
paragraph in each <div>– /descendant::p - all <p> elements
• predicates for accessing characters within elements– substring(/p/s[2]/text(),10,12)
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
XPointer
• extends XPath syntax to allow : – addressing points and ranges as well as
nodes– locating information by string matching– use of addressing expressions in URI-
references as fragment identifiers
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
XLink
• uni- or multi-directional links
• can specify how link is to be activated– by hand or automatically by the browser
• can specify what to do with the target fragment – replace it or insert it into the source document
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Links to External Documents
• None in SGML
• HyTime/TEI invented "doc" attribute
• CES used "doc" with inheritance to avoid repetition of the attribute– not supported by SGML processors
• XML: XLink and xml:base attribute
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
XSLT• a powerful tree-traversal language
• translate any XML document into another document in any form– html– XML– plain text– etc.
• most to offer for handling annotated resources
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
XSLT Capabilities
selection of elements or portions of element content using the XPath syntax
rearrangement, transformation of extracted information (text content, element names, etc.) in the target document
• addition of information to the target document
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
A Simple Example<?xml version="1.0">
<chunk type="BODY" lang="en"
xml:base=
"http://www.cs.vassar.edu/~ME/Oen.xcesDoc#">
<par xlink:href="xptr(substring(//p[1]">
<s xlink:href="xptr(substring(//p/s[1]">
<tok type="WORD"
xlink:href=
"xptr(substring(//p/s[1]/text(),1,2">
<orth>It</orth>
<disamb>
<base>it</base>
<msd>Pp3ns</msd>
<ctag>PPER3</ctag></lex>
<lex>
<base>it</base>
<msd>Pp3ns</msd>
<ctag>PPER3</ctag></lex></tok>...
xcesAnadocumentxcesAnadocument
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
<xsl:stylesheet version="1.0" xmnls:xsl= "http://www.w3.org/1999/XSL/Transform">
<xsl:template match= “/”> <html> <body> <xsl:apply-templates/> </body> </html></xsl:template>
<xsl:template match="//par"/> <xsl:for-each select=”//tok”/> <xsl:value-of select=”orth”/> <xsl:text>|</xsl:text> <xsl:value-of select=”disamb/base”/> <xsl:text>|</xsl:text> <xsl:value-of select=”disamb/ctag”/> </xsl:for-each> </xsl:template>
</xsl:stylesheet>
XSLT creates HTML
XSLTdocumentXSLTdocument
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Result
It|it|PPER3 was|be|PAST3 a|a|DINTbright|bright|ADJEcold|cold|ADJE day|day|NN…
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Possibilities
• create new documents containing selected annotations
• transduce XML encoded documents to tool-internal formats
• generate a new document with all phonemes that appear in a certain context (or all the unique contexts of a certain phoneme), etc.
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
XML Schemas
• constrain and document the meaning, usage and relationships of the constituent parts of XML documents– datatypes– elements and their content– attributes and their values
• provide default values for attributes and elements
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Impact for language resources
• provide means to define an abstract data model for a class of documents– e.g., data model for annotations and annotated
objects– one of the most important tasks for corpus and
tool creators
• provide for much tighter validation of document form and content
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Capabilities
• different attribute declarations and/or content models can apply to elements with the same name in different contexts– allows for more tightly constrained content
models than possible with DTDs– e.g., <name> in header and <name> in text
likely have different content constraints
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
• define equivalence classes for groups of elements and/or attributes– may be used in the same ways as defined
for a particular named element
• in CES used parameter entities to make a class of phrase-level objects (for example)– a "kludge"
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
• constrain attribute or element values (or combinations) to be unique, e.g.,– only one entry in a computational lexicon can
be defined with a given word form – only one paragraph can have an attribute
indicating that it is the 23rd– only one disambiguated form is given for each
token – only one correspondence for a given item in an
alignment document
Useful for error detection and preventionUseful for error detection and prevention
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
• establish dependencies based on element or attribute values, for example:– prevent nouns from being assigned a tense– specify that tokens with type attribute value
PUNCT include only <orth> elements containing specific characters
– specify annotation labels elsewhere, constrain element content to these values only
• e.g., constrain the values of the <msd> element in an XCES annotation document to the EAGLES morpho-syntactic specifications
Another means for error control and validationAnother means for error control and validation
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Why is XML a good thing?• search, extraction, and transformation
capabilities answer most current and foreseen needs for corpus-based language engineering
• means to fully implement the stand-off data architecture
• processing tools for XML recommendations are freely distributed– no need for costly and time-consuming tool
development
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Conclusion• XML will allow for
– representation of multi-lingual, multi-modal resources
– implementation of the stand-off scheme– compatibility with the WWW, enabling
• exploitation by LE researchers via the web
• harmonization and combination of LRE resources with other WWW data
– distributed model for data delivery
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
P.S....
• A set of XML recommendations for encoding language resources exists:– XCES (XML version of the Corpus Encoding
Standard--CES)– http://www.cs.vassar.edu/XCES
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Acknowledgements
• Laurent Romary (LORIA/CNRS)
• Patrice Bonhomme (LORIA/CNRS)