View
214
Download
1
Tags:
Embed Size (px)
Citation preview
EAGLES/ISLE Workshop
LREC 2000 • Athens, Greece
Requirements, Tools, and Architectures for Annotated Corpora
Nancy Ide • Vassar College
Chris Brew • Ohio State University
Data Architectures and Software Support for Large CorporaTowards an American National Corpus
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
Resources are expensive!
• funders expect to amortize cost of resource creation over several projects
• researchers don't want to reinvent the wheel
• want to be able to accommodate uses for corpora and tools that may not yet be envisaged
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
• cross-disciplinary acceptance no longer an option
• we need– reusability to avoid unnecessary labor and cost– flexibility and extensibility to accommodate
different applications, different modes and media, different approaches, and potential future uses
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
Areas for consideration
Annotation formats format of annotations themselves
Encoding formats markup scheme used to identify and delineate elements in the data
Data architecture organization of data in terms of document structure, linkage
Tools architecture framework for tool interoperability
• Tool support components • facilities to enable tools to work efficiently
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
Annotation Formats• need not be identical to achieve commonality• must work toward specifications that enable
mapping among annotations of the same type• EAGLES/ISLE guidelines
– layered model • universally agreed-upon and applicable specifications at
the bottom
• modules for specific languages, applications, and/or theoretical approaches at higher levels.
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
Encoding Formats
• standardized formats required for – data interchange– enabling easy human-readable display and
access
• may or may not serve as direct input to tools
• but must be capable of capturing all information that is input and output of tools
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
XML
• international standard, web compatible• used in several corpus-handling applications
• LT XML (Edinburgh)• ATLAS (NIST)• XCES (EAGLES)• American National Corpus
• provides good tools for linkage, search and extraction, validation and error reduction
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
Data Architectures
• must support : full range of annotation types alternative annotations and versions different languages different media and modalities (e.g., text,
speech signal, audio, video, image) potentially complex linkage among documents,
parts of documents, and different modalities
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
"Stand-off" Data Architecture
• annotations maintained in separate documents that point back to the original
• yields a “hyper-document” composed of the original text and all annotations
• increasingly accepted as the appropriate architecture for language resources– MULTEXT, LT NSL and LT XML, ATLAS,
CES and XCES, ANC
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
Advantages
• avoids unwieldy documents
• allows for versioning, alternative annotations
• XML mechanisms support complex inter-document linkage, linking various media
• XSLT enables selecting, transforming, adding to multiple documents to create new document
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
Data Models
• XML support for easy transduction of tags makes common tag set less an issue
• But...must have a common underlying data model– formalized description of data objects
• composition, attributes, class membership, applicable procedures, etc
• relations among these, independent of instantiation in any particular form
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
• must be able to capture structure and relations in diverse types of data and annotations
• impacts the design of annotation schema, encoding formats, data and tool architectures
• is the most important current need for corpus-based work
The data model...
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
Existing models
• TIPSTER– object-oriented– designed for use in IE
• ATLAS– annotation graph formalism– designed for use in speech
Design strongly influenced by background assumptions that may not scale up
Design strongly influenced by background assumptions that may not scale up
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
Abstraction
• an annotation is a one- or two-way link between – an annotation object, and – a point or span (or a list/set of points or spans)
within a base data set
• Links may or may not have a semantics
• Points and spans may be objects, or sets/lists of objects
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
Observations assumes fundamental linearity of objects in the
base time line (speech) sequence of characters, words, sentences, etc. pixel data etc.
the granularity of the data representation and encoding is critical
Targets may be individual objects or sets or lists of objects, so information with more than one dimension is accommodated
Targets may be individual objects or sets or lists of objects, so information with more than one dimension is accommodated
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
Implications
annotation scheme must be mappable to the structures defined for annotation objects
encoding scheme must be able to capture the object structure and relations expressed in the model (e.g., class membership and inheritance) requires sophisticated means to specify linkage consider logistics of identifying spans by enclosing
them in start and end tags (enabling hierarchical grouping of objects in the data), vs. explicit addressing of start and end points
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
Implications...
must be possible to represent objects and relations in some form that is both usable by a variety of tools and prevents information loss– ideally, in a variety of formats suitable to
different tools and applications
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
Recommendation
• Form a group to study this, consisting of representatives for– different areas of LE (text, speech, etc.)– different languages, geographical location– different media– different user needs– Information Retrieval and Computer Science
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
Tools and Tool Architectures
• must support multi-lingual, multi-modal data
• must be flexible– adaptable to different annotation schemes,
different applications
• must be extensible
• must be reusable
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
Existing systems
• MULTEXT (1994)– developed fundamental data and tool architecture
for corpora used in subsequent systems• tool modularity, pipeline tool architecture
• API interface
• SGML encoding standard for linguistic annotation (CES)
• concept of "stand-off" annotation
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
• LT XML (1999), U of Edinburgh – grew out of MULTEXT– views XML files as either
• flat stream of markup and text
• tree-structured XML
– powerful query language
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
• GATE (Sheffield)– implements TIPSTER data and tool architecture– object model for data and annotation– modular tool design, very extensible
• ATLAS (2000) (NIST)– still in development– layered data and tool architecture similar to
previous systems– annotation graph formalism instantiated in XML
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
Agreement on tools/systems• tool architecture
– "plug-and-play"– modular – layered design
• physical storage representation
• intermediate data representation (model)
• API to enable application development
• query capability
• stand-off data architecture
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
Details to work out
• data model
• level to extend notion of modularity– gross function, or minimal function?
• best means to accommodate different languages, modalities– engine-based approach, language- or medium-
specific knowledge as data?
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
Tool Support Components
• resources are large
• compression and indexing required for a usable system– compression is easy
• excellent compression techniques for XML data
– indexing is trickier• good techniques for full-text search exist
• but...may not scale up to more complex data
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
• Documents with diagrams, engineering drawings. Illustrated books, with body text and illustration intermingled or
overlaid Manuscripts in which the physical details of the calligraphy and
media matter Interlinked texts, including output of machine translation systems,
speech transcription efforts, lexicographic endeavors Databases of phonetic phenomena Personal and public information spaces: hard disk folder structures,
mailing list archives, personal email archives, voice mailboxes, etc. Dialogue etc.
Non-traditional data
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
Recommendations
• develop architectures that abandon the notion of a single distinguished time line
• adopt ideas from the database community– work on semi-structured data– work that views XML documents as a
collection of documents with additional tags and relations between tags
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
Conclusion
• design tools and resources not based on needs of a particular research community
• open architecture approach
• build on existing standards, emerging consensus
• (widely) distributed development
• involve other relevant communities (IR, CS)