28
EAGLES/ISLE Workshop LREC 2000 • Athens, Greece Requirements, Tools, and Architectures for Annotated Corpora Nancy Ide • Vassar College Chris Brew • Ohio State University Data Architectures and Software Support for Large Corpora Towards an American National Corpus

EAGLES/ISLE Workshop LREC 2000 Athens, Greece Requirements, Tools, and Architectures for Annotated Corpora Nancy Ide Vassar College Chris Brew Ohio State

  • View
    214

  • Download
    1

Embed Size (px)

Citation preview

EAGLES/ISLE Workshop

LREC 2000 • Athens, Greece

Requirements, Tools, and Architectures for Annotated Corpora

Nancy Ide • Vassar College

Chris Brew • Ohio State University

Data Architectures and Software Support for Large CorporaTowards an American National Corpus

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

Resources are expensive!

• funders expect to amortize cost of resource creation over several projects

• researchers don't want to reinvent the wheel

• want to be able to accommodate uses for corpora and tools that may not yet be envisaged

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

• cross-disciplinary acceptance no longer an option

• we need– reusability to avoid unnecessary labor and cost– flexibility and extensibility to accommodate

different applications, different modes and media, different approaches, and potential future uses

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

Areas for consideration

Annotation formats format of annotations themselves

Encoding formats markup scheme used to identify and delineate elements in the data

Data architecture organization of data in terms of document structure, linkage

Tools architecture framework for tool interoperability

• Tool support components • facilities to enable tools to work efficiently

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

Annotation Formats• need not be identical to achieve commonality• must work toward specifications that enable

mapping among annotations of the same type• EAGLES/ISLE guidelines

– layered model • universally agreed-upon and applicable specifications at

the bottom

• modules for specific languages, applications, and/or theoretical approaches at higher levels.

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

Encoding Formats

• standardized formats required for – data interchange– enabling easy human-readable display and

access

• may or may not serve as direct input to tools

• but must be capable of capturing all information that is input and output of tools

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

XML

• international standard, web compatible• used in several corpus-handling applications

• LT XML (Edinburgh)• ATLAS (NIST)• XCES (EAGLES)• American National Corpus

• provides good tools for linkage, search and extraction, validation and error reduction

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

Data Architectures

• must support : full range of annotation types alternative annotations and versions different languages different media and modalities (e.g., text,

speech signal, audio, video, image) potentially complex linkage among documents,

parts of documents, and different modalities

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

"Stand-off" Data Architecture

• annotations maintained in separate documents that point back to the original

• yields a “hyper-document” composed of the original text and all annotations

• increasingly accepted as the appropriate architecture for language resources– MULTEXT, LT NSL and LT XML, ATLAS,

CES and XCES, ANC

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

Advantages

• avoids unwieldy documents

• allows for versioning, alternative annotations

• XML mechanisms support complex inter-document linkage, linking various media

• XSLT enables selecting, transforming, adding to multiple documents to create new document

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

Data Models

• XML support for easy transduction of tags makes common tag set less an issue

• But...must have a common underlying data model– formalized description of data objects

• composition, attributes, class membership, applicable procedures, etc

• relations among these, independent of instantiation in any particular form

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

• must be able to capture structure and relations in diverse types of data and annotations

• impacts the design of annotation schema, encoding formats, data and tool architectures

• is the most important current need for corpus-based work

The data model...

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

Existing models

• TIPSTER– object-oriented– designed for use in IE

• ATLAS– annotation graph formalism– designed for use in speech

Design strongly influenced by background assumptions that may not scale up

Design strongly influenced by background assumptions that may not scale up

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

Abstraction

• an annotation is a one- or two-way link between – an annotation object, and – a point or span (or a list/set of points or spans)

within a base data set

• Links may or may not have a semantics

• Points and spans may be objects, or sets/lists of objects

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

Observations assumes fundamental linearity of objects in the

base time line (speech) sequence of characters, words, sentences, etc. pixel data etc.

the granularity of the data representation and encoding is critical

Targets may be individual objects or sets or lists of objects, so information with more than one dimension is accommodated

Targets may be individual objects or sets or lists of objects, so information with more than one dimension is accommodated

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

Implications

annotation scheme must be mappable to the structures defined for annotation objects

encoding scheme must be able to capture the object structure and relations expressed in the model (e.g., class membership and inheritance) requires sophisticated means to specify linkage consider logistics of identifying spans by enclosing

them in start and end tags (enabling hierarchical grouping of objects in the data), vs. explicit addressing of start and end points

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

Implications...

must be possible to represent objects and relations in some form that is both usable by a variety of tools and prevents information loss– ideally, in a variety of formats suitable to

different tools and applications

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

Recommendation

• Form a group to study this, consisting of representatives for– different areas of LE (text, speech, etc.)– different languages, geographical location– different media– different user needs– Information Retrieval and Computer Science

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

Tools and Tool Architectures

• must support multi-lingual, multi-modal data

• must be flexible– adaptable to different annotation schemes,

different applications

• must be extensible

• must be reusable

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

Existing systems

• MULTEXT (1994)– developed fundamental data and tool architecture

for corpora used in subsequent systems• tool modularity, pipeline tool architecture

• API interface

• SGML encoding standard for linguistic annotation (CES)

• concept of "stand-off" annotation

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

• LT XML (1999), U of Edinburgh – grew out of MULTEXT– views XML files as either

• flat stream of markup and text

• tree-structured XML

– powerful query language

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

• GATE (Sheffield)– implements TIPSTER data and tool architecture– object model for data and annotation– modular tool design, very extensible

• ATLAS (2000) (NIST)– still in development– layered data and tool architecture similar to

previous systems– annotation graph formalism instantiated in XML

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

Agreement on tools/systems• tool architecture

– "plug-and-play"– modular – layered design

• physical storage representation

• intermediate data representation (model)

• API to enable application development

• query capability

• stand-off data architecture

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

Details to work out

• data model

• level to extend notion of modularity– gross function, or minimal function?

• best means to accommodate different languages, modalities– engine-based approach, language- or medium-

specific knowledge as data?

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

Tool Support Components

• resources are large

• compression and indexing required for a usable system– compression is easy

• excellent compression techniques for XML data

– indexing is trickier• good techniques for full-text search exist

• but...may not scale up to more complex data

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

• Documents with diagrams, engineering drawings. Illustrated books, with body text and illustration intermingled or

overlaid Manuscripts in which the physical details of the calligraphy and

media matter Interlinked texts, including output of machine translation systems,

speech transcription efforts, lexicographic endeavors Databases of phonetic phenomena Personal and public information spaces: hard disk folder structures,

mailing list archives, personal email archives, voice mailboxes, etc. Dialogue etc.

Non-traditional data

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

Recommendations

• develop architectures that abandon the notion of a single distinguished time line

• adopt ideas from the database community– work on semi-structured data– work that views XML documents as a

collection of documents with additional tags and relations between tags

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

Conclusion

• design tools and resources not based on needs of a particular research community

• open architecture approach

• build on existing standards, emerging consensus

• (widely) distributed development

• involve other relevant communities (IR, CS)