22
July 1-3, 2005 E-MELD 2005 Ontologies in Linguistic Annotation 1 The GOLD Effort So Far Terry Langendoen Brian Fitzsimons Emily Kidder Department of Linguistics University of Arizona

July 1-3, 2005 E-MELD 2005 Ontologies in Linguistic Annotation 1 The GOLD Effort So Far Terry Langendoen Brian Fitzsimons Emily Kidder Department of Linguistics

Embed Size (px)

DESCRIPTION

July 1-3, 2005 E-MELD 2005 Ontologies in Linguistic Annotation 3 Whalen’s problem “We want to be able to describe the data in just the way we want, but we don’t want to program it.”  Doug Whalen, at 2001 E-MELD Workshop

Citation preview

Page 1: July 1-3, 2005 E-MELD 2005 Ontologies in Linguistic Annotation 1 The GOLD Effort So Far Terry Langendoen Brian Fitzsimons Emily Kidder Department of Linguistics

July 1-3, 2005 E-MELD 2005Ontologies in Linguistic Annotation 1

The GOLD Effort So Far

Terry LangendoenBrian Fitzsimons

Emily Kidder

Department of LinguisticsUniversity of Arizona

Page 2: July 1-3, 2005 E-MELD 2005 Ontologies in Linguistic Annotation 1 The GOLD Effort So Far Terry Langendoen Brian Fitzsimons Emily Kidder Department of Linguistics

July 1-3, 2005 E-MELD 2005

Ontologies in Linguistic Annotation 2

Acknowledgments

Everyone else who’s worked on E-MELD at U Arizona 2001-05, especially: Graduate students: Scott Farrar, Will Lewis, Peter

Norquest, Ruby Basham Undergraduate students: Jesse Kirchner, Shauna

Eggers, Alexis Lanham, Sandy Chow Everyone who’s worked on E-MELD

elsewhere, especially: Gary, Helen, Anthony, Laura, Zhenwei, Baden,

Doug

Page 3: July 1-3, 2005 E-MELD 2005 Ontologies in Linguistic Annotation 1 The GOLD Effort So Far Terry Langendoen Brian Fitzsimons Emily Kidder Department of Linguistics

July 1-3, 2005 E-MELD 2005

Ontologies in Linguistic Annotation 3

Whalen’s problem

“We want to be able to describe the data in just the way we want, but we don’t want to program it.” Doug Whalen, at 2001 E-MELD Workshop

Page 4: July 1-3, 2005 E-MELD 2005 Ontologies in Linguistic Annotation 1 The GOLD Effort So Far Terry Langendoen Brian Fitzsimons Emily Kidder Department of Linguistics

July 1-3, 2005 E-MELD 2005

Ontologies in Linguistic Annotation 4

Our problem

We want to be able to describe the data in just the way we want, and we want to be able to use everybody else’s data described in just the way they want, and we want to be able to process it in all kinds of ways that make sense to us as scientists and teachers.

Call this the interoperability problem.

Page 5: July 1-3, 2005 E-MELD 2005 Ontologies in Linguistic Annotation 1 The GOLD Effort So Far Terry Langendoen Brian Fitzsimons Emily Kidder Department of Linguistics

July 1-3, 2005 E-MELD 2005

Ontologies in Linguistic Annotation 5

TEI’s data interchange solution

Create a “data interchange” format such as the Text Encoding Initiative’s P3. Require projects that wish to share data to define

mappings to and from the interchange format.

φ ψˉ¹X ——————-> P3 ——————> Y

ψ φˉ¹Y ——————-> P3 ——————> X

Page 6: July 1-3, 2005 E-MELD 2005 Ontologies in Linguistic Annotation 1 The GOLD Effort So Far Terry Langendoen Brian Fitzsimons Emily Kidder Department of Linguistics

July 1-3, 2005 E-MELD 2005

Ontologies in Linguistic Annotation 6

Two lessons from the TEI

Use a standard markup language. Our choice (like theirs): XML.

Individual projects don’t have to use XML, but their software should export to XML.

Page 7: July 1-3, 2005 E-MELD 2005 Ontologies in Linguistic Annotation 1 The GOLD Effort So Far Terry Langendoen Brian Fitzsimons Emily Kidder Department of Linguistics

July 1-3, 2005 E-MELD 2005

Ontologies in Linguistic Annotation 7

XML markup is syntax

In TEI, the tags <s>, <w> and <m> were designed to delimit sentences, words and morphemes respectively. But they can be used to describe any three-level

hierarchy over character strings, such as: <s> = sentence, <w> = word, <m> = morpheme <s> = paragraph, <w> = sentence, <m> = word <s> = chapter, <w> = paragraph, <m> = morpheme <s> = big chunk, <w> = middle-size chunk, <m> = small

chunk

Page 8: July 1-3, 2005 E-MELD 2005 Ontologies in Linguistic Annotation 1 The GOLD Effort So Far Terry Langendoen Brian Fitzsimons Emily Kidder Department of Linguistics

July 1-3, 2005 E-MELD 2005

Ontologies in Linguistic Annotation 8

Two avenues to markup semantics

The syntax is the semantics (SIS) This is essentially the TEI solution.

Leave the semantics to us (LSU) Essentially the “Semantic Web” idea

Page 9: July 1-3, 2005 E-MELD 2005 Ontologies in Linguistic Annotation 1 The GOLD Effort So Far Terry Langendoen Brian Fitzsimons Emily Kidder Department of Linguistics

July 1-3, 2005 E-MELD 2005

Ontologies in Linguistic Annotation 9

Problems with SIS

Hard sell. Based on the TEI experience, it’ll be hard to convince linguists to use it.

Expensive. It will be costly to retrofit existing resources to conform to it.

Fragile. Future changes will be likely to break existing applications.

Page 10: July 1-3, 2005 E-MELD 2005 Ontologies in Linguistic Annotation 1 The GOLD Effort So Far Terry Langendoen Brian Fitzsimons Emily Kidder Department of Linguistics

July 1-3, 2005 E-MELD 2005

Ontologies in Linguistic Annotation 10

Advantages of LSU

Easier sell. Can have lots of special purpose markup schemas for different purposes, which will be easier to use.

Cheap. Migration to best practice much less costly.

Robust. Changes are less likely to break existing applications.

Page 11: July 1-3, 2005 E-MELD 2005 Ontologies in Linguistic Annotation 1 The GOLD Effort So Far Terry Langendoen Brian Fitzsimons Emily Kidder Department of Linguistics

July 1-3, 2005 E-MELD 2005

Ontologies in Linguistic Annotation 11

Place of a linguistic ontology as part of LSU

The central component of LSU is a linguistic ontology that: defines the common concepts used in linguistic

analysis and description, expresses the relations that hold among those

concepts, relates those concepts to concepts of common-

sense understanding (“upper” ontology) and concepts in other disciplines.

Page 12: July 1-3, 2005 E-MELD 2005 Ontologies in Linguistic Annotation 1 The GOLD Effort So Far Terry Langendoen Brian Fitzsimons Emily Kidder Department of Linguistics

July 1-3, 2005 E-MELD 2005

Ontologies in Linguistic Annotation 12

Proof of concept that it works

Last year, the Arizona team, together with Gary, Scott, and Will’s team at CSU Fresno, showed that GOLD could be used for smart searching across massive cross-linguistic databases created from XML documents of different types. Interlinear glossed texts Lexicons

Page 13: July 1-3, 2005 E-MELD 2005 Ontologies in Linguistic Annotation 1 The GOLD Effort So Far Terry Langendoen Brian Fitzsimons Emily Kidder Department of Linguistics

July 1-3, 2005 E-MELD 2005

Ontologies in Linguistic Annotation 13

The GOLD Summit

Last November, Will hosted a summit meeting of researchers most involved with GOLD to plan for its further development and maintenance after Arizona’s E-MELD funding ran out yesterday. It recommended: Creating a GOLD website. Forming a GOLD Council with oversight

responsibility, and putting procedures in place using the OLAC model to foster and evaluate development and maintenance.

Focusing the E-MELD 2005 workshop on GOLD.

Page 14: July 1-3, 2005 E-MELD 2005 Ontologies in Linguistic Annotation 1 The GOLD Effort So Far Terry Langendoen Brian Fitzsimons Emily Kidder Department of Linguistics

July 1-3, 2005 E-MELD 2005

Ontologies in Linguistic Annotation 14

Current state of play

We’re proposing to move GOLD “out of the lab” effective with this meeting despite the fact that: GOLD version 0.2 has very small coverage, even

within morphosyntax, and many areas of the field are not covered at all.

Several important design issues have not been settled.

What upper ontology should we use? (Currently SUMO) Some “core GOLD” concepts are in flux. We broke last year’s applications with our redesign of the

treatment of grammatical features.

Page 15: July 1-3, 2005 E-MELD 2005 Ontologies in Linguistic Annotation 1 The GOLD Effort So Far Terry Langendoen Brian Fitzsimons Emily Kidder Department of Linguistics

July 1-3, 2005 E-MELD 2005

Ontologies in Linguistic Annotation 15

Classes and instances in GOLD 0.1 (“Old GOLD”)

Reasoning with classes and instances If i is of type A and A is a subclass of B, then i is of

type B. For example, a search for instances of Verb will

find all instances of both TransitiveVerb and IntransitiveVerb.

Page 16: July 1-3, 2005 E-MELD 2005 Ontologies in Linguistic Annotation 1 The GOLD Effort So Far Terry Langendoen Brian Fitzsimons Emily Kidder Department of Linguistics

July 1-3, 2005 E-MELD 2005

Ontologies in Linguistic Annotation 16

A problem with saying what we want about language X

In language X, verbs are inflected only for tense. Verb inflectedFor Tense?

This won’t do if both subject and object of the relation are classes.

Fails to represent the claim that tense is the only feature that verbs are inflected for in X.

XVerb inflectedFor XTense? OK, since XVerb and XTense are both instances (of the

GOLD classes Verb and Tense respectively) Lack of other inflectional features will show up in

response to query.

Page 17: July 1-3, 2005 E-MELD 2005 Ontologies in Linguistic Annotation 1 The GOLD Effort So Far Terry Langendoen Brian Fitzsimons Emily Kidder Department of Linguistics

July 1-3, 2005 E-MELD 2005

Ontologies in Linguistic Annotation 17

A problem with saying what we want in GOLD

XTense hasValue XFutureTense OK since hasValue relates instances.

Tense hasValue FutureTense Not OK since hasValue relates classes.

Page 18: July 1-3, 2005 E-MELD 2005 Ontologies in Linguistic Annotation 1 The GOLD Effort So Far Terry Langendoen Brian Fitzsimons Emily Kidder Department of Linguistics

July 1-3, 2005 E-MELD 2005

Ontologies in Linguistic Annotation 18

Parallel structures for GOLD and language-specific concepts

Allow certain GOLD concepts to be instances of other GOLD classes. In particular, define atomic feature values as instances of particular feature classes.

Allow certain language-specific concepts to be classes that are instantiated by other language-specific concepts. In particular, define language-specific features as classes instantiated by their language-specific values.

Page 19: July 1-3, 2005 E-MELD 2005 Ontologies in Linguistic Annotation 1 The GOLD Effort So Far Terry Langendoen Brian Fitzsimons Emily Kidder Department of Linguistics

July 1-3, 2005 E-MELD 2005

Ontologies in Linguistic Annotation 19

Feature systems as substructures

Any /|\ / | \ / | \ / | \ / | \ / | \ NonP HodP PreHodPTenseSystem-x as a substructure of TenseFeature

Page 20: July 1-3, 2005 E-MELD 2005 Ontologies in Linguistic Annotation 1 The GOLD Effort So Far Terry Langendoen Brian Fitzsimons Emily Kidder Department of Linguistics

July 1-3, 2005 E-MELD 2005

Ontologies in Linguistic Annotation 20

Mapping from a language class to a GOLD class

+------------+ +------------+| Any <------+----+-- XAny || | | || NonP <-----+----+-- XPres || | | || HodP <-----+----+-- XRecP || | | || PreHodP <--+----+-- XRemP |+------------+ +------------+Mapping to GOLD TenseSystem-x from XTense

Page 21: July 1-3, 2005 E-MELD 2005 Ontologies in Linguistic Annotation 1 The GOLD Effort So Far Terry Langendoen Brian Fitzsimons Emily Kidder Department of Linguistics

July 1-3, 2005 E-MELD 2005

Ontologies in Linguistic Annotation 21

Isomorphism between a language system and a GOLD system

XAny /|\ / | \ / | \ / | \ / | \ / | \ XPres XRecP XRemPXTense system isomorphic to TenseSystem-x

Page 22: July 1-3, 2005 E-MELD 2005 Ontologies in Linguistic Annotation 1 The GOLD Effort So Far Terry Langendoen Brian Fitzsimons Emily Kidder Department of Linguistics

July 1-3, 2005 E-MELD 2005

Ontologies in Linguistic Annotation 22

Future of GOLD

?