45
Towards portability and interoperability for linguistic annotation and language-specific ontologies Robert Munro & David Nathan Endangered Languages Archive, School of Oriental and African Studies

Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

Embed Size (px)

Citation preview

Page 1: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

Towards portability and interoperability for linguistic

annotation and language-specific ontologies

Robert Munro & David Nathan

Endangered Languages Archive, School of Oriental and African Studies

Page 2: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

Outline

1. Introduction and motivation

2. Linguistic ontologies and markups

3. Representing knowledge

4. Supporting fieldworkers

5. Supporting speakers

6. Conclusions

Page 3: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

1. Introduction and motivation

Page 4: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

Introduction

The main goal of this paper:how does GOLD meets the requirements of portability

for language documentation and description (Bird & Simons, 2003)

Road-testing:ability to meet the needs of archive users and

contributors

Page 5: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

Motivation

The Endangered Languages Archive (ELAR) is part of the Hans Rausing Endangered Languages Project (HRELP)

HRELP supports:the archivegrants for documentation projectspostgraduate programs focussing on language

documentation

Page 6: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

Motivation

We (ELAR):support a digital archive (preserve data and provide

access to it)

We also train students and grantees in:markup strategiesdata management strategiesmultimedia developmentchoice of recording equipment

Page 7: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

Motivation

There is concern that cataloguing metadata (IMDI / OLAC) has not yet been sufficiently extended (Nathan and Austin, 2004)rich linguistic and contextual information is not being

recorded in well-formed portable formats/structures

Common ontologies present a solution to this

Page 8: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

How does GOLD meet our needs

We find GOLD to be the most suitable ontology for supporting data portability

GOLD’s focus has been on ‘datanalysis sets’

Page 9: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

Summary

We suggest extending the focus to:data acquisitiondata access

Key extensions:formalising the definitions of concepts by representing

them as a set of formal propertiesexplicitly capturing the conventions and constraints for

presentation (rendering)modelling features that are inherently indeterminate

and/or complex structures

Page 10: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

2. Linguistic ontologies and markups

Page 11: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

Linguistic ontologies and markups

Ontology:strictly, what we agree exists

Markup:strictly, what we are certain about

Ontology and markup converge:only with consensus and complete confidencebut there is rarely full confidence in the classification

of new hard-to-classify phenomena in little-studied endangered languages

Page 12: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

Indeterminacy

Builders of ontologies outside of linguistics have been reluctant to accept inherent indeterminacy:

In some cases, the incompatibilities [between ontologies] can be smoothed over by tweaking definitions of concepts or formalizations of axioms; in other cases, wholesale theoretical revision may be required. (Niles & Pease, 2001)

If we can identify the incompatibilities, we can model them

Page 13: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

Supporting linguistics

A theory-neutral model of linguistics is not possible:Theories are poly-centricThey will change

We need a pan-theory model of linguistics

Page 14: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

Formulising definitions

Each concept in GOLD should be represented by a set of properties that describe that concept

Three possible values for a given property: ‘Yes’, ‘No’, or ‘Undefined’ (default)

To accurately represent variance: include enough properties to distinguish terms

For portability: include as many properties as possible

Page 15: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

Formulising definitions

‘Yes’ can potentially be expanded: whether the property is mandatory or optional for the

conceptdependencies between properties for a concept

Page 16: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

Example

‘Noun’ in GOLD:Noun Definition: A noun is a broad classification of parts of speech which include substantives and nominals (Crystal 1997:371; Mish et al. 1990:1176). (http://emeld.org/gold-ns/description.html#Noun, last checked 23/05/2003)

How do I know if my definition is the same as Crystal or Mish et al?

Is it both definitions, or the common ground?

Page 17: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

Example

Will future users of GOLD have the same definition?the core of ‘noun’ may have longevitythe boundaries with other concepts will not

COPEs can define extensions in terms of sets of properties, and add those properties to GOLD

Page 18: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

Example

GOLD:

COPEs:

NOUN

GerundNOUN NomVerbNOUN

Can’t formally identify the similarities

Page 19: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

Example

GOLD:

COPEs:

NOUN

GerundNOUN NomVerbNOUN

+ property: verb suffix + property: verb suffix

Can formally identify the similarities

Definition of NOUN can grow

Page 20: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

3. Representing knowledge

Page 21: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

Rendering

Separating form from content:ideal for flexibilitynot possible for some materials (esp. video)

Page 22: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

Rendering conventions / constraints

Some are well known:italicize part-of-speech in dictionariesalign interlinear transcriptions

Some are not:representation of language-specific kinship systems,

ethnobotanical ontologies etc

Page 23: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

Solution 1

Include a (written) description and/or example of the rendering conventions and constraints:hard-code the interface

Page 24: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

Solution 2

Include formal representations of the conventions within the data:interface takes instructions from the data

Page 25: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

Solutions

These are two extremeshard-coded and language specificdata driven and language independent

Database architectures and linguistic ontologiesnot designed for navigation‘transparent’ access to such structures – who does it

support?

Page 26: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

4. Supporting fieldworkers

Page 27: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

Supporting indeterminacy

There are two kinds of indeterminacy in linguistics: confidence in assigning a category (uncertainty) phenomena that are inherently variable, probabilistic,

gradient or continuous

Page 28: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

The most valuable information

The most valuable information that a field linguist learns may be the least likely to be annotated

Example: 7uhch in Lakanon Maya:A temporal-modal deictic expressing participant

frames and speaker's footings (Bergqvist 2005)This term has been given the most thought by the

researcher, but it is still not completely understoodThe uncertainty (or the extent of certainty) should be

recorded: all the properties we do know

Page 29: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

5 reasons for modelling uncertainty

1. To record our the extent of our knowledge For example, we want everything known about

7uhch in Lakanon Maya to be recorded, even if we don’t yet have a category for it

Page 30: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

5 reasons for modelling uncertainty

2. For searchability If an archive implementing an ontology with

uncertain categories exists, then we can more easily find existing solutions to a problem

If a problem is truly new, then we can allow future researchers to find it

Page 31: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

5 reasons for modelling uncertainty

3. To reach certainty Even an indeterminate markup can allow a

corpus analysis that can inform a decision about assigning the appropriate category

Page 32: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

5 reasons for modelling uncertainty

4. To highlight problems with descriptive frameworks

A feature may only appear to belong to multiple (or no) categories because the descriptive framework does not yet account for it

Page 33: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

5 reasons for modelling uncertainty

5. Because the concept is inherently indeterminate

The concept may be inherently fuzzy but not previously encountered as a continuous / contiguous phenomena

Page 34: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

Inherently indeterminate features

Eg: cline, gradience, squish, continuities, contiguities, vague, fuzzy, probabilistic

Many prosodic, semantic and discourse features are inherently continuous

Growing arguments for probabilities to be part of our formal linguistic models for morphological and syntactic structures (Aarts, 2004; Bayen, 2003; Manning, 2003)

Page 35: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

Inherently indeterminate features

Representing categories by formal properties meets the current requirements of modelling gradience (Aarts, 2004)

Perhaps the “ContinuousObject” concept of SUMO (Niles & Pease, 2001) could also be used?

The problem is, currently, largely unresolved

Page 36: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

Incorporating new categories

How do we know that a given category is not the same as another one identified elsewhere?

Formal properties for concepts give us another means for comparison

Page 37: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

Incorporating structures

As well as inherently discrete phenomena and inherently indeterminate ones, there is a third kind: concepts that are complex structurescommon in syntax and discourse semantics

How do we model a structure in an ontology?

Page 38: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

5. Supporting speakers

Page 39: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

Users of EL archives

The largest (and growing) user group for endangered languages materials are the speakers of endangered languages

Rarely interested in linguistic categories or navigating a corpus or archive via them

Supporting language-specific ontologies means supporting information-rich structures for both navigation and analysis

Page 40: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

Case Study: Yolngu kinship

The Yolngu languages have an extensive kinship terminology called Gurrutu27 terms that identify individuals and sets of

individuals in terms of moiety, generation, gender, and patriline or matriline.

The terms extend infinitely through cyclicity

Page 41: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

Case Study: Yolngu kinship

Speakers draw from the same sets of kinship relations to describe their relationship to the Yolngu lands

We cannot always annotate well-known linguistic concepts independently of language-specific ontologies

Page 42: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

6. Conclusions

Page 43: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

Conclusions

Ontology building for endangered languages can be very different to other ontology projectsThe uncertain is often more valuable than the certainThe local is often more interesting than the universal… but will still need interoperability

We suggest extending the focus of GOLD todata acquisition data access

Page 44: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

Conclusions

Current GOLD does not need to be altered to incorporate our suggestionsexcept to remove assumptions of invariability

Key extensionsformalising the definitions of concepts by representing

them as a set of formal propertiesexplicitly capturing the conventions and constraints for

presentation (rendering)modelling features that are inherently indeterminate

and/or complex structures

Page 45: Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

References

Aarts, B 2004 Modelling linguistic gradience. Studies in Language, 28(1):1–49.Bateman, J 1992 The theoretical status of ontologies in natural language processing. In Text Representation and Domain Modelling – ideas

from linguistics and AI, Technische Universität BerlinBayen, H 2003 Probabilistic Approaches to Morphology In Bod, R., Hay J. and Jannedy, S. (eds). Probabilistic Linguistics. MIT Press.Bergqvist, H 2005 Semantics of temporal deictics in Lakandon Maya. Presentation given at the ELAP-ELAR seminar series, SOAS, London.Bird, S & G Simons. 2003. Seven Dimensions of Portability for Language Documentation and Description, Language 79/3: 557-582.Christie, M & W Gaykamangu 2003. “Kinship, moiety, land & language in Arnhem Land”. In literacy link. Australian Council for Adult Literacy, vol

23, no 5 Oct 2003.Christie, M, W Gaykamangu & D Nathan. 2001. Yolngu Languages and Culture: Gupapuyngu. Faculty of Aboriginal and Torres Strait Islander

Studies, NTU [Multimedia CD-ROM]Crystal, D. 1997 A dictionary of linguistics and phonetics. 4th edition. Cambridge, MA: BlackwellCysouw, M, J Good, M Albu & HJ Bibiko 2005 Can GOLD “cope” with WALS? Retrofitting an ontology onto the World Atlas of Language

Structures. Proceedings of the E-MELD 2005Farrar, S. & D. T. Langendoen. 2003. A linguistic ontology for the Semantic Web. GLOT International 7 (3), 97-100.Farrar, S. 2003a Markup and the GOLD ontology. Proceedings of the EMELD 2003 Farrar, S. 2003b An ontological account of linguistics: extending SUMO with GOLD. Proceedings of the 2003 IEEE International Conference on

Natural Language Processing and Knowledge Engineering. BeijingFoley, W A 2003 Genre, register and language documentation in literate and preliterate communities. In Peter K Austin (ed.) Language

Documentation and Description vol 1Grinevald, C 2003 Speakers and documentation of endangered languages. In Peter K Austin (ed.) Language Documentation and Description

volume 1Gruber, T R. 1993 A translation approach to portable ontologies. Knowledge Acquisition, 5(2), 199-220Himmelmann, N P 1998 Documentary and descriptive linguistics. Linguistics 36. 161-195. Berlin: de Gruyter. Holton, G 2003 Approaches to digitization and annotation: A survey of language documentation materials in the Alaska Native Language Center

Archive. Proceedings of the EMELD 2003Manning, C. 2003 Probabilistic Syntax In Bod, R., Hay J. and Jannedy, S. (eds). Probabilistic Linguistics. MIT Press.Nathan, D. (ed) 1996. Australia’s Indigenous Languages. Adelaide: SSABSANathan, D and P K Austin (2004) Reconceiving metadata: language documentation through thick and thin. In Peter K Austin (ed.) Language

Documentation and Description Volume 2. Niles, I & A Pease. 2001. Towards a standard upper ontology. Proceedings of the 2nd International Conference on Formal Ontology in

Information Systems (FOIS-2001)Penton, D, C Bow, S Bird & B Hughes. 2004. Towards a General Model for Linguistic Paradigms. Proceedings of EMELD 2004