26
ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer 1 CLARIN-NL MD tutorial, 24-25 September 2009

ISOcat Data Category Registry Defining widely accepted linguistic concepts

  • Upload
    marged

  • View
    40

  • Download
    0

Embed Size (px)

DESCRIPTION

ISOcat Data Category Registry Defining widely accepted linguistic concepts. Menzo Windhouwer. ISOcat: a reference implementation. ISO 12620:2009 - PowerPoint PPT Presentation

Citation preview

Page 1: ISOcat Data Category Registry Defining widely accepted linguistic concepts

ISOcat

Data Category RegistryDefining widely accepted linguistic concepts

Menzo Windhouwer

1CLARIN-NL MD tutorial, 24-25 September 2009

Page 2: ISOcat Data Category Registry Defining widely accepted linguistic concepts

ISOcat: a reference implementation• ISO 12620:2009

– Terminology and other content and language resources — Specification of data categories and management of a Data Category Registry for language resources

– ISO 12620:1999 was a fixed list of data categories, this revision provides a data model and management procedures

• ISO Technical Committee 37– Terminology and other language and content

resources2CLARIN-NL MD tutorial, 24-25 September 2009

Page 3: ISOcat Data Category Registry Defining widely accepted linguistic concepts

ISO 24613:2008 Lexical Markup Framework

3

Lexicon

Lexical Entry

Form Sense

0..*

0..*1..*

1..*

partOfSpeech

writtenForm

writtenForm

grammaticalNumber

lexicalType

Word Form

Lemma

CLARIN-NL MD tutorial, 24-25 September 2009

Page 4: ISOcat Data Category Registry Defining widely accepted linguistic concepts

Data categories• “result of the specification of a given data

field ” (ISO 12620:2009)• data element concept (ISO 11179)

– “concept for which the definition, identification and conceptual domain are specified independently of any particular representation”

• complex data categories are data element concepts

4CLARIN-NL MD tutorial, 24-25 September 2009

Page 5: ISOcat Data Category Registry Defining widely accepted linguistic concepts

Data category types

5

writtenForm

string

open

grammaticalGender

string

neuter

masculine

feminine

closed

simple:

email

string

constrained

Constraint: .+@.+

complex:

CLARIN-NL MD tutorial, 24-25 September 2009

Page 6: ISOcat Data Category Registry Defining widely accepted linguistic concepts

Data category relationships• Value domain

membership• Subsumption

relationships between simple data categories

• Relationships between complex data categories are not stored in the DCR

___ ___ ___ CLARIN-NL MD tutorial, 24-25 September 2009 6

partOfSpeech

string

pronoun

personalpronoun

Page 7: ISOcat Data Category Registry Defining widely accepted linguistic concepts

Data category specification• Administration Information Section• Description Section

– Data Element Name– Language Section

• Name Section• Conceptual Domain• Linguistic Section

– Conceptual Domain

7

Mandatory:1.A mnemonic identifier2.An English definition3.An English name4.A conceptual domain

CLARIN-NL MD tutorial, 24-25 September 2009

Page 8: ISOcat Data Category Registry Defining widely accepted linguistic concepts

Guidelines for data categories (I)• Identifier:

– camel case and XML-valid element name (without a namespace)• partOfSpeech• my:POS, 123POS

• Data Element Name:– language independent name for the data category

used in a specific application domain (specified in the source)• PoS in TBX

___ ___ ___ CLARIN-NL MD tutorial, 24-25 September 2009 8

Page 9: ISOcat Data Category Registry Defining widely accepted linguistic concepts

Guidelines for data categories (II)• Name Section in a Language Section

– legible name• ‘part of speech’ in the English language section• ‘partie du discours’ in the French language section

• Definition:– intentional definitions (ISO 704)– should consist of a single sentence fragment

• Source:– add a source for any quoted material

___ ___ ___ CLARIN-NL MD tutorial, 24-25 September 2009 9

Page 10: ISOcat Data Category Registry Defining widely accepted linguistic concepts

Guidelines for data categories (III)• Justification:

– a simple statement justifying the relevance of the data category to the field of language resources

– especially needed for standardization

___ ___ ___ CLARIN-NL MD tutorial, 24-25 September 2009 10

Page 11: ISOcat Data Category Registry Defining widely accepted linguistic concepts

Private versus standard• The standard subset of data categories in

the registry should be coherent• The coherency is guarded by Thematic

Domain Groups and the DCR Board• Standard data categories need to meet

some more constraints then private ones:– mandatory justification– DC relations demand profile overlap– …

___ ___ ___ CLARIN-NL MD tutorial, 24-25 September 2009 11

Page 12: ISOcat Data Category Registry Defining widely accepted linguistic concepts

Data Category Selections• Anyone

1. can register with ISOcat2. can create data categories3. can create data category selections (DCSs)4. can share DCSs5. can make DCSs public

6. may submit DCSs for standardization

12CLARIN-NL MD tutorial, 24-25 September 2009

Page 13: ISOcat Data Category Registry Defining widely accepted linguistic concepts

Profiles versus DCSs• Profile membership is part of the DC

specification– the profile indicates the thematic domain of the DC– the profile view in the UI is created by a query– there are a limited number of profiles

• A DCS is a collection of DCs– hand picked by an user for a specific purpose– can contain DCs from various profiles– there can be an unlimited number of DCSs

• There isn’t (yet) a profile specific view on a DCS ___ ___ ___ CLARIN-NL MD tutorial, 24-25 September 2009 13

Page 14: ISOcat Data Category Registry Defining widely accepted linguistic concepts

ISO standardization process

14

Submissiongroup

Data Category RegistryBoard

Validation

Thematic DomainGroup

Evaluation

Stewardshipgroup

ISO

Publication

CLARIN-NL MD tutorial, 24-25 September 2009

Page 15: ISOcat Data Category Registry Defining widely accepted linguistic concepts

Submission group• The owner, possibly together with a group of

users, which submit a DCS for standardization

• The data categories in the selection should already meet the more stricter constraints for standardized data categories (as far as possible)– justification– profile(s)– …

___ ___ ___ CLARIN-NL MD tutorial, 24-25 September 2009 15

Page 16: ISOcat Data Category Registry Defining widely accepted linguistic concepts

Thematic Domain GroupsTDG 1: MetadataTDG 2: MorphosyntaxTDG 3: Semantic Content Representation TDG 4: Syntax TDG 5: Machine Readable DictionaryTDG 6: Language Resource OntologyTDG 7: LexicographyTDG 8: Language CodesTDG 9: TerminologyTDG 11: Multilingual Information ManagementTDG 12: Lexical ResourcesTDG 13: Lexical SemanticsTDG 14: Source Identification

___ ___ ___ CLARIN-NL MD tutorial, 24-25 September 2009 16

• TDGs are the owner and guardians of a coherent subset of the DCR

• TDGs own one or more profiles

• Each TDG has a chair• A number of judges (assigned

by SC P members)• A number of expert members

(up to 50%)

• TDGs are constituted at the TC37/SC plenary

• New TDGs need to be proposed by a SC

Page 17: ISOcat Data Category Registry Defining widely accepted linguistic concepts

Harmonization• When a DC belongs to multiple profiles

belonging to different TDGs harmonization may be needed– one TDG becomes the owner of the DC– judges from the other TDG(s) are involved in

the evaluation process

___ ___ ___ CLARIN-NL MD tutorial, 24-25 September 2009 17

Page 18: ISOcat Data Category Registry Defining widely accepted linguistic concepts

Stewardship group• Members of the TDG who will maintain the

data category• The TDG becomes the owner of a

standardized data category• Changes to the data category need to go

through the standardization procedure (evaluation by the TDG, validation by the DCR Board)

___ ___ ___ CLARIN-NL MD tutorial, 24-25 September 2009 18

Page 19: ISOcat Data Category Registry Defining widely accepted linguistic concepts

Using data categories (I)• Each data category has a Persistent

Identifier (PID):http://www.isocat.org/datcat/DC-1297

– once a data category has been created it can never be deleted only deprecated or superseded

– the registration authority of 12620 is obliged to keep these URLs working

19CLARIN-NL MD tutorial, 24-25 September 2009

Page 20: ISOcat Data Category Registry Defining widely accepted linguistic concepts

Using data categories (II)• This PID can be embedded in the schemata

of linguistic resources:– CMD<CMD_Element name="Role" ValueScheme="string" ConceptLink="…/DC-1234">

– Relax NG<rng:element name="gender" dcr:datcat="…/DC-1297">

– XML Schema, TEI ODD, TBX, RDF, XML, …• DC Reference vocabulary:

– http://www.isocat.org/12620/

20CLARIN-NL MD tutorial, 24-25 September 2009

Page 21: ISOcat Data Category Registry Defining widely accepted linguistic concepts

Using data categories (III)• The full data category specification can be

downloaded from ISOcat in the Data Category Interchange Format (DCIF)– DCIF is based on a simplified version of the DCR

data model, and leaves out some administrative information

– DCIF vocabulary:• http://www.isocat.org/12620/

21CLARIN-NL MD tutorial, 24-25 September 2009

Page 22: ISOcat Data Category Registry Defining widely accepted linguistic concepts

Usage scenarios• DC references only:

– find semantic overlap between two or more resources by comparing their DC references

• DC references and a schema/component registry:– find interesting resource (types) by comparing the DC

references of schemas/components in the registry• DC references and a network of registries:

– find (in)direct related resources by related DCs

___ ___ ___ CLARIN-NL MD tutorial, 24-25 September 2009 22

Page 23: ISOcat Data Category Registry Defining widely accepted linguistic concepts

Relation Registry• ISOcat contains a ‘flat’ list of concepts• The Relation Registry will support storing

(user-specific) relations between these concepts– is-a– part-of– equivalent-to– related-to– …

23

Will support:1.Ontologies and taxonomies

on top of data categories2.Searches across related data

categories3.…

CLARIN-NL MD tutorial, 24-25 September 2009

Page 24: ISOcat Data Category Registry Defining widely accepted linguistic concepts

Registry network

___ ___ ___ CLARIN-NL MD tutorial, 24-25 September 2009 24

Linguistic resources

Data category registries

Relation registries

MPIDCR

ISODCR

Typological Database SystemRRMPI RR

MPIarchive

TDSdatabaseresource

Page 25: ISOcat Data Category Registry Defining widely accepted linguistic concepts

Status of ISOcat• ISOcat is under active development:

– Now:• You can access public data categories and selections• You can create your own data categories and selections• You can share your data categories and selections with others (everyone, or

a specified group)– Future:

• Some social features (forum to discuss specific data categories)• Cleanup of profiles by TDGs• Import external ‘data category’ sets, such as:

– parts of the ISO Concept Database– Dublin Core– TEI

• Standardization workflow• High availability (mirrors)• Relation registry

25CLARIN-NL MD tutorial, 24-25 September 2009

Page 26: ISOcat Data Category Registry Defining widely accepted linguistic concepts

Thanks for your attention!

http://www.isocat.org/

[email protected]@mpi.nl

26CLARIN-NL MD tutorial, 24-25 September 2009