12
ww.isocat.org ISOcat A short introduction Marc Kemps-Snijders a , Sue Ellen Wright b , Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b Kent State University marc.kemps-snijders @ mpi.nl , [email protected] , [email protected] February 19, 2010 1 CLARIN-NL Bijeenkomst

ISOcat: a short introduction

Embed Size (px)

DESCRIPTION

ISOcat presentation at the CLARIN-NL Bijeenkomst in Utrecht on February 19, 2010.

Citation preview

Page 1: ISOcat: a short introduction

www.isocat.org

CLARIN-NL Bijeenkomst 1

ISOcatA short introduction

Marc Kemps-Snijdersa, Sue Ellen Wrightb, Menzo Windhouwera aMax Planck Institute for Psycholinguistics, bKent State University

[email protected], [email protected], [email protected]

February 19, 2010

Page 2: ISOcat: a short introduction

www.isocat.org

CLARIN-NL Bijeenkomst 2

ISOcat: a data category registry

• ISO 12620:2009– Terminology and other content and language resources —

Specification of data categories and management of a Data Category Registry for language resources

February 19, 2010

Page 3: ISOcat: a short introduction

www.isocat.org

CLARIN-NL Bijeenkomst 3

Data category

• The result of the specification of a given data field– A data category is an elementary descriptor in a linguistic

structure or an annotation scheme.

• Model consists of 3 main parts:– Administrative part

• Administration and identification

– Descriptive part• Documentation in various working languages

– Linguistic part• Conceptual domain(s for various object languages)

February 19, 2010

Page 4: ISOcat: a short introduction

www.isocat.org

CLARIN-NL Bijeenkomst 4

Data Category Registry• ISOcat is a free service: anyone can access it or register as an

expert and create/share his/her own data categories.

• Data categories can be submitted to the standardization process, in which case they are assigned to a Thematic Domain Group which judges it.

• At regular intervals, snapshots of the standardized subset of the DCR will be submitted to ISO.

DCR Board

TDG

metadataTDG

…..TDG

morphosyntaxTDG

terminology

February 19, 2010

Page 5: ISOcat: a short introduction

www.isocat.org

CLARIN-NL Bijeenkomst 5

Standardization

February 19, 2010

Submissiongroup

Data Category RegistryBoard

Validation

Thematic DomainGroup

Evaluation

Stewardshipgroup

Decision Group

rejected rejected

Publication

Page 6: ISOcat: a short introduction

www.isocat.org

CLARIN-NL Bijeenkomst 6

Data categories and linguistic resources

February 19, 2010

Lexicon

Lexical Entry

Form Sense

0..*

0..*1..*

1..*

partOfSpeech

writtenForm

writtenForm

grammaticalGender

lexicalType

Word Form

Lemma

Language BWO genders

grammaticalGenderwordOrder

A (schema for a) lexicon

A (schema for a) typological database

Shar

ed se

man

tics!

Page 7: ISOcat: a short introduction

www.isocat.org

CLARIN-NL Bijeenkomst 7

HTML content type

Data category persistent identifier (PID):http://isocat.org/datcat/ISO-DC-1345

HTTP307

redirect

http://www.isocat.org/rest/dc/1345.html

Default content type

http://www.isocat.org/rest/dc/1345.dcif

Referencing data categories

February 19, 2010

<dcif:dataCategory pid="http://www.isocat.org/datcat/DC-1345" type="complex"><dcif:administrationInformation>

<dcif:administrationRecord> <dcif:identifier>partOfSpeech</dcif:identifier> <dcif:version>0.0.0</dcif:version> <dcif:registrationStatus>candidate</dcif:registrationStatus> <dcif:origin>?</dcif:origin>

<dcif:creation> <dcif:creationDate>2004-07-09</dcif:creationDate> <dcif:changeDescription xml:lang="en">

…</dcif:changeDescription>

</dcif:creation> </dcif:administrationRecord> </dcif:administrationInformation>

<dcif:descriptionSection> <dcif:profile>MorphoSyntax</dcif:profile>

<dcif:languageSection><dcif:language>en</dcif:language><dcif:definitionSection>

<dcif:definition xml:lang="en">Term used to describe how a particular word is used in a sentence.

… … …

</dcif:dataCategory>

Page 8: ISOcat: a short introduction

www.isocat.org

CLARIN-NL Bijeenkomst 8

Annotating linguistic resources

February 19, 2010

• Schema language support for equivalence:– for example ODD from TEI

<elementSpec ident="pos"> <equiv name="partOfSpeech" uri="http://isocat.org/datcat/ISO-DC-369"/> …

</elementSpec>

• Annotation using dcr:datcat attribute:– for schemas or instances– for example RelaxNG schema

<rng:element name="partOfSpeech" dcr:datcat="http://isocat.org/datcat/ISO-DC-369" > <rng:choice>

<rng:value dcr:datcat="http://isocat.org/datcat/ISO-DC-370"> verb

</rng:value> <rng:value dcr:datcat="http://isocat.org/datcat/ISO-DC-371">

noun </rng:value>

</rng:choice></rng:element>

• XML oriented, is more needed?

Page 9: ISOcat: a short introduction

www.isocat.org

CLARIN-NL Bijeenkomst 9

Data categories as RDF resources

February 19, 2010

:headword dcr:datcat <http://isocat.org/datcat/DC-258> ; rdfs:label "head word"@en ; rdfs:comment "A lemma heading a dictionary entry."@en ; rdfs:label "lemma"@nl ; rdfs:comment "Het eerste woord van een artikel in een

woordenboek."@nl .

:partOfSpeech dcr:datcat <http://isocat.org/datcat/DC-396> ; rdfs:label "part of speech"@en ; rdfs:comment "A category assigned to a word based on its grammatical and

semantic properties."@en .

A domain modeling approach: :headword a rdfs:Class .

:partOfSpeech a rdf:Property ; rdfs:domain :headword .

Alternative approach:

:headword a rdfs:Class .

:partOfSpeech a rdf:Class.

:hasPartOfSpeech a rdf:Property ; rdfs:domain :headword rdfs:range:partOfSpeech.

:noun a partOfSpeech.

Page 10: ISOcat: a short introduction

www.isocat.org

CLARIN-NL Bijeenkomst 10

ISOcat status

February 19, 2010

• ISOcat is under active development:– Now:

• You can access public data categories and selections• You can create your own data categories and selections• You can share your data categories and selections with others (everyone, or a

specified group)

– In progress:• Cleanup of profiles by TDGs• Standardization workflow• Some social features (forum to discuss specific data categories)• Import external ‘data category’ sets, such as:

– parts of the ISO Concept Database– Dublin Core– TEI

– Future:• High availability (mirrors)• Relation registry

Page 11: ISOcat: a short introduction

www.isocat.org

CLARIN-NL Bijeenkomst 11

ISOcat workshop

• Utrecht, Thursday March 25, 2010• Especially aimed at supporting Call 1 projects• Signup at: www.clarin.nl• Program:– A deeper introduction to ISOcat– A tutorial on using ISOcat– How to annotate specific linguistic resources?

February 19, 2010

Invitation

Send examples of the types of linguistic resources your project wants to annotate with data category references to

[email protected]

and we will discuss them at the workshop!

Page 12: ISOcat: a short introduction

www.isocat.org

CLARIN-NL Bijeenkomst 12February 19, 2010

Thank you for your attention!

Visitwww.isocat.org

[email protected]