18
A I M S Is ISO 639 enough for a multilingual thesaurus? The AGROVOC case Caterina Caracciolo, Gudrun Johannsen, Lavanya Kiran, Johannes Keizer Food and Agriculture Organization of the UN AOS 2012 Sept 4. 2012 - Kuching (MY)

Caracciolo et al_2012_aos_agrovoc_multilinguality

Embed Size (px)

Citation preview

Page 1: Caracciolo et al_2012_aos_agrovoc_multilinguality

A I M SIs ISO 639 enough for a multilingual

thesaurus?The AGROVOC case

Caterina Caracciolo, Gudrun Johannsen, Lavanya Kiran, Johannes Keizer

Food and Agriculture Organization of the UNAOS 2012

Sept 4. 2012 - Kuching (MY)

Page 2: Caracciolo et al_2012_aos_agrovoc_multilinguality

Background

• AGROVOC is published in 21 languages + other under development

• Multilinguality has always been an issue• Since the beginning, multilinguality was

interpreted as “translation”:– One hierarchy of terms (one structure),

translations in various languages• This organization remained with the move

from a term-centered to a concept-centered resource04/13/2023 2

Page 3: Caracciolo et al_2012_aos_agrovoc_multilinguality

AGROVOC as object-centered resource…

• Being mainly a resource for document indexing in the area of agriculture, it contains large amount of words referring to plants, animals, food in general

04/13/2023 3

Page 4: Caracciolo et al_2012_aos_agrovoc_multilinguality

# of concepts below top concepts

04/13/2023 4strategies

site

events

time

factors

processes

technology

stages

state

measures

groups

locations

systems

subjects

resources

objects

features

properties

methods

products

activities

phenomena

entities

substances

organism

0 5000 10000 15000 20000 25000

Series1

Page 5: Caracciolo et al_2012_aos_agrovoc_multilinguality

Differentiating languages

• Salmon (en)• Salmón (es)• лососи (ru)

04/13/2023 5

Page 6: Caracciolo et al_2012_aos_agrovoc_multilinguality

But distribution of languages may be wide…

04/13/2023 6

Page 7: Caracciolo et al_2012_aos_agrovoc_multilinguality

… and names of food tend to vary…

04/13/2023 7

Palta

Aguacate

Page 8: Caracciolo et al_2012_aos_agrovoc_multilinguality

… and names of food tend to vary…

04/13/2023 8

Coime, coimi, cuimi, millmi

Achis,Coyos (Cajamarca), Achita (Ayacucho), Kiwicha (Cusco)

Ataco morado, sangorache, sergorache, hawarcha

Page 9: Caracciolo et al_2012_aos_agrovoc_multilinguality

Not only food names vary

04/13/2023 9

Page 10: Caracciolo et al_2012_aos_agrovoc_multilinguality

Requirements for rendering multilinguality in AGROVOC

1. Unambiguously express the geographic area where a given word is used– specification of the area of use of a given word

should be optional.

2. No limitations on the type of area allowed– Countries, groups of countries, geographical or

administrative regions should be equally available for specification.

04/13/2023 KISAF, Rome 10

Page 11: Caracciolo et al_2012_aos_agrovoc_multilinguality

AGROVOC as a SKOS resource

• skos:Concept is to indicate a group of words in various languages, to be considered translations of one another

• URI are kept “abstract” to emphasize independence of the concept from language– E.g. http://aims.fao.org/aos/agrovoc/c_12332

• The words grouped are then labels of the given concept

04/13/2023 11

Page 12: Caracciolo et al_2012_aos_agrovoc_multilinguality

SKOS properties to express terms

• skos:prefLabel, skos:altLabel– take plain literals as values– and an optional language tag expressed by XML

attribute xml:lang• skosxl:prefLabel, skosxl:altLabel

– Take entities with URIs, so extra infomation be attached to labels

04/13/2023 12

Page 13: Caracciolo et al_2012_aos_agrovoc_multilinguality

AGROVOC uses ISO 639 2 digitsto tag languages in xml:lang

• ISO 639 provides codes for languages independently of– the country where they are spoken:

• Spanish, Basque (same country, both official languages)• Dutch, Flamish (different country, similar enough

languages…)

– And their status: French and Breton (same country, Breton has no status)

• Only one code for English, Spanish…• Limitations shown from previous examples04/13/2023 KISAF, Rome 13

Page 14: Caracciolo et al_2012_aos_agrovoc_multilinguality

Multilinguality

ISO 639Languagecodes

04/13/2023 14

Page 15: Caracciolo et al_2012_aos_agrovoc_multilinguality

Is ISO 639 3 digits an option?

• More languages are included– More contemporary languages

• Bemba language

– “Old” languages (no longer spoken)• Old French (842ca-1400)

– Groups of languages• Cuacasian languages

– Artificial languages• Same approach as the 2 digit version

04/13/2023 KISAF, Rome 15

Page 16: Caracciolo et al_2012_aos_agrovoc_multilinguality

Is IETF an option?

• Internet Engineering Task Force (IETF)• IETF 5646 Tags for identifying languages

– Basis is ISO for languages (639) – Subtags from ISO for countries (3166), ISO for

scripts (15924) • Examples:

– tr-CY = Turkish from Cyprus– zh-Hant-HK = Chinese in traditional Chinese script

04/13/2023 KISAF, Rome 16

Page 17: Caracciolo et al_2012_aos_agrovoc_multilinguality

Is a relational approach an option?

• Keep tagging approach to mark the language– Use ISO 639 or IETF

• And introduce a relational notion of “where a given word is used”

• Link together a concept representing a geographic area, and the object to name– E.g., Kiwicha isNameUsedInRegion Cusco

• Aim at “standard” relations…

04/13/2023 KISAF, Rome 17

Page 18: Caracciolo et al_2012_aos_agrovoc_multilinguality

Conclusions?

• This is work in progress• We continue working out use cases, especially

from Spanish and Portuguese• Assess alternatives

04/13/2023 KISAF, Rome 18