28
Turning three overlapping thesauri into a G lobal A gricultural C oncept S cheme SWIB14, Bonn, 3 December 2014 Osma Suominen and Thomas Baker

Osma Suominen and Thomas Baker SWIB14, Bonn, 3 December 2014swib.org/swib14/slides/suominen_swib14_30.pdf · AGROVOC CAB Thesaurus NAL Thesaurus 140,000 concepts, >1.4M terms 32,000

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Osma Suominen and Thomas Baker SWIB14, Bonn, 3 December 2014swib.org/swib14/slides/suominen_swib14_30.pdf · AGROVOC CAB Thesaurus NAL Thesaurus 140,000 concepts, >1.4M terms 32,000

Turning three overlapping thesauriinto a Global Agricultural Concept Scheme

SWIB14, Bonn, 3 December 2014

Osma Suominen and Thomas Baker

Page 2: Osma Suominen and Thomas Baker SWIB14, Bonn, 3 December 2014swib.org/swib14/slides/suominen_swib14_30.pdf · AGROVOC CAB Thesaurus NAL Thesaurus 140,000 concepts, >1.4M terms 32,000

Outline

1. Background

2. Starting point: three thesauri

3. Creating GACS

4. Challenges

5. Next steps and future of GACS

Page 3: Osma Suominen and Thomas Baker SWIB14, Bonn, 3 December 2014swib.org/swib14/slides/suominen_swib14_30.pdf · AGROVOC CAB Thesaurus NAL Thesaurus 140,000 concepts, >1.4M terms 32,000

Background● Food and Agriculture Organization of the UN● CABI (UK)● National Agricultural Library (US)

Each organization maintains a thesaurus of terms and concepts related to agriculture -- concepts like rice, ricefield aquaculture, and plant pests.

Page 4: Osma Suominen and Thomas Baker SWIB14, Bonn, 3 December 2014swib.org/swib14/slides/suominen_swib14_30.pdf · AGROVOC CAB Thesaurus NAL Thesaurus 140,000 concepts, >1.4M terms 32,000

Global Agricultural Concept Scheme (GACS)

1. To improve the semantic interoperability of thesauri maintained by FAO, CABI, and NAL.

2. To provide core concepts broadly supported across the three thesauri.

3. To achieve efficiencies of scale by maintaining the core concepts in cooperation.

Page 5: Osma Suominen and Thomas Baker SWIB14, Bonn, 3 December 2014swib.org/swib14/slides/suominen_swib14_30.pdf · AGROVOC CAB Thesaurus NAL Thesaurus 140,000 concepts, >1.4M terms 32,000

Three Thesauri

Page 6: Osma Suominen and Thomas Baker SWIB14, Bonn, 3 December 2014swib.org/swib14/slides/suominen_swib14_30.pdf · AGROVOC CAB Thesaurus NAL Thesaurus 140,000 concepts, >1.4M terms 32,000

Separate thesauri, separate databasesCreate GACS as a glue linking them together

Page 7: Osma Suominen and Thomas Baker SWIB14, Bonn, 3 December 2014swib.org/swib14/slides/suominen_swib14_30.pdf · AGROVOC CAB Thesaurus NAL Thesaurus 140,000 concepts, >1.4M terms 32,000

AGROVOC CAB Thesaurus NAL Thesaurus

140,000concepts,

>1.4M terms

32,000concepts,

>1.2M terms

53,000concepts,

>200k terms

English, Spanish, Portuguese, German, Czech, Persian, Polish, Hindi, French, Italian, Russian, Japanese, Hungarian, Chinese, Slovak, Thai, Lao, Turkish, Korean, Arabic, Telugu ...

English, Spanish, Portuguese, Dutch+ many languages with lower coverage

English, Spanish

All thesauri represented using SKOS

Page 8: Osma Suominen and Thomas Baker SWIB14, Bonn, 3 December 2014swib.org/swib14/slides/suominen_swib14_30.pdf · AGROVOC CAB Thesaurus NAL Thesaurus 140,000 concepts, >1.4M terms 32,000

Overlap estimateObtained via automatic mappings created using AgreementMakerLight

Page 9: Osma Suominen and Thomas Baker SWIB14, Bonn, 3 December 2014swib.org/swib14/slides/suominen_swib14_30.pdf · AGROVOC CAB Thesaurus NAL Thesaurus 140,000 concepts, >1.4M terms 32,000

Long tail distribution (in AGRIS)10,000 concepts cover nearly 99% of occurrences in metadata

Page 10: Osma Suominen and Thomas Baker SWIB14, Bonn, 3 December 2014swib.org/swib14/slides/suominen_swib14_30.pdf · AGROVOC CAB Thesaurus NAL Thesaurus 140,000 concepts, >1.4M terms 32,000

Creating GACS

Page 11: Osma Suominen and Thomas Baker SWIB14, Bonn, 3 December 2014swib.org/swib14/slides/suominen_swib14_30.pdf · AGROVOC CAB Thesaurus NAL Thesaurus 140,000 concepts, >1.4M terms 32,000

Requirements and Wishes1. An integrated view and bridge of existing thesauri2. Reuses thesaurus development work, incl. translations3. Compatible with existing databases4. Based on RDF technologies: URIs, SKOS etc.5. Available as Linked Open Data

Currently building GACS Beta, a proof-of-concept implementation attempting to fulfill most requirements

Page 12: Osma Suominen and Thomas Baker SWIB14, Bonn, 3 December 2014swib.org/swib14/slides/suominen_swib14_30.pdf · AGROVOC CAB Thesaurus NAL Thesaurus 140,000 concepts, >1.4M terms 32,000

Selection of top 10,000 conceptsEach partner organization provided the 10,000 concepts most frequently used in their respective databases.

These lists of concepts were modified as follows:● added all countries (from

AGROVOC)● added organisms hierarchy all

the way to the top

Page 13: Osma Suominen and Thomas Baker SWIB14, Bonn, 3 December 2014swib.org/swib14/slides/suominen_swib14_30.pdf · AGROVOC CAB Thesaurus NAL Thesaurus 140,000 concepts, >1.4M terms 32,000

Automated mappingsCreated using AgreementMakerLight softwarebetween the full thesauri, for completeness

AgreementMakerLight was top performer atOAEI 2014 ontology mapping competition!

Page 14: Osma Suominen and Thomas Baker SWIB14, Bonn, 3 December 2014swib.org/swib14/slides/suominen_swib14_30.pdf · AGROVOC CAB Thesaurus NAL Thesaurus 140,000 concepts, >1.4M terms 32,000

Human evaluation of mappingsCreated Google Docs spreadsheets using the lists of selected concepts and the auto-generated mappings. Three sheets with circa 10,700 rows each.

Mappings manually evaluated bystaff of partner organizations.

Evaluated 60 to 150 rows/hour,total evaluation time over 300hours so far.Currently projected to take500-600 hours for GACS Beta.

Page 15: Osma Suominen and Thomas Baker SWIB14, Bonn, 3 December 2014swib.org/swib14/slides/suominen_swib14_30.pdf · AGROVOC CAB Thesaurus NAL Thesaurus 140,000 concepts, >1.4M terms 32,000

Forming GACS conceptsby merging the source concepts and aggregating their information

riceUF paddyUF paddy rice

cerealsUF feed cerealsUF small grain cereals (grain)

Oryza sativaUF Oryza glutinosaUF Oryza indicaUF Oryza japonicaUF Oryza sativa … (subsp, var etc.)

OryzaUF PadiaUF rice (plant)

agrovoc:c_5435cabt:82917nalt:56271

exactMatch

agrovoc:c_5438cabt:82935nalt:56277

exactMatch

agrovoc:c_1474cabt:26247

exactMatch

agrovoc:c_6599cabt:101613

nalt:56293

exactMatch

(actually we use SKOS, not traditional thesaurus tags)

Page 16: Osma Suominen and Thomas Baker SWIB14, Bonn, 3 December 2014swib.org/swib14/slides/suominen_swib14_30.pdf · AGROVOC CAB Thesaurus NAL Thesaurus 140,000 concepts, >1.4M terms 32,000

Size of GACS

GACSGACS Beta will have around14,000 of the most used concepts

Page 17: Osma Suominen and Thomas Baker SWIB14, Bonn, 3 December 2014swib.org/swib14/slides/suominen_swib14_30.pdf · AGROVOC CAB Thesaurus NAL Thesaurus 140,000 concepts, >1.4M terms 32,000

Quality evaluationUsing the qSKOS and Skosify tools that can find and correct problems in SKOS vocabularies [1], we can detect● missing, invalid or overlapping concept labels● anomalies in concept hierarchy, e.g. cycles● ...and many other kinds of problems.

Many problems are expected due to merging of concepts within GACS, but most should be automatically corrected.

[1] Osma Suominen and Christian Mader: Assessing and Improving the Quality of SKOS Vocabularies. JoDS, 3(1) 2014.

Page 19: Osma Suominen and Thomas Baker SWIB14, Bonn, 3 December 2014swib.org/swib14/slides/suominen_swib14_30.pdf · AGROVOC CAB Thesaurus NAL Thesaurus 140,000 concepts, >1.4M terms 32,000

Lessons already learned● It is hard to sustain focus on mapping beyond circa five hours per day.

● Mapping reveals issues with both the source and target thesauri -- areas for improvement, or errors, fixable in collaboration.

● Starting with the 10,000 most-used concepts shines a light on parts of thesauri that may long have lacked attention.

● Starting small, with a core, avoids the potential stress of over-committing resources.

● Mapping provides an incentive to adopt open-data technologies that can have prove beneficial in other areas.

Page 20: Osma Suominen and Thomas Baker SWIB14, Bonn, 3 December 2014swib.org/swib14/slides/suominen_swib14_30.pdf · AGROVOC CAB Thesaurus NAL Thesaurus 140,000 concepts, >1.4M terms 32,000

Challenges

Page 21: Osma Suominen and Thomas Baker SWIB14, Bonn, 3 December 2014swib.org/swib14/slides/suominen_swib14_30.pdf · AGROVOC CAB Thesaurus NAL Thesaurus 140,000 concepts, >1.4M terms 32,000

Differences in modelingQ: Are taxonomic organism names (e.g. ‘Bos taurus’) different concepts than the common names (‘cattle’)?● sometimes there is no 1:1 match

and/or context of use is different● the source thesauri all have different policies

No final answer yet...

Page 22: Osma Suominen and Thomas Baker SWIB14, Bonn, 3 December 2014swib.org/swib14/slides/suominen_swib14_30.pdf · AGROVOC CAB Thesaurus NAL Thesaurus 140,000 concepts, >1.4M terms 32,000

Lumpsclusters of concepts mapped one-to-several, several-to-one, or in spirals

Page 23: Osma Suominen and Thomas Baker SWIB14, Bonn, 3 December 2014swib.org/swib14/slides/suominen_swib14_30.pdf · AGROVOC CAB Thesaurus NAL Thesaurus 140,000 concepts, >1.4M terms 32,000

Next steps and future of GACS

Page 24: Osma Suominen and Thomas Baker SWIB14, Bonn, 3 December 2014swib.org/swib14/slides/suominen_swib14_30.pdf · AGROVOC CAB Thesaurus NAL Thesaurus 140,000 concepts, >1.4M terms 32,000

Additional mapping roundsNeed to perform 2-3 moresmaller mapping roundsin order to ensure thatall necessary conceptshave been fully mappedbetween all source thesauri

Page 25: Osma Suominen and Thomas Baker SWIB14, Bonn, 3 December 2014swib.org/swib14/slides/suominen_swib14_30.pdf · AGROVOC CAB Thesaurus NAL Thesaurus 140,000 concepts, >1.4M terms 32,000

GACS system infrastructure

Page 26: Osma Suominen and Thomas Baker SWIB14, Bonn, 3 December 2014swib.org/swib14/slides/suominen_swib14_30.pdf · AGROVOC CAB Thesaurus NAL Thesaurus 140,000 concepts, >1.4M terms 32,000

VocBench for editing

Page 27: Osma Suominen and Thomas Baker SWIB14, Bonn, 3 December 2014swib.org/swib14/slides/suominen_swib14_30.pdf · AGROVOC CAB Thesaurus NAL Thesaurus 140,000 concepts, >1.4M terms 32,000

Beyond GACS Beta?Q: Can GACS replace existing agricultural thesauri?● definitely not with GACS Beta due to smaller scope/size● a future GACS may be an alternative for some

scenarios, but not all uses of existing thesauri because ○ they cover areas beyond agriculture○ existing systems and processes (publication,

automatic indexing…) depend on current thesauri

In future, more partners are expected and the scope of GACS can be adjusted.