Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Turning three overlapping thesauriinto a Global Agricultural Concept Scheme
SWIB14, Bonn, 3 December 2014
Osma Suominen and Thomas Baker
Outline
1. Background
2. Starting point: three thesauri
3. Creating GACS
4. Challenges
5. Next steps and future of GACS
Background● Food and Agriculture Organization of the UN● CABI (UK)● National Agricultural Library (US)
Each organization maintains a thesaurus of terms and concepts related to agriculture -- concepts like rice, ricefield aquaculture, and plant pests.
Global Agricultural Concept Scheme (GACS)
1. To improve the semantic interoperability of thesauri maintained by FAO, CABI, and NAL.
2. To provide core concepts broadly supported across the three thesauri.
3. To achieve efficiencies of scale by maintaining the core concepts in cooperation.
Three Thesauri
Separate thesauri, separate databasesCreate GACS as a glue linking them together
AGROVOC CAB Thesaurus NAL Thesaurus
140,000concepts,
>1.4M terms
32,000concepts,
>1.2M terms
53,000concepts,
>200k terms
English, Spanish, Portuguese, German, Czech, Persian, Polish, Hindi, French, Italian, Russian, Japanese, Hungarian, Chinese, Slovak, Thai, Lao, Turkish, Korean, Arabic, Telugu ...
English, Spanish, Portuguese, Dutch+ many languages with lower coverage
English, Spanish
All thesauri represented using SKOS
Overlap estimateObtained via automatic mappings created using AgreementMakerLight
Long tail distribution (in AGRIS)10,000 concepts cover nearly 99% of occurrences in metadata
Creating GACS
Requirements and Wishes1. An integrated view and bridge of existing thesauri2. Reuses thesaurus development work, incl. translations3. Compatible with existing databases4. Based on RDF technologies: URIs, SKOS etc.5. Available as Linked Open Data
Currently building GACS Beta, a proof-of-concept implementation attempting to fulfill most requirements
Selection of top 10,000 conceptsEach partner organization provided the 10,000 concepts most frequently used in their respective databases.
These lists of concepts were modified as follows:● added all countries (from
AGROVOC)● added organisms hierarchy all
the way to the top
Automated mappingsCreated using AgreementMakerLight softwarebetween the full thesauri, for completeness
AgreementMakerLight was top performer atOAEI 2014 ontology mapping competition!
Human evaluation of mappingsCreated Google Docs spreadsheets using the lists of selected concepts and the auto-generated mappings. Three sheets with circa 10,700 rows each.
Mappings manually evaluated bystaff of partner organizations.
Evaluated 60 to 150 rows/hour,total evaluation time over 300hours so far.Currently projected to take500-600 hours for GACS Beta.
Forming GACS conceptsby merging the source concepts and aggregating their information
riceUF paddyUF paddy rice
cerealsUF feed cerealsUF small grain cereals (grain)
Oryza sativaUF Oryza glutinosaUF Oryza indicaUF Oryza japonicaUF Oryza sativa … (subsp, var etc.)
OryzaUF PadiaUF rice (plant)
agrovoc:c_5435cabt:82917nalt:56271
exactMatch
agrovoc:c_5438cabt:82935nalt:56277
exactMatch
agrovoc:c_1474cabt:26247
exactMatch
agrovoc:c_6599cabt:101613
nalt:56293
exactMatch
(actually we use SKOS, not traditional thesaurus tags)
Size of GACS
GACSGACS Beta will have around14,000 of the most used concepts
Quality evaluationUsing the qSKOS and Skosify tools that can find and correct problems in SKOS vocabularies [1], we can detect● missing, invalid or overlapping concept labels● anomalies in concept hierarchy, e.g. cycles● ...and many other kinds of problems.
Many problems are expected due to merging of concepts within GACS, but most should be automatically corrected.
[1] Osma Suominen and Christian Mader: Assessing and Improving the Quality of SKOS Vocabularies. JoDS, 3(1) 2014.
Demo of GACS Alpha in Skosmos
Lessons already learned● It is hard to sustain focus on mapping beyond circa five hours per day.
● Mapping reveals issues with both the source and target thesauri -- areas for improvement, or errors, fixable in collaboration.
● Starting with the 10,000 most-used concepts shines a light on parts of thesauri that may long have lacked attention.
● Starting small, with a core, avoids the potential stress of over-committing resources.
● Mapping provides an incentive to adopt open-data technologies that can have prove beneficial in other areas.
Challenges
Differences in modelingQ: Are taxonomic organism names (e.g. ‘Bos taurus’) different concepts than the common names (‘cattle’)?● sometimes there is no 1:1 match
and/or context of use is different● the source thesauri all have different policies
No final answer yet...
Lumpsclusters of concepts mapped one-to-several, several-to-one, or in spirals
Next steps and future of GACS
Additional mapping roundsNeed to perform 2-3 moresmaller mapping roundsin order to ensure thatall necessary conceptshave been fully mappedbetween all source thesauri
GACS system infrastructure
VocBench for editing
Beyond GACS Beta?Q: Can GACS replace existing agricultural thesauri?● definitely not with GACS Beta due to smaller scope/size● a future GACS may be an alternative for some
scenarios, but not all uses of existing thesauri because ○ they cover areas beyond agriculture○ existing systems and processes (publication,
automatic indexing…) depend on current thesauri
In future, more partners are expected and the scope of GACS can be adjusted.
Thank youReports available on the FAO AIMS site:
http://aims.fao.org/community/agrovoc/blogs/phase-one-gacs-approved-read-reports
These slides: http://tinyurl.com/swib14-gacs
[email protected]@tombaker.org