29
Wikidata, a target for Europeana’s semantic strategy Valentine Charles, Hugo Manguinhas, Antoine Isaac: Europeana Vladimir Alexiev: Ontotext Corp GLAM Wiki 2015, Den Haag

Wikidata, a target for Europeana’s semantic strategy (Glam-Wiki 2015)

Embed Size (px)

Citation preview

Page 1: Wikidata, a target for Europeana’s semantic strategy (Glam-Wiki 2015)

Wikidata, a target for Europeana’s semantic strategy

Valentine Charles, Hugo Manguinhas, Antoine Isaac: Europeana Vladimir Alexiev: Ontotext Corp

GLAM Wiki 2015, Den Haag

Page 2: Wikidata, a target for Europeana’s semantic strategy (Glam-Wiki 2015)

Europeana.eu, Europe’s cultural heritage portal

40M objects from 2,200 galleries, museums, archives and libraries

Page 3: Wikidata, a target for Europeana’s semantic strategy (Glam-Wiki 2015)

Europeana has many data challenges: diversity

Aggregates metadata from the cultural heritage sector in Europe

• Large amount of references to places, agents, concepts, time

Page 4: Wikidata, a target for Europeana’s semantic strategy (Glam-Wiki 2015)

Europeana has many data challenges: diversity Metadata in more than 30 languages

From all EU countries

Page 5: Wikidata, a target for Europeana’s semantic strategy (Glam-Wiki 2015)

Europeana’s priority 1: Improve data quality

Europeana Data Model (EDM), a framework for richer data

• Re-uses several existing Semantic Web-based models

Dublin Core, OAI-ORE, SKOS, CIDOC-CRM…

• EDM gives support for contextual resources (semantic layer)

Rely on vocabularies to solve a problem of data interlinking

• Encourage data providers to contribute their own vocabularies and benefit from data links made at data providers’ level

Page 6: Wikidata, a target for Europeana’s semantic strategy (Glam-Wiki 2015)

Vocabularies currently provided to Europeana

Page 7: Wikidata, a target for Europeana’s semantic strategy (Glam-Wiki 2015)

Europeana also manages its own vocabularies

Page 8: Wikidata, a target for Europeana’s semantic strategy (Glam-Wiki 2015)

Europeana performs automatic enrichment based on vocabularies

Goal: Contextualization which reaches outside the scope of a particular platform

ObjectObject

Page 9: Wikidata, a target for Europeana’s semantic strategy (Glam-Wiki 2015)

Automatic enrichment process in Europeana

• Selection of metadata fields in resource descriptions

• Selection of potential rules to match

• Selection of metadata fields in resource descriptions

• Selection of potential rules to match

AnalysisAnalysis

• Matching the values of the metadata fields to values of the contextual resources

• Adding contextual links

• Matching the values of the metadata fields to values of the contextual resources

• Adding contextual links

LinkingLinking

• Selecting the values from the contextual resource

• Augmentation of the search index with the labels from the vocabulary

• Selecting the values from the contextual resource

• Augmentation of the search index with the labels from the vocabulary

Augmentation

Page 10: Wikidata, a target for Europeana’s semantic strategy (Glam-Wiki 2015)

Enrichment Types and Current Vocabularies

Enrichment Type Target vocabulary Sourcemetadata fields

Places GeoNames dcterms:spatial, dc:coverage

Concepts GEMET, DBpedia dc:subject, dc:type

Agents DBpedia dc:creator, dc:contributor

Time Semium Time

dc:date, dc:coverage, dcterms:temporal, edm:year

Page 11: Wikidata, a target for Europeana’s semantic strategy (Glam-Wiki 2015)

Europeana enrichment - an example

Page 12: Wikidata, a target for Europeana’s semantic strategy (Glam-Wiki 2015)

How Wikidata fits in Europeana’s semantic strategy?

Page 13: Wikidata, a target for Europeana’s semantic strategy (Glam-Wiki 2015)

Wikipedia's Relevance for Cultural Heritage

Authority Lists and Thesauri have central importance in CH

Wikipedia being "the sum of all knowledge" has broader reach than any institutional authority list

Only large-scale aggregations like VIAF (35 institutions) and LCSH (about 10 libraries around LoC) are comparable

While some facts are inaccurate and disputable, Wikipedia has a great role as a source of stable URLs on all kinds of topics

Page 14: Wikidata, a target for Europeana’s semantic strategy (Glam-Wiki 2015)

How Big is Wikidata?

Name data sources for semantic enrichment (Europeana Creative D2.4) gives DBpedia and Wikidata stats

Wikidata: 3y old, 14M items, 209M edits

2.7M humans, 5k families, 22k literary characters

215k organizations

66k creative orgs (bands, radio/TV stations, newspapers…)

30k educational institutions

20k non-profit orgs

13k GLAM orgs: 0.5k galleries,1k libraries, 0.2k archives, 9k museums

500k creative works

110k heritages sites and monuments

40k family names, 20k first names

Page 15: Wikidata, a target for Europeana’s semantic strategy (Glam-Wiki 2015)

Is this big enough?

Wikidata: 2.7M humans, 215k organizations, 800k places, 500k works

VIAF: 35M personal names, 5.4M orgs/conferences, 410k places, 1.7M works

GeoNames: 9M places

Only 1.1M persons are coreferenced, see Authority Addicts: The New Frontier of Authority Control on Wikidata

VIAF much bigger but still Wikidata is very important for GLAM:

Wikidata is active in Authority Control and Coreferencing

(VIAF) Moving to Wikidata: will get 1M persons/orgs, and many multilingual names (see next)

Authority Files have barely more than names & dates; Wikipedia often has a lot more info

Page 16: Wikidata, a target for Europeana’s semantic strategy (Glam-Wiki 2015)

Wikidata Multilingual Coverage

Wikidata/DBpedia has huge multilingual coverage

Each entity is represented in 2.11 Wikipedias on average (see Europeana food and drink classification scheme, EFD D2.2)

But popular entities are present in many more (up to 180); and even in one Wikipedia there are many languages

E.g. Lucas Cranach in Wikidata: 57 lang tags, representing 44 languages and 13 language variants

Languages are consistently marked

Important for semantic enrichment (Named Entity Recognition)

Even though language labels in Europeana are not consistent

Page 17: Wikidata, a target for Europeana’s semantic strategy (Glam-Wiki 2015)

Name Variants for Lucas Cranach

Wikidata and VIAF each have 70 variants and dominate the "Wikipedia tradition" and "Library tradition" datasets respectively (see Name data sources for semantic enrichment)

Only 5 variants are in common (see Interactive Venn diagram)

Excellent complementarity. VIAF has more variants, Wikidata more multilingual names

VIAF's move to sync to Wikidata will narrow the gap

Page 18: Wikidata, a target for Europeana’s semantic strategy (Glam-Wiki 2015)

Wikidata is connected to other vocabularies

Europeana prefers using pivot vocabularies

• that are connected to many other vocabularies

• It is key to avoid duplication and redundancy

Wikidata has lot of coreferences to other vocabularies that can be used to create extra links, and extract missing data

• https://www.wikidata.org/wiki/Wikidata:WikiProject_Authority_control

• https://twitter.com/hashtag/coreferencing: shots and news

• Please tweet!

Page 19: Wikidata, a target for Europeana’s semantic strategy (Glam-Wiki 2015)

VIAF-Wikidata Coreferences for Lucas Cranach

Can be leveraged to fill the gaps, e.g. bring RKDartists into VIAF

VIAF id in VIAF Wikidata id in WikidataviafID 49268177 VIAF 49268177BAV ADV10197613 BNC .a10853637 BNE XX907273 BNF cb12176451h BNF 12176451hDNB 118522582 GND 118522582ISNI 0000000121319721 ISNI 0000 0001 2131 9721JPG 500115364 ULAN 500115364LC n50020861 LCCN n50020861LNB LNC10-000002573 NDL 00436834 NKC jn20000700335 NLA 000035031951 NLI 000035532,001445575,001448179 NLP a16828161 NTA 068435312 NTA PPN 068435312NUKAT vtls000190728 SELIBR 182422 SUDOC 028710010 WKP Lucas_Cranach_the_Elder Many Wikipedias IMAGINE T7238,T267474 Cantic a10853637 Commons Creator Lucas Cranach (I) Commons category Lucas Cranach d. Ä. Freebase /m/0kqp0 RKDartists 18978 SIMBAD CRANACH, Lucas the Elder Your Paintings lucas-the-elder-cranach

Page 20: Wikidata, a target for Europeana’s semantic strategy (Glam-Wiki 2015)

Wikidata Coreferencing (1)

Excellent Mix-n-Match tool by Magnus Manske. 54 catalogs loaded!!

Decent auto-matching and excellent crowd-sourcing features

Page 21: Wikidata, a target for Europeana’s semantic strategy (Glam-Wiki 2015)

Wikidata Coreferencing (2)

Excellent Authority Control navbox in Wikipedia

E.g. matching British Museum person-institution thesaurus (currently not coreferenced to anything: high value to BM)

Page 22: Wikidata, a target for Europeana’s semantic strategy (Glam-Wiki 2015)

Europeana Food and Drink

How do you define such wide area as Food and Drink, which is so pervasive in every day life and culture?

Europeana food and drink classification scheme (EFD D2.2, or presentation) studies ~20 datasets for relevance to FD

Concludes that Wikipedia is our playing ground, and we should try to use Wikipedia Categories to delineate the topic

• AGROVOC has 32k concepts but on production/science

• Wikipedia/DBpedia has 6.6k proper Foods (with infoboxes and ingredients)

• But I estimate 0.6-1.2M things relevant to FD in all Wikipedias

Background image: 2 levels of Food_and_drink cat hierarchy

Page 23: Wikidata, a target for Europeana’s semantic strategy (Glam-Wiki 2015)

Wikidata is Easily Accessible

It is important for Europeana to have the data

• Technically available:

• Data dump preferably as Linked Data (RDF)

• SPARQL end-point or other query mechanism (e.g. WDQ)

• Properly documented and structured

• Wikidata has an excellent Property Proposal process

• Wikidata integrity constraints are excellent

• In contrast, no Class creation process, so the classes are quite a mess (16k of which 2/3 have less than 5 instances)

• Data templates should be made more visible and be used as references

• Open access

Page 24: Wikidata, a target for Europeana’s semantic strategy (Glam-Wiki 2015)

Wikidata Property Integrity Constraints

E.g. ULAN id constraints help to find records to merge / split

E.g. Communist Party of the Russian Federation has 5 LCNAF id's, what's up? Is it so popular with the Library of Congress?

Page 25: Wikidata, a target for Europeana’s semantic strategy (Glam-Wiki 2015)

How Wikidata will be used by Europeana

Semantic Enrichment of Europeana data with additional information

• With a specific focus on entities such as persons and concepts

Linking Europeana objects with Wikidata

• Approach similar to https://www.wikidata.org/wiki/Wikidata:WikiProject_sum_of_all_paintings

• But would be extended to the whole Europeana dataset

• Links would be added in the Europeana data

Structure (data template) for CH objects (e.g. paintings) still not very rich on Wikidata, e.g. Measurements not there

• Improvements are made all the time, but see next

Page 26: Wikidata, a target for Europeana’s semantic strategy (Glam-Wiki 2015)

Wikidata Items as Linking Hubs

Still, they're great as stable URLs

Providing the basic info (who, when, where, what)

And acting as coreferencing hubs

I don't expect Wikidata CH objects to ever be described in the full richness & complexity of professional art research. E.g. see British Museum Mapping to CIDOC CRM

Page 27: Wikidata, a target for Europeana’s semantic strategy (Glam-Wiki 2015)

Wikidata and DBpedia

Wikidata and DBpedia are the two structured representations of Wikipedia

Wikidata: initially populated from Wikipedia, manually curated, will master structured data for Wikipedia. Synchronized through an assortment of bots

Data is fairly accurate but data depth is still small

DBpedia: automatically extracted from Wikipedia, live update, one-way extraction only.

Data reach is deep, but there are many problems in ontology and individual mappings, especially for non-English. E.g. United Nations is extracted as "Country". See DBpedia Ontology and Mapping Problems.

Should they be together?

Page 28: Wikidata, a target for Europeana’s semantic strategy (Glam-Wiki 2015)

GLAMs should add to Wikipedia or Wikidata! EFD project. Swiecenie Koszyczek, "blessing of the baskets", a

colorful Polish tradition

There's no article in pl.wikipedia.org, so we can't relate such artifacts to anything

Content partner's museum staff have no time to make a proper Wikipedia article

But adding a Wikidata item is quick & easy

Appropriate categories (Easter Traditions, Easter-related Foods) will put it in context

Page 29: Wikidata, a target for Europeana’s semantic strategy (Glam-Wiki 2015)

Thank you

Valentine Charles, [email protected] Alexiev, [email protected]

Hugo Manguinhas, [email protected]