View
128
Download
0
Category
Tags:
Preview:
Citation preview
Understanding Change in Versioned KOS on the Web
Albert Meroño-‐Peñuela Christophe Guéret Stefan Schlobach
@albertmeronyo
EvoluFon and variaFon of classificaFon systems – KnoweScape workshop
04-‐03-‐2015
CEDAR: Harmonizing Historical Census Data in the SemanFc Web
CEDAR: Harmonizing Historical Census Data in the SemanFc Web
CEDAR: Source Historical Data
Dutch Historical Censuses (1795-‐1971) [Public Historical StaFsFcal Data]
5
From scans to spreadsheets
Uniform queries on the Web
1795 1830 1840 1849 1859 1869 1879 1889 1899 1909 1919 1920 1930 1947 1956 1960 1971
(through ~3K heterogeneous tables)
RDF Data Cube
“There are many situaFons where it would be useful to be able to publish mulF-‐
dimensional data, such as staFsFcs, on the web in such a way that they can be linked
to related data sets and concepts.”
RDF Data Cube vocabulary (QB) • SDMX compaFble • Defines cubes as a set of observa*ons that consist of
dimensions, measures and a/ributes
• Dimensions: Fme period, region, sex (qb:DimensionProperty)• Measure: populaFon life expectancy (qb:MeasureProperty) • Ajribute: unit of measure = years, metadata status = measured (qb:AttributeProperty)
ObservaFon: “the measured life expectancy of males in Newport in the period 2004-‐2006 is 76.7 years”
Dynamic ClassificaFons
• Gemeentegeschiedenis.nl
Dynamic ClassificaFons
hjp://lod.cedar-‐project.nl/maps/ (kudos to Richard Zijdeman)
Dynamic ClassificaFons
• HISCO
hjp://historyofwork.iisg.nl/
LSD Dimensions
hjp://lsd-‐dimensions.org/ hjps://github.com/albertmeronyo/LSD-‐Dimensions
Daily JSON-‐LD dumps
hjp://lsd-‐dimensions.org/
Concept Drift
Census classificaFon of occupaFons as for
1859
• Root node is void • Depth 1: occupaFon groups • Leaves: actual occupaFons
Concept Drift
Census classificaFon of occupaFons as for
1889
• Root node is void • Depth 1: occupaFon groups • Leaves: actual occupaFons
Concept Drift
Census classificaFon of occupaFons as for
1899
• Root node is void • Depth 1: occupaFon groups • Leaves: actual occupaFons
Concept Dris
Upper ontologies (HISCO, AC)
Year-dependent ontologies
1859 1869 1879
Concept Dris
Upper ontologies (HISCO, AC)
Year-dependent ontologies
Concept Dris
Upper ontologies (HISCO, AC)
Year-dependent ontologies
? ?
PredicFng Change
• KOS version chains: subsequent unique version iden*fiers to unique states of KOS
• ProblemaFc for – Data publishers (KOS maintainability) – Data users/linkers (link validity)
A. Meroño-‐Peñuela, C. Guéret, S. Schlobach. Predic1ng Change in Versioned Knowledge Organisa1on Systems on the Web. IJCAI 2015 (under review)
PredicFng Change • Proposal: generic approach to predict when and where a Web KOS of any domain will change – Using supervised learning on past versions of KOS
• SotA1: predicFon of class extension in – 1 OBO/OWL version chain (Gene Ontology) – using few classifiers
• Contribu1on2: predicFon of concept dri: in – 150 Web KOS version chains – using all (21) SotA classifiers (WEKA API)
2 A. Meroño-‐Peñuela, C. Guéret, S. Schlobach. “Predic1ng Change in Versioned Knowledge Organisa1on Systems on the Web”. IJCAI 2015 (under review)
1 C. Pesquita, F.M. Couto. “Predic1ng the extension of biomedical ontologies”. PLoS computa1onal biology 8 (9), e1002630
Concept Dris
• Proxy for change of meaning over Fme1 – Intension dri: occurs when there is a difference in the properFes or ajributes of two variants of the same concept
– Extension dri: occurs when there is a difference in the individuals that belong to two variants of the same concept
– Label dri: occurs when there is a difference in the labels of two variants of the same concept
1 S. Wang, S. Schlobach, K. Klein. “What Is Concept DriR and How to Measure It?”. EKAW 2010.
Input Datasets
KOS version chains from • HISCO/CEDAR (1 version chain) • DBpedia (2 version chains) • Linked Open Vocabularies1 (134 version chains) • *Ontology chains from 637 SPARQL endpoints2 (6 version chains)
1 hjp://lov.okfn.org/ 2 hjps://github.com/albertmeronyo/ConceptDris-‐data/tree/master/src
Features
• From which data characterisFcs (related to change) should we learn?
• SotA in Ontology Change [Stojanovic 2004] – Structure-‐driven (rdfs:subClassOf, skos:broader)
• maxDepth, children, parents, siblings – Data-‐driven (rdf:type)
• members, childMembers, parentMembers, siblingMembers
– Usage-‐driven • incExtLinks (on the Web!)
Pipeline
hjps://github.com/albertmeronyo/ConceptDris
EvaluaFon
• Use a subset of past versions for learning (Vt) • Check whether changed happened by observing Vr, Ve
Results – classifier performance
CEDAR/HISCO classificaFon performance over Fme
Dbpedia ontology classificaFon performance over Fme
Results – understanding performance
RelaFonship between characterisFcs of input version chains and selected classifiers / performance? • totalSize • nSnapshots • avgGap • avgTreeDepth • ra1oInstances • ra1oStructural • ra1oInserts • ra1oDeletes • ra1oComm
f(xi)? q roc q classifier
Table 1:
Dependent variable:
functions rules trees functions rules trees functions rules trees
(1) (2) (3) (4) (5) (6) (7) (8) (9)
log(nSnapshots) �0.291 �0.257 1.975 �0.180 �0.239 1.745 �0.193 �0.212 1.838
(0.656) (0.765) (1.503) (0.680) (0.790) (1.512) (0.667) (0.777) (1.497)
log(avgGap) 0.238 0.145 1.385
⇤0.266 0.173 1.269
⇤0.248 0.161 1.351
⇤
(0.242) (0.271) (0.734) (0.240) (0.269) (0.703) (0.240) (0.270) (0.729)
log(totalSize) 0.669
⇤⇤⇤0.539
⇤ �0.052 0.636
⇤⇤0.531
⇤ �0.010 0.641
⇤⇤⇤0.524
⇤ �0.025
(0.249) (0.278) (0.563) (0.251) (0.282) (0.555) (0.249) (0.279) (0.557)
avgTreeDepth �0.399 �0.334 0.534 �0.393 �0.336 0.564 �0.385 �0.323 0.553
(0.302) (0.330) (0.719) (0.304) (0.334) (0.728) (0.303) (0.332) (0.728)
ratioInstances 1.378 2.463 3.090 1.071 2.246 3.394 1.269 2.330 3.221
(3.485) (4.021) (6.654) (3.455) (3.981) (6.629) (3.476) (4.005) (6.649)
ratioStructural �9.054 1.357 �9.539 �9.039 1.674 �10.799 �9.594 1.116 �10.030
(6.040) (6.135) (13.505) (6.142) (6.353) (13.945) (6.136) (6.267) (13.827)
ratioInserts 3.006 2.376 �3.540
(1.906) (2.210) (4.401)
ratioDeletes 1.918 0.929 �2.341
(1.907) (2.154) (4.058)
ratioComm �1.440 �0.945 1.615
(1.028) (1.170) (2.219)
Constant �5.610
⇤⇤ �5.580
⇤⇤ �12.702
⇤⇤ �5.288
⇤⇤ �5.259
⇤⇤ �12.402
⇤⇤ �4.059
⇤ �4.494
⇤ �14.266
⇤⇤
(2.248) (2.511) (5.954) (2.210) (2.494) (5.759) (2.265) (2.585) (6.511)
Akaike Inf. Crit. 313.543 313.543 313.543 316.179 316.179 316.179 314.605 314.605 314.605
Note:
⇤p<0.1;
⇤⇤p<0.05;
⇤⇤⇤p<0.01
1
Classifier SelecFon
SimulaFon of avgGap VS Classifier Family SelecFon
Conclusions
• SemanFc technology for Social History – It saved work!
• Historical datasets as an observatory of dynamic KOS – Logging usage of KOS in Linked StaFsFcal Data
• Modeling change in Web KOS – Version chains are scarce (beware of bias) – Chain recipe: nSnapshots, avgTreeDepth, raFoStructural, raFoInserts, raFoComm
– Classifier dependence: avgGap, totalSize
Thank you
Questions, suggestions, comments most welcome
@albertmeronyo
https://github.com/albertmeronyo/ConceptDrift
http://www.cedar-project.nl http://krr.cs.vu.nl/
http://easy.dans.knaw.nl/ http://lsd-dimensions.org/
Me in 6 tweets hjp://www.albertmeronyo.org
• Background: Computer Science, Web hacker, AI & Law • PhD candidate at the VU University Amsterdam, DANS, and eHumaniFes group (KNAW)
• Topic: SemanFc Web for the HumaniFes • CEDAR project (2012-‐2015): harmonized historical Dutch censuses in the SemanFc Web
• Problem: staFsFcal data publishing, concept dris and dynamics of meaning
• Last paper: What is Linked Historical Data? (EKAW 2014)
Recommended