Upload
monika-solanki
View
272
Download
0
Embed Size (px)
Citation preview
http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe
Towards maintainable constraint validationand repair for taxonomies- The PoolParty approach
Monika Solankihttps://w3id.org/people/msolanki
@nimonikaUniversity of Oxford
Joint work withChristian Mader
Fraunhofer IAIS, Germany
http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe
PoolParty (SWC) Use case
PoolParty(PPT): leading commercial taxonomymanagement application, authoring tool for knowledgegraphs, provides taxonomy import functionality tointeract with third party datasetsTaxonomists using PPT integrate a variety of models,schemata, ontologies and vocabularies into theirknowledge bases.Challenge: combining varied data sources to ensure thatthese data mashups at any time conform to a set of qualityheuristics, as expected by the data processing algorithms.
[email protected], @nimonika Constraint validation and repair for taxonomies
http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe
MotivationConsuming and interlinking enterprise data and openlyavailable data within an industry setting.Ensuring that the interlinked datasets confirm to a set ofquality heuristics.Interactively detecting and repairing datasets withconstraint violations.
[email protected], @nimonika Constraint validation and repair for taxonomies
http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe
Ensuring Data Consistency
Current - checks to ensure that the data persisted in the triplestore do not violate it’s data consistency are scattered in thecode and sometimes performed multiple times.
RequirementsProvide a mechanism to specify data constraints in aformal way,Identify and analyse datasets that are imported into PPTand are a source of constraint violations.
[email protected], @nimonika Constraint validation and repair for taxonomies
http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe
Constraint resolutionCurrent - checks to ensure that the data persisted in the triplestore do not violate it’s data consistency are scattered in thecode and sometimes performed multiple times.
RequirementsProvide a validation mechanism to check for constraintviolation and evaluate this against the selected datasets.Combine formal data constraint definitions with reusablerepair strategies that can be easily applied by end-users ina (semi-) automatic way.
[email protected], @nimonika Constraint validation and repair for taxonomies
http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe
Dataset selection
SWC-generated: Datasets for which a conversion to aPPT-compatible taxonomy has been performed by SWC(containing 10 datasets),Custom-generated: Datasets for which a conversion to aPPT-compatible taxonomy has been performed bythird-party institutions (containing 9 datasets), andWeb: Datasets that are using SKOS, but for which iscurrently unknown if they are compatible with PPT(containing 7 datasets).
[email protected], @nimonika Constraint validation and repair for taxonomies
http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe
Constraint specification
ConceptTypeAssertion (cta):
SELECT DISTINCT ?resource WHERE {?resource skos:broader|skos:narrower ?otherRes.FILTER NOT EXISTS {?resource a skos:Concept}}
HierarchicalConsistency (hc):
SELECT DISTINCT ?resource WHERE {?resource a skos:ConceptFILTER NOT EXISTS {?resource (skos:broader|^skos:narrower)*/skos:
topConceptOf ?parent.?parent a skos:ConceptScheme.}}
[email protected], @nimonika Constraint validation and repair for taxonomies
http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe
Validation using SHACL
HierarchicalConsistency (hc):
ppts:ConceptShapea sh:Shape;sh:scopeClass skos:Concept;sh:property [a sh:PropertyConstraint;sh:predicate skos:prefLabel;sh:minCount 1;sh:minLength 1;sh:datatype rdf:langString;sh:uniqueLang true];
sh:constraint [a sh:Constraint;a sh:OrConstraint;sh:shapes (ppts:ConceptHasBroaderShape ppts:
ConceptIsTopConceptShape)].
[email protected], @nimonika Constraint validation and repair for taxonomies
http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe
Repair strategies
AddInverseStrategy
ppts:ConceptHavingBroadera sh:Shape;sh:scope [a sh:Scope;a sh:PropertyScope ;sh:predicate skos:broader];
sh:inverseProperty [a sh:InversePropertyConstraint;sh:predicate skos:narrower;sh:minCount 1;
rs:strategy [a rs:AddInverseStrategy]].
[email protected], @nimonika Constraint validation and repair for taxonomies
http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe
Implementation
SHACL implementation (TopQuadrant), Sesame, SWClibraries⇒ Java applicationSKOS data model, Dataset file, Constraint specification⇒Violation reportViolation report, SKOS data model, Dataset file, Constraintspecification⇒ Triples changeset
Not yet Optimised for runtime performance
[email protected], @nimonika Constraint validation and repair for taxonomies
http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe
Validation results
cta was never violated in datasets converted to PPTtaxonomies.upl is a SKOS-level constraint, better respected byvocabulary providers.Violations observed across all datasets.
[email protected], @nimonika Constraint validation and repair for taxonomies
http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe
Validation performance
Omitted 10 datasets that contained ≤ 50000 triples.No correlation between the dataset size and time taken toperform the validation.Structure of the dataset makes a difference.
[email protected], @nimonika Constraint validation and repair for taxonomies
http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe
Repair strategy execution performance
Repair strategy applied to a special case of the constraintbr - BidirectionalRelationsHierarical.Only considered skos:broaderThan andskos:narrowerThan. Did not consider owl:inverse.Repair scales well even with larger datasets.
[email protected], @nimonika Constraint validation and repair for taxonomies
http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe
Summary and Conclusions
Interwoven SHACL-based data consistency specificationand validation with repair strategies.Validation of datasets generated by PPT can be done withreasonable performance.Integrating repair strategies and data constraintspecification helps in building a unified, maintainablemodel.The model also plays a pivotal role in harmonizing dataand software development processes.
[email protected], @nimonika Constraint validation and repair for taxonomies