Upload
nandana-mihindukulasooriya
View
123
Download
1
Embed Size (px)
Citation preview
4V: Volumen, Velocidad, Variedad y Validez en la gestión innovadora de datos
(TIN2013-46238)
Progress Report – WP3Zaragoza, 15 de Junio 2016
Ontology Engineering Group (OEG)Escuela Técnica Superior de Ingenieros Informáticos
Universidad Politécnica de MadridCampus de Montegancedo,
Boadilla del Monte, 28660, Spain
2
Outline
• Loupe• On-going work
• Quality Assessment and Repair• Conciseness• Consistency
• Collaborations • A two-fold quality assurance approach for dynamic KBs: The
3cixty use case
Nandana Mihindukulasooriya, OEG
3
Loupe - An Online Tool for Inspecting Datasets in the Linked Data CloudDemo @ ISWC2015
Nandana Mihindukulasooriya, OEG
4
Loupe - Overview
Nandana Mihindukulasooriya, OEG
Explore the vocabularies used and the abstract triple patterns in 2+ billion triples including all Dbpedia datasets, Wikidata, Linked Brainz, Bio2RDF.
Loupe helps to understand data, uncover patterns, formulate queries, and detect quality issues
Loupe - An Online Tool for Inspecting Datasets in the Linked Data CloudDemo @ ISWC2015.
5
Loupe – Google Analytics
Ontology Engineering Group, Universidad Politécnica de Madrid
6
Loupe – Google Analytics (II)
• Users from 84 countries• Spain(23.76%), US (16.69%), Germany (10.64%), UK
(9.14%), Italy (4.51%)
Ontology Engineering Group
7
Loupe On-going work
Nandana Mihindukulasooriya, OEG
8
Loupe – Use Case Analysis
• Dataset Descriptions• Dataset statistics • Dataset profiling
• Dataset exploration• Class/property browsing • Triple pattern browsing
• Dataset discovery and recommendation• keywords, vocabularies• SPARQL queries • RDF shapes
Ontology Engineering Group
• Quality assessment• Consistency• Misused vocabularies
• Guided SPARQL query generation• auto-complete based on
abstract triple patterns• Vocabulary reuse and
recommendation• Recommendation of
vocabularies based on popularity
• Ontology development feedback• Common properties
9
Loupe – LOD Laundromat integration
Nandana Mihindukulasooriya, OEG
• Current status of Loupe• 2 billion triples from 32 datasets
• LOD Laundromat• 32 billion triples from 650K documents • cleaned for syntax errors and duplicates• coverage of smaller documents
• Collaboration with VU University Amsterdam• Steps
• Fully automatic dataset download, SPARQL endpoint creation, indexing, clean up
• UI changes to handle large number of datasets• Vocabulary usage datasets
10
Loupe Ontology – Vocabulary Usage Statistics of LOD
• Analysis of existing metrics • VoID • DCAT • RDFStats• LODStats • VoID-Ext
• Analysis of use case requirements• Statistics • Profiling • Discovery• Recommendation
Nandana Mihindukulasooriya, OEG
11
Loupe Ontology
Nandana Mihindukulasooriya, OEG
12
An Analysis of the Quality Issues of the Properties Available in the Spanish DbpediaCAEPIA 2015, Albacete
Nandana Mihindukulasooriya, OEG
13
Analyzed Quality Dimensions
Nandana Mihindukulasooriya, OEG
An Analysis of the Quality Issues of the Properties Available in the Spanish Dbpedia CAEPIA2015.
A. Conciseness. A dataset does not contain redundant concepts with different identifiers.
B. Consistency. A dataset does not contain conflicting or contradictory data.
C. Syntactic Validity. Values belong to the legal value range for the represented domain and do not violate the syntactic rules.
D. Semantic Accuracy. Values correctly represent real world facts
14Ontology Engineering Group, Universidad Politécnica de Madrid
Conciseness
• Many redundant properties in esDBpedia• 97.93% are auto-generated
• Causes• Capitalization (857): partidosEnPrimera,partidosenprimera• Synonyms: causaDeMuerte, causaDeFallecimiento• Prepositions: causaDeFallecimiento, causaFallecimiento• Spelling (7,495): apeliido, apelldio, apellid• Singular/plural: apellido, apellidos• Gender: administrador, administradora• Accent usage (1,252): administracion, administración• Parsing (107): altitudMin/máx, residencia/trabajo, idioma/s
15
Consistency
• Diverse and incorrect domain and range types • esdbpedia:edad has range of type dbo:Place • esdbpedia:lugarmuerte has range of type dbo:Person• esdbpedia:pais has range of type dbo:Actor
• OWL properties with IRI and literal values• 3,380 properties• Use of strings and URL interchangeably
• esdbpedia:lugarDeEntierro• "Madrid"@es• http://es.dbpedia.org/resource/Madrid
Ontology Engineering Group, Universidad Politécnica de Madrid
16
Conciseness
Nandana Mihindukulasooriya, OEG
17
How to query for the birth place of a person in DBpedia?
Nandana Mihindukulasooriya, OEG
DBpedia (lang)
Syntactically Similar Semantically Similar
English birthplace, birthplace, placeofbirth, birthplace, birthdplace, birthPalce, birthplace, PlaceOfBirth, laceOfBirth, oplaceOfBirth, birthplace, birthplace, birthPalce, birthPlae, birthPace, birthPlaxe, birtPlace, birthPlcace, bithPlace, brithPlace, nbirthPlace, birthplace, birghPlace, birthdplace, biRthPlace, birth, placebirth, placeOfBirth, placOfBirth, birthPlaceOf, birthPlae
cityofbirth, cityofbirthPlace, cityOfBirth, birthLocation
Spanish birthPlace, placeOfBirth, birthPlace, birthplacelugarDeNacimiento, lugarNacimiento, lugarNacimiento, lugarnacimiento, lugardenacimiento, lugarNacimento, lugarNaciento
ciudaddenacimiento, ciudadDenacimiento, paisdenacimiento, paisNacimiento
German geburtsort, birthplace, birthPlace, placeOfBirth placeofbirth
geburtsland, countryofbirth
18
Conciseness
• Less-concise datasets• Multiple identifiers with same semantics
• Issues • Harder to understand data and vocabularies used• Harder to write queries • Harder to reuse
• Causes• Less concise mappings
• Diverse distributed mappings created by multiple teams• No policies or guidance of consistent vocabulary usage• No tools for recommending class / properties
• Crowd-sourced ontologies• No or minimum labels / descriptions
Nandana Mihindukulasooriya, OEG
19
RDF generation process
Nandana Mihindukulasooriya, OEG
Bulk RDF Transformation(e.g., LOD Refine, DBpedia extraction framework, Ad-hoc programs)
structured dataunstructured
Query RewritingRDF Mappings(e.g., R2RML, Mappings Wiki, D2R mappings, LOD Refine RDF skeletons)
SPARQL Endpoint(e.g., Virtuoso, Fuseki)
RDF Dumps
Linked DataResources(e.g,, Pubby, ELDA)
Triple Store Web Server
SPARQL Clients Linked Data Clients
Data sources
Transformation
Storage
Access
20
DBpedia extraction process
Nandana Mihindukulasooriya, OEG
mappings
infobox
RDF Triplestore
Ren
derin
g
21
Issues in DBpedia mappings
• 16 DBpeida chapters• Crowd-sourced mappings using mapings wiki
• 5553 template mappings• Mostly using DBpedia ontology
• 739 classes, 3049 properties • In-concise usage of similar properties
• elevation & height, formationYear & foundingYear, team & club, occupation & profession, foundedBy & founder
• Plan for repair• Detection of inconsistent property usage
• Feedback to the ontology team• Feedback and guidance to the mapping teams
• Automatic cleaning of the mappings (in RML)
Nandana Mihindukulasooriya, OEG
22
Repairing conciseness issues in mappings
Nandana Mihindukulasooriya, OEG
Bulk RDF Transformation(e.g., LOD Refine, DBpedia extraction framework, Ad-hoc programs)
structured dataunstructured
Query RewritingRDF Mappings(e.g., R2RML, Mappings Wiki, D2R mappings, LOD Refine RDF skeletons)
SPARQL Endpoint(e.g., Virtuoso, Fuseki)
RDF Dumps
Linked DataResources(e.g,, Pubby, ELDA)
Triple Store Web Server
SPARQL Clients Linked Data Clients
Data sources
Transformation
Storage
Access
23
Detecting in-concise mapping based on data
dbr:Adobe_Systems dbo:formationYear “1982” ^^xsd:gYear
Ontology Engineering Group
dbr:Adobe_Systems dbo:foundingYear “1982” ^^xsd:gYearDBpedia EN
DBpedia ES
Detection of in-concise mappings
24Nandana Mihindukulasooriya, OEG
SC P1 ?o
Graph 1 (e.g., Dbpedia EN) Graph 2 (e.g., Dbpedia ES)
SC P2 ?oM1(C,P1,P2)
M2(C,P1,P2) SC P1 O SC P2 O
M3(C,P1,P2) SC P1 O1 SC P2 O2
M4(G1,C,P1,P2)
M5(G2,C,P1,P2)SC
P1 ?o
P2 ?o
SC
P1 ?o
P2 ?o
C P1 P1 M1 M2/M1
M3/M1
M4/M1
M5/M1
Company foundingYear formationYear 170 0.72 0.24 0 0.05
Person activeYearsEndYear year 150 0.84 0.16 0 0
Person birthPlace deathPlace 2845 0.59 0.43 0.53 0.31
in-concise mappings
1
2
3
4
5
25
RDF generation process
Nandana Mihindukulasooriya, OEG
Bulk RDF Transformation(e.g., LOD Refine, DBpedia extraction framework, Ad-hoc programs)
structured dataunstructured
Query RewritingRDF Mappings(e.g., R2RML, Mappings Wiki, D2R mappings, LOD Refine RDF skeletons)
SPARQL Endpoint(e.g., Virtuoso, Fuseki)
RDF Dumps
Linked DataResources(e.g,, Pubby, ELDA)
Triple Store Web Server
SPARQL Clients Linked Data Clients
Data sources
Transformation
Storage
Access
26
Property Maps
Property MapGeneration
• Step 1: group properties into clusters according to their domain and range
• Step 2: Multilingual NL preprocessing
• Step 3: aggregate properties by similarity (syntactic and semantic)
Ontology Engineering Group
27
Enhance SPARQL queries with property mappings
Ontology Engineering Group
28
Consistency
Nandana Mihindukulasooriya, OEG
29
Consistency
• Consistent data does not contain conflicting or contradictory data.
Nandana Mihindukulasooriya, OEG
@prefix dbr: <http://dbpedia.org/resource/> .@prefix dbo: <http://dbpedia.org/ontology/> .
dbo:City a owl:Class ; rdfs:subClassOf
[ a owl:Restriction ; owl:onProperty dbo:populationTotal ; owl:maxCardinality "1"^^xsd:nonNegativeInteger ], [ a owl:Restriction ; owl:onProperty dbo:mayor;
owl:maxCardinality "1"^^xsd:nonNegativeInteger ] .
dbo:country a owl:ObjectProperty ; rdfs:domain dbo:City; rdfs:range dbo:Country .
30
Consistency (II)
• Consistency issues• Data does not comply with the formal definitions or schema
Nandana Mihindukulasooriya, OEG
@prefix dbr: <http://dbpedia.org/resource/> .@prefix dbo: <http://dbpedia.org/ontology/> .
dbr:Zaragoza a dbo:City; dbo:populationTotal 666058;
dbo:populationTotal 684953; dbo:country dbr:Aragón; dbo:mayor dbr:Juan_Alberto_Belloch; dbo:mayor dbr:Pedro_Santisteve_Roche .
dbr:Aragón a dbo:AutonomousCommunity .
12
3
31
populationTotal - Cardinality Violation
Nandana Mihindukulasooriya, OEG
1
32
Consistency – (Incorrect) inferences
Nandana Mihindukulasooriya, OEG
dbr:Juan_Alberto_Belloch owl:sameAs dbr:Pedro_Santisteve_Roche .
dbr:Aragón a dbo:Country .
• Open World Assumption and Non-Unique Name Assumption• Works better for inferencing than validation
2
3
33
Consistency – Rich Semantics
• Checking consistency with OWL.
Nandana Mihindukulasooriya, OEG
@prefix dbr: <http://dbpedia.org/resource/> .@prefix dbo: <http://dbpedia.org/ontology/> .@prefix dbo: <http://www.w3.org/2002/07/owl#>.
dbo:City a owl:Class ; rdfs:subClassOf [ a owl:Restriction ; owl:onProperty dbo:populationTotal ; owl:maxCardinality "1"^^xsd:nonNegativeInteger ], [ a owl:Restriction ; owl:onProperty dbo:mayor;
owl:maxCardinality "1"^^xsd:nonNegativeInteger ] .dbo:country a owl:ObjectProperty; rdfs:domain dbo:Place; rdfs:range dbo:Country .
dbo:AutonomousCommunity owl:disjointWith dbo:Country .
dbr:Juan_Alberto_Belloch owl:differentFrom dbr:Pedro_Santisteve_Roche .
2
3
34
Consistency – SHACAL constraints
• Checking consistency with W3C SHACL.
Nandana Mihindukulasooriya, OEG
@prefix sh: <http://www.w3.org/ns/shacl#>@prefix dbo: <http://dbpedia.org/ontology/> .
_:cityShape a sh:Shape; sh:scopeClass dbo:City; sh:property [ sh:predicate dbo:mayor; sh:maxCount 1; sh:nodeKind sh:IRI; sh:classIn (dbo:Person schema:Person foaf:Person) ] ; sh:property [ sh:predicate dbo:country; sh:maxCount 1; sh:minCount 1; sh:nodeKind sh:IRI; sh:classIn (dbo:Country); sh:stem “http://dbpedia.org/” ] .
35
Data validation with semi-automatically generated RDF Shapes
Nandana Mihindukulasooriya, OEG
PatternExtraction
Domain ExpertReview
RDF ShapeGeneration
DataValidation
Data Repair
SHACL Shapes
36
Cardinality constraints example
Nandana Mihindukulasooriya, OEG
schema:Place Min Max P1 P99 Mean 0 1 2 3 4 5rdf:type 1 2 1 1 1.0002 0 99.9793 0.0207 0 0 0rdfs:label 1 6 1 6 4.2508 0 4.4048 36.6743 1.7445 0.4831 0rdfs:seeAlso 0 4 1 2 1.5717 0.0340 42.7702 57.1905 0.0041 0.0011 0owl:sameAs 0 6 0 0 0.0058 99.4455 0.5339 0.0146 0.0041 0.0015 0schema.org:review 0 2 0 2 0.0329 98.3175 0.0717 1.6108 0 0 0schema.org:url 0 40 0 10 0.5085 89.8340 1.8947 3.7013 0.3008 1.2155 0.3434events:poster 0 23 0 1 0.0155 98.9609 0.5900 0.4237 0.0097 0.0120 0.0007dc:publisher 0 2 0 2 1.0677 39.1777 14.8776 45.9447 0 0 0events:businessType 0 4 0 2 1.5273 4.1889 38.9255 56.8673 0.0041 0.0142 0schema:description 0 28 1 12 3.0573 0.0886 30.5193 32.8359 1.9605 19.1139 0.1226geo:location 0 24 0 4 0.2040 92.7525 0.6819 3.2436 0.2634 2.9831 0.0060
Property cardinalities of schema:Place class (extracted from data)
Pat. Min Max Description A 0 N No restrictions B 0 1 Maximum 1 C 1 N Minimum 1D 1 1 Exactly 1
Common cardinalities
CardinalityClassifier
schema:Place Classrdf:type D (Exactly 1)rdfs:label C (Minimum 1)rdfs:seeAlso C (Minimum 1)owl:sameAs A (No restrictions)schema.org:review A (No restrictions) Expert Review
schema:Place Classrdf:type C (Minimum 1)rdfs:label C (Minimum 1)rdfs:seeAlso C (Minimum 1)owl:sameAs A (No restrictions)schema.org:review A (No restrictions)
_:placeShape a sh:Shape; sh:scopeClass schema:Place; sh:property [ sh:predicate rdf:type; sh:minCount 1 ] ; sh:property [ sh:predicate rdfs:label; sh:minCount 1 ] ; sh:property [ sh:predicate rdfs:seeAlso; sh:minCount 1 ] ;
Approved PatternsExtracted Patterns
Restrictions in SHACL
37
W3C SHACL restrictions
• Value type constraints • sh:class, sh:classIn, sh:datatype, sh:datatypeIn,
sh:nodeKind• Cardinality constraints
• sh:minCount, sh:maxCount• Value range constraints
• sh:minInclusive, sh:minExclusive, sh:maxInclusive, sh:maxExclusive
• String based constraints• sh:minLength, sh:maxLength, sh:pattern, sh:stem,
sh:uniqueLang• Property pair constraints
• sh:equals, sh:disjoint, sh:lessThan, sh:lessThanOrEquals
Ontology Engineering Group
38
A Two-Fold Quality Assurance Approach for Dynamic Knowledge Bases: The 3cixty Use Case
Nandana Mihindukulasooriya, OEG
39
Continuous Integration is essential
Ontology Engineering Group, Universidad Politécnica de Madrid
40
Exploratory testing with Loupe
Ontology Engineering Group, Universidad Politécnica de Madrid
Automated testing with SPARQL Interceptor
41Ontology Engineering Group, Universidad Politécnica de Madrid
• a set of user-defined SPARQL queries (as unit tests)• Knowledge-based specific
TestSPARQLQueries
SystemRequirements
Schema Constraints
Conventions and other
restrictions
Inputs from Exploratory
Testing
42
SPARQL Interceptor
Ontology Engineering Group, Universidad Politécnica de Madrid
Designed and implemented by Localidata.