27
Validating RDF Data Quality using Constraints to Direct the Development of Constraint Languages Thomas Hartmann Benjamin Zapilko, Joachim Wackerow, Kai Eckert International Conference on Semantic Systems (ICSC 2016)

2016.02 - Validating RDF Data Quality using Constraints to Direct the Development of Constraint Languages (ICSC 2016)

Embed Size (px)

Citation preview

Page 1: 2016.02 - Validating RDF Data Quality using Constraints to Direct the Development of Constraint Languages (ICSC 2016)

Validating RDF Data Quality using Constraints

to Direct the Development of Constraint Languages

Thomas Hartmann

Benjamin Zapilko, Joachim Wackerow, Kai Eckert

International Conference on Semantic Systems (ICSC 2016)

Page 2: 2016.02 - Validating RDF Data Quality using Constraints to Direct the Development of Constraint Languages (ICSC 2016)

XML Validation

Page 3: 2016.02 - Validating RDF Data Quality using Constraints to Direct the Development of Constraint Languages (ICSC 2016)

<!ELEMENT library (book+, author*)>

<!ELEMENT book (isbn, title, author-ref+)>

<!ATTLIST book

id ID #REQUIRED

>

<!ELEMENT author-ref EMPTY>

<!ATTLIST author-ref

id IDREF #REQUIRED

>

<!ELEMENT author (name)>

<!ATTLIST author

id ID #REQUIRED

>

<!ELEMENT isbn (#PCDATA)>

<!ELEMENT title (#PCDATA)>

<!ELEMENT name (#PCDATA)>

Page 4: 2016.02 - Validating RDF Data Quality using Constraints to Direct the Development of Constraint Languages (ICSC 2016)

RDF Validation Workshop

Page 5: 2016.02 - Validating RDF Data Quality using Constraints to Direct the Development of Constraint Languages (ICSC 2016)

Working Groups on RDF Validation

W3C Data Shapes Working Group

DCMI RDF Application Profiles Task Group

Page 6: 2016.02 - Validating RDF Data Quality using Constraints to Direct the Development of Constraint Languages (ICSC 2016)

http://purl.org/net/rdf-validation

81 Types of Constraints on RDF Data

Page 7: 2016.02 - Validating RDF Data Quality using Constraints to Direct the Development of Constraint Languages (ICSC 2016)

Constraint Languages

Page 8: 2016.02 - Validating RDF Data Quality using Constraints to Direct the Development of Constraint Languages (ICSC 2016)

SPARQL Query Language for RDF

SELECT ?concept

WHERE {

?concept a [ rdfs:subClassOf* skos:Concept ] .

FILTER NOT EXISTS {

?concept ?p ?o .

FILTER ( ?p IN (

skos:related,

skos:relatedMatch,

skos:broader, ... ) ) . } }

Page 9: 2016.02 - Validating RDF Data Quality using Constraints to Direct the Development of Constraint Languages (ICSC 2016)

SPARQL Inferencing Notation (SPIN)

# FILTER NOT EXISTS { ?book author ?person }

[ a sp:Filter ;

sp:expression [

a sp:notExists ;

sp:elements (

[ sp:subject [ sp:varName "book" ] ;

sp:predicate author ;

sp:object [ sp:varName "person" ]])]])

Page 10: 2016.02 - Validating RDF Data Quality using Constraints to Direct the Development of Constraint Languages (ICSC 2016)

Web Ontology Language (OWL)

:Publication rdfs:subClassOf

[ a owl:Restriction ;

owl:onProperty :author ;

owl:allValuesFrom :Person ] .

Page 11: 2016.02 - Validating RDF Data Quality using Constraints to Direct the Development of Constraint Languages (ICSC 2016)

Shape Expressions (ShEx)

:Publication {

( :isbn xsd:string, :title xsd:string )

|

( :issn xsd:string, :title xsd:string )}

Page 12: 2016.02 - Validating RDF Data Quality using Constraints to Direct the Development of Constraint Languages (ICSC 2016)

Resource Shapes (ReSh)

:Computer-Science-Book

a oslc:ResourceShape ;

oslc:property [

oslc:propertyDefinition :subject ;

oslc:allowedValues [

oslc:allowedValue

"Computer Science" ,

"Informatics" ,

"Information Technology" ] ] .

Page 13: 2016.02 - Validating RDF Data Quality using Constraints to Direct the Development of Constraint Languages (ICSC 2016)

[ a dsp:DescriptionTemplate ;

dsp:resourceClass :Science-Fiction-Book ;

dsp:statementTemplate [

dsp:property :subject ;

dsp:nonLiteralConstraint [

dsp:valueClass skos:Concept ;

dsp:valueURI

:Science-Fiction, :Sci-Fi, :SF ;

dsp:vocabularyEncodingScheme

:Science-Fiction-Book-Subjects ; ] ] .

Description Set Profiles (DSP)

Page 14: 2016.02 - Validating RDF Data Quality using Constraints to Direct the Development of Constraint Languages (ICSC 2016)

Shapes Constraint Language (SHACL)

:BookShape

a sh:Shape ;

sh:scopeClass :Book ;

sh:property [

sh:predicate :author ;

sh:valueShape :PersonShape ;

sh:minCount 1 ; ] .

Page 15: 2016.02 - Validating RDF Data Quality using Constraints to Direct the Development of Constraint Languages (ICSC 2016)

http://purl.org/net/rdfval-demo

RDF Validation Environment

Page 16: 2016.02 - Validating RDF Data Quality using Constraints to Direct the Development of Constraint Languages (ICSC 2016)
Page 17: 2016.02 - Validating RDF Data Quality using Constraints to Direct the Development of Constraint Languages (ICSC 2016)

Constraint Types Classification

1. RDFS/OWL Based

2. Constraint Language Based

3. SPARQL Based

Page 18: 2016.02 - Validating RDF Data Quality using Constraints to Direct the Development of Constraint Languages (ICSC 2016)

RDFS/OWL Based

:Publication rdfs:subClassOf

[ a owl:Restriction ;

owl:onProperty :author ;

owl:allValuesFrom :Person ] .

Page 19: 2016.02 - Validating RDF Data Quality using Constraints to Direct the Development of Constraint Languages (ICSC 2016)

Constraint Language Based

:Publication {

( :isbn xsd:string, :title xsd:string )

|

( :issn xsd:string, :title xsd:string )}

Page 20: 2016.02 - Validating RDF Data Quality using Constraints to Direct the Development of Constraint Languages (ICSC 2016)

SPARQL Based

SELECT ?concept

WHERE {

?concept a [ rdfs:subClassOf* skos:Concept ] .

FILTER NOT EXISTS {

?concept ?p ?o .

FILTER ( ?p IN (

skos:related,

skos:relatedMatch,

skos:broader, ... ) ) . } }

Page 21: 2016.02 - Validating RDF Data Quality using Constraints to Direct the Development of Constraint Languages (ICSC 2016)

Constraints Classification

1. Informational

2. Warning

3. Error

Page 22: 2016.02 - Validating RDF Data Quality using Constraints to Direct the Development of Constraint Languages (ICSC 2016)

Evaluation Setup

• 115 constraints from vocabularies and experts

• constraints classified and implemented

• on 3 vocabularies in the SBE sciences– well-established vocabularies (QB, SKOS)

– vocabulary under development (DDI-RDF)

Page 23: 2016.02 - Validating RDF Data Quality using Constraints to Direct the Development of Constraint Languages (ICSC 2016)

Validated Data Sets

Vocabulary Data Sets Triples

QB 9,990 3,775,983,610

SKOS 4,178 477,737,281

DDI-RDF 1,526 9,673,055

Total 15,694 4.26 billion

33 SPARQL Endpoints

Page 24: 2016.02 - Validating RDF Data Quality using Constraints to Direct the Development of Constraint Languages (ICSC 2016)

Finding 1

C [%] CV [%]

SPARQL 63.2 78.2

CL 34.7 21.8

RDFS/OWL 35.6 21.8

C (constraints), CV (constraint violations)

Page 25: 2016.02 - Validating RDF Data Quality using Constraints to Direct the Development of Constraint Languages (ICSC 2016)

Finding 2

C [%] CV [%]

SPARQL 63.2 78.2

CL 34.7 21.8

RDFS/OWL 35.6 21.8

C (constraints), CV (constraint violations)

Page 26: 2016.02 - Validating RDF Data Quality using Constraints to Direct the Development of Constraint Languages (ICSC 2016)

Finding 3

C [%] CV [%]

Info 42.3 31.3

Warning 18.7 62.7

Error 39.0 6.1

C (constraints), CV (constraint violations)

Page 27: 2016.02 - Validating RDF Data Quality using Constraints to Direct the Development of Constraint Languages (ICSC 2016)

Limitations

> 3 Vocabularies

> 1 Domain