Test-driven Assessment of [R2]RML Mappings to Improve Dataset Quality

Mapping and RDF Dataset Quality Assessment

with RDFUnit SPARQL-based test cases applied to [R2]RML mappings

Test-driven Assessment of [R2]RML

Mappings to Improve Dataset Quality

http://rml.io • http:// rdfunit.aksw.org

Anastasia Dimou1, Dimitris Kontokostas2, Markus Freudenberg2, Ruben Verborgh1,

Jens Lehmann2, Erik Mannens1, Sebastian Hellmann2, Rik Van de Walle1

…WHERE { ?resource %%P1%% ?c.

FILTER (DATATYPE(?c) != %%D1%%) }

<#Mapping>

rr:predicateObjectMap [

rr:predicate foaf:age; rr:objectMap [rml:reference “Age " ] ].

[R2]RML Mapping

Name Surname Age

Anastasia Dimou 12

Dimitris Kontokostas 15

http://example.com/

{Name}_{Surname}

foaf:Project

foaf:age "Age"

xsd:floathttp://example.com/

Anastasia_Dimou

foaf:Project

foaf:age "12"

xsd:float

http://example.com/

Dimitris_Kontokostas

foaf:Project

foaf:age "15"

xsd:float

RDF Dataset

for [R2]RML mappings for RDF dataset

… WHERE { ?resource foaf:age ?c.

FILTER (DATATYPE(?c) != xsd:int) }

… WHERE {

?resource rr:predicateObjectMap ?poMap.

?poMap rr:predicate %%P1%%;

rr:objectMap ?objM.

?objM rr:datatype ?c.

FILTER (?c != %%D1%%) }

size time #fail test cases #violations

DBPedia EN 115K 11s 1 160

DBPedia NL 53K 6s 1 124

DBLP 368 12s 2 8

Quality Assessment results for [R2]RML mappings

size time #fail test cases #violations

DBPedia EN 62M 16h 1,128 3.2M

DBPedia NL 21M 1.5h 683 815K

DBLP 12M 12h 7 8.1M

Quality Assessment results for RDF dataset

The number of errors grows

linearly in function of the number of iterations and

geometrically if multiple references and returned values

Shift the Quality Assessment from the RDF dataset

also to the Mapping definitions that generate the dataset

The number of errors grows

linearly in function of the number of applicable Term Maps.

The time to execute the assessment is significantly reduced

Same violations appear repeatedly over distinct entities.

Quality Assessment (QA) (-) is not incorporated into the publishing workflow

(-) is performed by third parties

Dataset Quality Assessment (DQA)(-) results are not incorporated into the dataset

(-) adjustments are manually, rarely applied & not at the root

(-) adjustments are overwritten when

a new version of the original data is mapped & published

Mapping Quality Assessment (MQA)(+) discover violations before they are even generated

(+) specify the origin of the violation

(+) structural adjustments can still be applied easily

(+) reduce the effort required to act upon QA results

(+) prevents same violations to appear repeatedly

within the dataset and over distinct entities

(+) prevents the generation of low quality RDF datasets

(+) uniform Mapping and Dataset Quality Assessment

The violations derive from mapping definitions

that specify how the RDF dataset will be generated

1Ghent University – iMinds – Multimedia Lab 2AKSW, University of Leipzig

Technology

Test-driven Assessment of [R2]RML Mappings to Improve Dataset Quality