21
Using SPARQL and SPIN for Data Quality Management on the Semantic Web Christian Fürber / Martin Hepp [email protected], [email protected] Presentation @ BIS May 4th 2010

Using SPARQL and SPIN for Data Quality Management on the Semantic Web

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Using SPARQL and SPIN for Data Quality Management on the Semantic Web

Using SPARQL and SPIN for

Data Quality Management

on the Semantic Web

Christian Fürber / Martin [email protected], [email protected]

Presentation @ BIS

May 4th 2010

Page 2: Using SPARQL and SPIN for Data Quality Management on the Semantic Web

Vision of the Semantic Web

Publishing data on the

web in a meaningful way for

more automation,

better integration,

and higher reusability of data.

C. Fürber, M. Hepp: Using SPARQL and SPIN for Data

Quality Management on the Semantic Web 2

© Hanspeter Graf / www.pixelio.de

Page 3: Using SPARQL and SPIN for Data Quality Management on the Semantic Web

C. Fürber, M. Hepp: Using SPARQL and SPIN for Data

Quality Management on the Semantic Web3

Growth of Data:

Well on Track…

Reference: http://www4.wiwiss.fu-berlin.de/bizer/pub/lod-datasets_2009-03-05.html

Retrieving

information

Building smart

SemWeb apps

Page 4: Using SPARQL and SPIN for Data Quality Management on the Semantic Web

C. Fürber, M. Hepp:

Using SPARQL and SPIN for Data Quality

Management on the Semantic Web

4

…but what if the published data was of

poor quality?

Get a giant

camcorder

from

amazon!

Page 5: Using SPARQL and SPIN for Data Quality Management on the Semantic Web

Using Poor Data is Costly

Without quality checks your SemWeb Apps will

take this data seriously and…

…get an oversized shipping

package with expensive postage,

…and waste transportation capacity.

C. Fürber, M. Hepp:

Using SPARQL and SPIN for Data Quality

Management on the Semantic Web

5

Page 6: Using SPARQL and SPIN for Data Quality Management on the Semantic Web

Yes, if we know about data quality

problems, before anything bad will

happen!

6

Is there any way to avoid data

quality disasters?

C. Fürber, M. Hepp: Using SPARQL and SPIN for Data

Quality Management on the Semantic Web

A giant

camcorder on

the road!

Page 7: Using SPARQL and SPIN for Data Quality Management on the Semantic Web

The Impact of Poor Data Quality

7

Poor Decisions

Failed Business Processes

Failed Projects

Higher Costs

Missed Revenues

Lower Product /

Service Quality

Lower Stakeholder

Satisfaction

Fatal Disasters

C. Fürber, M. Hepp: Using SPARQL and SPIN for Data

Quality Management on the Semantic Web

Page 8: Using SPARQL and SPIN for Data Quality Management on the Semantic Web

Data Quality is a Key Bottleneck of the

Semantic Web<vocab:location rdf:about="http://www.stockdbdemo2.com/stockdblocation/1">

<vocab:location_ZIP></vocab:location_ZIP>

<vocab:location_STREETNO></vocab:location_STREETNO>

<vocab:location_COUNTRY>France</vocab:location_COUNTRY>

<vocab:location_ID rdf:datatype="http://www.w3.org/2001/XMLSchema#int"

>1</vocab:location_ID>

<vocab:location_STREET>8489 Strong St.</vocab:location_STREET>

<vocab:location_STATE>NV</vocab:location_STATE>

<rdfs:label>location #1</rdfs:label>

<vocab:location_CITY>Las Vegas</vocab:location_CITY>

</vocab:location>

8

Missing literal values

Functional dependency

violation

Syntax violation

C. Fürber, M. Hepp: Using SPARQL and SPIN for Data

Quality Management on the Semantic Web

Unique value violation

Page 9: Using SPARQL and SPIN for Data Quality Management on the Semantic Web

Our Approach

Identification of data quality problems on

instance level of Semantic Web sources

solely with Semantic Web technologies.

9C. Fürber, M. Hepp: Using SPARQL and SPIN for Data

Quality Management on the Semantic Web

<vocab:location rdf:about="http://www.stockdbdemo2.com/stockdblocation/1">

<vocab:location_ZIP></vocab:location_ZIP>

<vocab:location_STREETNO></vocab:location_STREETNO>

<vocab:location_COUNTRY>France</vocab:location_COUNTRY>

<vocab:location_ID rdf:datatype="http://www.w3.org/2001/XMLSchema#int"

>1</vocab:location_ID>

<vocab:location_STREET>8489 Strong St.</vocab:location_STREET>

<vocab:location_STATE>NV</vocab:location_STATE>

<rdfs:label>location #1</rdfs:label>

<vocab:location_CITY>Las Vegas</vocab:location_CITY>

</vocab:location>

Integration advantages

Access to SemWeb data may be

useful for dqm.

Page 10: Using SPARQL and SPIN for Data Quality Management on the Semantic Web

Proposed Architecture

10

RDB

SPIN

OBDQM

Domain-

Ontology

Knowledge

Base

C. Fürber, M. Hepp: Using SPARQL and SPIN for Data

Quality Management on the Semantic Web

Linked

Data Cloud

SPARQL + SPIN Query Layer

Ontology Layer

Data Sources Layer

Page 11: Using SPARQL and SPIN for Data Quality Management on the Semantic Web

Defining Data Quality Rules with

SPARQL (1)

Define what is allowed and negate it.

Define what is not allowed.

Negations and regular expressions save manual

effort.

11C. Fürber, M. Hepp: Using SPARQL and SPIN for Data

Quality Management on the Semantic Web

Page 12: Using SPARQL and SPIN for Data Quality Management on the Semantic Web

Defining Data Quality Rules with

SPARQL (2)

The city „Las Vegas“ must be in the country „USA“.

12

# Checking functional dependency of {?arg4} with {?arg2}

CONSTRUCT {

_:b0 a spin:ConstraintViolation .

_:b0 spin:violationRoot ?this .

_:b0 spin:violationPath vocab:location_COUNTRY .

}

WHERE {

?this vocab:location_CITY „Las Vegas“ .

FILTER (!spl:hasValue(?this, vocab:location_COUNTRY, “USA”)) .

}

C. Fürber, M. Hepp: Using SPARQL and SPIN for Data

Quality Management on the Semantic Web

Page 13: Using SPARQL and SPIN for Data Quality Management on the Semantic Web

Defining Data Quality Rules with

SPARQL (3)

High reusability of data quality rules through SPIN‘s

SPARQL query templates.

13

# Checking functional dependency of {?arg4} with {?arg2}

CONSTRUCT {

_:b0 a spin:ConstraintViolation .

_:b0 spin:violationRoot ?this .

_:b0 spin:violationPath ?arg3 .

}

WHERE {

?this ?arg1 ?arg2 .

FILTER (!spl:hasValue(?this, ?arg3, ?arg4)) .

}

C. Fürber, M. Hepp: Using SPARQL and SPIN for Data

Quality Management on the Semantic Web

Page 14: Using SPARQL and SPIN for Data Quality Management on the Semantic Web

Enforced DQ-Rules with SPIN

C. Fürber, M. Hepp:

Using SPARQL and SPIN for Data Quality

Management on the Semantic Web

14

Application: http://www.topquadrant.com/products/TB_Composer.html#free

Page 15: Using SPARQL and SPIN for Data Quality Management on the Semantic Web

More Data Quality Rule Templates (1)Data Quality Problem SPARQL Query Template

Missing literal values ASK WHERE {

?this ?arg1 "" .

}

Out of range value

(lower limit)

ASK WHERE {

?this ?arg1 ?value .

FILTER (?value < ?arg2) .

}

Out of range value

(upper limit)

ASK WHERE {

?this ?arg1 ?value .

FILTER (?value > ?arg2) .

}

15C. Fürber, M. Hepp: Using SPARQL and SPIN for Data

Quality Management on the Semantic Web

RDB RDBKnowledge

Base

Global Ontology

Page 16: Using SPARQL and SPIN for Data Quality Management on the Semantic Web

More Data Quality Rule Templates (2)Data Quality Problem SPARQL Query Template

Syntax violation

(only letters and dots

allowed)

ASK WHERE {

?this ?arg1 ?value .

FILTER (!regex(str(?value),

"^([A-Za-z,. ])*$"))}

Unique value violation CONSTRUCT {

_:b0 a spin:ConstraintViolation .

_:b0 spin:violationRoot ?a .

_:b0 spin:violationPath ?arg1 .

}

WHERE {

?a ?arg1 ?uniqueValue .

?b ?arg1 ?uniqueValue .

FILTER (?a != ?b)}

16C. Fürber, M. Hepp: Using SPARQL and SPIN for Data

Quality Management on the Semantic Web

RDB RDB Knowledge

Base

Global Ontology

Page 17: Using SPARQL and SPIN for Data Quality Management on the Semantic Web

Contributions

• Domain-independent SPARQL query

templates for data quality problem identification

• Queries are highly reusable

• Architecture enables the use of Linked Data

• Methodology for data quality management of

Semantic Web data

• First approach on how to apply SPIN for DQM

17C. Fürber, M. Hepp: Using SPARQL and SPIN for Data

Quality Management on the Semantic Web

Page 18: Using SPARQL and SPIN for Data Quality Management on the Semantic Web

Limitations & Open Issues

• Knowing the problem does not mean we can

solve it

• Homonym / Synonym handling

• Incomplete knowledge may cause constraint

violations of clean instances

• Current approach focuses on literal values

• Scalability on large data sets

18C. Fürber, M. Hepp: Using SPARQL and SPIN for Data

Quality Management on the Semantic Web

Page 19: Using SPARQL and SPIN for Data Quality Management on the Semantic Web

Ongoing Extensions

• Extension to a broader set of data quality problems

• Enabling synonym handling and homonym tolerance

• Enhancement of peformance

• Calculation of information quality scores

• Integration of Linked Data as trusted reference for

data quality management

• Evaluate the quality of popular Semantic Web data sets

on instance level (e.g. Geonames & DBPedia)

• Extension for (semi-)automated data cleansing

19C. Fürber, M. Hepp: Using SPARQL and SPIN for Data

Quality Management on the Semantic Web

Page 20: Using SPARQL and SPIN for Data Quality Management on the Semantic Web

20

Christian FuerberResearcher

E-Business & Web Science Research Group

Werner-Heisenberg-Weg 39

85577 Neubiberg

Germany

skype c.fuerber

email [email protected]

web http://www.unibw.de/ebusiness

homepage http://www.fuerber.com

Paper is available at http://bit.ly/bYes0V