Upload
christian-fuerber
View
4.363
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
Using SPARQL and SPIN for
Data Quality Management
on the Semantic Web
Christian Fürber / Martin [email protected], [email protected]
Presentation @ BIS
May 4th 2010
Vision of the Semantic Web
Publishing data on the
web in a meaningful way for
more automation,
better integration,
and higher reusability of data.
C. Fürber, M. Hepp: Using SPARQL and SPIN for Data
Quality Management on the Semantic Web 2
© Hanspeter Graf / www.pixelio.de
C. Fürber, M. Hepp: Using SPARQL and SPIN for Data
Quality Management on the Semantic Web3
Growth of Data:
Well on Track…
Reference: http://www4.wiwiss.fu-berlin.de/bizer/pub/lod-datasets_2009-03-05.html
Retrieving
information
Building smart
SemWeb apps
C. Fürber, M. Hepp:
Using SPARQL and SPIN for Data Quality
Management on the Semantic Web
4
…but what if the published data was of
poor quality?
Get a giant
camcorder
from
amazon!
Using Poor Data is Costly
Without quality checks your SemWeb Apps will
take this data seriously and…
…get an oversized shipping
package with expensive postage,
…and waste transportation capacity.
C. Fürber, M. Hepp:
Using SPARQL and SPIN for Data Quality
Management on the Semantic Web
5
Yes, if we know about data quality
problems, before anything bad will
happen!
6
Is there any way to avoid data
quality disasters?
C. Fürber, M. Hepp: Using SPARQL and SPIN for Data
Quality Management on the Semantic Web
A giant
camcorder on
the road!
The Impact of Poor Data Quality
7
Poor Decisions
Failed Business Processes
Failed Projects
Higher Costs
Missed Revenues
Lower Product /
Service Quality
Lower Stakeholder
Satisfaction
Fatal Disasters
C. Fürber, M. Hepp: Using SPARQL and SPIN for Data
Quality Management on the Semantic Web
Data Quality is a Key Bottleneck of the
Semantic Web<vocab:location rdf:about="http://www.stockdbdemo2.com/stockdblocation/1">
<vocab:location_ZIP></vocab:location_ZIP>
<vocab:location_STREETNO></vocab:location_STREETNO>
<vocab:location_COUNTRY>France</vocab:location_COUNTRY>
<vocab:location_ID rdf:datatype="http://www.w3.org/2001/XMLSchema#int"
>1</vocab:location_ID>
<vocab:location_STREET>8489 Strong St.</vocab:location_STREET>
<vocab:location_STATE>NV</vocab:location_STATE>
<rdfs:label>location #1</rdfs:label>
<vocab:location_CITY>Las Vegas</vocab:location_CITY>
</vocab:location>
8
Missing literal values
Functional dependency
violation
Syntax violation
C. Fürber, M. Hepp: Using SPARQL and SPIN for Data
Quality Management on the Semantic Web
Unique value violation
Our Approach
Identification of data quality problems on
instance level of Semantic Web sources
solely with Semantic Web technologies.
9C. Fürber, M. Hepp: Using SPARQL and SPIN for Data
Quality Management on the Semantic Web
<vocab:location rdf:about="http://www.stockdbdemo2.com/stockdblocation/1">
<vocab:location_ZIP></vocab:location_ZIP>
<vocab:location_STREETNO></vocab:location_STREETNO>
<vocab:location_COUNTRY>France</vocab:location_COUNTRY>
<vocab:location_ID rdf:datatype="http://www.w3.org/2001/XMLSchema#int"
>1</vocab:location_ID>
<vocab:location_STREET>8489 Strong St.</vocab:location_STREET>
<vocab:location_STATE>NV</vocab:location_STATE>
<rdfs:label>location #1</rdfs:label>
<vocab:location_CITY>Las Vegas</vocab:location_CITY>
</vocab:location>
Integration advantages
Access to SemWeb data may be
useful for dqm.
Proposed Architecture
10
RDB
SPIN
OBDQM
Domain-
Ontology
Knowledge
Base
C. Fürber, M. Hepp: Using SPARQL and SPIN for Data
Quality Management on the Semantic Web
Linked
Data Cloud
SPARQL + SPIN Query Layer
Ontology Layer
Data Sources Layer
Defining Data Quality Rules with
SPARQL (1)
Define what is allowed and negate it.
Define what is not allowed.
Negations and regular expressions save manual
effort.
11C. Fürber, M. Hepp: Using SPARQL and SPIN for Data
Quality Management on the Semantic Web
Defining Data Quality Rules with
SPARQL (2)
The city „Las Vegas“ must be in the country „USA“.
12
# Checking functional dependency of {?arg4} with {?arg2}
CONSTRUCT {
_:b0 a spin:ConstraintViolation .
_:b0 spin:violationRoot ?this .
_:b0 spin:violationPath vocab:location_COUNTRY .
}
WHERE {
?this vocab:location_CITY „Las Vegas“ .
FILTER (!spl:hasValue(?this, vocab:location_COUNTRY, “USA”)) .
}
C. Fürber, M. Hepp: Using SPARQL and SPIN for Data
Quality Management on the Semantic Web
Defining Data Quality Rules with
SPARQL (3)
High reusability of data quality rules through SPIN‘s
SPARQL query templates.
13
# Checking functional dependency of {?arg4} with {?arg2}
CONSTRUCT {
_:b0 a spin:ConstraintViolation .
_:b0 spin:violationRoot ?this .
_:b0 spin:violationPath ?arg3 .
}
WHERE {
?this ?arg1 ?arg2 .
FILTER (!spl:hasValue(?this, ?arg3, ?arg4)) .
}
C. Fürber, M. Hepp: Using SPARQL and SPIN for Data
Quality Management on the Semantic Web
Enforced DQ-Rules with SPIN
C. Fürber, M. Hepp:
Using SPARQL and SPIN for Data Quality
Management on the Semantic Web
14
Application: http://www.topquadrant.com/products/TB_Composer.html#free
More Data Quality Rule Templates (1)Data Quality Problem SPARQL Query Template
Missing literal values ASK WHERE {
?this ?arg1 "" .
}
Out of range value
(lower limit)
ASK WHERE {
?this ?arg1 ?value .
FILTER (?value < ?arg2) .
}
Out of range value
(upper limit)
ASK WHERE {
?this ?arg1 ?value .
FILTER (?value > ?arg2) .
}
15C. Fürber, M. Hepp: Using SPARQL and SPIN for Data
Quality Management on the Semantic Web
RDB RDBKnowledge
Base
Global Ontology
More Data Quality Rule Templates (2)Data Quality Problem SPARQL Query Template
Syntax violation
(only letters and dots
allowed)
ASK WHERE {
?this ?arg1 ?value .
FILTER (!regex(str(?value),
"^([A-Za-z,. ])*$"))}
Unique value violation CONSTRUCT {
_:b0 a spin:ConstraintViolation .
_:b0 spin:violationRoot ?a .
_:b0 spin:violationPath ?arg1 .
}
WHERE {
?a ?arg1 ?uniqueValue .
?b ?arg1 ?uniqueValue .
FILTER (?a != ?b)}
16C. Fürber, M. Hepp: Using SPARQL and SPIN for Data
Quality Management on the Semantic Web
RDB RDB Knowledge
Base
Global Ontology
Contributions
• Domain-independent SPARQL query
templates for data quality problem identification
• Queries are highly reusable
• Architecture enables the use of Linked Data
• Methodology for data quality management of
Semantic Web data
• First approach on how to apply SPIN for DQM
17C. Fürber, M. Hepp: Using SPARQL and SPIN for Data
Quality Management on the Semantic Web
Limitations & Open Issues
• Knowing the problem does not mean we can
solve it
• Homonym / Synonym handling
• Incomplete knowledge may cause constraint
violations of clean instances
• Current approach focuses on literal values
• Scalability on large data sets
18C. Fürber, M. Hepp: Using SPARQL and SPIN for Data
Quality Management on the Semantic Web
Ongoing Extensions
• Extension to a broader set of data quality problems
• Enabling synonym handling and homonym tolerance
• Enhancement of peformance
• Calculation of information quality scores
• Integration of Linked Data as trusted reference for
data quality management
• Evaluate the quality of popular Semantic Web data sets
on instance level (e.g. Geonames & DBPedia)
• Extension for (semi-)automated data cleansing
19C. Fürber, M. Hepp: Using SPARQL and SPIN for Data
Quality Management on the Semantic Web
20
Christian FuerberResearcher
E-Business & Web Science Research Group
Werner-Heisenberg-Weg 39
85577 Neubiberg
Germany
skype c.fuerber
email [email protected]
web http://www.unibw.de/ebusiness
homepage http://www.fuerber.com
Paper is available at http://bit.ly/bYes0V
References & Links
C. Fürber, M. Hepp:
Using SPARQL and SPIN for Data Quality
Management on the Semantic Web
21
LOD-Cloud:
http://www4.wiwiss.fu-berlin.de/bizer/pub/lod-datasets_2009-03-05.html
D2RQ:
http://www4.wiwiss.fu-berlin.de/bizer/d2rq/spec/
SPIN:
http://spinrdf.org/
TopBraid Composer Free Edition:
http://www.topquadrant.com/products/TB_Composer.html#free