Upload
lucas-lane
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Can RDB2RDF Tools Feasible Expose Large Science Archives
for Data Integration?
Alasdair J G Gray (University of Glasgow now Manchester)
Norman Gray (Universities of Leicester and Glasgow)
Iadh Ounis (University of Glasgow)
ESWC 2009 – Crete3 June 2009
A.J.G. Gray - ESWC 2009 2
Outline
• Motivation: The Virtual Observatory• Can SPARQL be used to express scientific
queries?• Can existing archives be exposed with
semantic tools?– Can RDB2RDF tools extract large volumes of data?
3 June 2009
A.J.G. Gray - ESWC 2009 3
International Virtual Observatory Alliance
“facilitate the international coordination and collaboration necessary for the development and deployment of the tools, systems and organizational structures necessary to enable the international utilization of astronomical archives as an integrated and interoperating virtual observatory.”
3 June 2009
A.J.G. Gray - ESWC 2009 4
Searching for Brown Dwarfs
• Data sets:– Near Infrared, 2MASS/UK Infrared Deep Sky
Survey– Optical, APMCAT/Sloan Digital Sky Survey
• Complex colour/motion selection criteria• Similar problems
– Halo White Dwarfs
3 June 2009
A.J.G. Gray - ESWC 2009 5
Deep Field Surveys
• Observations in multiple wavelengths– Radio to X-Ray
• Searching for new objects– Galaxies, stars, etc
• Requires correlations across many catalogues– ISO– Hubble– SCUBA– etc
3 June 2009
A.J.G. Gray - ESWC 2009 6
The Problem
Locate and combine relevant data
• Heterogeneous publishers– Archive centres– Research labs
• Heterogeneous data– Relational– XML– Files
Virtual Observatory
3 June 2009
A.J.G. Gray - ESWC 2009 7
A Data Integration Approach
• Heterogeneous sources– Autonomous – Local schemas
• Homogeneous view– Mediated global schema
• Mapping– LAV: local-as-view– GAV: global-as-view
3 June 2009
Global Schema
Query1 Queryn
DB1
Wrapper1
DBk
Wrapperk
DBi
Wrapperi
Mappings
Relies on agreement of a common global
schema
A.J.G. Gray - ESWC 2009 8
P2P Data Integration Approach
• Heterogeneous sources– Autonomous – Local schemas
• Heterogeneous views– Multiple schemas
• Mappings– From sources to common
schema– Between pairs of schema
• Require common integration data model
Can RDF do this?3 June 2009
Schema1
DB1
Wrapper1
DBk
Wrapperk
DBi
Wrapperi
Schemaj
Query1 Queryn
Mappings
A.J.G. Gray - ESWC 2009 9
Resource Description Framework
• W3C standard• Designed as a metadata
data model• Contains semantic
details• Ideal for linking
distributed data• Queried through
SPARQL
3 June 2009
#foundIn
#Sun
The Sun#name
#MilkyWay
Milky Way#name
The Galaxy#name
IAU:Starrdf:type
IAU:BarredSpiral
rdf:type
A.J.G. Gray - ESWC 2009 10
SPARQL
• Declarative query language– Select returned data
• Graph or tuples• Attributes to return
– Describe structure of desired results– Filter data
• W3C standard• Syntactically similar to SQL
– Should be easy for scientists to learn!
3 June 2009
A.J.G. Gray - ESWC 2009 11
Integrating Using RDF
• Data resources– Expose schema and data
as RDF– Need a SPARQL endpoint
• Allows multiple – Access models– Storage models
• Easy to relate data from multiple sources
Relational DB
RDF / Relational
Conversion
XML DB
RDF / XML Conversion
Common Model (RDF)
Mappings
SPARQLquery
3 June 2009
We will focus on exposing relational data sources
A.J.G. Gray - ESWC 2009 12
RDB2RDF
Extract-Transform-Load• Data replicated as RDF
– Data can become stale
• Native SPARQL query support– Limited optimisation
mechanisms
Existing RDF stores• Jena• Seasame
Query-driven Conversion• Data stored as relations
• Native SQL query support– Highly optimised access
methods
• SPARQL queries must be translated
Existing translation systems• D2RQ• SquirrelRDF
3 June 2009
A.J.G. Gray - ESWC 2009 13
System Hypothesis
Is it viable to perform query-driven conversions to facilitate data access from a data model that a scientist is familiar with?
Can RDB2RDF tools feasibly expose large science archives for data integration?
Relational DB
RDB2RDF
XML DB
RDF / XML Conversion
Common Model (RDF)
Mappings
SPARQLquery
3 June 2009
SPARQLquery
A.J.G. Gray - ESWC 2009 14
Astronomical Test Data Set
• SuperCOSMOS Science Archive (SSA)– Data extracted from scans of Schmidt plates– Stored in a relational database– About 4TB of data, detailing 6.4 billion objects– Fairly typical of astronomical data archives
• Schema designed using 20 real queries• Personal version contains
– Data for a specific region of the sky
– About 0.1% of the data– About 500MB
3 June 2009
A.J.G. Gray - ESWC 2009 15
Analysis of Test Data
• Using personal version– About 500MB in size (similar size to related work)
• Organised in 14 Relations– Number of attributes: 2 – 152
• 4 relations with more than 20 attributes
– Number of rows: 3 – 585,560– Two views
• Complex selection criteria in views
3 June 2009
Makes this different from business cases and previous work!
A.J.G. Gray - ESWC 2009 16
Is SPARQL expressive enough?
Can the 20 sample queries be expressed in SPARQL?
3 June 2009
A.J.G. Gray - ESWC 2009 17
Real Science QueriesQuery 5: Find the positions and (B,R,I) magnitudes of all star-like objects within delta mag of 0.2 of the colours of a quasar of redshift 2.5 < z < 3.5SQL:SELECT ra, dec, sCorMagB,
sCorMagR2, sCorMagIFROM ReliableStarsWHERE (sCorMagB-
sCorMagR2 BETWEEN 0.05 AND 0.80) AND (sCorMagR2-sCorMagI BETWEEN -0.17 AND 0.64)
SPARQL:SELECT ?ra ?decl
?sCorMagB ?sCorMagR2 ?sCorMagI
WHERE {<bindings>FILTER (?sCorMagB –
?sCorMagR2 >= 0.05 && ?sCorMagB - ?sCorMagR2 <= 0.80)
FILTER (?sCorMagR2 – ?sCorMagI >= -0.17 && ?sCorMagR2 - ?sCorMagI <= 0.64)}
3 June 2009
A.J.G. Gray - ESWC 2009 18
Analysis of Test Queries
Query Feature Query Numbers
Arithmetic in body 1-5, 7, 9, 12, 13, 15-20
Arithmetic in head 7-9, 12, 13
Ordering 1-8, 10-17, 19, 20
Joins (including self-joins) 12-17, 19
Range functions (e.g. Between, ABS) 2, 3, 5, 8, 12, 13, 15, 17-20
Aggregate functions (including Group By) 7-9, 18
Math functions (e.g. power, log, root) 4, 9, 16
Trigonometry functions 8, 12
Negated sub-query 18, 20
Type casting (e.g. Radians to degrees) 7, 8, 12
Server defined functions 10, 11
3 June 2009
A.J.G. Gray - ESWC 2009 19
Expressivity of SPARQL
Features• Select-project-join• Arithmetic in body• Conjunction and disjunction• Ordering• String matching• External function calls
(extension mechanism)
Limitations• Range shorthands• Arithmetic in head• Math functions• Trigonometry functions• Sub queries• Aggregate functions• Casting
3 June 2009
A.J.G. Gray - ESWC 2009 20
Analysis of Test Queries
Query Feature Query Numbers
Arithmetic in body 1-5, 7, 9, 12, 13, 15-20
Arithmetic in head 7-9, 12, 13
Ordering 1-8, 10-17, 19, 20
Joins (including self-joins) 12-17, 19
Range functions (e.g. Between, ABS) 2, 3, 5, 8, 12, 13, 15, 17-20
Aggregate functions (including Group By) 7-9, 18
Math functions (e.g. power, log, root) 4, 9, 16
Trigonometry functions 8, 12
Negated sub-query 18, 20
Type casting (e.g. radians to degrees) 7, 8, 12
Server defined functions 10, 11
Expressible queries: 1, 2, 3, 5, 6, 14, 15, 17, 19 3 June 2009
A.J.G. Gray - ESWC 2009 21
Can RDB2RDF tools feasibly expose large science archives for data integration?
3 June 2009
A.J.G. Gray - ESWC 2009 22
Experiment
• Time query evaluation– 5 out of 20 queries used– No joins
• Systems compared:– Relational DB (Base line)
• MySQL v5.1.25
– RDB2RDF tools• D2RQ v0.5.2• SquirrelRDF v0.1
– RDF Triple stores• Jena v2.5.6 (SDB)• Sesame v2.1.3 (Native)
3 June 2009
Relational DB
RDB2RDF
SPARQLquery
Triple store
SPARQLquery
Relational DB
SQLquery
A.J.G. Gray - ESWC 2009 23
Experimental Configuration
• 8 identical machines– 64 bit Intel Quad Core Xeon 2.4GHz– 4GB RAM– 100 GB Hard drive– Java 1.6– Linux
• 10 repetitions
3 June 2009
A.J.G. Gray - ESWC 2009 24
Performance Results
3 June 2009
# Query 1 # Query 2 # Query 3 # Query 5 # Query 60
100
200
300
400
500
600
700
800
900
1000
MySQLD2RQSqRDFJenaSesame
ms
3,45
0
5,33
921
,492
485,
932
2,73
3
7,22
9
4,09
01,
307
17,7
93
7,46
819
,984
372,
561
1
A.J.G. Gray - ESWC 2009 25
Conclusions
• SPARQL not expressive enough for real astronomical queries
• RDBMS benefits from 30+ years research– Query optimisation– Indexes
• RDF stores are improving– Require existing data to be replicated
• RDB2RDF tools show promise– Need to exploit relational database
3 June 2009