Using Semantic Web Technology to Integrate Scientific Data
Alasdair J G GrayUniversity of Manchester
A.J.G. Gray - Bolzano-Bozen Seminar 2
Outline
• Motivation: Astronomy & The Virtual Observatory• Data Integration Challenges
1. Locating the relevant data
2. Requesting the required data
3. Understanding the returned data
23 June 2009
Using Semantic Web tools and technology to overcome these challenges
A.J.G. Gray - Bolzano-Bozen Seminar 3
Context: Astronomy
23 June 2009
• Data collected across electromagnetic spectrum
• Analysed within one wavelength
• Data collection is – expensive– time consuming
• Existing data– large quantities– freely available
Image: Wikipedia
A.J.G. Gray - Bolzano-Bozen Seminar 4
Virtual Observatory“facilitate the international
coordination and collaboration necessary for the development and deployment of the tools, systems and organizational structures necessary to enable the international utilization of astronomical archives as an integrated and interoperating virtual observatory.”
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar 5
Searching for Brown Dwarfs• Data sets:
– Near Infrared, 2MASS/UK Infrared Deep Sky Survey– Optical, APMCAT/Sloan Digital Sky Survey
• Complex colour/motion selection criteria• Similar problems
– Halo White Dwarfs
23 June 2009Image: AstroGrid
A.J.G. Gray - Bolzano-Bozen Seminar 6
Deep Field Surveys
• Observations in multiple wavelengths– Radio to X-Ray
• Searching for new objects– Galaxies, stars, etc
• Requires correlations across many catalogues– ISO– Hubble– SCUBA– etc
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar 7
Locate, retrieve, and interpret relevant data
• Heterogeneous publishers– Archive centres– Research labs
• Heterogeneous data– Relational– XML– Image Files
23 June 2009
Virtual Observatory
Virtual Observatory: The Problems
A.J.G. Gray - Bolzano-Bozen Seminar 8
Virtual Observatory: The Problems
Locate, retrieve, and interpret relevant data
1. Which data sources contain relevant data?
2. How do I query the relevant data sources?
3. How can I interpret/combine/analyse the data?
23 June 2009
Virtual Observatory
A.J.G. Gray - Bolzano-Bozen Seminar 9
Finding relevant data sources
1. Which data sources contain relevant data?
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar 10
Which data sources do I use?• VO registry
– 65,000+ entries– Many mirrored services
• VOExplorer– Registry search tool
• Resources tagged with keywords
23 June 2009
- 6df- survey- galaxy- galaxies
- redshift- redshifts- 2mass
A.J.G. Gray - Bolzano-Bozen Seminar 11
Analysis of Registry KeywordsProblems:– Plural/singular– Case– Abbreviations– Different tags– Specificity of tags
Thanks to Sébastien Derriere for this data.
23 June 2009
75 Star 52 Galaxy 37 Stars 36 Galaxies 16 AGN 12 Cluster of
Galaxies 12 Nebulae 11 Planets 10 GRB 10 Globular Clusters 8 Star Cluster 7 Nebula 6 Variable stars 5 Hot stars 5 Pulsar
4 supernova 3 Clusters of
Galaxies 3 Infrared:stars 3 Quasars: general 3 Supernova 3 White dwarfs 3 galaxies 2 Comets 2 Cool stars 2 Extragalactic
Source 2 Extragalactic
objects 2 Infrared: stars 2 Interstellar
medium 2 QSO 2 QSOs 2 SNR 2 Variable Star 2 White Dwarf 2 clusters of
galaxies 2 stars 1 Asteroids 1 BL Lac 1 Be/X-ray binary
stars 1 Binary stars ...
A.J.G. Gray - Bolzano-Bozen Seminar 12
Analysis of Registry Keywords
Problems:– Plural/singular– CaseSolution:(standard IR techniques)
– Stemming• Star & Stars
become Star
– Case normalisation• lowercase
23 June 2009
75 Star 52 Galaxy 37 Stars 36 Galaxies 16 AGN 12 Cluster of
Galaxies 12 Nebulae 11 Planets 10 GRB 10 Globular Clusters 8 Star Cluster 7 Nebula 6 Variable stars 5 Hot stars 5 Pulsar
4 supernova 3 Clusters of
Galaxies 3 Infrared:stars 3 Quasars: general 3 Supernova 3 White dwarfs 3 galaxies 2 Comets 2 Cool stars 2 Extragalactic
Source 2 Extragalactic
objects 2 Infrared: stars 2 Interstellar
medium 2 QSO 2 QSOs 2 SNR 2 Variable Star 2 White Dwarf 2 clusters of
galaxies 2 stars 1 Asteroids 1 BL Lac 1 Be/X-ray binary
stars 1 Binary stars ...
A.J.G. Gray - Bolzano-Bozen Seminar 13
Analysis of Registry Keywords
Problems:– Abbreviations– Different tags– Specificity of tags Solution:
Need to understand semantics!
23 June 2009
75 Star 52 Galaxy 37 Stars 36 Galaxies 16 AGN 12 Cluster of
Galaxies 12 Nebulae 11 Planets 10 GRB 10 Globular Clusters 8 Star Cluster 7 Nebula 6 Variable stars 5 Hot stars 5 Pulsar
4 supernova 3 Clusters of
Galaxies 3 Infrared:stars 3 Quasars: general 3 Supernova 3 White dwarfs 3 galaxies 2 Comets 2 Cool stars 2 Extragalactic
Source 2 Extragalactic
objects 2 Infrared: stars 2 Interstellar
medium 2 QSO 2 QSOs 2 SNR 2 Variable Star 2 White Dwarf 2 clusters of
galaxies 2 stars 1 Asteroids 1 BL Lac 1 Be/X-ray binary
stars 1 Binary stars ...
A.J.G. Gray - Bolzano-Bozen Seminar 14
Semantic Options• Folksonomies
– Keyword tags, freely chosen
• Vocabulary– Controlled list of words with definitions
• Taxonomy– Relationships:
Broader/Narrower/Related
• Thesaurus– Synonyms, antonyms, see also
• Ontology– Formal specification of a shared
conceptualisation – OWL
“Vocabulary” used to cover vocabularies, taxonomies, and thesauri.
23 June 2009
Ima
ge:
Leo
nard
Co
hen
Sea
rch
A.J.G. Gray - Bolzano-Bozen Seminar 15
What is a Vocabulary?
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar 16
What is a Controlled Vocabulary?• A set of terms with:
– Label– Synonyms– Definition– Relationships to other
terms:• Broader term• Narrower term• Related term
• Example:– “Spiral galaxy”– “Spiral nebula”– “A galaxy having a spiral structure”– Relationships carrying semantic
information:• BT: “Galaxy”• NT: “Barred spiral galaxy”• RT: “Spiral arm”
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar 17
Existing Vocabularies in Astronomy
• Journal Keywords– Developed for tagging
papers– 311 terms– Actively used
• Astronomy Visualization Metadata (AVM)– Tagging images– 217 terms– Actively used
• IAU Thesaurus– Developed for libraries in
1993– 2,551 terms– Never really used
• Unified Content Descriptor (UCD)– Tagging resource data– 473 terms– Actively used
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar 18
Common Vocabulary FormatRequirements:
– Provide term identifiers • Unambiguous tagging
– Capture semantic relationships
• Poly-hierarchy structure
– Machine processable• Allows inter-operability• “Machine intelligence”
– Avoids problems of:• Spelling• Case• Plurality problems• Tags
– Automated reasoning:• Interested in all “Supernova”• Items tagged as “1a Supernova”
also returned
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar 19
SKOS– W3C standard for sharing vocabularies– Based on RDF
• Semantic model for describing resources– Provides URI for each term– Captures properties of terms– Encodes relationships between terms
• Enables automated reasoning• Standard serialisations• “Looser” semantics than OWL
– Adopted by IVOA as a standard for vocabularies23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar 20
Example SKOS Vocabulary Term
Example“Spiral galaxy”“Spiral nebula”“A galaxy having a spiral
structure”Relationships:
BT: “Galaxy”NT: “Barred spiral
galaxy”RT: “Spiral arm”
In turtle notation#spiralGalaxy a concept; prefLabel “Spiral galaxy”@en; altLabel “Spiral nebula”@en; definition “A galaxy having a
spiral structure”@en; broader #galaxy; narrower #barredSpiralGalaxy; related #spiralArm .
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar 21
Inter-operable Vocabularies
Which vocabulary should I use?
• One that you know!• Closest match to your
needs• Vocabulary terms related
using mappings– Part of the SKOS standard– One mapping file per pair
of vocabularies
Inter-vocabulary mappings
• Broad match: – more general term
• Narrow match: – more specific term
• Related match: – associated term
• Exact match: – equivalent term
• Close match: – similar but not equivalent term
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar 22
Mapping Editor
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar 23
Putting it all together• Use vocabulary concepts for
– Tagging (using URI)• Resources in the registry• VOEvent packets
– Searching by vocabulary concept• User keyword search converted to vocabulary URI
• Provides semantic advantages– Reasoning about terms
• Relationships (Intra-vocabulary)• Mappings (Inter-vocabulary)
• Requires a mechanism to convert a string to a concept
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar 24
Vocabulary Explorer• Search and browse
vocabularies– Configure
• Vocabularies• Mappings
• Based on Information Retrieval techniques• Matching mechanisms• Ranking results
23 June 2009
http://explicator.dcs.gla.ac.uk/WebVocabularyExplorer/
A.J.G. Gray - Bolzano-Bozen Seminar 25
Search Results
Run BB2 BM25 DFR-BM25
IFB2 In-expB2
In-expC2
InL2 PL2 TF-IDF
Initial 0.93 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95
Query Expansion
0.93 0.94 0.94 0.95 0.95 0.94 0.95 0.94 0.94
Term weighting 1
0.93 0.95 0.95 0.95 0.95 0.95 0.95 0.96 0.96
Term weighting 2
0.93 0.95 0.95 0.96 0.96 0.95 0.96 0.96 0.96
Combined 0.91 0.94 0.94 0.94 0.94 0.93 0.94 0.94 0.94
23 June 2009
•Evaluation over 59 queries•nDCG evaluation model (distinguishes highly relevant/relevant/not relevant)
A.J.G. Gray - Bolzano-Bozen Seminar 26
Vocabulary Explorer Screenshot
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar 27
Vocabulary Explorer Screenshot
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar 28
Vocabulary Explorer Screenshot
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar 29
Vocabulary Explorer Screenshot
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar 30
Finding the Right Term: Conclusions
• Vocabularies improve search– Remove ambiguity– Increase precision and recall– Enable
• Reasoning about relevance• Faceted browsing
• Provided tools for working with vocabularies– Reliable search from keyword string to vocabulary term– Exploration of vocabularies– Mapping terms across vocabularies
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar 31
Extracting relevant data
2. How do I query the relevant data sources?
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar 32
Locate, retrieve, and interpret relevant data
• Heterogeneous publishers– Archive centres– Research labs
• Heterogeneous data– Relational– XML– Image Files
23 June 2009
Virtual Observatory
Virtual Observatory: The Problems
A.J.G. Gray - Bolzano-Bozen Seminar 33
A Data Integration Approach• Heterogeneous sources
– Autonomous – Local schemas
• Homogeneous view– Mediated global schema
• Mapping– LAV: local-as-view– GAV: global-as-view
23 June 2009
Global Schema
Query1 Queryn
DB1
Wrapper1
DBk
Wrapperk
DBi
Wrapperi
Mappings
Relies on agreement of a common global
schema
A.J.G. Gray - Bolzano-Bozen Seminar 34
P2P Data Integration Approach• Heterogeneous sources
– Autonomous – Local schemas
• Heterogeneous views– Multiple schemas
• Mappings– From sources to common
schema– Between pairs of schema
• Require common integration data model
Can RDF do this?23 June 2009
Schema1
DB1
Wrapper1
DBk
Wrapperk
DBi
Wrapperi
Schemaj
Query1 Queryn
Mappings
A.J.G. Gray - Bolzano-Bozen Seminar 35
Resource Description Framework
• W3C standard• Designed as a metadata
data model• Contains semantic details• Ideal for linking
distributed data• Queried through SPARQL
23 June 2009
#foundIn
#Sun
The Sun#name
#MilkyWay
Milky Way#name
The Galaxy#name
IAU:Starrdf:type
IAU:BarredSpiral
rdf:type
A.J.G. Gray - Bolzano-Bozen Seminar 36
SPARQL
• Declarative query language– Select returned data (projection)
• Graph or tuples• Attributes to return
– Describe structure of desired results– Filter data (selection)
• W3C standard• Syntactically similar to SQL
– Should be easy for astronomers to learn!23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar 37
Integrating Using RDF• Data resources
– Expose schema and data as RDF
– Need a SPARQL endpoint
• Allows multiple – Access models– Storage models
• Easy to relate data from multiple sources
23 June 2009
Relational DB
RDF / Relational Conversion
XML DB
RDF / XML Conversion
Common Model (RDF)
Mappings
SPARQLquery
We will focus on exposing relational data
sources
A.J.G. Gray - Bolzano-Bozen Seminar 38
RDB2RDF: Two Approaches
Extract-Transform-Load• Data replicated as RDF
– Data can become stale
• Native SPARQL query support– Limited optimisation
mechanisms
Existing RDF stores• Jena• Seasame
Query-driven Conversion• Data stored as relations
• Native SQL query support– Highly optimised access methods
• SPARQL queries must be translated
Existing translation systems• D2RQ• SquirrelRDF
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar 39
System Test Hypothesis
Is it viable to perform query-driven conversions to facilitate data access from a data model that an astronomer is familiar with?
Can RDB2RDF tools feasibly expose large science archives for data integration?
23 June 2009
Relational DB
RDB2RDF
XML DB
RDF / XML Conversion
Common Model (RDF)
Mappings
SPARQLquery
SPARQLquery
A.J.G. Gray - Bolzano-Bozen Seminar 40
Astronomical Test Data Set
• SuperCOSMOS Science Archive (SSA)– Data extracted from scans of Schmidt plates– Stored in a relational database– About 4TB of data, detailing 6.4 billion objects– Fairly typical of astronomical data archives
• Schema designed using 20 real queries• Personal version contains
– Data for a specific region of the sky– About 0.1% of the data– About 500MB
23 June 2009
Ima
ge:
Su
perC
OS
MO
S S
cien
ce A
rchi
ve
A.J.G. Gray - Bolzano-Bozen Seminar 41
Analysis of Test Data
• Using personal version– About 500MB in size (similar size to related work)
• Organised in 14 Relations– Number of attributes: 2 – 152
• 4 relations with more than 20 attributes
– Number of rows: 3 – 585,560– Two views
• Complex selection criteria in views
23 June 2009
Makes this different from business cases and previous work!
A.J.G. Gray - Bolzano-Bozen Seminar 42
Is SPARQL expressive enough?
Can the 20 sample queries be expressed in SPARQL?
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar 43
Real Science QueriesQuery 5: Find the positions and (B,R,I) magnitudes of all star-like objects within delta mag of 0.2 of the colours of a quasar of redshift 2.5 < z < 3.5SQL:SELECT ra, dec, sCorMagB,
sCorMagR2, sCorMagIFROM ReliableStarsWHERE (sCorMagB-sCorMagR2
BETWEEN 0.05 AND 0.80) AND (sCorMagR2-sCorMagI BETWEEN -0.17 AND 0.64)
SPARQL:SELECT ?ra ?decl ?sCorMagB
?sCorMagR2 ?sCorMagIWHERE {…<bindings>…FILTER (?sCorMagB –
?sCorMagR2 >= 0.05 && ?sCorMagB - ?sCorMagR2 <= 0.80)
FILTER (?sCorMagR2 – ?sCorMagI >= -0.17 && ?sCorMagR2 - ?sCorMagI <= 0.64)}
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar 44
Analysis of Test QueriesQuery Feature Query Numbers
Arithmetic in body 1-5, 7, 9, 12, 13, 15-20
Arithmetic in head 7-9, 12, 13
Ordering 1-8, 10-17, 19, 20
Joins (including self-joins) 12-17, 19
Range functions (e.g. Between, ABS) 2, 3, 5, 8, 12, 13, 15, 17-20
Aggregate functions (including Group By) 7-9, 18
Math functions (e.g. power, log, root) 4, 9, 16
Trigonometry functions 8, 12
Negated sub-query 18, 20
Type casting (e.g. Radians to degrees) 7, 8, 12
Server defined functions 10, 11
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar 45
Expressivity of SPARQL
Features• Select-project-join• Arithmetic in body• Conjunction and disjunction• Ordering• String matching• External function calls
(extension mechanism)
Limitations• Range shorthands• Arithmetic in head• Math functions• Trigonometry functions• Sub queries• Aggregate functions• Casting
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar 46
Analysis of Test QueriesQuery Feature Query Numbers
Arithmetic in body 1-5, 7, 9, 12, 13, 15-20
Arithmetic in head 7-9, 12, 13
Ordering 1-8, 10-17, 19, 20
Joins (including self-joins) 12-17, 19
Range functions (e.g. Between, ABS) 2, 3, 5, 8, 12, 13, 15, 17-20
Aggregate functions (including Group By) 7-9, 18
Math functions (e.g. power, log, root) 4, 9, 16
Trigonometry functions 8, 12
Negated sub-query 18, 20
Type casting (e.g. radians to degrees) 7, 8, 12
Server defined functions 10, 11
23 June 2009
Expressible queries: 1, 2, 3, 5, 6, 14, 15, 17, 19
A.J.G. Gray - Bolzano-Bozen Seminar 47
Can RDB2RDF tools feasibly expose large science archives for data integration?
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar 48
Experiment• Time query evaluation
– 5 out of 20 queries used– No joins
• Systems compared:– Relational DB (Base line)
• MySQL v5.1.25
– RDB2RDF tools• D2RQ v0.5.2• SquirrelRDF v0.1
– RDF Triple stores• Jena v2.5.6 (SDB)• Sesame v2.1.3 (Native)
23 June 2009
Relational DB
RDB2RDF
SPARQLquery
Triple store
SPARQLquery
Relational DB
SQLquery
A.J.G. Gray - Bolzano-Bozen Seminar 49
Experimental Configuration
• 8 identical machines– 64 bit Intel Quad Core Xeon 2.4GHz– 4GB RAM– 100 GB Hard drive– Java 1.6– Linux
• 10 repetitions
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar 50
Performance Results
23 June 2009
# Query 1 # Query 2 # Query 3 # Query 5 # Query 60
100
200
300
400
500
600
700
800
900
1000
MySQLD2RQSqRDFJenaSesame
ms
3,45
0
5,33
9
21,4
92
485
,93
2
2,73
3
7,22
9
4,09
01,
307
17,7
93
7,46
8
19,9
84
372
,56
1
1
A.J.G. Gray - Bolzano-Bozen Seminar 51
The Show Stopper: Query Translation
• Each bound variable resulted in a self-join– RDBMS cannot optimize for this– RDBMS perform badly with self-joins
• Each row retrieved with a separate query– 1 query becomes n queries,
where n is cardinality of relation• Predicate selection in RDB2RDF tool
– No optimization possible23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar 52
Extracting Relevant Data: Conclusions
• SPARQL not expressive enough for real (astronomy) queries
• RDBMS benefits from 30+ years research– Query optimisation– Indexes
• RDF stores are improving– Require existing data to be replicated
• RDB2RDF tools show promise– Need to exploit relational database
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar 53
Can RDB2RDF Tools Feasible Expose Large Science Archives for Data Integration?
Not currently!
More work needed on query translation…
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar 54
Conclusions & Future WorkTraditional Integration Challenges
1. Locating data
2. Extracting relevant data
3. Understanding data
Semantic Web Solution• SKOS Vocabularies
– Removes ambiguity– Enables limited machine
understanding
• RDB2RDF Tools– Requires improved query
translation
• Semantic model mappings– Follow “chains” of mappings– Relies on RDB2RDF work
23 June 2009