Data integration â€“ options using the Semantic Web

David ShottonImage BioInformatics Research GroupDepartment of ZoologyUniversity of Oxford, UK

http:/ibrg.zoo.ox.ac.uk

Data integration – options using the Semantic Web

© David Shotton, 2010 Published under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 Licence

e-mail: [email protected]

Workshop on Linked Data and the Semantic Web

OUCS, 29/03/2010

Outline of the talk

The move towards open data

Integrating open data published in heterogeneous formats across distributed databases

Query distribution across relational databases

Data warehousing

Open linked data

Data webs

The benefits of published data

The benefits of open data publication:

review and validation by others,

re-use in another contexts, and

integration with other data to create a new greater whole

Governments, research funding agencies, publishers and (increasingly) researchers are agreed that the results of publicly funded research should be made publicly available

Barriers to data sharing

(From RIN Report)

Ethical constraints

IPR issues

Concerns about misuse

Concerns about data ownership

“Above all, as researchers, we see data as a critical part of our‘intellectual capital’, generated through a considerable investment of time, effort and skill.

“In a competitive environment, our willingness to share is therefore subject to reservations, in particular as to the control we have over the manner and timing of sharing.

“Any sharing or publishing environment must therefore have secure embargo procedures, such that we can state at the outset when in the future we are happy for the data to be published, safe in the knowledge that our wishes will be honoured.”

Desiderata to promote data sharing

A combination of eight social and technical factors:

Personal attribution and credit for data publication

An established mechanism for citation of datasets

A generic minimum metadata standard for datasets

A tool to permit the easy creation of well-structured metadata

A standard mechanism for packaging data files and their metadata

Appropriate repositories to archive and publish research datasets

Reciprocal citation links between datasets and research articles

Mechanisms for quality control of data publications

Requirements for usefulness: structured research datasets

For research datasets to be maximally useful, they have to be:

Saved in machine-processable form, in conformity with appropriateWeb standards (e.g. XML, RDF, OWL)

Published and made freely accessible on the Web

Referenced by globally unique and resolvable identifiers (e.g. DOIs)

Accompanied by useful metadata based upon minimal information standards and ontologies, including provenance information:

by whom, when, where and why the data were recorded

by whom the research was funded

FlyTEDFlyTED (the Drosophila Testis Gene Expression Database) is our specialist database of gene expression images and their metadata

http://www.fly-ted.org

The challenges of integrating bioinformatics data

There are now over 1200 bioinformatics databases

Combining information from various sources is time-consuming process

Different databases have different and often highly sophisticated user interfaces

They also differ markedly in their underlying data representations

Nucleic Acids Research Database Collection

0

200

400

600

800

1000

1200

1400

2003 2004 2005 2006 2007 2008 2009 2010

Year

Num

ber o

f bio

info

rmat

ics

data

base

s

Cochrane GR, Galperin MY (2010) Nucleic Acids Research 38:D1-D4

Some people spend more time searching for information per week than they spend actually using the data !

Data integration for many researchers amounts to nothing more sophisticated than cutting and pasting into a Word document !!

Different approaches to data integration

Sequential integration into workflows, e.g. Taverna, myGrid

Synchronous integration – four quite different approaches:

Distributed querying

Data warehousing

Open Linked DataThe ‘pure’ Semantic Web approach Crawl the entire web of linked RDF data to find what you want

Data Webs‘Bespoke’ data webs for distinct information domainsConvert core metadata from contributing resources into RDF for queryingThen link users back to original sources for more complete details

Database integration – the heavyweight approach

OGSA-DAI(Open Grid Services Architecture – Database Access and Integration)

Mechanism for distributing SQL queries over geographically separate conventional relational databases

Heavy investment from UK e-Science budget

Large development team

Heavyweight software

Not used by researchers!

(The following 3 slides are taken from the 2006 OGSA-DAI Architecture document at http://www.ogsadai.org.uk/documentation/presentations/ggf16/)

The OGSA-DAI framework

OGSA-DAI data services

User authorization

OGSA-DAI state management

Maintenance of state

Data warehousing

All the data from contributing resources is copied into a central repository

Incoming data are normalized to the central data model

Queries are then efficiently made against this single resource

Incurs a large maintenance task in keeping the warehoused data current, in the face of ongoing ‘churn’ in the contributing databases

In our field of Drosophila gene expression, the data warehouse FlyMine provides an integrated database for Drosophila and Anopheles genomics

A customised instance of the generic data warehouse platform InterMine

Its powerful user interface can be used to construct arbitrary queries over the FlyMine data model, but this can be be hard to understand without an informatics background, and requires time to master

http://www.flymine.org/

Open Linked Data

Tim Berners-Lee’s four rules for open linked data (from http://www.w3.org/DesignIssues/LinkedData.html)

Open linked data facilitates web-scale distributed publication of data, without having to go through any kind of central authority, fulfilling Tim’s original dream of the web

By using common ontology terms to describe things, it becomes easy to navigate around related sets of information by following links, and to pull together information from unrelated sources in novel ways

Jeni Tennison’s blog, Monday 22 March 2010

“Link following is all very well, but querying provides much morepotential power.

What’s been very unclear to me is how this distributed publication ofdata can be married with the use of SPARQL for querying.

After all, SPARQL doesn’t (in its present form) support federated search, so to use SPARQL over all this distributed linked data, it sounds like you really need a central triplestore that contains everything you might want to query.

SPARQL queries operate over a default graph (or dataset) and a set of supplementary named graphs. For efficiency, these need to be pulled into a single triplestore.

I think the answer (for the moment at least) is to forget about querying the entire web of linked data and focus on supporting the easy creation of targeted, curated, triplestores that each incorporate a useful subset of the linked data that’s out there. ”

Enter the data web . . .

The data web concept

A data web is designed to integrate data published in a distributed fashion by independent data providers using their own Web servers

The data may either be in RDF or in conventional relational databases for which SPARQL endpoints are created

Bespoke data webs are created, one for each specific domain of interest, integrating information from a number of ‘subscribing’ resources

Core RDF metadata are made available from each resource, and is ideally, but not necessarily, integrated to a common data web schema

SPARQL querying of the data web permits discover of the existence of related information across the subscribing resources

If users require further details on a particular item than provided by the metadata within the data web, links are provided back to the full dataset that resides unmodified in the original database

(Data webs were first proposed at the symposium Semantic Interoperability for e-Research in the Sciences, Arts and Humanities, Imperial College, March 30th 2006; and first realized for OpenFlyData in October 2008)

Problems to be solved when creating a data web

Syntactic differences between data sources

Data are stored in incompatible formats within different DBMSs

Solved by converting all data to RDF, accessible using SPARQL

Semantic differences between data sources (class names)

One person’s “author” is another person’s “creator”

Solved by mapping to a common data schema or ontology

The co-reference problem

The same entity – for example a particular gene – is known by different names in different databases

(e.g. schuy, schumacher-levy or CG17736)

Solved by creating a co-reference service to disambiguate synonyms and homonyms

The fundamental components of a data web

Technical approaches for our creation of data webs

We use the Web as the platform, and the browser as the user interface

We use W3C Semantic Web tools and standards:

All identifiers are mapped to URIs

RDF (the Resource Description Framework, a standard for describing data on the Web) is used as the standard format for describing data

The RDF query language SPARQL is used for data web queries

A SPARQL web service endpoint is made on each data resource, or on an integration of these

By presenting standard relational database content as RDF, SPARQL queries of the data web can be used to access ‘legacy’ data

We employ open source software components to build our services, loosely coupled by RESTful Web services

Two methods of creating a SPARQL endpoint

Creation of a local RDF triplestore that caches selected source metadata, which are SPARQLed– “RDF caching”

Use of software to dynamically rewrite the SPARQL query into the database query language (SQL) – “SPARQL virtualization”

OpenFlyData

A data web integrating heterogeneous Drosophila data from distributed bioinformatics databases

Created during the JISC FlyWeb Project, October 2007-May 2009

First OpenFlyData release: Oct 2008; full functionality: April 2009

Displays information from four Drosophila gene expression databases

About 180 million triples in present form http://openflydata.org/

Miles A, Zhao J, Klyne G, White-Cooper H, Shotton D (2010). OpenFlyData: An exemplar data web integrating gene expression data on the fruit fly Drosophila melanogaster. Journal of Biomedical Informatics (accepted for publication)

Preprint at http://imageweb.zoo.ox.ac.uk/pub/2009/publications/Miles_et_al_OpenFlyData_paper/

CLAROS

A data web integrating heterogeneous classical art data from distributed sources

Created during the OU CLAROS Project, Nov 2008-Oct 2009

First CLAROS release: Aug 2009; still under development

Integrates information from four major European classics resources

About 10 million triples in present form http://www.clarosnet.org/

Kurtz D, Parker G, Shotton D, Klyne G, Schroff F, Zisserman A and Wilks Y (2009). CLAROS – bringing classical art to a global public. Proc. IEEE e-Science Conference, Oxford, 9-11 December 2009.

Preprint at http://imageweb.zoo.ox.ac.uk/pub/2009/publications/ Kurtz_Parker_Shotton_et_al-IEEE_CLAROS_paper.pdf

OpenCitations.net

CiTO, the Citation Typing Ontology (http://purl.org/net/cito/) enables the publication of citation information as open linked data

By citations, I mean the statements of the type

<http://example1.com/citingwork> cito:cites<http://example2.com/citedwork> .

I plan to apply to the JISC for start-up funding to set up OpenCitations.net, a public RDF triplestore of biomedical literature citations

Open access journals from UK Pubmed Central and Biomed Central will be used as an initial source of citations, plus some 3 million citations already mined from PDFs of life science articles harvested from the web

Availability of such open linked data for the biomedical sciences would be invaluable for examination of citation networks and other phenomena, such as tracking citation bias and the conversion of hypotheses into ‘facts’simply by the process of citation

(Greenberg SA (2009): How citation distortions create unfounded authority: analysis of a citation network. British Medical Journal 339:b2680, doi:10.1136/bmj.b2680)

In conclusion, an analogy:Data publishing and global warming

Waiting for some international committee in Copenhagen to create the perfect solution to the data publication problem is not the way forward

Just as we can each act locally to reduce our carbon footprint,

so we can each do something personally to increase our data footprint

Each of us can take responsibility for publishing our own research data, getting help from experts who can assist us technically as necessary

The important thing is to make a start !

Acknowledgements

My colleagues Graham Klyne, Jun Zhao and Alistair Miles, who have done all the real work in creating FlyTED, OpenFlyData and the CLAROS webd

Helen White-Cooper, University of Cardiff and her team for Drosophila in situ data, and for user feedback on FlyTED and OpenFlyData

The HP Labs Jena team, especially Andy Seaborne, for techical advice on using the Jena TDB

The JISC, the RIN, the John Fell Fund and EPSRC for funding our work

end

Documents

Data integration â€“ options using the Semantic Web