Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
David ShottonImage BioInformatics Research GroupDepartment of ZoologyUniversity of Oxford, UK
http:/ibrg.zoo.ox.ac.uk
Data integration – options using the Semantic Web
© David Shotton, 2010 Published under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 Licence
e-mail: [email protected]
Workshop on Linked Data and the Semantic Web
OUCS, 29/03/2010
Outline of the talk
The move towards open data
Integrating open data published in heterogeneous formats across distributed databases
Query distribution across relational databases
Data warehousing
Open linked data
Data webs
The benefits of published data
The benefits of open data publication:
review and validation by others,
re-use in another contexts, and
integration with other data to create a new greater whole
Governments, research funding agencies, publishers and (increasingly) researchers are agreed that the results of publicly funded research should be made publicly available
Barriers to data sharing
(From RIN Report)
Ethical constraints
IPR issues
Concerns about misuse
Concerns about data ownership
“Above all, as researchers, we see data as a critical part of our‘intellectual capital’, generated through a considerable investment of time, effort and skill.
“In a competitive environment, our willingness to share is therefore subject to reservations, in particular as to the control we have over the manner and timing of sharing.
“Any sharing or publishing environment must therefore have secure embargo procedures, such that we can state at the outset when in the future we are happy for the data to be published, safe in the knowledge that our wishes will be honoured.”
Desiderata to promote data sharing
A combination of eight social and technical factors:
Personal attribution and credit for data publication
An established mechanism for citation of datasets
A generic minimum metadata standard for datasets
A tool to permit the easy creation of well-structured metadata
A standard mechanism for packaging data files and their metadata
Appropriate repositories to archive and publish research datasets
Reciprocal citation links between datasets and research articles
Mechanisms for quality control of data publications
Requirements for usefulness: structured research datasets
For research datasets to be maximally useful, they have to be:
Saved in machine-processable form, in conformity with appropriateWeb standards (e.g. XML, RDF, OWL)
Published and made freely accessible on the Web
Referenced by globally unique and resolvable identifiers (e.g. DOIs)
Accompanied by useful metadata based upon minimal information standards and ontologies, including provenance information:
by whom, when, where and why the data were recorded
by whom the research was funded
FlyTEDFlyTED (the Drosophila Testis Gene Expression Database) is our specialist database of gene expression images and their metadata
http://www.fly-ted.org
The challenges of integrating bioinformatics data
There are now over 1200 bioinformatics databases
Combining information from various sources is time-consuming process
Different databases have different and often highly sophisticated user interfaces
They also differ markedly in their underlying data representations
Nucleic Acids Research Database Collection
0
200
400
600
800
1000
1200
1400
2003 2004 2005 2006 2007 2008 2009 2010
Year
Num
ber o
f bio
info
rmat
ics
data
base
s
Cochrane GR, Galperin MY (2010) Nucleic Acids Research 38:D1-D4
Some people spend more time searching for information per week than they spend actually using the data !
Data integration for many researchers amounts to nothing more sophisticated than cutting and pasting into a Word document !!
Different approaches to data integration
Sequential integration into workflows, e.g. Taverna, myGrid
Synchronous integration – four quite different approaches:
Distributed querying
Data warehousing
Open Linked DataThe ‘pure’ Semantic Web approach Crawl the entire web of linked RDF data to find what you want
Data Webs‘Bespoke’ data webs for distinct information domainsConvert core metadata from contributing resources into RDF for queryingThen link users back to original sources for more complete details
Database integration – the heavyweight approach
OGSA-DAI(Open Grid Services Architecture – Database Access and Integration)
Mechanism for distributing SQL queries over geographically separate conventional relational databases
Heavy investment from UK e-Science budget
Large development team
Heavyweight software
Not used by researchers!
(The following 3 slides are taken from the 2006 OGSA-DAI Architecture document at http://www.ogsadai.org.uk/documentation/presentations/ggf16/)
Data warehousing
All the data from contributing resources is copied into a central repository
Incoming data are normalized to the central data model
Queries are then efficiently made against this single resource
Incurs a large maintenance task in keeping the warehoused data current, in the face of ongoing ‘churn’ in the contributing databases
In our field of Drosophila gene expression, the data warehouse FlyMine provides an integrated database for Drosophila and Anopheles genomics
A customised instance of the generic data warehouse platform InterMine
Its powerful user interface can be used to construct arbitrary queries over the FlyMine data model, but this can be be hard to understand without an informatics background, and requires time to master
http://www.flymine.org/
Open Linked Data
Tim Berners-Lee’s four rules for open linked data (from http://www.w3.org/DesignIssues/LinkedData.html)
Open linked data facilitates web-scale distributed publication of data, without having to go through any kind of central authority, fulfilling Tim’s original dream of the web
By using common ontology terms to describe things, it becomes easy to navigate around related sets of information by following links, and to pull together information from unrelated sources in novel ways
Jeni Tennison’s blog, Monday 22 March 2010
“Link following is all very well, but querying provides much morepotential power.
What’s been very unclear to me is how this distributed publication ofdata can be married with the use of SPARQL for querying.
After all, SPARQL doesn’t (in its present form) support federated search, so to use SPARQL over all this distributed linked data, it sounds like you really need a central triplestore that contains everything you might want to query.
SPARQL queries operate over a default graph (or dataset) and a set of supplementary named graphs. For efficiency, these need to be pulled into a single triplestore.
I think the answer (for the moment at least) is to forget about querying the entire web of linked data and focus on supporting the easy creation of targeted, curated, triplestores that each incorporate a useful subset of the linked data that’s out there. ”
Enter the data web . . .
The data web concept
A data web is designed to integrate data published in a distributed fashion by independent data providers using their own Web servers
The data may either be in RDF or in conventional relational databases for which SPARQL endpoints are created
Bespoke data webs are created, one for each specific domain of interest, integrating information from a number of ‘subscribing’ resources
Core RDF metadata are made available from each resource, and is ideally, but not necessarily, integrated to a common data web schema
SPARQL querying of the data web permits discover of the existence of related information across the subscribing resources
If users require further details on a particular item than provided by the metadata within the data web, links are provided back to the full dataset that resides unmodified in the original database
(Data webs were first proposed at the symposium Semantic Interoperability for e-Research in the Sciences, Arts and Humanities, Imperial College, March 30th 2006; and first realized for OpenFlyData in October 2008)
Problems to be solved when creating a data web
Syntactic differences between data sources
Data are stored in incompatible formats within different DBMSs
Solved by converting all data to RDF, accessible using SPARQL
Semantic differences between data sources (class names)
One person’s “author” is another person’s “creator”
Solved by mapping to a common data schema or ontology
The co-reference problem
The same entity – for example a particular gene – is known by different names in different databases
(e.g. schuy, schumacher-levy or CG17736)
Solved by creating a co-reference service to disambiguate synonyms and homonyms
Technical approaches for our creation of data webs
We use the Web as the platform, and the browser as the user interface
We use W3C Semantic Web tools and standards:
All identifiers are mapped to URIs
RDF (the Resource Description Framework, a standard for describing data on the Web) is used as the standard format for describing data
The RDF query language SPARQL is used for data web queries
A SPARQL web service endpoint is made on each data resource, or on an integration of these
By presenting standard relational database content as RDF, SPARQL queries of the data web can be used to access ‘legacy’ data
We employ open source software components to build our services, loosely coupled by RESTful Web services
Two methods of creating a SPARQL endpoint
Creation of a local RDF triplestore that caches selected source metadata, which are SPARQLed– “RDF caching”
Use of software to dynamically rewrite the SPARQL query into the database query language (SQL) – “SPARQL virtualization”
OpenFlyData
A data web integrating heterogeneous Drosophila data from distributed bioinformatics databases
Created during the JISC FlyWeb Project, October 2007-May 2009
First OpenFlyData release: Oct 2008; full functionality: April 2009
Displays information from four Drosophila gene expression databases
About 180 million triples in present form http://openflydata.org/
Miles A, Zhao J, Klyne G, White-Cooper H, Shotton D (2010). OpenFlyData: An exemplar data web integrating gene expression data on the fruit fly Drosophila melanogaster. Journal of Biomedical Informatics (accepted for publication)
Preprint at http://imageweb.zoo.ox.ac.uk/pub/2009/publications/Miles_et_al_OpenFlyData_paper/
CLAROS
A data web integrating heterogeneous classical art data from distributed sources
Created during the OU CLAROS Project, Nov 2008-Oct 2009
First CLAROS release: Aug 2009; still under development
Integrates information from four major European classics resources
About 10 million triples in present form http://www.clarosnet.org/
Kurtz D, Parker G, Shotton D, Klyne G, Schroff F, Zisserman A and Wilks Y (2009). CLAROS – bringing classical art to a global public. Proc. IEEE e-Science Conference, Oxford, 9-11 December 2009.
Preprint at http://imageweb.zoo.ox.ac.uk/pub/2009/publications/ Kurtz_Parker_Shotton_et_al-IEEE_CLAROS_paper.pdf
OpenCitations.net
CiTO, the Citation Typing Ontology (http://purl.org/net/cito/) enables the publication of citation information as open linked data
By citations, I mean the statements of the type
<http://example1.com/citingwork> cito:cites<http://example2.com/citedwork> .
I plan to apply to the JISC for start-up funding to set up OpenCitations.net, a public RDF triplestore of biomedical literature citations
Open access journals from UK Pubmed Central and Biomed Central will be used as an initial source of citations, plus some 3 million citations already mined from PDFs of life science articles harvested from the web
Availability of such open linked data for the biomedical sciences would be invaluable for examination of citation networks and other phenomena, such as tracking citation bias and the conversion of hypotheses into ‘facts’simply by the process of citation
(Greenberg SA (2009): How citation distortions create unfounded authority: analysis of a citation network. British Medical Journal 339:b2680, doi:10.1136/bmj.b2680)
In conclusion, an analogy:Data publishing and global warming
Waiting for some international committee in Copenhagen to create the perfect solution to the data publication problem is not the way forward
Just as we can each act locally to reduce our carbon footprint,
so we can each do something personally to increase our data footprint
Each of us can take responsibility for publishing our own research data, getting help from experts who can assist us technically as necessary
The important thing is to make a start !
Acknowledgements
My colleagues Graham Klyne, Jun Zhao and Alistair Miles, who have done all the real work in creating FlyTED, OpenFlyData and the CLAROS webd
Helen White-Cooper, University of Cardiff and her team for Drosophila in situ data, and for user feedback on FlyTED and OpenFlyData
The HP Labs Jena team, especially Andy Seaborne, for techical advice on using the Jena TDB
The JISC, the RIN, the John Fell Fund and EPSRC for funding our work