Upload
francois-belleau
View
288
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Building mashup from Linked Data using Bio2RDF’s Talend components François Belleau, Vincent, Emonet, Arnaud Droit Centre de Biologie Computationnelle Centre de recherche du CHUQ
Citation preview
Context
What is known about PARP family proteins involved in Reactome
pathways ? Interesting question. Our proposed methodology is to
build semantic mashup to solve this problem by using two open
source software : OpenLink Virtuoso triplestore and Talend
Open Studio for data integration.
Our goal is to help solve the data integration problem, a reality in
bioinformatic. Taverna and Galaxy workflows have been very
successful in addressing this problem. They still lack support for
Semantic technologies like RDF and SPARQL. BioMart has also
been successful by offering a global model to share and query data.
Bio2RDF project has the same goal but instead it use Semantic
Web technology strategy based on the distributed RDF graph of
Linked Data and public SPARQL endpoints to address this problem.
Methodology
To implement our strategy, we added Semantic technology and Life
Science linked data sources to Talend. We have created two
collections of components. The first one, Talend4SW, integrates
Virtuoso triplestore into Talend and offer simple utilities to transform
RDF data. The second collection of component, Talend4Bio2RDF,
is used to fetch RDF data from Life Science’s SPARQL endpoints.
Connected together in a workflow, those components are used to
query Bio2RDF release 2 endpoints, UniProt REST service and
EBI’s SPARQL endpoints. They all consume the new Bio2RDF
REST services available at http://bio2rdf.org.
Using those components to build a proper Talend workflow, we
populate a triplestore by fetching RDF data directly from the web.
Each triple is then stored in a local Virtuoso triplestore which is
queried using SPARQL to discover new URIs that will be
dereferenced. At the end we have obtained the needed data to
answer our initial query, and a final SPARQL query returns the
answer.
Results
This well designed semantic workflow instantiate the database
needed to answer the initial query in a few steps. Finally,
PARP1_HUMAN is the only protein of the PARP family present in
Reactome’s pathways.
These new Talend components can be imported from Talend
Exchange http://www.talendforge.org/exchange. This Talend
workflow used to answer the PARP question can be downloaded
from myExperiment
http://www.myexperiment.org/workflows/4050.html
Building mashup from Linked Data
using Bio2RDF’s Talend components
François Belleau, Vincent, Emonet, Arnaud Droit
Centre de Biologie Computationnelle
Centre de recherche du CHUQ
The PI of this project is Dr Arnaud Droit, Directeur du Centre de
Biologie Computationnelle du CRCHUQ à l’Université Laval.
http://bio2rdf.org
The tBio2RDFRequest component is used to fetch RDF graph from describe, links and search Bio2RDF REST services. Result is available in different format.
The tNtriplesTemplate component is used to generate N-Triples from the incoming data flow using a text template. Here it is used to create the owl:sameAs triples needed to connect Bio2RDF resources to UniProt ones because of the different URI pattern they use.
The tDerefrencableURI component is used to fetch a graph using its URI. Here it is used to dereference UniProt URI for proteins and keywords.
The tEBIRequest component is used to send queries to EBI new SPARQL endpoints, here it fetches Reactome.
This final complex query is used to answer the question by linking data together from HGNC, UniProt and Reactome database the Linked Data way.
The execution process can be monitored by looking at the URI used to fetch RDF data from the web. The table shows the number of triples loaded in the previous run.
Talend being a complete ETL solution, results can easily exported to Excel spreadsheet for analysis.
Our team can help you add your own curated database to this RDF Linked Data project based on Open Source software. Now your project can join the Semantic Web of Life Sciences resources.