1
Graph-based queries of Semantic-Web integrated biological data Marco Moretto 1,2 , Alessandro Cestaro 2 , Enrico Blanzieri 1 and Riccardo Velasco 2 1 University of Trento Via Sommarive, 14 38100 Trento, Italy 2 Fondazione Edmund Mach Via Edmund Mach, 1 38010 S. Michele all’Adige, Trento , Italy [email protected] Abstract In the post-genomic era, life science researchers are faced with the need to manage and inspect a grow- ing abundance of data and information. Data from different sources, both public and proprietary, have the most value when considered in the context of each other as they give more information. In or- der to answer questions that spans multiple fields in the biology domain without an integrated approach, a biologist needs to visit all data sources related to the problem and find relevant data. In the last years we have become witnesses of a growing inter- est for the Semantic Web technologies to integrate and query biological data. Semantic Web technolo- gies were designed to meet the challenges of re- duce the complexity of combining data from multi- ple sources, resolve the lack of widely accepted stan- dards and manage highly distributed and mutable resources. However, Semantic Web standard tech- nologies do not provide any tools to query integrated knowledge bases from a graph perspective, that is defining graph traversal patterns. For example, it is not possible to ask a query like “are enzyme A and compound B related?” without knowing the com- plete structure of the knowledge base. After explor- ing different alternatives we come up with the use of a graph traversal programming language on top of a triplestore in order to perform several path traversal queries on an integrated graph. We tested the feasi- bility of the approach integrating Uniprot, Gene On- tology, Chebi and Kegg resources posing queries of different complexity. Overview The Resource Description Framework (RDF) data model is based upon the idea of making statements about resources in the form of subject-predicate-object ex- pressions, known as triples. From a database perspec- tive, RDF can be considered an extension of graph database models. Proposed solution Gathering resources from public repositories If necessary convert them into RDF format Store them into a Sesame triplestore Integrate them providing linking triples Query the integrated RDF graph using Gremlin Example query: Gremlin code Query Given an enzyme and a compound, are they related? interactions= [g.v(g.uri(’bpx:control’)),g.v(g.uri(’bpx:biochemicalReaction’))] compound = g.v(g.uri(’chebi:CHEBI_16077’)) enzyme = g.v(g.uri(’uniprot:D7SXJ4’)) enzyme.outE[[label:g.uri(’unicore:enzyme’)]].inV.inE[[ label:g.uri(’rdfs:seeAlso’)]].outV.inE.outV.loop(2){!interactions.contains(( it.object.outE[[label:g.uri(’rdf:type’)]].inV >> 1))}.outE.inV.loop(2) {it.object != compound } Gremlin Is a domain specific programming language for graphs based on Groovy. It is not tied to a particular graph backend and its syntax allows for the represen- tation of graph traversal expression succinctly Example query: graph Query Given an enzyme and a compound, are they related? Conclusions and results KB Uniprot GO Chebi KEGG kb1 Vitis vinifera only ids only ids Vitis vinifera kb2 Eudicotyledons only ids only ids Vitis vinifera kb3 Viridiplantae full full Vitis vinifera KB N of vertexes N of edges Loading time Disk space kb1 297.207 3.136.164 3 minutes 214 Mb kb2 1.375.608 21.196.845 15 minutes 1.8 Gb kb3 13.149.000 181.693.000 3 hours 15 Gb An integrated approach allows biologists to query different information resources without the need to visit all of them in order to find relevant data DBMS knowledge bases must be designed and modified with an idea of the type of queries they are going to answer Semantic Web technologies provide standard tools and technologies to easily integrate data from different sources SPARQL does not allow path traversal queries Graph-based approach allows to express queries like “are entity A and entity B related?” References [1] Goble, C. and Stevens, R. State of the nation in data integration for bioinformatics Journal of biomedical informatics, 41(5):687–693, 2008. [2] Angles, R. and Gutierrez, C. Querying RDF data from a graph database perspective The Semantic Web: Research and Applications, Springer, 346–360 2005. [3] Marko Rodriguez Gremlin https://github.com/tinkerpop/gremlin/wiki

Semantic-Webintegratedbiologicaldata Graph-basedqueriesof ISCB-ECCB... · Marco Moretto1;2, Alessandro Cestaro2, Enrico Blanzieri1 and Riccardo Velasco2 1 University of Trento Via

Embed Size (px)

Citation preview

Page 1: Semantic-Webintegratedbiologicaldata Graph-basedqueriesof ISCB-ECCB... · Marco Moretto1;2, Alessandro Cestaro2, Enrico Blanzieri1 and Riccardo Velasco2 1 University of Trento Via

Graph-based queries ofSemantic-Web integrated biological data

Marco Moretto1,2, Alessandro Cestaro2, Enrico Blanzieri1 and Riccardo Velasco2

1University of Trento Via Sommarive, 14 38100 Trento, Italy

2Fondazione Edmund Mach Via Edmund Mach, 1 38010 S. Michele all’Adige, Trento , Italy

[email protected]

AbstractIn the post-genomic era, life science researchers arefaced with the need to manage and inspect a grow-ing abundance of data and information. Data fromdifferent sources, both public and proprietary, havethe most value when considered in the context ofeach other as they give more information. In or-der to answer questions that spans multiple fields inthe biology domain without an integrated approach,a biologist needs to visit all data sources relatedto the problem and find relevant data. In the lastyears we have become witnesses of a growing inter-est for the Semantic Web technologies to integrateand query biological data. Semantic Web technolo-gies were designed to meet the challenges of re-duce the complexity of combining data from multi-ple sources, resolve the lack of widely accepted stan-dards and manage highly distributed and mutableresources. However, Semantic Web standard tech-nologies do not provide any tools to query integratedknowledge bases from a graph perspective, that isdefining graph traversal patterns. For example, it isnot possible to ask a query like “are enzyme A andcompound B related?” without knowing the com-plete structure of the knowledge base. After explor-ing different alternatives we come up with the use ofa graph traversal programming language on top ofa triplestore in order to perform several path traversalqueries on an integrated graph. We tested the feasi-bility of the approach integrating Uniprot, Gene On-tology, Chebi and Kegg resources posing queries ofdifferent complexity.

Overview

The Resource Description Framework (RDF) data modelis based upon the idea of making statements aboutresources in the form of subject-predicate-object ex-

pressions, known as triples. From a database perspec-tive, RDF can be considered an extension of graphdatabase models.

Proposed solution• Gathering resources from public repositories

• If necessary convert them into RDF format

• Store them into a Sesame triplestore

• Integrate them providing linking triples

• Query the integrated RDF graph using Gremlin

Example query: Gremlin codeQuery Given an enzyme and a compound, are they related?

interactions= [g.v(g.uri(’bpx:control’)),g.v(g.uri(’bpx:biochemicalReaction’))]compound = g.v(g.uri(’chebi:CHEBI_16077’))enzyme = g.v(g.uri(’uniprot:D7SXJ4’))enzyme.outE[[label:g.uri(’unicore:enzyme’)]].inV.inE[[label:g.uri(’rdfs:seeAlso’)]].outV.inE.outV.loop(2){!interactions.contains((it.object.outE[[label:g.uri(’rdf:type’)]].inV >> 1))}.outE.inV.loop(2){it.object != compound }

GremlinIs a domain specific programming language forgraphs based on Groovy. It is not tied to a particulargraph backend and its syntax allows for the represen-tation of graph traversal expression succinctly

Example query: graphQuery Given an enzyme and a compound, are

they related?

Conclusions and resultsKB Uniprot GO Chebi KEGGkb1 Vitis vinifera only ids only ids Vitis viniferakb2 Eudicotyledons only ids only ids Vitis viniferakb3 Viridiplantae full full Vitis vinifera

KB N of vertexes N of edges Loading time Disk spacekb1 297.207 3.136.164 3 minutes 214 Mbkb2 1.375.608 21.196.845 15 minutes 1.8 Gbkb3 13.149.000 181.693.000 3 hours 15 Gb

• An integrated approach allows biologists toquery different information resources without theneed to visit all of them in order to find relevantdata

• DBMS knowledge bases must be designed andmodified with an idea of the type of queries theyare going to answer

• Semantic Web technologies provide standardtools and technologies to easily integrate datafrom different sources

• SPARQL does not allow path traversal queries

• Graph-based approach allows to express

queries like “are entity A and entity B related?”

References[1] Goble, C. and Stevens, R. State of the nation in data integration for bioinformatics Journal of biomedical informatics, 41(5):687–693, 2008.[2] Angles, R. and Gutierrez, C. Querying RDF data from a graph database perspective The Semantic Web: Research and Applications, Springer, 346–360 2005.[3] Marko Rodriguez Gremlin https://github.com/tinkerpop/gremlin/wiki

1