The Power of Graphs to Analyze Biological Data - Davy Suvee @ GraphConnect London 2013

Preview:

DESCRIPTION

This talk will illustrate the power and flexibility of Graph Databases and Neo4j specifically to help in the overall analysis of biological data sets. Davy will show how to build a visual exploration environment that helps researchers at identifying clusters within various biological data sets, including gene expression and mutation prevalence data. Additionally, he will demo BRAIN (Bio Relations and Intelligence Network), a powerful data exploration platform that combines various scientific data sources (including Pubmed, Swissprot and Drugbank). It uses Neo4J under the cover to both store and enable powerful querying capabilities that provide key insights and deductions.

Citation preview

Grap

hCon

nect

the power of graphs to analyze biological data

about me

who am i ...

Davy Suvee@DSUVEE

➡ big data architect @ datablend - continuum• provide big data and nosql consultancy

• 5 years of hands-on expertise in the pharma/biotech sector

massive data

big data in pharma

full genome sequencing

complex databiological networks

scalable number crunching platform

visual insights-driven platform

graphs!!

outlier detection platform

big data in pharma (2 specific use cases)

neo4j, mongodb/cassandra and gephi

euretos - brainneo4j, mongodb, solr and prefuse

gene expression clustering

★ 4.800 samples★ 27.000 genes

➡ oncology data set:

➡ Question:★ for a particular subset of samples, which genes are co-expressed?

storing gene expressions (mongodb)

{ "_id" : { "$oid" : "4f1fb64a1695629dd9d916e3"} ,  "sample_name" : "122551hp133a21.cel" ,  "genomics_id" : 122551 ,  "sample_id" : 343981 ,  "donor_id" : 143981 ,  "sample_type" : "Tissue" ,  "sample_site" : "Ascending colon" ,  "pathology_category" : "MALIGNANT" ,  "pathology_morphology" : "Adenocarcinoma" ,  "pathology_type" : "Primary malignant neoplasm of colon" ,  "primary_site" : "Colon" ,  "expressions" : [ { "gene" : "X1_at" , "expression" : 5.54217719084415} ,                    { "gene" : "X10_at" , "expression" : 3.92335121981739} ,                    { "gene" : "X100_at" , "expression" : 7.81638155662255} ,                    { "gene" : "X1000_at" , "expression" : 5.44318512260619} ,                     … ]}

correlating samples (mongodb/map-reduce)

pearson correlation

x y

43 99

21 65

25 79

42 75

57 87

59 81

0,52

co-expression graph (neo4j)

➡ create a node for each sample➡ if correlation between two samples >= 0.8

create an edge between both nodes

122552

122553

122551

correlated

value : 0,86

co-expression visualisation (gephi)

euretos - brain

➡ pubmed: 23 million biomedical articles• 1300 new ones added every day• google-like search interface

➡ reading an article ...• malaria is transferred by mosquitoes

euretos - brain

authors references

euretos - brain

ooooooh crap ...

euretos - brain

➡ nanopub (nanopub.org)• the smallest unit of publishable information

➡ assertion• subject: malaria• predicate: transferred by• object: mosquito

➡ provenance• how this came to be (meta-data)

euretos - brain➡ unfortunately, malaria is encoded in various ways ...

malaria P22384 AQ879

db1 db2 db3

malaria

euretos - brain

malaria mosquitotransferred by

euretos - brain

➡ brain (http://www.euretos.com/brain)• exploration and analysis platform• millions of concepts/triples/nanopubs• pubmed, uniprot, omim, pubchem, ...

➡ architectural stack• meta-data is stored in mongodb• graph in neo4j• swing interface connecting to rest endpoints

brain

brain

brain

brain

brain

brain

brain

brain

Questions?

E-MAIL

info@datablend.be

Follow us

twitter.com/data_blendwww.datablend.be

www.datablend.be info@datablend.be 0499/05.00.89

datablend - continuum