Structuring what we know and use that to better understand...

Preview:

Citation preview

Structuring what we know and use that to better understand your data

@Chris_Evelo: Department of Bioinformatics – BiGCaT,

WikiPathways team, ELIXIR Interoperability team, Open PHACTS

So many…

ELIXIR, EXCELERATE, CORBEL, GA4GH, EGA, dbNP, ENPADASI, DISH, Open PHACTS, BBMRI, DRE, EuroCAT, DTL, EATRIS, DiXa, UniProt, PDB, CheBI, ChEMBL, HMDB, ISA, FAIR, RDF, VOID, Nanopubs, eNanomapper, KEGG, Reactome, Entrez, Parelsnoer, Arrayexpress, GEO, ENCODE, Recon2, SMBL, SBGN, MIM

And that is just what I discussed yesterday…

The typical question we get about using big data

We can do things like this (diabetic liver)

Pihlajamäki et al. dataset is from Gene Expression Omnibus (accession number GSE15653)

Pihlajamäki et al. J ClinEndocrinol Metab. 2009, 94 (9): 3521-3529. DOI: 10.1210/jc.2009-0212.

Martina Kutmon et al.BMC Genomics 2014, 15:971.DOI: 10.1186/1471-2164-15-971

Data predators

Data: Wang et al. 2011. in Gene Expression Omnibus (GEO, http://ncbi.nlm.nih.gov/geo/, accession number: GSE17461.

Published paper: Effects of 1alpha,25 dihydroxyvitamin D3 and testosterone on miRNA and mRNA expression in LNCaP cells. WL Wang et al. Mol Cancer 2011. 10. doi:10.1186/1476-4598-10-58

Or: Vitamin D effects on prostate cancer cells

Integrative network-based analysis of mRNA and microRNA expression in vitamin D3-treated cancer cells

Internal &external

datarepositories

e.g. dbNP,Sage, Atlas

knowledgeresources &

(semantic web)Integration

e.g. Open PHACTSWikiPathways

study capturingISA

models

studydataprocessing,statistics,storagee.g. arrayanalysis.org

ontologies

modeling & data integration,network biology (extension),supervised statistics

curation, simulation annotation &

provenance

Integrative Systems Biology

researchapplications

mappingBridgeDb

extraction,SPARQLingconversion

http://www.wikipathways.org/instance/WP430

http://www.wikipathways.org/index.php/Pathway:WP430

WikiPathways

• Public resource for biological pathways

• Anyone can contribute and curate

• More up-to-date representation of biological knowledge

WikiPathways: capturing the full diversity of pathway knowledge. M Kutmon et al

Nucleic Acids Res 2015: first published online: Oct 19.

Big data: Wikiomics. Mitch Waldrop. Nature 2008: 455, 22-25

We the curators. Allison Doerr. Nature Methods 2008: 5, 754–755

No rest for the bio-wikis. Ewen Callaway. Nature 2010: 468, 359-360

How to do interoperable data visualization?

Connect to Genome Databases

Backpages link to multiple databases

You could do this for gene lists

Don’t be afraid to reinvent wheels!

BridgeDb: Abstraction Layer

interface

IDMapper

class

IDMapperRdb

relational database

class

IDMapperFile

tab-delimited text

class

IDMapperBiomart

web service

The BridgeDb Framework: Standardized Access to Gene, Protein and Metabolite Identifier

Mapping Services. Martijn P van Iersel, Alexander R Pico, Thomas Kelder, Jianjiong Gao, Isaac Ho,

Kristina Hanspers, Bruce R Conklin, Chris T Evelo. BMC Bioinformatics 2010, 11: 5.

Combine: WikiPathways tissue analyzer

Work done by Jonathan Melius

WikiPathways, a house of webs?

Combine: adding miRNA’s clutters

Combine: regulator Interaction in MiPaSt PathVisio plugin

Work done by Christian Oertlin.

Pathways in Cytoscape

Figure 2. The Cardiac Hypertrophic Response pathway loaded as a network.

Kutmon M, Lotia S, Evelo CT and Pico AR 2014 [v1; ref status: indexed, http://f1000r.es/3ij] F1000Research 2014, 3:152 (doi: 10.12688/f1000research.4254.1)

PPS1

Liver

All pathways

Pathways with high z-score

grouped together.

Explains why there are

relatively few significant

genes, but many pathways

with high z-score.

Cytoscape visualization used to group

Pathway interactions and what causes them

Thomas Kelder, Lars Eijssen, Robert Kleemann, Marjan van Erk, Teake Kooistra, Chris Evelo

(2011) Exploring pathway interactions in insulin resistant mouse liver.

BMC Systems Biology 5: 127 Aug. http://dx.doi.org/doi:10.1186/1752-0509-5-127

Pathway interactions and

detailed network visualization

for the interactions with three

apoptosis related pathways for

the comparison between HF and

LF diet at t = 0. A: Subgraph of the

pathway interaction network, based

on incoming interactions to three

stress response and apoptosis

pathways with the highest in-

degree. Pathway nodes with a thick

border are significantly enriched (p

< 0.05) with differentially expressed

genes. B: The protein interactions

that compose the interactions

between the three apoptosis

related pathways and their

neighbors in the subgraph as

shown in box A (see inset, included

interactions are colored orange).

Protein nodes have a thick border

when their encoding genes are

significantly differentially expressed

(q < 0.05).

Regulation resources

human ErbB signaling pathway extended with validated microRNA regulation

If we don’t do the magic

LiteraturePubChem

GenbankPatents

DatabasesDownloads

Data Analysis Data Integration Firewalled Databases

How do R&D companies use public data?

How do pharma companies use public data?

Pfizer

AZ

Roche

n

@gray_alasdair Big Data Integration 39

Semantic web grammar

Nanopub

Db

VoID

Data Cache (Virtuoso Triple Store)

Semantic Workflow Engine

Linked Data API (RDF/XML, TTL, JSON)

Domain

Specific

Services

Identity

Resolution

Service

Chemistry

Registration

Normalisation

& Q/C

Identifier

Management

Service

Indexing

Co

re P

latf

orm

P12374

EC2.43.4

CS4532

“Adenosine

receptor 2a”

VoID

Db

Nanopub

Db

VoID

Db

VoID

Nanopub

VoID

Public Content Commercial

Public

Ontologies

User

Annotations

Apps

Nanopub

Db

VoID

Data Cache (Virtuoso Triple Store)

Semantic Workflow Engine

Linked Data API (RDF/XML, TTL, JSON)

Domain

Specific

Services

Identity

Resolution

Service

Chemistry

Registration

Normalisation

& Q/C

Identifier

Management

Service

Indexing

Co

re P

latf

orm

P12374

EC2.43.4

CS4532

“Adenosine

receptor 2a”

VoID

Db

Nanopub

Db

VoID

Db

VoID

Nanopub

VoID

Public Content Commercial

Public

Ontologies

User

Annotations

Apps

Choose a standard

Link one resource to another

Or use both and map

Mapping tools are core tools: need funding and sustainability

Database identifier mapping tools we have:

• A software framework (BridgeDb)– Application in WikiPathways, PathVisio, Cytoscape, R/Bioconductor– An installable webservice– Open source– Community based– Database based (small)

• A semantic web implementation (Open PHACTS IMS)– With installable Docker image– Linkset based (fast)– Transitivity (and limits for that)

• gene -> protein -> has enzyme code• Protein -> has enzyme code -> other proteins

• Identifiers.org for ID schema’s and resolution

This is not just Open PHACTS

Federated SPARQL queries:

e.g. find all genes related to disease, then all pathways with these genes…

Used as hackaton (swat4ls) examples

Only works sometimes, by chance

Needs integrated ID mapping!

Ontology mapping• Many available, even as services

• Often integated in data resources

– Make my own, slim, combine, map, extend

– Needs feedback to original!

Metabolite mapping needs

• More mappings! (plant products, drugs, xenobiotics)

• Ontology based mapping (CheBi)

• Because:

– Palmitic acid is a fatty acid

– R,R,R-tocopherol is a form of Vitamin E

• And these should (sometimes) map

Also applies to biology:scientific lenses

Chemistry mapping

• Structure not ID based

• Allow substructure searches

• Open PHACTS open source ???

• We need it, may have to redo

From reproducibility to reusability

Reuse problems

The age distribution in the experimental groups were not significantly different…

Can we reuse that data to find out age effects?

Yes, if that is actually captured

Needs:Ontologies (bioportal)Principles/standards (FAIR, ISA)Capture tools (dbNP, Molgenis, OpenCLinica, eNotebooks)Study repositories (Biosamples, Biostudies)Data repositories (EGA, GEO, Arrayexpress, Metabolights, Pride)

Recommended