Designing an IT infrastructure for data-intensive collaborative - omics projects

Designing an IT infrastructure for data-intensive collaborative -omics projects

Stathis [email protected]

European Bioinformatics InstituteCambridge, UK

ICTA 2011

Outline

• Introduction• Why design at all?• Principles of collaborative design• A software suite for cross-disciplinary

collaborative studies• Results• Conclusions

INTRODUCTION

The “central dogma” of information flow in molecular biology

DNA RNA ProteinTranscription

(RNA Synthesis)Translation

(Protein Synthesis)

Replication(DNA Synthesis)

Source: http://www.rsc.org/chemistryworld/Issues/2009/November/BiologysNobelMoleculeFactory.asp

The -omics cascade

GENOMICS

What CAN happen

TRANSCRIPTOMICS

What APPEARS to happen

PROTEOME

What MAKES it happen

METABOLOME

What HAS happened

Source: Systems Biology and the Omics Cascade, Karolinska Institutet, June 9-13, 2008

PHENOTYPE

http://xkcd.com/793/

407-omes and -omics

terms1

Sources:1 http://omics.org/index.php/Alphabetically_ordered_list_of_omes_and_omics2 http://www.ensemblgenomes.org/3 http://www.genome.gov/sequencingcosts/4 http://en.wikipedia.org/wiki/Interdisciplinarity

330Genomes

sequenced to date2

3BSize of human

genome in bases

$10kCost to sequence a single human3

30kInterdisciplinary

bachelors degrees awarded in 2005 in

USA4

2006 2007 2008 2009 2010 2011

Trends in publication keywords in the field of bioinformatics

semantic

linked data

2006 2007 2008 2009 2010 2011


cloud

server

2006 2007 2008 2009 2010 2011


omics

genomics

Challenges in -omics research

• Expensive studies– Small number of replicates (n)

(microarrays, subjects...)– Large number of variables

(genes, proteins, etc)

• This results in:– Inflated type I error (false positives)– Poor statistical Power (true positives)

WHY DESIGN AT ALL?http://xkcd.com/970/

Volume vs Complexity cost model

Project Samples Research subjects

Studies/data types

Assays Files/volume

Users/roles/user groups

Publ-s per year

MolPAGE

16.5k 2.2k 300/11 26 000/11

27 000/0.7 TB

80/1/1 1

ENGAGE

>100k 100k 400/13 *** 400/0.25 TB

30/5/13 10

V

C~ data types*user roles*scripts

volume

complexity

Growth of complexity is slower than volume

Both volume and complexity grow fast

Maria Krestyaninova, 2009

Ome vs Omics

Source: http://omics.org/index.php/File:Ome_versus_omics_graph_by_Jong_Bhak_openfree.gif

$3,000,000,000

Cost

$10,000

~$0

2003 2016Ome and OmicsBalance point

2010

$50,000 per person

Reporting requirements for publication

Phenotypes/conditions or outcomes considered in a study

Statistical methods/protocols used in

a study

HTP data used for association (e.g. GWA)genomics

Raw dataProcessed data

Results of analysis

Omics investigation

DataShaper, OBO

ISATAB, MAGETAB, MIBBI

Bioconductor

Nobody wants a cellphone that makes calls!

Make your application:1. Contextualized2. Usable3. Enjoyable4. Visible (increases reputation)5. Sociable6. Valuable7. Explorable8. Flexible9. In a participatory way10. …

OPEN-SOURCE COLLABORATIVE DESIGN

Maxims of the post-information era

• “If the news is important, it will find me”• “Information wants to be free”• “Its not information overload, its filter failure”• “The people formerly known as the audience”• “The sources go direct”• and finally…

Source: http://markcoddington.com/2010/01/30/a-quick-guide-to-the-maxims-of-new-media/

“Do what you do best, link the rest”

http://xkcd.com/974/

Agile development

Individuals & interactions over processes and tools

Working software over comprehensive documentation

Customer collaboration over contract negotiation

Responding to change over following a plan

• In practice: frequent iterations over customer feedback, trust

Metadesign

Participation level

Analysis Concept design

Concept communication

Distribution End-of-life

none

indirect

consultative

Shared control

Full control

Courtesy of Massimo Menichinelli http://www.openp2pdesign.org/

SOFTWARE FOR CROSS-DISCIPLINARY COLLABORATIVE STUDIES

SIMBioMS

The big picture

CENTRAL DATA ARCHIVES

SIMBIOM

SOBIBA

ISA

QURETECMETABAR

etc.

• dynamic storage• project hosting• fast exchange

• permanent deposition• large volumes• open access

support for collaborative

discovery

knowledge access and

sustainability

large consortia

stand alone researchers


USERS

DATA PROVIDERS

System overview

Biobanks

-omics

Experiment DB

Sample DB

Public Index

submission

submission

controlled access

open access


Current infrastructural volume

• 12 installations in 3 countries• 100 user-organisations• >50.000 samples• >50.000 assays and studies• 4 large federated R&D projects across Europe

and Russia

Krestyaninova et al, Bioinformatics, 2009Viksna et al, BMC Bioinformatics, 2007

SIMBIOMS in collaborative biomedical research initiativesProject Goal/Description Funded by Simbioms team involvement

Strategic research collaborations

BBMRIwww.bbmri.eu Build a network of population-based biobanks,

experts, and foster collaboration between them. Provide advice to industry.

EC, OECD Prototyping of data management model, use-case design, discussions.

P3Gwww.p3g.org

Canadian Gov., memberships

Leading international Informatics Working Group; discussions.

ELIXIRwww.elixir-europe.org/page.php

Create a sustainable infrastructure for the storage and distribution of information produced by bioscientists. EC

Prototyping, reports, cooperation with organisation of medical informatics committee on behalf of EBI.

TaraOceans oceans.taraexpeditions.org 3-year long circumnavigation expedition for marine

genomics and climate integrative study.CNRS, industry, potentially EC

Preliminary design of data management solution; meetings, discussions.

Services for research collaborations

ENGAGEwww.euengage.org Genetic and genomic research for clinical application. EC

Design, development and maintenance of dedicated data exchange services – based on SIMBioMS.

MolPAGEwww.molpage.org

Biomarkers: discovery and development of novel high-throughput methods. EC

MuTHER Exploration of gene expression in multiple tissues on 1000 twins associated with aging. Wellcome Trust

SIROCCOwww.sirocco-project.eu

Study of small RNAs as regulatory cell mechanism; therapeutical applications. EC

CAGEKID Kidney cancer study. EC

SUMMIT Surrogate markers for vascular Micro- &Macrovascular hard endpoints for Innovative diabetes Tools EC

http://www.bbmri.eu/

http://www.p3g.org/

http://www.elixir-europe.org/page.php

http://www.elixir-europe.org/page.php

http://www.oceans.taraexpeditions.org/

http://www.euengage.org/

http://www.euengage.org/

http://www.sirocco-project.eu/

Anton Enright, 2011

CONCLUSIONS

Complex interactions

• Who has a say in knowledge extracted from information?– Research subjects

• Consent to particular research being conducted– Scientists

• Protective of vision about their data– Funding sources

• Expect publications from grantees

Pharma

BioBanksResearch Institutions

big data

industryacademia

state

FDA

Ministry of Health Ministry of Education

Yulia Tammisto, 2011

Complex software

• TIME is the scarcest resource• Software adoption due to:

– Requirements – No other way to do things – Usefulness

• Use = 1 – Reuse

One goal

Search for the truth

Thank you!

Acknowledgements:

• Maria Krestyaninova• Ugis Sarkans• Anton Enright• Mat Davis• Yulia Tammisto• Massimo Menichinelli• Teemu Perheentupa• Jani Heikkinen• Balaji Rajashekar• Raivo Kolde• Jaak Vilo

Uniquer

www.simbioms.org

http://www.simbioms.org/

Documents

Designing an IT infrastructure for data-intensive collaborative - omics projects