Transcript
Page 1: Designing an IT infrastructure for data-intensive collaborative - omics  projects

Designing an IT infrastructure for data-intensive collaborative -omics projects

Stathis [email protected]

European Bioinformatics InstituteCambridge, UK

ICTA 2011

Page 2: Designing an IT infrastructure for data-intensive collaborative - omics  projects

Outline

• Introduction• Why design at all?• Principles of collaborative design• A software suite for cross-disciplinary

collaborative studies• Results• Conclusions

Page 3: Designing an IT infrastructure for data-intensive collaborative - omics  projects

INTRODUCTION

Page 4: Designing an IT infrastructure for data-intensive collaborative - omics  projects

The “central dogma” of information flow in molecular biology

DNA RNA ProteinTranscription

(RNA Synthesis)Translation

(Protein Synthesis)

Replication(DNA Synthesis)

Source: http://www.rsc.org/chemistryworld/Issues/2009/November/BiologysNobelMoleculeFactory.asp

Page 5: Designing an IT infrastructure for data-intensive collaborative - omics  projects

The -omics cascade

GENOMICS

What CAN happen

TRANSCRIPTOMICS

What APPEARS to happen

PROTEOME

What MAKES it happen

METABOLOME

What HAS happened

Source: Systems Biology and the Omics Cascade, Karolinska Institutet, June 9-13, 2008

PHENOTYPE

Page 6: Designing an IT infrastructure for data-intensive collaborative - omics  projects

http://xkcd.com/793/

Page 7: Designing an IT infrastructure for data-intensive collaborative - omics  projects

407-omes and -omics

terms1

Sources:1 http://omics.org/index.php/Alphabetically_ordered_list_of_omes_and_omics2 http://www.ensemblgenomes.org/3 http://www.genome.gov/sequencingcosts/4 http://en.wikipedia.org/wiki/Interdisciplinarity

330Genomes

sequenced to date2

3BSize of human

genome in bases

$10kCost to sequence a single human3

30kInterdisciplinary

bachelors degrees awarded in 2005 in

USA4

Page 8: Designing an IT infrastructure for data-intensive collaborative - omics  projects

2006 2007 2008 2009 2010 2011

Trends in publication keywords in the field of bioinformatics

semantic

linked data

2006 2007 2008 2009 2010 2011

Trends in publication keywords in the field of bioinformatics

cloud

server

2006 2007 2008 2009 2010 2011

Trends in publication keywords in the field of bioinformatics

omics

genomics

Page 9: Designing an IT infrastructure for data-intensive collaborative - omics  projects

Challenges in -omics research

• Expensive studies– Small number of replicates (n)

(microarrays, subjects...)– Large number of variables

(genes, proteins, etc)

• This results in:– Inflated type I error (false positives)– Poor statistical Power (true positives)

Page 10: Designing an IT infrastructure for data-intensive collaborative - omics  projects

WHY DESIGN AT ALL?http://xkcd.com/970/

Page 11: Designing an IT infrastructure for data-intensive collaborative - omics  projects

Volume vs Complexity cost model

Project Samples Research subjects

Studies/data types

Assays Files/volume

Users/roles/user groups

Publ-s per year

MolPAGE

16.5k 2.2k 300/11 26 000/11

27 000/0.7 TB

80/1/1 1

ENGAGE

>100k 100k 400/13 *** 400/0.25 TB

30/5/13 10

V

C~ data types*user roles*scripts

volume

complexity

Growth of complexity is slower than volume

Both volume and complexity grow fast

Maria Krestyaninova, 2009

Page 12: Designing an IT infrastructure for data-intensive collaborative - omics  projects

Ome vs Omics

Source: http://omics.org/index.php/File:Ome_versus_omics_graph_by_Jong_Bhak_openfree.gif

$3,000,000,000

Cost

$10,000

~$0

2003 2016Ome and OmicsBalance point

2010

$50,000 per person

Page 13: Designing an IT infrastructure for data-intensive collaborative - omics  projects

Reporting requirements for publication

Phenotypes/conditions or outcomes considered in a study

Statistical methods/protocols used in

a study

HTP data used for association (e.g. GWA)genomics

Raw dataProcessed data

Results of analysis

Omics investigation

DataShaper, OBO

ISATAB, MAGETAB, MIBBI

Bioconductor

Page 14: Designing an IT infrastructure for data-intensive collaborative - omics  projects

Nobody wants a cellphone that makes calls!

Make your application:1. Contextualized2. Usable3. Enjoyable4. Visible (increases reputation)5. Sociable6. Valuable7. Explorable8. Flexible9. In a participatory way10. …

Page 15: Designing an IT infrastructure for data-intensive collaborative - omics  projects

OPEN-SOURCE COLLABORATIVE DESIGN

Page 16: Designing an IT infrastructure for data-intensive collaborative - omics  projects

Maxims of the post-information era

• “If the news is important, it will find me”• “Information wants to be free”• “Its not information overload, its filter failure”• “The people formerly known as the audience”• “The sources go direct”• and finally…

Source: http://markcoddington.com/2010/01/30/a-quick-guide-to-the-maxims-of-new-media/

Page 17: Designing an IT infrastructure for data-intensive collaborative - omics  projects

“Do what you do best, link the rest”

http://xkcd.com/974/

Page 18: Designing an IT infrastructure for data-intensive collaborative - omics  projects

Agile development

Individuals & interactions over processes and tools

Working software over comprehensive documentation

Customer collaboration over contract negotiation

Responding to change over following a plan

• In practice: frequent iterations over customer feedback, trust

Page 19: Designing an IT infrastructure for data-intensive collaborative - omics  projects

Metadesign

Participation level

Analysis Concept design

Concept communication

Distribution End-of-life

none

indirect

consultative

Shared control

Full control

Courtesy of Massimo Menichinelli http://www.openp2pdesign.org/

Page 20: Designing an IT infrastructure for data-intensive collaborative - omics  projects

SOFTWARE FOR CROSS-DISCIPLINARY COLLABORATIVE STUDIES

SIMBioMS

Page 21: Designing an IT infrastructure for data-intensive collaborative - omics  projects

The big picture

CENTRAL DATA ARCHIVES

SIMBIOM

SOBIBA

ISA

QURETECMETABAR

etc.

• dynamic storage• project hosting• fast exchange

• permanent deposition• large volumes• open access

support for collaborative

discovery

knowledge access and

sustainability

large consortia

stand alone researchers

Maria Krestyaninova, 2009

Page 22: Designing an IT infrastructure for data-intensive collaborative - omics  projects

USERS

DATA PROVIDERS

System overview

Biobanks

-omics

Experiment DB

Sample DB

Public Index

submission

submission

controlled access

open access

Maria Krestyaninova, 2009

Page 23: Designing an IT infrastructure for data-intensive collaborative - omics  projects

Current infrastructural volume

• 12 installations in 3 countries• 100 user-organisations• >50.000 samples• >50.000 assays and studies• 4 large federated R&D projects across Europe

and Russia

Krestyaninova et al, Bioinformatics, 2009Viksna et al, BMC Bioinformatics, 2007

Page 24: Designing an IT infrastructure for data-intensive collaborative - omics  projects

SIMBIOMS in collaborative biomedical research initiativesProject Goal/Description Funded by Simbioms team involvement

Strategic research collaborations

BBMRIwww.bbmri.eu Build a network of population-based biobanks,

experts, and foster collaboration between them. Provide advice to industry.

EC, OECD Prototyping of data management model, use-case design, discussions.

P3Gwww.p3g.org

Canadian Gov., memberships

Leading international Informatics Working Group; discussions.

ELIXIRwww.elixir-europe.org/page.php

Create a sustainable infrastructure for the storage and distribution of information produced by bioscientists. EC

Prototyping, reports, cooperation with organisation of medical informatics committee on behalf of EBI.

TaraOceans oceans.taraexpeditions.org 3-year long circumnavigation expedition for marine

genomics and climate integrative study.CNRS, industry, potentially EC

Preliminary design of data management solution; meetings, discussions.

Services for research collaborations

ENGAGEwww.euengage.org Genetic and genomic research for clinical application. EC

Design, development and maintenance of dedicated data exchange services – based on SIMBioMS.

MolPAGEwww.molpage.org

Biomarkers: discovery and development of novel high-throughput methods. EC

MuTHER Exploration of gene expression in multiple tissues on 1000 twins associated with aging. Wellcome Trust

SIROCCOwww.sirocco-project.eu

Study of small RNAs as regulatory cell mechanism; therapeutical applications. EC

CAGEKID Kidney cancer study. EC

SUMMIT Surrogate markers for vascular Micro- &Macrovascular hard endpoints for Innovative diabetes Tools EC

Page 25: Designing an IT infrastructure for data-intensive collaborative - omics  projects

Anton Enright, 2011

Page 26: Designing an IT infrastructure for data-intensive collaborative - omics  projects

CONCLUSIONS

Page 27: Designing an IT infrastructure for data-intensive collaborative - omics  projects

Complex interactions

• Who has a say in knowledge extracted from information?– Research subjects

• Consent to particular research being conducted– Scientists

• Protective of vision about their data– Funding sources

• Expect publications from grantees

Pharma

BioBanksResearch Institutions

big data

industryacademia

state

FDA

Ministry of Health Ministry of Education

Yulia Tammisto, 2011

Page 28: Designing an IT infrastructure for data-intensive collaborative - omics  projects

Complex software

• TIME is the scarcest resource• Software adoption due to:

– Requirements – No other way to do things – Usefulness

• Use = 1 – Reuse

Page 29: Designing an IT infrastructure for data-intensive collaborative - omics  projects

One goal

Search for the truth

Page 30: Designing an IT infrastructure for data-intensive collaborative - omics  projects

Thank you!

Acknowledgements:

• Maria Krestyaninova• Ugis Sarkans• Anton Enright• Mat Davis• Yulia Tammisto• Massimo Menichinelli• Teemu Perheentupa• Jani Heikkinen• Balaji Rajashekar• Raivo Kolde• Jaak Vilo

Uniquer

www.simbioms.org


Recommended