Designing an IT infrastructure for data-intensive collaborative - omics projects

  • Published on

  • View

  • Download

Embed Size (px)


Designing an IT infrastructure for data-intensive collaborative - omics projects. Stathis Kanterakis European Bioinformatics Institute Cambridge, UK ICTA 2011. Outline. Introduction Why design at all? Principles of collaborative design - PowerPoint PPT Presentation


Designing an IT infrastructure for data-intensive collaborative omics projects

Designing an IT infrastructure for data-intensive collaborative -omics projects

Stathis Bioinformatics InstituteCambridge, UKICTA 2011

Delivered at the International Conference on Informatics and Communication Technologies and Applications, Orlando, FL, Dec 1st, 2011 by Stathis Kanterakis1OutlineIntroductionWhy design at all?Principles of collaborative designA software suite for cross-disciplinary collaborative studiesResultsConclusions

IntroductionThe central dogma of information flow in molecular biologyDNA RNA ProteinTranscription(RNA Synthesis)Translation(Protein Synthesis)Replication(DNA Synthesis)

Source: so-called 'central dogma' of biology, 'DNA makes RNA makes protein', was formalised by Francis Crick in 1970. It describes how genetic information stored in the form of DNA is converted into functional molecules: 'RNA makes protein' is where the ribosome comes in.4The -omics cascadeGENOMICSWhat CAN happenTRANSCRIPTOMICSWhat APPEARS to happenPROTEOMEWhat MAKES it happenMETABOLOMEWhat HAS happenedSource: Systems Biology and the Omics Cascade, Karolinska Institutet, June 9-13, 2008PHENOTYPEMetabolomics: systematic study of the unique chemical fingerprints that specific cellular processes leave behind (small molecule metabolite profiles)*This is what we study in Bioinformatics No longer sufficient to identify one of these, need to look at it in the context of omicsJournals adhere to specific reporting requirementsSystems biology creates models that span wider than a particular phenomenon5

407-omes and -omics terms1Sources:1 330Genomes sequenced to date23BSize of human genome in bases$10kCost to sequence a single human330kInterdisciplinary bachelors degrees awarded in 2005 in USA4Systems biology?An informatics technology by itself cannot solve problems connected to complex biological problems, just buzzwords8Challenges in -omics researchExpensive studiesSmall number of replicates (n)(microarrays, subjects...)Large number of variables(genes, proteins, etc)This results in:Inflated type I error (false positives)Poor statistical Power (true positives)Biomarker, drug target discoveryHypothesis generationExpensive equipment. People have started forming consortia, Biobanks have started to emergeVery diverse expertise required9Why design at all? about increase in cross-disciplinary biological researchIncrease in volume and complexityWhy care? Value costs associated with doing/not doing things right?

10Volume vs Complexity cost modelProjectSamplesResearch subjectsStudies/data typesAssaysFiles/volumeUsers/roles/user groupsPubl-s per yearMolPAGE16.5k2.2k300/1126 000/1127 000/0.7 TB80/1/11ENGAGE>100k100k400/13***400/0.25 TB30/5/1310VC~ data types*user roles*scriptsvolumecomplexityGrowth of complexity is slower than volumeBoth volume and complexity grow fastMaria Krestyaninova, 2009Size of a bubble represents price (or effort), colour of the bubbles correspond to two different types of projects: the first one in the table (MolPAGE) has relatively high volume of data (0.7 TB), simple access rights management and follow standard ways of data description, while the second, despite lower volume requires more effort because of the complexity of user access rights and more diverse data description. The table clearly shows that number of discoveries is higher in a project of higher complexity, but so is the effort required on data management. If data is stored in a universal standard its much cheaper to handle, but at the same time the creative yield would be lower, because the ability to re-name (re-annotate) data units or data points and ability to share the data selectively with other experts are imperative for collaborative discovery.11Ome vs OmicsSource:$3,000,000,000Cost


~$020032016Ome and OmicsBalance point2010$50,000 per personDigitalization costs going downAnalysis costs going up12Reporting requirements for publicationDataShaper, OBOISATAB, MAGETAB, MIBBIBioconductorA REQUIREMENT!Ontologies & study designMinimum information for x13Nobody wants a cellphone that makes calls!Make your application:ContextualizedUsableEnjoyableVisible (increases reputation)SociableValuableExplorableFlexibleIn a participatory way

Users expect more (cell phone shaver)14OPEN-SOURCE collaborative designMaxims of the post-information eraIf the news is important, it will find meInformation wants to be freeIts not information overload, its filter failureThe people formerly known as the audienceThe sources go directand finallySource: unnamed college student in an anecdote in a March 27, 2008, New York Times article by Brian Stelter on how young people share political news = summary of the way news is consumed online by linking, sharing, reading one bit whether even seeing the whole or even the original source. The user creates her own news agenda, and her most trusted sources are her social networks- 1984, when writer Stewart Brand said this (as he recalled it 13 years later): On the one hand information wants to be expensive, because its so valuable. The right information in the right place just changes your life. On the other hand, information wants to be free, because the cost of getting it out is getting lower and lower all the time- title of a keynote speech given by NYU professor and new media guru Clay Shirky on Sept. 18, 2008, at the Web 2.0 Expo in New York = information overload has been around for centuries, and the reason it seems so problematic on the web is that we havent developed the proper filters for all that information. social filters and sharing, and curation and aggregation of news- NYU professor Jay Rosen June 27, 2006 = Our users know more than we do- blogging and RSS pioneer Dave Winer, who seems to have officially coined it in the March 19, 2009, post The reboot of journalism.16Do what you do best, link the rest by journalist Jeff Jarvis: can we do it better? If not, then link. And devote your time to what you can do better.Generality vs usefulness (Use = 1 Re-Use)Semantic web vs schema.orgMAGE vs MAGE-TAB

17Agile developmentIndividuals & interactions over processes and toolsWorking software over comprehensive documentationCustomer collaboration over contract negotiationResponding to change over following a plan

In practice: frequent iterations over customer feedback, trustFeedback forms, workshops, individual interactions18MetadesignParticipation levelAnalysisConcept designConcept communicationDistributionEnd-of-lifenoneindirectconsultativeShared controlFull controlCourtesy of Massimo Menichinelli MenichinelliInfrastructure creation & involvement of users19Software for cross-disciplinary collaborative studiesSIMBioMSThe big pictureCENTRAL DATA ARCHIVESSIMBIOMSOBIBAISAQURETECMETABARetc. dynamic storage project hosting fast exchange

permanent deposition large volumes open accesssupport for collaborative discoveryknowledge access and sustainability

large consortia

stand alone researchersMaria Krestyaninova, 2009Data graveyard (grows bigger)Green points stay lean because the pull and push data constantlyEach package serves consortia but does not cover full sets of needsProblem: highly varied in-house/open data management solutions, need efficient data exchangeSolution: web-based shared infrastructure; customizable and secure

21USERSDATA PROVIDERSSystem overview Biobanks-omicsExperiment DBSample DBPublic Indexsubmissionsubmissioncontrolled accessopen access

Maria Krestyaninova, 2009

Current infrastructural volume12 installations in 3 countries100 user-organisations>50.000 samples>50.000 assays and studies 4 large federated R&D projects across Europe and RussiaKrestyaninova et al, Bioinformatics, 2009Viksna et al, BMC Bioinformatics, 2007

SIMBIOMS in collaborative biomedical research initiativesProjectGoal/DescriptionFunded bySimbioms team involvementStrategic research collaborationsBBMRIwww.bbmri.euBuild a network of population-based biobanks, experts, and foster collaboration between them. Provide advice to industry.EC, OECDPrototyping of data management model, use-case design, discussions.P3Gwww.p3g.orgCanadian Gov., membershipsLeading international Informatics Working Group; a sustainable infrastructure for the storage and distribution of information produced by bioscientists.ECPrototyping, reports, cooperation with organisation of medical informatics committee on behalf of EBI.TaraOceans

3-year long circumnavigation expedition for marine genomics and climate integrative study.CNRS, industry, potentially ECPreliminary design of data management solution; meetings, discussions.Services for research collaborationsENGAGEwww.euengage.orgGenetic and genomic research for clinical application.ECDesign, development and maintenance of dedicated data exchange services based on SIMBioMS.MolPAGEwww.molpage.orgBiomarkers: discovery and development of novel high-throughput methods.ECMuTHERExploration of gene expression in multiple tissues on 1000 twins associated with aging.Wellcome TrustSIROCCOwww.sirocco-project.euStudy of small RNAs as regulatory cell mechanism; therapeutical applications.ECCAGEKIDKidney cancer study.ECSUMMITSurrogate markers for vascular Micro- &Macrovascular hard endpoints for Innovative diabetes ToolsEC

Anton Enright, 2011SIROCCO example25ConclusionsComplex interactionsWho has a say in knowledge extracted from information?Research subjectsConsent to particular research being conductedScientistsProtective of vision about their dataFunding sourcesExpect publications from granteesPharmaBioBanksResearch Institutionsbig dataindustryacademiastateFDAMinistry of Health Ministry of Education Yulia Tammisto, 2011Subject/patient has primary ownership of data, may be against somebody finding out health-related risks, e.g. predisposition to cancerScientist/data generator want to protect their vision about how data should be converted to knowledge (pride bias)They also want to affirm proper credit from knowledge extracted from their data to justify/secure future funding27Complex softwareTIME is the scarcest resourceSoftware adoption due to:RequirementsNo other way to do thingsUsefulness

Use = 1 Reuse

Alignment of objectivesUsers are NOT the decision makersGenerality vs usefulness

28One goalSearch for the truthThank you!Acknowledgements:

Maria KrestyaninovaUgis SarkansAnton EnrightMat DavisYulia TammistoMassimo MenichinelliTeemu PerheentupaJani HeikkinenBalaji RajashekarRaivo KoldeJaak Vilo



View more >