Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Broad Institute Workbencha Cloud-Based Platform for Data Analysis,
Management, and Sharing
Topics
● Overview
● Scientific Use Scenarios
● Cancer Genome Analysis
● Medical and Population Genetics
● US Precision Medicine Initiative
● Software Overview
● Takeaway and Questions
OverviewDavid Siedzik
Chief Product OwnerBroad Institute Genomics Data Sciences Platform
A story about genomics data generation_________________
The Widening Gulf
The day we ran out of compute
Genomic Data Generation Increase
Genome Processing at the Broad...
...Produces a lot of data...
● A single human genome is approximately 120 GB
● We are sequencing genomes at a rate of 1 every 12 minutes
● That’s ~1 Petabyte every 2 months
● Earlier this year we started filling up storage and maxing out compute.
...And it’s hard to store and use that data locally...
● Too many copies of the same dataset littered across the file system
● Hard to track and control data access
● Local file storage is expensive and difficult to maintain at petabyte scale
● Not enough compute to go around
The Opportunity
These challenges focused us on a bold goal:
To develop systems and tools that not only keep up with Broad’s own needs but also increase the availability of genomic resources across the global scientific community - both in data generation, and in the ability to analyze, manage and most importantly share data.
In keeping with our mission, we brought together two core groups—our Sequencing Platform and Data Sciences Platform—to ensure seamless integration of best in class sequencing products and open-source software resources for the global bioinformatics community.
Broad sequenced and hosts genomes/exomes of > 250K individuals to date..
Broad does a lot of sequencing...● Public Clouds allow scientists to
rent compute time vs. relying upon an institution’s cluster
● Broad uses more than 100 pipelines: how do we make those available to run consistently in our internal and external community?
● Platforms built on cloud infrastructure can reduce the need to hire a bioinformatician and buy hardware
● Computing in the cloud can yield both speed and cost advantages
Opportunity: Expand Access to Computing At Scale
250,00065,000
The Exome Aggregation Consortium (ExAC)The Exome Aggregation Consortium II (ExACII)
GnomAD
Variant frequency across 60K exomes -> tremendous value for clinical interpretation of rare disease genomes
Collaborations (since 2012):
● 1753 sample collections
● 913 distinct projects
● 5679 orders
● 150 PIs at any time
Opportunity: Science Is Ever More Collaborative
Opportunity: Bring Researchers to the Data
● NCI identified the importance of these three opportunities and funded three software platform development pilots
● Each system aimed to simplify TCGA data access in a secure and scalable cloud-based offering that brings the analysis to a single copy of the data, enabling access to protected TCGA data those with dbGaP approval
● The Broad Institute Workbench framework was initially developed to power Broad’s NCI Cloud Pilot, FireCloud
The First Step: NCI Cloud Pilot
● One copy of the data. Datasets can be stored in one place and shared with collaborators -> eliminating duplication in data sharing
● A community-driven “App Store” for methods and best practices pipelines.
● Infinite compute resource, as needed. Workflows can scale to use significant computing power as needed, which yields a reduction in compute time and cost compared to on-premises computation
● Access by web browser or API. Addresses the needs of computational scientists/software engineers as well as those who are less technical
A cloud-based data management and analysis platform for scalable, collaborative research
http://www.firecloud.org
Broad Institute Workbench - In Concept
Broad Institute Workbench - In Reality
Cancer Genomics Analysis on FireCloud
Chet BirgerGetz Lab @ Broad Institute
Scientific Goal: Comprehensive Catalogue of Genes Responsible for Cancer Initiation and Progression
● Foundational for cancer diagnostics, therapeutics, clinical trial design, and selection of rational combination therapies for individual patients
● Guides therapeutic development by identifying dysregulated pathways and druggable targets
Catalogue of Cancer Genes
Unbiased identification of genes harbouring somatic genetic variations at a statistically significant rate or pattern in cancer
● MutSig tool suite for SNPs and INDELs
● GISTIC for CNVs
● Baysian Nonnegative Matrix Factorization for mutational signal discovery
● GSEA, PARADIGM to identify dysregulated pathways
● and a lot more!
Computational Methods
● Requires large international cohorts to achieve goal
● Several large international projects working to compile this catalogue: e.g., TCGA, ICGC and PCAWG to name a few.
● Getz Lab is a key contributor to these projects○ Tools and analytical pipelines
○ Sequencing (Broad Genomics)
○ Computational analysis and interpretation
International Effort
Complete characterization of ~35 adult cancers ~20 common cancers at 500 cases each ~15 rare cancers at 50-150 cases each
~11,000 cases ~2.5PB data, originally stored in CGHub and DCC, now in the GDC
The size of the data set and compute capacity required to work on it makes access and analysis difficult for any but the best-resourced institutions.
TCGA Produced Large Amounts of Data
International Cancer Genome Consortium
•––––
•––––
•––
0.8 PB
5 PB
Genomic Data Distribution is a Challenge
For 90% powerto detect 90% of cancer genes with frequency ≥ 2%, need ~2000 samples per tumor type.
50 tumor types x 2000= 100,000 pairs
Lawrence et. al., Discovery and saturation analysis of cancer genes across 21 tumour types, Nature, January 2014
What size cohorts do we need?
We need large datasets with genomic and clinical data to obtain sufficient power to detect/learn:
1. Complete catalog of cancer genes and pathways (>2% of patients) (1000s / tumor type)
2. Explain >95% of tumor types and subtypes (1000s / tumor type)
3. Mutational Signatures (100s - 1000s / tumor type)
4. Germline risk alleles (10,000s / tumor type)
5. Biomarkers for response (100s to 1000s / tumor type / drug)
Preparing for a lot more data
● 2009: Getz lab began development of Firehose (FireCloud’s on-premises precusor) as a computational platform for TCGA data
● The size of the data sets and computational needs has grown dramatically
● On-premises FireHose not capable of supporting current and future research
● Moving to the cloud with its near limitless storage and elastic compute
● Necessity of migration to cloud coincided with NCI Cancer Genomics Cloud Pilot Project - The Broad Institute Workbench grew from this initial funding
Moving to the Cloud
● Virtually all of Getz Lab’s efforts at fully characterizing the somatic variations driving cancer are done in the context of large international projects
● Many smaller projects also collaborative efforts
● Support for collaborative science one of FireCloud’s principal design goals
● Achieved through FireCloud’s workspace-centric design
Supporting Collaborative Science
Medical and Population Genetics
Alisa ManningDiabetes Research Group
Broad Institute
Large-Scale Statistical Analysis of Whole Genome Sequence Data with Hail and
the Broad Institute Workbench
http://hail.is/Hail is an open-source framework for scalable genetic data analysis...
Broad Institute Workbench
a Cloud-Based Platform for Data Analysis, Management, and Sharing
What is… ?
● Hail is a scalable, reliable framework and a powerful language for genetic data analysis
● Open source, under active development, widespread adoption at Broad
● Leveraging open-source big-data tools
● Innovating to solve the unique problems of genetics
Hail web site: http://hail.isHail code: github.com/broadinstitute/hailContact: [email protected]
• N = 18,877 samples (FHS, OOA, JHS)• Variants called with `vt` (Tan A et al. Bioinformatics. 2015)• ~193,000,000 variants (passing, biallelic sites)
NHLBI’s Trans-Omics for Precision Medicine (TOPMed)
Hail commands for complex QC and variant annotation (2 - 12 hours depending on the number of cores)hail read -i file:///mnt/geno/nhlbi.1575.sftp-exchange-area.keep.freeze3a.pass.gtonly.minDP10.genotypes.vds \filtervariants expr -c 'va.pass' --keep \filtervariants expr -c 'v.contig == "X" || v.contig == "Y" || v.contig == "MT"' --remove \…variantqc filtervariants expr -c 'va.qc.AC > 0' --keep \annotatevariants intervals -r va.isLCF -i file:///mnt/lustre/aganna/LCR.interval_list \annotatevariants expr -c 'va.badpHWE = va.annot.pHWE_Amish <= 0.000000001 || va.annot.pHWE_FHS <= 0.000000001 || va.annot.pHWE_JHS <= 0.000000001' \...
The genetic architecture of type 2 diabetes. 2016 Aug 4;536(7614):41–7.
Pilot Analysis in Workbench
Workspaces
Method Repository
Google Cloud Storage
Summary Data Analysis Methods Monitor
Workbench Schematic
The Workspace links methods and analysis to data
MethodRepository
Google Cloud Storage
Data Analysis Methods Monitor
Workspaces
SummaryWorkspaces
The Data Model links sample IDs to VCF files, trait files, and other inputs
MethodRepository
Google Cloud Storage
Data Analysis Methods Monitor
Workspaces
SummaryData Model
We implemented single-variant association analysis with EPACTs.
MethodRepository
Google Cloud Storage
Data Analysis Methods Monitor
Workspaces
SummaryEPACTs in Workbench
Methods allow you to specify the input and output for your pipeline
MethodRepository
Google Cloud Storage
Data Analysis Methods Monitor
Workspaces
SummaryMethod Customization
Logs, input, and output files are linked to the execution of a method
MethodRepository
Google Cloud Storage
Data Analysis Methods Monitor
Workspaces
SummaryMethod Provenance
Data model links methods, analysis to results
MethodRepository
Google Cloud Storage
Data Analysis Methods Monitor
Workspaces
SummaryMethod Results
We are...
• Implementing our standard analysis pipelines in the Workbench
• Creating new methods for whole genome sequence data
• Developing our pipelines with state of the art computing paradigms
• Looking for broader engagement from you!
http://hail.is/Hail is an open-source framework for scalable genetic data analysis...
Scaling Statistical Genetics
Precision Medicine InitiativeKristian Cibulskis
Engineering DirectorBroad Data Sciences Platform
US Health - Early 1900s
1948: Launch of Framingham Heart Study
Dawber TR, Meadors GF, Moore FEJ: Epidemiological approaches to heart disease: the Framingham Study. Am J Public Health 1951, 41:279-286. Dawber TR, Kannel WB, Revotskie N, Stokes JI, Kagan A, Gordon T: Some factors associated with the development of coronary heart disease; six years' follow-up experience in the Framingham Study. Am J Public Health 1959, 49:1349-1356.
1961: Early Biomarkers - “Factors of Risk”
Decline in Heart Disease Mortality
Source: CDC Morbidity and Mortality Weekly Report (MMWR)
Framingham for the 21st Century
Precision Medicine Initiative aims to be a
“Framingham for the 21st Century”
It will be the largest medical scientific study in history of the world
January 2015: State of the Union
● 1 million or more participants● Longitudinal, ability to recontact● Focus on engagement
● Two methods of enrollment○ Healthcare provider organizations○ Direct volunteers
September 2015: PMI Working Group report
Precision Medicine InitiativeCohort Program
now known as
All of Us Research Programjoinallofus.org
Building a Research FoundationFor 21st Century Medicine
All of Us
PMI Organization and Data Flow
Data & Research Center
Biobank
PMI Organization and Data Flow
Data & Research Center
Biobank
PMI Data & Research Center (DRC)
Mission● To acquire, organize and provide access to what will be
one of the world’s largest and most diverse datasets for precision medicine research
● Provide research support for the scientific data and analysis tools for the program, helping to build a vibrant community of researchers
Data & Research Center (DRC)
Acquire & Organize
ResearchSupport
Data & Research Center (DRC)
Broad Institute Workbench
The Workbench Vision
Plans for launch and beyond
● PMI Cohort Program anticipates 3–4 years to reach one million participants
● Phased implementation as we pilot, iterate, and scale
● Initial releases will focus on data collection and portals
Workbench Deep DiveAlex BaumannProduct Owner
● One copy of the data. Datasets can be stored in one place and shared with collaborators -> eliminating duplication in data sharing
● A community-driven “App Store” for methods and best practices pipelines
● Infinite compute resource, as needed. Workflows can scale to use significant computing power as needed, which yields a reduction in compute time and cost compared to on-premises computation
● Access by web browser or API. Addresses the needs of computational scientists/software engineers as well as those who are less technical
A cloud-based data management and analysis platform for scalable, collaborative research
http://www.firecloud.org
Broad Institute Workbench
...To the Cloud!
WDL: an open source language for computational biologists to express analytical pipelines
Cromwell: an open source, scalable, robust engine for interpreting and executing a WDL using various backends
Workbench: Several services, all open source
Google Genomics Pipelines API: co-developed by Broad and Google Genomics, a scalable Docker-as-a-Service data scheduler
Introducing Workbench
Within the workbench you can access your workspaces and browse methods
MethodRepository
Google Cloud Storage
Data Analysis Methods Monitor
Workspaces
Summary
What Are Workspaces?
● Datasets and associated analyses are done within Workspaces, so you can:
● Organize: Workspaces contain datasets and analyses that can be run on these datasets
● Track: Workspaces retain the history of all analyses that have been run to support reproducibility and traceability
● Collaborate: Workspaces can be shared with others as Readers (view-only), Writers (run analyses and modify data), and Owners (modify but also delete and share)
MethodRepository
Google Cloud Storage
Data Analysis Methods Monitor
Workspaces
Summary
Inside a Workspace
The Summary tab supports sharing, accessing the bucket, and various metadata
MethodRepository
Google Cloud Storage
Data Analysis Methods Monitor
Workspaces
Summary
Workbench Features
● Domain specific layer above Google Cloud Platform - data accessible and usable outside of Workbench via Google APIs, with other clouds coming soon
● Designed for scalability of data and analyses
● TCGA data (both open and controlled access) available in Workbench, Broad data delivery will be within workspaces, and we will be hosting other large public cancer data sets (e.g., TARGET, CCLE)
● Broad’s best practice pipelines are going into the methods repo
● Data model supports easily running methods at scale and maintaining organized data files and metadata
MethodRepository
Google Cloud Storage
Data Analysis Methods Monitor
Workspaces
Summary
Google Cloud Storage
Data is stored within buckets, which are accessible via Google console (or gsutil)
MethodRepository
Google Cloud Storage
Data Analysis Methods Monitor
Workspaces
Summary
Sharing a Workspace
A workspace can be shared with other users as Owners, Writers and Readers
MethodRepository
Google Cloud Storage
Data Analysis Methods Monitor
Workspaces
Summary
Workspace Data Model
● Incorporates TCGA data model with Participants, Samples, Pairs, and sets
● Metadata can be constant values such as the number 50, or data file urls
● Allows you to organize multiple datasets that all use one copy of the data
● Can be used as inputs to analyses and updated from outputs of analyses
● Outputs can be written back to data model and used in downstream analyses
MethodRepository
Google Cloud Storage
Data Analysis Methods Monitor
Workspaces
Summary
Organizing Data
The Data tab organizes data files and other metadata around higher level concepts such as participants, samples, pairs of samples, and sets
MethodRepository
Google Cloud Storage
Data Analysis Methods Monitor
Workspaces
Summary
IGV Integration
The Integrative Genomics Viewer can be used to visualize genomics datasets using the data model and files within buckets
MethodRepository
Google Cloud Storage
Data Analysis Methods Monitor
Workspaces
Summary
Launching Methods
Analyses can be run upon entities in the data model, gathering all data files and metadata from attributes of entities
MethodRepository
Google Cloud Storage
Data Analysis Methods Monitor
Workspaces
Summary
Monitoring Analyses
Analyses can be viewed as they progress, and all are kept for historical reasons
MethodRepository
Google Cloud Storage
Data Analysis Methods Monitor
Workspaces
Summary
Methods Repository
● Stores workflows and tasks you can reuse and share publicly or privately
● Method tools packaged as docker images; ensure tool portability
● Methods are versioned to support reproducibility and reusability
● Many Broad best practice public methods available and more in the works ● We plan to provide and consume methods via the GA4GH Tool Registry API
MethodRepository
Google Cloud Storage
Data Analysis Methods Monitor
Workspaces
Summary
Supporting Curation
Methods show their version, creator, documentation and other metadata, and we plan to add ratings, comments and other tools to aid in community curation
MethodRepository
Google Cloud Storage
Data Analysis Methods Monitor
Workspaces
Summary
Architecture
FireCloud API & Web Portal
Workbench Service
Cromwell(WDL Execution)
Methods Repository
gsutil
Google IDs for Authentication
Google Cloud Storage
All ServicesGoogle Compute Engine
CloudSQL for RDBMS
Cloud Monitoring for Operations
Google Genomics Pipeline API
What you need to knowDavid Siedzik
Chief Product Owner
How to Learn More
● Available now at www.firecloud.org and the APIs are at api.firecloud.org
● Post questions and comments on our forum!
● All of our tools are open source, and we encourage software collaborators and feature requests https://github.com/broadinstitute
● Alexander Baumann will be available to answer your questions or show demos at Meet the Expert (Booth 329) Friday from 1-2 pm
An open, unlimited, and fast future _________________
To the Cloud!
Acknowledgements - The Team
AnalysisAlisa Manning ([email protected])Cotton SeedSeung Hoan ChoiPradeep NatarajanMaryam Zekavat
LinksFireCloud: www.firecloud.orgWorkflow Definition Language (WDL): https://software.broadinstitute.org/wdl/
Broad Institute Data Science and Data EngineeringGenomic Platform Cancer Program
National Cancer InstituteNational Institute of Health
PIsGad GetzAnthony Philippakis