Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Broad Institute Workbencha Cloud-Based Platform for Data Analysis,

Management, and Sharing

[email protected]

Topics

● Overview

● Scientific Use Scenarios

● Cancer Genome Analysis

● Medical and Population Genetics

● US Precision Medicine Initiative

● Software Overview

● Takeaway and Questions

OverviewDavid Siedzik

Chief Product OwnerBroad Institute Genomics Data Sciences Platform

A story about genomics data generation_________________

The Widening Gulf

The day we ran out of compute

Genomic Data Generation Increase

Genome Processing at the Broad...

...Produces a lot of data...

● A single human genome is approximately 120 GB

● We are sequencing genomes at a rate of 1 every 12 minutes

● That’s ~1 Petabyte every 2 months

● Earlier this year we started filling up storage and maxing out compute.

...And it’s hard to store and use that data locally...

● Too many copies of the same dataset littered across the file system

● Hard to track and control data access

● Local file storage is expensive and difficult to maintain at petabyte scale

● Not enough compute to go around

The Opportunity

These challenges focused us on a bold goal:

To develop systems and tools that not only keep up with Broad’s own needs but also increase the availability of genomic resources across the global scientific community - both in data generation, and in the ability to analyze, manage and most importantly share data.

In keeping with our mission, we brought together two core groups—our Sequencing Platform and Data Sciences Platform—to ensure seamless integration of best in class sequencing products and open-source software resources for the global bioinformatics community.

Broad sequenced and hosts genomes/exomes of > 250K individuals to date..

Broad does a lot of sequencing...● Public Clouds allow scientists to

rent compute time vs. relying upon an institution’s cluster

● Broad uses more than 100 pipelines: how do we make those available to run consistently in our internal and external community?

● Platforms built on cloud infrastructure can reduce the need to hire a bioinformatician and buy hardware

● Computing in the cloud can yield both speed and cost advantages

Opportunity: Expand Access to Computing At Scale

250,00065,000

The Exome Aggregation Consortium (ExAC)The Exome Aggregation Consortium II (ExACII)

GnomAD

Variant frequency across 60K exomes -> tremendous value for clinical interpretation of rare disease genomes

Collaborations (since 2012):

● 1753 sample collections

● 913 distinct projects

● 5679 orders

● 150 PIs at any time

Opportunity: Science Is Ever More Collaborative

Opportunity: Bring Researchers to the Data

● NCI identified the importance of these three opportunities and funded three software platform development pilots

● Each system aimed to simplify TCGA data access in a secure and scalable cloud-based offering that brings the analysis to a single copy of the data, enabling access to protected TCGA data those with dbGaP approval

● The Broad Institute Workbench framework was initially developed to power Broad’s NCI Cloud Pilot, FireCloud

The First Step: NCI Cloud Pilot

● One copy of the data. Datasets can be stored in one place and shared with collaborators -> eliminating duplication in data sharing

● A community-driven “App Store” for methods and best practices pipelines.

● Infinite compute resource, as needed. Workflows can scale to use significant computing power as needed, which yields a reduction in compute time and cost compared to on-premises computation

● Access by web browser or API. Addresses the needs of computational scientists/software engineers as well as those who are less technical

A cloud-based data management and analysis platform for scalable, collaborative research

http://www.firecloud.org

Broad Institute Workbench - In Concept



Broad Institute Workbench - In Reality

Cancer Genomics Analysis on FireCloud

Chet BirgerGetz Lab @ Broad Institute

Scientific Goal: Comprehensive Catalogue of Genes Responsible for Cancer Initiation and Progression

● Foundational for cancer diagnostics, therapeutics, clinical trial design, and selection of rational combination therapies for individual patients

● Guides therapeutic development by identifying dysregulated pathways and druggable targets

Catalogue of Cancer Genes

Unbiased identification of genes harbouring somatic genetic variations at a statistically significant rate or pattern in cancer

● MutSig tool suite for SNPs and INDELs

● GISTIC for CNVs

● Baysian Nonnegative Matrix Factorization for mutational signal discovery

● GSEA, PARADIGM to identify dysregulated pathways

● and a lot more!

Computational Methods

● Requires large international cohorts to achieve goal

● Several large international projects working to compile this catalogue: e.g., TCGA, ICGC and PCAWG to name a few.

● Getz Lab is a key contributor to these projects○ Tools and analytical pipelines

○ Sequencing (Broad Genomics)

○ Computational analysis and interpretation

International Effort

Complete characterization of ~35 adult cancers ~20 common cancers at 500 cases each ~15 rare cancers at 50-150 cases each

~11,000 cases ~2.5PB data, originally stored in CGHub and DCC, now in the GDC

The size of the data set and compute capacity required to work on it makes access and analysis difficult for any but the best-resourced institutions.

TCGA Produced Large Amounts of Data

International Cancer Genome Consortium

•––––

•––––

•––

0.8 PB

5 PB

Genomic Data Distribution is a Challenge

For 90% powerto detect 90% of cancer genes with frequency ≥ 2%, need ~2000 samples per tumor type.

50 tumor types x 2000= 100,000 pairs

Lawrence et. al., Discovery and saturation analysis of cancer genes across 21 tumour types, Nature, January 2014

What size cohorts do we need?

We need large datasets with genomic and clinical data to obtain sufficient power to detect/learn:

1. Complete catalog of cancer genes and pathways (>2% of patients) (1000s / tumor type)

2. Explain >95% of tumor types and subtypes (1000s / tumor type)

3. Mutational Signatures (100s - 1000s / tumor type)

4. Germline risk alleles (10,000s / tumor type)

5. Biomarkers for response (100s to 1000s / tumor type / drug)

Preparing for a lot more data

● 2009: Getz lab began development of Firehose (FireCloud’s on-premises precusor) as a computational platform for TCGA data

● The size of the data sets and computational needs has grown dramatically

● On-premises FireHose not capable of supporting current and future research

● Moving to the cloud with its near limitless storage and elastic compute

● Necessity of migration to cloud coincided with NCI Cancer Genomics Cloud Pilot Project - The Broad Institute Workbench grew from this initial funding

Moving to the Cloud

● Virtually all of Getz Lab’s efforts at fully characterizing the somatic variations driving cancer are done in the context of large international projects

● Many smaller projects also collaborative efforts

● Support for collaborative science one of FireCloud’s principal design goals

● Achieved through FireCloud’s workspace-centric design

Supporting Collaborative Science

Medical and Population Genetics

Alisa ManningDiabetes Research Group

Broad Institute

Large-Scale Statistical Analysis of Whole Genome Sequence Data with Hail and

the Broad Institute Workbench

http://hail.is/Hail is an open-source framework for scalable genetic data analysis...

Broad Institute Workbench

a Cloud-Based Platform for Data Analysis, Management, and Sharing

http://hail.is/

http://hail.is/

What is… ?

● Hail is a scalable, reliable framework and a powerful language for genetic data analysis

● Open source, under active development, widespread adoption at Broad

● Leveraging open-source big-data tools

● Innovating to solve the unique problems of genetics

Hail web site: http://hail.isHail code: github.com/broadinstitute/hailContact: [email protected]

• N = 18,877 samples (FHS, OOA, JHS)• Variants called with `vt` (Tan A et al. Bioinformatics. 2015)• ~193,000,000 variants (passing, biallelic sites)

NHLBI’s Trans-Omics for Precision Medicine (TOPMed)

Hail commands for complex QC and variant annotation (2 - 12 hours depending on the number of cores)hail read -i file:///mnt/geno/nhlbi.1575.sftp-exchange-area.keep.freeze3a.pass.gtonly.minDP10.genotypes.vds \filtervariants expr -c 'va.pass' --keep \filtervariants expr -c 'v.contig == "X" || v.contig == "Y" || v.contig == "MT"' --remove \…variantqc filtervariants expr -c 'va.qc.AC > 0' --keep \annotatevariants intervals -r va.isLCF -i file:///mnt/lustre/aganna/LCR.interval_list \annotatevariants expr -c 'va.badpHWE = va.annot.pHWE_Amish <= 0.000000001 || va.annot.pHWE_FHS <= 0.000000001 || va.annot.pHWE_JHS <= 0.000000001' \...

The genetic architecture of type 2 diabetes. 2016 Aug 4;536(7614):41–7.

Pilot Analysis in Workbench

Workspaces

Method Repository

Google Cloud Storage

Summary Data Analysis Methods Monitor

Workbench Schematic

The Workspace links methods and analysis to data

MethodRepository


Data Analysis Methods Monitor

Workspaces

SummaryWorkspaces

The Data Model links sample IDs to VCF files, trait files, and other inputs

MethodRepository



Workspaces

SummaryData Model

We implemented single-variant association analysis with EPACTs.

MethodRepository



Workspaces

SummaryEPACTs in Workbench

Methods allow you to specify the input and output for your pipeline

MethodRepository



Workspaces

SummaryMethod Customization

Logs, input, and output files are linked to the execution of a method

MethodRepository



Workspaces

SummaryMethod Provenance

Data model links methods, analysis to results

MethodRepository



Workspaces

SummaryMethod Results

We are...

• Implementing our standard analysis pipelines in the Workbench

• Creating new methods for whole genome sequence data

• Developing our pipelines with state of the art computing paradigms

• Looking for broader engagement from you!

http://hail.is/Hail is an open-source framework for scalable genetic data analysis...

Scaling Statistical Genetics

http://hail.is/

http://hail.is/

Precision Medicine InitiativeKristian Cibulskis

Engineering DirectorBroad Data Sciences Platform

US Health - Early 1900s

1948: Launch of Framingham Heart Study

Dawber TR, Meadors GF, Moore FEJ: Epidemiological approaches to heart disease: the Framingham Study. Am J Public Health 1951, 41:279-286. Dawber TR, Kannel WB, Revotskie N, Stokes JI, Kagan A, Gordon T: Some factors associated with the development of coronary heart disease; six years' follow-up experience in the Framingham Study. Am J Public Health 1959, 49:1349-1356.

http://www.biomedcentral.com/sfx_links?ui=1471-2350-8-63&bibl=B1

http://www.biomedcentral.com/sfx_links?ui=1471-2350-8-63&bibl=B3

1961: Early Biomarkers - “Factors of Risk”

Decline in Heart Disease Mortality

Source: CDC Morbidity and Mortality Weekly Report (MMWR)

Framingham for the 21st Century

Precision Medicine Initiative aims to be a

“Framingham for the 21st Century”

It will be the largest medical scientific study in history of the world

January 2015: State of the Union

● 1 million or more participants● Longitudinal, ability to recontact● Focus on engagement

● Two methods of enrollment○ Healthcare provider organizations○ Direct volunteers

September 2015: PMI Working Group report

Precision Medicine InitiativeCohort Program

now known as

All of Us Research Programjoinallofus.org

Building a Research FoundationFor 21st Century Medicine

All of Us

http://joinallofus.org

http://joinallofus.org

PMI Organization and Data Flow

Data & Research Center

Biobank

PMI Organization and Data Flow

Data & Research Center

Biobank

PMI Data & Research Center (DRC)

Mission● To acquire, organize and provide access to what will be

one of the world’s largest and most diverse datasets for precision medicine research

● Provide research support for the scientific data and analysis tools for the program, helping to build a vibrant community of researchers

Data & Research Center (DRC)

Acquire & Organize

ResearchSupport

Data & Research Center (DRC)


The Workbench Vision

Plans for launch and beyond

● PMI Cohort Program anticipates 3–4 years to reach one million participants

● Phased implementation as we pilot, iterate, and scale

● Initial releases will focus on data collection and portals

Workbench Deep DiveAlex BaumannProduct Owner

● One copy of the data. Datasets can be stored in one place and shared with collaborators -> eliminating duplication in data sharing

● A community-driven “App Store” for methods and best practices pipelines

● Infinite compute resource, as needed. Workflows can scale to use significant computing power as needed, which yields a reduction in compute time and cost compared to on-premises computation

● Access by web browser or API. Addresses the needs of computational scientists/software engineers as well as those who are less technical

A cloud-based data management and analysis platform for scalable, collaborative research





...To the Cloud!

WDL: an open source language for computational biologists to express analytical pipelines

Cromwell: an open source, scalable, robust engine for interpreting and executing a WDL using various backends

Workbench: Several services, all open source

Google Genomics Pipelines API: co-developed by Broad and Google Genomics, a scalable Docker-as-a-Service data scheduler

Introducing Workbench

Within the workbench you can access your workspaces and browse methods

MethodRepository



Workspaces

Summary

What Are Workspaces?

● Datasets and associated analyses are done within Workspaces, so you can:

● Organize: Workspaces contain datasets and analyses that can be run on these datasets

● Track: Workspaces retain the history of all analyses that have been run to support reproducibility and traceability

● Collaborate: Workspaces can be shared with others as Readers (view-only), Writers (run analyses and modify data), and Owners (modify but also delete and share)

MethodRepository



Workspaces

Summary

Inside a Workspace

The Summary tab supports sharing, accessing the bucket, and various metadata

MethodRepository



Workspaces

Summary

Workbench Features

● Domain specific layer above Google Cloud Platform - data accessible and usable outside of Workbench via Google APIs, with other clouds coming soon

● Designed for scalability of data and analyses

● TCGA data (both open and controlled access) available in Workbench, Broad data delivery will be within workspaces, and we will be hosting other large public cancer data sets (e.g., TARGET, CCLE)

● Broad’s best practice pipelines are going into the methods repo

● Data model supports easily running methods at scale and maintaining organized data files and metadata

MethodRepository



Workspaces

Summary


Data is stored within buckets, which are accessible via Google console (or gsutil)

MethodRepository



Workspaces

Summary

Sharing a Workspace

A workspace can be shared with other users as Owners, Writers and Readers

MethodRepository



Workspaces

Summary

Workspace Data Model

● Incorporates TCGA data model with Participants, Samples, Pairs, and sets

● Metadata can be constant values such as the number 50, or data file urls

● Allows you to organize multiple datasets that all use one copy of the data

● Can be used as inputs to analyses and updated from outputs of analyses

● Outputs can be written back to data model and used in downstream analyses

MethodRepository



Workspaces

Summary

Organizing Data

The Data tab organizes data files and other metadata around higher level concepts such as participants, samples, pairs of samples, and sets

MethodRepository



Workspaces

Summary

IGV Integration

The Integrative Genomics Viewer can be used to visualize genomics datasets using the data model and files within buckets

MethodRepository



Workspaces

Summary

Launching Methods

Analyses can be run upon entities in the data model, gathering all data files and metadata from attributes of entities

MethodRepository



Workspaces

Summary

Monitoring Analyses

Analyses can be viewed as they progress, and all are kept for historical reasons

MethodRepository



Workspaces

Summary

Methods Repository

● Stores workflows and tasks you can reuse and share publicly or privately

● Method tools packaged as docker images; ensure tool portability

● Methods are versioned to support reproducibility and reusability

● Many Broad best practice public methods available and more in the works ● We plan to provide and consume methods via the GA4GH Tool Registry API

MethodRepository



Workspaces

Summary

Supporting Curation

Methods show their version, creator, documentation and other metadata, and we plan to add ratings, comments and other tools to aid in community curation

MethodRepository



Workspaces

Summary

Architecture

FireCloud API & Web Portal

Workbench Service

Cromwell(WDL Execution)

Methods Repository

gsutil

Google IDs for Authentication


All ServicesGoogle Compute Engine

CloudSQL for RDBMS

Cloud Monitoring for Operations

Google Genomics Pipeline API

What you need to knowDavid Siedzik

Chief Product Owner

How to Learn More

● Available now at www.firecloud.org and the APIs are at api.firecloud.org

● Post questions and comments on our forum!

● All of our tools are open source, and we encourage software collaborators and feature requests https://github.com/broadinstitute

● Alexander Baumann will be available to answer your questions or show demos at Meet the Expert (Booth 329) Friday from 1-2 pm


https://api.firecloud.org/

https://github.com/broadinstitute

An open, unlimited, and fast future _________________

To the Cloud!

Acknowledgements - The Team

AnalysisAlisa Manning ([email protected])Cotton SeedSeung Hoan ChoiPradeep NatarajanMaryam Zekavat

LinksFireCloud: www.firecloud.orgWorkflow Definition Language (WDL): https://software.broadinstitute.org/wdl/

Broad Institute Data Science and Data EngineeringGenomic Platform Cancer Program

National Cancer InstituteNational Institute of Health

PIsGad GetzAnthony Philippakis

mailto:[email protected]



https://software.broadinstitute.org/wdl/

https://software.broadinstitute.org/wdl/

Documents

Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,