Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Bionimbus: Lessons from a Petabyte-‐Scale

Science Cloud Service Provider (CSP)

Robert Grossman

Ins?tute for Genomics & Systems Biology Center for Research Informa?cs

Computa?on Ins?tute Department of Medicine University of Chicago

& Open Data Group

September 11, 2012

The OSDC & Bionimbus Teams

•  Open Science Data Cloud (OSDC) Team – MaM Greenway, Allison Heath, Ray Powell, Rafael Suarez.

– Major funding for the OSDC is provided by the Gordon and BeMy Moore Founda?on.

•  Bionimbus Team –  Elizabeth Bartom, Casey Brown, Jason Grundstad, David Hanley, Nicolas Negre, Tom Stricker, MaM SlaMery, Rebecca Spokony & Kevin White.

–  Bionimbus is a joint project between Laboratory for Advanced Compu?ng & White Lab at the University of Chicago and uses in part the OSDC infrastructure.

Let’s Step Back 20 Years

•  1992-‐96: Petabyte Access & Storage Solu?ons (PASS) Project for SSC.

•  It developed & benchmarked federated rela?onal, OO DB, object stores, & column-‐oriented data warehouse solu?ons at the TB-‐scale.

A picture of Cern’s Large Hadron Collider (LHC). The LHC took about a decade to construct, and cost about $4.75 billion. Source of picture: Conrad Melvin, Crea?ve Commons BY-‐SA 2.0, www.flickr.com/photos/58220828@N07/5350788732

Part 1. Genomics as a Big Data Science

Source: Lincoln Stein

One Million Genomes •  Sequencing a million genomes would most likely fundamentally change the way we understand genomic varia?on.

•  The genomic data for a pa?ent is about 1 TB (including samples from both tumor and normal ?ssue).

•  One million genomes is about 1000 PB or 1 EB •  With compression, it may be about 100 PB •  At $1000/genome, the sequencing would cost about $1B

Big data driven discovery on 1,000,000 genomes and 1 EB of data.

Genomic-‐driven

diagnosis

Improved understanding of genomic science

Genomic-‐ driven drug development

Precision diagnosis and treatment. Preven?ve

health care.

TNBC

ER+

Source: White Lab, University of Chicago.

With genomics, we can stra?fy diseases and treat each stratum differently.

Clonal Evolu?on of Tumors

Tumors evolve temporally and spa?ally. Source: Mel Greaves & Carlo C. Maley, Clonal evolu?on in cancer, Nature, Volume 241, pages 306-‐312, 2012.

Combina?ons of Rare Alleles

Allele frequency

Penetrance

Very rare Common

Low

High

Rare Uncommon 0.001 0.01 0.1

Intermediate

Modest

alleles causing

Mendelian disease

most common variants

implicated in common disease

by GWA

rare examples of high-‐penetrance common variants

influencing common disease

rare variants of small effect

very hard to iden?fy by gene?c means

Low-‐frequency variants with

intermediate penetrance

Source: Mark McCarthy

TCGA Analysis of Lung Cancer

•  178 cases of SQCC (lung cancer)

•  Matched tumor & normal

•  Mean of 360 exonic muta?ons, 323 CNV, & 165 rearrangements per tumor

Source: The Cancer Genome Atlas Research Network, Comprehensive genomic characteriza?on of squamous cell lung cancers, Nature, 2012, doi:10.1038/nature11404.

Discipline Dura3on Size # Devices

HEP -‐ LHC 10 years 15 PB/year* One

Astronomy -‐ LSST 10 years 12 PB/year** One

Genomics -‐ NGS 2-‐4 years 0.5 TB/genome 1000’s

Some Examples of Big Data Science

*At full capacity, the Large Hadron Collider (LHC), the world's largest par?cle accelerator, is expected to produce more than 15 million Gigabytes of data each year. … This ambi?ous project connects and combines the IT power of more than 140 computer centres in 33 countries. Source: hMp://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-‐en.html **As it carries out its 10-‐year survey, LSST will produce over 15 terabytes of raw astronomical data each night (30 terabytes processed), resul?ng in a database catalog of 22 petabytes and an image archive of 100 petabytes. Source: hMp://www.lsst.org/News/enews/teragrid-‐1004.html

One large instrument Many smaller instruments

Part 2. What Instrument Do we Use to Make Big Data Discoveries?

How do we build a “datascope?”

What is big data?

TB? PB? EB? ZB?

Think of data as big if you measure it in MW, as in Facebook’s Pineville Data Center is 30 MW.

Another way:

opencompute.org

An algorithm and compu?ng infrastructure is “big-‐data scalable” if adding a rack (or container) of data (and corresponding processors) allows you to do the same computa?on in the same ?me but over more data.

Commercial Cloud Service Provider (CSP) 15 MW Data Center

100,000 servers 1 PB DRAM

100’s of PB of disk

Automa?c provisioning and infrastructure management

Monitoring, network security and forensics

Accoun?ng and billing Customer

Facing Portal

Data center network

~1 Tbps egress bandwidth

25 operators for 15 MW Commercial Cloud

What are some of the important differences between commercial and research-‐focused CSPs?

Science Clouds

Science CSP Commercial CSP POV Democra?ze access to

data. Integrate data to make discoveries. Long term archive.

As long as you pay the bill; as long as the business model holds.

Data & Storage

Data intensive compu?ng & HP storage

Internet style scale out and object-‐based storage

Flows Large data flows in and out

Lots of small web flows

Streams Streaming processing required

NA

Accoun?ng Essen?al Essen?al Lock in Moving environment

between CSPs essen?al Lock in is good

Part 3. The Open Cloud Consor?um’s Open Science Data Cloud

23 www.opencloudconsor?um.org

•  U.S based not-‐for-‐profit corpora?on. •  Manages cloud compu?ng infrastructure to

support scien?fic research: Open Science Data Cloud.

•  Manages cloud compu?ng testbeds: Open Cloud Testbed.

Cloud Services Opera?ons Centers (CSOC)

•  The OSDC operates Cloud Services Opera?ons Center (or CSOC).

•  It is a CSOC focused on suppor?ng Science Clouds for researchers.

•  Compare to Network Opera?ons Center or NOC.

•  Both are an important part of cyber infrastructure for big data science.

•  Design 1: Put cores over spindles.

•  Higher cost but easy to compute over all the data.

•  Design 2: separate (some of the )storage from the compute.

2012 OSDC rack design (dray) •  950 TB / rack •  600 cores / rack

Different Styles of OSDC Racks

Open Science Data Cloud

3 PB 2011 10 PB 2012

able to scale to 100 PB?

Automa?c provisioning and infrastructure management

Monitoring, compliance, &

security

Accoun?ng and billing (OSDC)

Customer Facing Portal (Tukey)

Data center network

~100 Gbps bandwidth

5-‐12 operators to operate 1-‐5 MW Science Cloud

Science Cloud SW & Services

OSDC Data Stack based upon OpenStack, Hadoop, GlusterFS, UDT, …

OSDC Philosophy •  We try to automate as much as possible (we automate the setup & opera?ons of a rack).

•  We try to write as liMle soyware as possible. •  Each project is a bit different, but in general: •  We assign (permanent) IDs to data managed by the OSDC and manage associated metadata.

•  We assign and enforce permissions for users & groups of users and for files/objects, collec?ons of files/objects, and collec?ons of collec?ons.

•  We Support RESTful interfaces. •  Do accoun?ng for storage and core-‐hours.

Some Of Our Biggest Mistakes

•  Not charging those who were the largest users of our services. This resulted in a lot of bad behavior.

•  Trying to support donated equipment without adequate staff.

•  Being too op?mis?c about when big data soyware would be ready for prime ?me.

•  Some problems with big data soyware doesn’t show up at less than the full scale of the OSDC, but we have only one OSDC and it is difficult to test at this scale.

Essen?al Services for a Science CSP •  Support for data intensive compu?ng •  Support for big data flows •  Account management, authen?ca?on and authoriza?on services

•  Health and status monitoring •  Billing and accoun?ng •  Ability to rapidly provision infrastructure •  Security services, logging, event repor?ng •  Access to large amounts of public data •  High performance storage •  Simple data export and import services

Small Medium to Large Very Large

Data Size

10’s

100’s

1000’s

Number

Public infrastructure

Dedicated infrastructure

Shared community infrastructure

Individual scien?sts & small projects

Community based science via Science as a Service

very large projects

Part 4. Bionimbus

Bionimbus is a joint project between Laboratory For Advanced Compu?ng & the White Lab at the University of Chicago.

Step 1. Prepare a Sample

Step 2. Login to Bionimbus and get a Bionimbus Key.

Step 3. Send your sample to the sequencing center.

Step 4. Login on to Bionimbus and view your data

Step 5. Use Bionimbus to perform standard and custom pipelines.

Bionimbus can launch mul?ple virtual machines.

Bionimbus Virtual Machine Releases Peak Calling MAT

MA2C PeakSeq MACS SPP

Quality Control

Various

Alignment & Genotyping

Bow?e

TopHat Samtools Picard

37

Soyware Tools: Moving Genomes

Bionimbus Community Genomic Cloud

researcher

Personal “dropbox” + compute

•  1K genomes •  PubMed •  etc.

Cloud for Public Data

Bionimbus Private Genomic Cloud

researcher

Personal “dropbox” & compute


Cloud for Controlled Data

TCGA dbGaP


Bionimbus Private Biomedical Cloud

researcher

Personal “dropbox” plus compute


Cloud for Controlled Data

TCGA dbGaP

Cloud for PHI data

Clinical Research Data Warehouse

ScaMer, gather queries


Bionimbus Private Cloud

UC

Bionimbus Community

Cloud

Bionimbus Private Cloud XY

Amazon dbGaP

External sequencing partner

Internal Sequencers

Step 1. Get Bionimbus ID (BID), assign project, private/community, public cloud, etc.

Step 2. Send sample to be sequenced.

BID Generator

Step 3b. Return variant calls, CNV, annota?on…

Step 4. Secure data rou?ng to appropriate cloud based upon BID.

Step 5. Cloud based analysis using IGSB and 3rd party tools and applica?ons. Step 3a. Return raw

reads.

Database Services

Analysis Pipelines & Re-‐analysis Services

web2py-‐based Front End

Data Cloud Services

Data Inges?on Services

U?lity Cloud Services

Intercloud Services

(Hadoop, Sector/Sphere)

(Eucalyptus, OpenStack)

(PostgreSQL)

(IDs, etc.)

(UDT, replica?on)

44

>300 ChIP datasets -‐ Chroma?n/RNA ?mecourse -‐ CBP -‐ PolII -‐ Pho/silencers -‐ HDACs -‐ Insulators -‐ TFs Predic3ons 537 silencers 2,307 new promoters 12,285 enhancers 14,145 insulators

www.modencode.org

Negre et al. Nature 2011

Part 5. Managing One Million Genomes

Sequence (BAM) Files (100-‐1000 PB)

Varia?on (VCF) Files (1-‐10 PB)

Summary level (10-‐100 TB)

Rela?onal databases

NoSql & scien?fic databases

NoSql, DFS, file overlays?

Enrich with clinical data

(Genomic varia?on)

(Sequence data in binary form)

Acknowledgements Major funding and support for the Open Science Data Cloud (OSDC) is provided by the Gordon and BeMy Moore Founda?on. This funding is used to support the OSDC-‐Adler, Sullivan and Root facili?es. Addi?onal funding for the OSDC has been provided by the following sponsors: •  The OCC-‐Y Hadoop Cluster (approximately 1000 cores and 1 PB of storage) was

donated by Yahoo! in 2011. •  Cisco provides the OSDC access to the Cisco C-‐Wave, which connects OSDC data

centers with 10 Gbps wide area networks. •  NSF awarded the OSDC a 5-‐year (2010-‐2016) PIRE award to train scien?sts to use

the OSDC and to further develop the underlying technology. •  OSDC technology for high performance data transport is support in part by NSF

Award 1127316. •  The StarLight Facility in Chicago enables the OSDC to connect to over 30 high

performance research networks around the world at 10 Gbps or higher, with an increasing number of 100 Gbps connec?ons.

The OSDC is managed by the Open Cloud Consor?um, a 501(c)(3) not-‐for-‐profit corpora?on. If you are interested in providing funding or dona?ng equipment or services, please contact us at [email protected].

For more informa?on •  You can find some more informa?on on my blog:

rgrossman.com. •  Some of my technical papers are also available there. •  My email address is robert.grossman at uchicago dot edu •  I recently wrote a popular book about compu?ng called: The

Structure of Digital Compu?ng: From Mainframes to Big Data, which you can buy from Amazon.

Center forResearchInformatics

Sources for images

•  The image of the hard disk is from Norlando Pobre, Crea?ve Commons. •  The image of the Facebook Pineville Data Center is from the Intel Free Press,

www.flickr.com/photos/intelfreepress/6722296855/, Crea?ve Commons BY 2.0. •  The image of the LHC is from Conrad Melvin, Crea?ve Commons BY-‐SA 2.0, www.flickr.com/

photos/58220828@N07/5350788732

Technology

Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)