Upload
robert-grossman
View
2.693
Download
0
Tags:
Embed Size (px)
DESCRIPTION
This is a talk I gave at XLDB 2012 on September 11, 2012 at Stanford University.
Citation preview
Bionimbus: Lessons from a Petabyte-‐Scale
Science Cloud Service Provider (CSP)
Robert Grossman
Ins?tute for Genomics & Systems Biology Center for Research Informa?cs
Computa?on Ins?tute Department of Medicine University of Chicago
& Open Data Group
September 11, 2012
The OSDC & Bionimbus Teams
• Open Science Data Cloud (OSDC) Team – MaM Greenway, Allison Heath, Ray Powell, Rafael Suarez.
– Major funding for the OSDC is provided by the Gordon and BeMy Moore Founda?on.
• Bionimbus Team – Elizabeth Bartom, Casey Brown, Jason Grundstad, David Hanley, Nicolas Negre, Tom Stricker, MaM SlaMery, Rebecca Spokony & Kevin White.
– Bionimbus is a joint project between Laboratory for Advanced Compu?ng & White Lab at the University of Chicago and uses in part the OSDC infrastructure.
Let’s Step Back 20 Years
• 1992-‐96: Petabyte Access & Storage Solu?ons (PASS) Project for SSC.
• It developed & benchmarked federated rela?onal, OO DB, object stores, & column-‐oriented data warehouse solu?ons at the TB-‐scale.
A picture of Cern’s Large Hadron Collider (LHC). The LHC took about a decade to construct, and cost about $4.75 billion. Source of picture: Conrad Melvin, Crea?ve Commons BY-‐SA 2.0, www.flickr.com/photos/58220828@N07/5350788732
Part 1. Genomics as a Big Data Science
Source: Lincoln Stein
One Million Genomes • Sequencing a million genomes would most likely fundamentally change the way we understand genomic varia?on.
• The genomic data for a pa?ent is about 1 TB (including samples from both tumor and normal ?ssue).
• One million genomes is about 1000 PB or 1 EB • With compression, it may be about 100 PB • At $1000/genome, the sequencing would cost about $1B
Big data driven discovery on 1,000,000 genomes and 1 EB of data.
Genomic-‐driven
diagnosis
Improved understanding of genomic science
Genomic-‐ driven drug development
Precision diagnosis and treatment. Preven?ve
health care.
TNBC
ER+
Source: White Lab, University of Chicago.
With genomics, we can stra?fy diseases and treat each stratum differently.
Clonal Evolu?on of Tumors
Tumors evolve temporally and spa?ally. Source: Mel Greaves & Carlo C. Maley, Clonal evolu?on in cancer, Nature, Volume 241, pages 306-‐312, 2012.
Combina?ons of Rare Alleles
Allele frequency
Penetrance
Very rare Common
Low
High
Rare Uncommon 0.001 0.01 0.1
Intermediate
Modest
alleles causing
Mendelian disease
most common variants
implicated in common disease
by GWA
rare examples of high-‐penetrance common variants
influencing common disease
rare variants of small effect
very hard to iden?fy by gene?c means
Low-‐frequency variants with
intermediate penetrance
Source: Mark McCarthy
TCGA Analysis of Lung Cancer
• 178 cases of SQCC (lung cancer)
• Matched tumor & normal
• Mean of 360 exonic muta?ons, 323 CNV, & 165 rearrangements per tumor
Source: The Cancer Genome Atlas Research Network, Comprehensive genomic characteriza?on of squamous cell lung cancers, Nature, 2012, doi:10.1038/nature11404.
Discipline Dura3on Size # Devices
HEP -‐ LHC 10 years 15 PB/year* One
Astronomy -‐ LSST 10 years 12 PB/year** One
Genomics -‐ NGS 2-‐4 years 0.5 TB/genome 1000’s
Some Examples of Big Data Science
*At full capacity, the Large Hadron Collider (LHC), the world's largest par?cle accelerator, is expected to produce more than 15 million Gigabytes of data each year. … This ambi?ous project connects and combines the IT power of more than 140 computer centres in 33 countries. Source: hMp://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-‐en.html **As it carries out its 10-‐year survey, LSST will produce over 15 terabytes of raw astronomical data each night (30 terabytes processed), resul?ng in a database catalog of 22 petabytes and an image archive of 100 petabytes. Source: hMp://www.lsst.org/News/enews/teragrid-‐1004.html
One large instrument Many smaller instruments
Part 2. What Instrument Do we Use to Make Big Data Discoveries?
How do we build a “datascope?”
What is big data?
TB? PB? EB? ZB?
Think of data as big if you measure it in MW, as in Facebook’s Pineville Data Center is 30 MW.
Another way:
opencompute.org
An algorithm and compu?ng infrastructure is “big-‐data scalable” if adding a rack (or container) of data (and corresponding processors) allows you to do the same computa?on in the same ?me but over more data.
Commercial Cloud Service Provider (CSP) 15 MW Data Center
100,000 servers 1 PB DRAM
100’s of PB of disk
Automa?c provisioning and infrastructure management
Monitoring, network security and forensics
Accoun?ng and billing Customer
Facing Portal
Data center network
~1 Tbps egress bandwidth
25 operators for 15 MW Commercial Cloud
What are some of the important differences between commercial and research-‐focused CSPs?
Science Clouds
Science CSP Commercial CSP POV Democra?ze access to
data. Integrate data to make discoveries. Long term archive.
As long as you pay the bill; as long as the business model holds.
Data & Storage
Data intensive compu?ng & HP storage
Internet style scale out and object-‐based storage
Flows Large data flows in and out
Lots of small web flows
Streams Streaming processing required
NA
Accoun?ng Essen?al Essen?al Lock in Moving environment
between CSPs essen?al Lock in is good
Part 3. The Open Cloud Consor?um’s Open Science Data Cloud
23 www.opencloudconsor?um.org
• U.S based not-‐for-‐profit corpora?on. • Manages cloud compu?ng infrastructure to
support scien?fic research: Open Science Data Cloud.
• Manages cloud compu?ng testbeds: Open Cloud Testbed.
Cloud Services Opera?ons Centers (CSOC)
• The OSDC operates Cloud Services Opera?ons Center (or CSOC).
• It is a CSOC focused on suppor?ng Science Clouds for researchers.
• Compare to Network Opera?ons Center or NOC.
• Both are an important part of cyber infrastructure for big data science.
• Design 1: Put cores over spindles.
• Higher cost but easy to compute over all the data.
• Design 2: separate (some of the )storage from the compute.
2012 OSDC rack design (dray) • 950 TB / rack • 600 cores / rack
Different Styles of OSDC Racks
Open Science Data Cloud
3 PB 2011 10 PB 2012
able to scale to 100 PB?
Automa?c provisioning and infrastructure management
Monitoring, compliance, &
security
Accoun?ng and billing (OSDC)
Customer Facing Portal (Tukey)
Data center network
~100 Gbps bandwidth
5-‐12 operators to operate 1-‐5 MW Science Cloud
Science Cloud SW & Services
OSDC Data Stack based upon OpenStack, Hadoop, GlusterFS, UDT, …
OSDC Philosophy • We try to automate as much as possible (we automate the setup & opera?ons of a rack).
• We try to write as liMle soyware as possible. • Each project is a bit different, but in general: • We assign (permanent) IDs to data managed by the OSDC and manage associated metadata.
• We assign and enforce permissions for users & groups of users and for files/objects, collec?ons of files/objects, and collec?ons of collec?ons.
• We Support RESTful interfaces. • Do accoun?ng for storage and core-‐hours.
Some Of Our Biggest Mistakes
• Not charging those who were the largest users of our services. This resulted in a lot of bad behavior.
• Trying to support donated equipment without adequate staff.
• Being too op?mis?c about when big data soyware would be ready for prime ?me.
• Some problems with big data soyware doesn’t show up at less than the full scale of the OSDC, but we have only one OSDC and it is difficult to test at this scale.
Essen?al Services for a Science CSP • Support for data intensive compu?ng • Support for big data flows • Account management, authen?ca?on and authoriza?on services
• Health and status monitoring • Billing and accoun?ng • Ability to rapidly provision infrastructure • Security services, logging, event repor?ng • Access to large amounts of public data • High performance storage • Simple data export and import services
Small Medium to Large Very Large
Data Size
10’s
100’s
1000’s
Number
Public infrastructure
Dedicated infrastructure
Shared community infrastructure
Individual scien?sts & small projects
Community based science via Science as a Service
very large projects
Part 4. Bionimbus
Bionimbus is a joint project between Laboratory For Advanced Compu?ng & the White Lab at the University of Chicago.
Step 1. Prepare a Sample
Step 2. Login to Bionimbus and get a Bionimbus Key.
Step 3. Send your sample to the sequencing center.
Step 4. Login on to Bionimbus and view your data
Step 5. Use Bionimbus to perform standard and custom pipelines.
Bionimbus can launch mul?ple virtual machines.
Bionimbus Virtual Machine Releases Peak Calling MAT
MA2C PeakSeq MACS SPP
Quality Control
Various
Alignment & Genotyping
Bow?e
TopHat Samtools Picard
37
Soyware Tools: Moving Genomes
Bionimbus Community Genomic Cloud
researcher
Personal “dropbox” + compute
• 1K genomes • PubMed • etc.
Cloud for Public Data
Bionimbus Private Genomic Cloud
researcher
Personal “dropbox” & compute
Cloud for Public Data
Cloud for Controlled Data
TCGA dbGaP
• 1K genomes • PubMed • etc.
Bionimbus Private Biomedical Cloud
researcher
Personal “dropbox” plus compute
Cloud for Public Data
Cloud for Controlled Data
TCGA dbGaP
Cloud for PHI data
Clinical Research Data Warehouse
ScaMer, gather queries
• 1K genomes • PubMed • etc.
Bionimbus Private Cloud
UC
Bionimbus Community
Cloud
Bionimbus Private Cloud XY
Amazon dbGaP
External sequencing partner
Internal Sequencers
Step 1. Get Bionimbus ID (BID), assign project, private/community, public cloud, etc.
Step 2. Send sample to be sequenced.
BID Generator
Step 3b. Return variant calls, CNV, annota?on…
Step 4. Secure data rou?ng to appropriate cloud based upon BID.
Step 5. Cloud based analysis using IGSB and 3rd party tools and applica?ons. Step 3a. Return raw
reads.
Database Services
Analysis Pipelines & Re-‐analysis Services
web2py-‐based Front End
Data Cloud Services
Data Inges?on Services
U?lity Cloud Services
Intercloud Services
(Hadoop, Sector/Sphere)
(Eucalyptus, OpenStack)
(PostgreSQL)
(IDs, etc.)
(UDT, replica?on)
44
>300 ChIP datasets -‐ Chroma?n/RNA ?mecourse -‐ CBP -‐ PolII -‐ Pho/silencers -‐ HDACs -‐ Insulators -‐ TFs Predic3ons 537 silencers 2,307 new promoters 12,285 enhancers 14,145 insulators
www.modencode.org
Negre et al. Nature 2011
Part 5. Managing One Million Genomes
Sequence (BAM) Files (100-‐1000 PB)
Varia?on (VCF) Files (1-‐10 PB)
Summary level (10-‐100 TB)
Rela?onal databases
NoSql & scien?fic databases
NoSql, DFS, file overlays?
Enrich with clinical data
(Genomic varia?on)
(Sequence data in binary form)
Acknowledgements Major funding and support for the Open Science Data Cloud (OSDC) is provided by the Gordon and BeMy Moore Founda?on. This funding is used to support the OSDC-‐Adler, Sullivan and Root facili?es. Addi?onal funding for the OSDC has been provided by the following sponsors: • The OCC-‐Y Hadoop Cluster (approximately 1000 cores and 1 PB of storage) was
donated by Yahoo! in 2011. • Cisco provides the OSDC access to the Cisco C-‐Wave, which connects OSDC data
centers with 10 Gbps wide area networks. • NSF awarded the OSDC a 5-‐year (2010-‐2016) PIRE award to train scien?sts to use
the OSDC and to further develop the underlying technology. • OSDC technology for high performance data transport is support in part by NSF
Award 1127316. • The StarLight Facility in Chicago enables the OSDC to connect to over 30 high
performance research networks around the world at 10 Gbps or higher, with an increasing number of 100 Gbps connec?ons.
The OSDC is managed by the Open Cloud Consor?um, a 501(c)(3) not-‐for-‐profit corpora?on. If you are interested in providing funding or dona?ng equipment or services, please contact us at [email protected].
For more informa?on • You can find some more informa?on on my blog:
rgrossman.com. • Some of my technical papers are also available there. • My email address is robert.grossman at uchicago dot edu • I recently wrote a popular book about compu?ng called: The
Structure of Digital Compu?ng: From Mainframes to Big Data, which you can buy from Amazon.
Center forResearchInformatics
Sources for images
• The image of the hard disk is from Norlando Pobre, Crea?ve Commons. • The image of the Facebook Pineville Data Center is from the Intel Free Press,
www.flickr.com/photos/intelfreepress/6722296855/, Crea?ve Commons BY 2.0. • The image of the LHC is from Conrad Melvin, Crea?ve Commons BY-‐SA 2.0, www.flickr.com/
photos/58220828@N07/5350788732