Upload
robert-grossman
View
1.002
Download
1
Tags:
Embed Size (px)
DESCRIPTION
This is a talk titled "Cloud-Based Services For Large Scale Analysis of Sequence & Expression Data: Lessons from Cistrack" that I gave at CAMDA 2009 on October 6, 2009.
Citation preview
Cloud-Based Services For Large Scale Analysis of Sequence & Expression Data:
Lessons from Cistrack
Robert GrossmanLaboratory for Advanced Computing
University of Illinois at Chicago
Open Data Group
Institute for Genomics & Systems BiologyUniversity of Chicago
October 6, 2009
Cistrack Team (UIC & U. Chicago)
Nick Bild Jia Chen Robert Grossman Yunhong Gu David Hanley Oleksiy Karpenko Xiangjun Liu
Nicolas Negre Michal Sabala Damian Roqueiro Parantu K Shah Feng Tian Kevin White
Part 1Biology as a Data Intensive Science.
3
Two of the four Solexa machines at the IGSB facility at Argonne National Laboratory.
0
2
4
6
8
10
12
20062010201420182022202620302034203820422046205020542058206220662070
2019One of each described species
Projected sequencing capabilities (world-wide)
2023One of each species ~10M estimate
2031One of each species ~100M estimate
2060Total human population
log1
0 bi
llion
s of
bas
e pa
irs
Kevin White, unpublished
Is Biology a Large Data Science?
5
Amount of publically available sequence data is doubling approximately every 12 months.
vs
CPUs double approximately every 18 months (Moore’s Law). Disks double every 12-15 months (Johnson’s Law).
IBM joins race for $100 personal genome.
We Have a Problem
More and more of your colleagues (e.g. the biologist down the hall) with access to modern instruments are producing so much data that they cannot easily manage, analyze and archive it.
Large projects build their own infrastructure. Almost all other biologists are on their own.
vs
Point of View
Data
Analytic algorithms & statistical models
Analytic infrastructure
To do research today…
Part 2What is a Cloud?
9
What is a Cloud?
10
Software as a Service
Is Anything Else a Cloud?
11
Infrastructure as a Service – based upon scaling Virtual Machines (VMs)
Are There Other Types of Clouds?
12
Large Data Cloud Services
ad targeting
What is Virtualization?
13
Idea Dates Back to the 1960s
Virtualization first widely deployed with IBM VM/370.
14
IBM Mainframe
IBM VM/370
CMS
App
Native (Full) VirtualizationExamples: Vmware ESX
MVS
App
CMS
App
One Definition Clouds provide on-demand resources or
services over a network, often the Internet, with the scale and reliability of a data center.
No standard definition. Cloud architectures are not new. What is new:
– Scale– Ease of use– Pricing model.
15
16
Scale is new.
Elastic, Usage Based Pricing Is New
17
1 computer in a rack for 120 hours
120 computers in three racks for 1 hour
costs the same as
Elastic, usage based pricing turns capex into opex.Clouds can be used to manage surges in computing.
Simplicity Offered By the Cloud is New
18
+ .. and you have a computer ready to work.
A new programmer can develop a program to process a container full of data with less than day of training using MapReduce.
experimental science
simulation science
datascience
160930x
1670250x
197610x-100x
200410x-100x
Grids CloudsProblem Not enough cycles Too much dataInfrastructure Access
supercomputers; local clusters
Local and remote data centers provide services
Ease of use Difficult to use Easy to useModel NSF, DOD, DOE
high performance computing centers
Google, Amazon, Yahoo, Microsoft, Facebook, …
Projects caBIG, BIRN, … CUBioS/Cistrack
Clouds vs Grids
Hadoop & Sector
Hadoop SectorStorage Cloud Block-based file
systemFile-based
Programming Model
MapReduce UDF & MapReduce
Protocol TCP UDP-based protocol (UDT)
Replication At time of writing PeriodicallySecurity Not yet HIPAA capableLanguage Java C++
21
MalStone B Benchmark
22
MalStone BHadoop v0.18.3 799 minHadoop Streaming v0.18.3 142 minSector v1.19 44 min# Nodes 20 nodes# Records 10 BillionSize of Dataset 1 TB
Part 3Cistrack
23www.cistrack.org
Cistrack
Resource for cis-regulatory data. It is open source and based upon CUBioS. Currently used by the White Lab at University of
Chicago for managing ModENCODE fly data. Contains raw data, intermediate, and analyzed data
from approximately 240 experiments from Agilent, Affy and Solexa platforms.
Chromatin Developmental Time-Course H3K4me1 enhancers H3K4me3 promoters
& enhancers H3K9Ac activation H3K9me3
heterochromotin H3K27Ac activation H3K27me3 repression PolII transcript.
& promoters CBP HAT-
enhancers Total RNA expression
12 time points for which chromatin and RNA have been collected (Carolyn Morrison & Nicolas Negre)
8 antibodies used for ChIP experiments (Zirong Li and Bing Ren)
X
1. Cistrack Supports Cubes of Data
Drosophila regulatory elements from Drosophila modENCODE. ChIP-chip data using Agilent 244K dual-color arrays. Six histone modifications (H3K9me3, H3K27me3, H3K4me3, H3K4me1,
H3K27Ac, H3K9Ac), PolII and CBP. Each factor has been studied for 12 different time-points of Drosophila
development.
2. ChIP-Seq Data Volumes are Large
Cistrack integrates with large data clouds.
3. Continuous Reanalysis is Desirable
Cistrack supports VMs that can simplify re-appling Cistrack pipelines that have been updated to include a new algorithm.
In general, it is quite labor intensive to reanalyze your existing data with a new algorithm.
Cistrack Architecture
Cistrack Database
Analysis Pipelines & Re-analysis
Services
Cistrack Web Portal & Widgets
Cistrack Cloud Services
Ingestion Services
Part 4Reanalysis
30
Can you repeat an analytic pipeline one year after a post-doc leaves your lab?
Promoters: Use H3K4me3, PolII &RNA to Map Active Genes
Promoters: Use of H3K4me3, PolII & RNA to Map Active Genes
Active Genes - Solexa Result
Basic Idea
At anytime, you can launch one or virtual machines (VMs) to redo pipeline analysis and persist results to a database and VM to a cloud.
Replace
Cloud
VM VMVM
Raywulf We have designed a
cluster (called a Raywolf Cloud) that is optimized to serve as your own private cloud.
About $2K/TB. Will be used by the
Open Science Data Cloud.
Acknowledgements
Cis-Regulatory Map of the Drosophila Genome (modENCODE)
Data Generation– Kevin White, U. Chicago (Antibody pipeline, ChIP-chip pipeline)– Bing Ren, UCSD (Antibody validation, ChIP-chip pipeline)– Robert Grossman U. Illinois (LIMS, data management & analysis)
Computational identification of Cis-Regulatory Motifs– Manolis Kellis, MIT (Motif analysis, ChIP-chip data analysis)
Biological validation– Jim Posakony, UCSD (Promoters/Enhancers)– Steve Russell, Cambridge U. (Insulators/Silencers)– Hugo Bellen, Baylor (Element “necessity” validations)
Cistrack
Cistrack Cloud– Yunhong Gu – Michal Sabala
Cistrack DB– David Hanley– Xiangjun Liu– Nicolas Negre– Michal Sabala– Parantu K Shah
Cistrack Analysis Pipelines & Tools– Nick Bild– Jia Chen– Xiangjun Liu– Nicolas Negre– Damian Roqueiro– Parantu K Shah– Feng Tian– Kevin White
Thank You