39
Cloud-Based Services For Large Scale Analysis of Sequence & Expression Data: Lessons from Cistrack Robert Grossman Laboratory for Advanced Computing University of Illinois at Chicago Open Data Group Institute for Genomics & Systems Biology University of Chicago October 6, 2009 1

Bioclouds CAMDA (Robert Grossman) 09-v9p

Embed Size (px)

DESCRIPTION

This is a talk titled "Cloud-Based Services For Large Scale Analysis of Sequence & Expression Data: Lessons from Cistrack" that I gave at CAMDA 2009 on October 6, 2009.

Citation preview

Page 1: Bioclouds CAMDA (Robert Grossman) 09-v9p

Cloud-Based Services For Large Scale Analysis of Sequence & Expression Data:

Lessons from Cistrack

Robert GrossmanLaboratory for Advanced Computing

University of Illinois at Chicago

Open Data Group

Institute for Genomics & Systems BiologyUniversity of Chicago

October 6, 2009

Page 2: Bioclouds CAMDA (Robert Grossman) 09-v9p

Cistrack Team (UIC & U. Chicago)

Nick Bild Jia Chen Robert Grossman Yunhong Gu David Hanley Oleksiy Karpenko Xiangjun Liu

Nicolas Negre Michal Sabala Damian Roqueiro Parantu K Shah Feng Tian Kevin White

Page 3: Bioclouds CAMDA (Robert Grossman) 09-v9p

Part 1Biology as a Data Intensive Science.

3

Two of the four Solexa machines at the IGSB facility at Argonne National Laboratory.

Page 4: Bioclouds CAMDA (Robert Grossman) 09-v9p

0

2

4

6

8

10

12

20062010201420182022202620302034203820422046205020542058206220662070

2019One of each described species

Projected sequencing capabilities (world-wide)

2023One of each species ~10M estimate

2031One of each species ~100M estimate

2060Total human population

log1

0 bi

llion

s of

bas

e pa

irs

Kevin White, unpublished

Page 5: Bioclouds CAMDA (Robert Grossman) 09-v9p

Is Biology a Large Data Science?

5

Amount of publically available sequence data is doubling approximately every 12 months.

vs

CPUs double approximately every 18 months (Moore’s Law). Disks double every 12-15 months (Johnson’s Law).

Page 6: Bioclouds CAMDA (Robert Grossman) 09-v9p

IBM joins race for $100 personal genome.

Page 7: Bioclouds CAMDA (Robert Grossman) 09-v9p

We Have a Problem

More and more of your colleagues (e.g. the biologist down the hall) with access to modern instruments are producing so much data that they cannot easily manage, analyze and archive it.

Large projects build their own infrastructure. Almost all other biologists are on their own.

vs

Page 8: Bioclouds CAMDA (Robert Grossman) 09-v9p

Point of View

Data

Analytic algorithms & statistical models

Analytic infrastructure

To do research today…

Page 9: Bioclouds CAMDA (Robert Grossman) 09-v9p

Part 2What is a Cloud?

9

Page 10: Bioclouds CAMDA (Robert Grossman) 09-v9p

What is a Cloud?

10

Software as a Service

Page 11: Bioclouds CAMDA (Robert Grossman) 09-v9p

Is Anything Else a Cloud?

11

Infrastructure as a Service – based upon scaling Virtual Machines (VMs)

Page 12: Bioclouds CAMDA (Robert Grossman) 09-v9p

Are There Other Types of Clouds?

12

Large Data Cloud Services

ad targeting

Page 13: Bioclouds CAMDA (Robert Grossman) 09-v9p

What is Virtualization?

13

Page 14: Bioclouds CAMDA (Robert Grossman) 09-v9p

Idea Dates Back to the 1960s

Virtualization first widely deployed with IBM VM/370.

14

IBM Mainframe

IBM VM/370

CMS

App

Native (Full) VirtualizationExamples: Vmware ESX

MVS

App

CMS

App

Page 15: Bioclouds CAMDA (Robert Grossman) 09-v9p

One Definition Clouds provide on-demand resources or

services over a network, often the Internet, with the scale and reliability of a data center.

No standard definition. Cloud architectures are not new. What is new:

– Scale– Ease of use– Pricing model.

15

Page 16: Bioclouds CAMDA (Robert Grossman) 09-v9p

16

Scale is new.

Page 17: Bioclouds CAMDA (Robert Grossman) 09-v9p

Elastic, Usage Based Pricing Is New

17

1 computer in a rack for 120 hours

120 computers in three racks for 1 hour

costs the same as

Elastic, usage based pricing turns capex into opex.Clouds can be used to manage surges in computing.

Page 18: Bioclouds CAMDA (Robert Grossman) 09-v9p

Simplicity Offered By the Cloud is New

18

+ .. and you have a computer ready to work.

A new programmer can develop a program to process a container full of data with less than day of training using MapReduce.

Page 19: Bioclouds CAMDA (Robert Grossman) 09-v9p

experimental science

simulation science

datascience

160930x

1670250x

197610x-100x

200410x-100x

Page 20: Bioclouds CAMDA (Robert Grossman) 09-v9p

Grids CloudsProblem Not enough cycles Too much dataInfrastructure Access

supercomputers; local clusters

Local and remote data centers provide services

Ease of use Difficult to use Easy to useModel NSF, DOD, DOE

high performance computing centers

Google, Amazon, Yahoo, Microsoft, Facebook, …

Projects caBIG, BIRN, … CUBioS/Cistrack

Clouds vs Grids

Page 21: Bioclouds CAMDA (Robert Grossman) 09-v9p

Hadoop & Sector

Hadoop SectorStorage Cloud Block-based file

systemFile-based

Programming Model

MapReduce UDF & MapReduce

Protocol TCP UDP-based protocol (UDT)

Replication At time of writing PeriodicallySecurity Not yet HIPAA capableLanguage Java C++

21

Page 22: Bioclouds CAMDA (Robert Grossman) 09-v9p

MalStone B Benchmark

22

MalStone BHadoop v0.18.3 799 minHadoop Streaming v0.18.3 142 minSector v1.19 44 min# Nodes 20 nodes# Records 10 BillionSize of Dataset 1 TB

Page 23: Bioclouds CAMDA (Robert Grossman) 09-v9p

Part 3Cistrack

23www.cistrack.org

Page 24: Bioclouds CAMDA (Robert Grossman) 09-v9p

Cistrack

Resource for cis-regulatory data. It is open source and based upon CUBioS. Currently used by the White Lab at University of

Chicago for managing ModENCODE fly data. Contains raw data, intermediate, and analyzed data

from approximately 240 experiments from Agilent, Affy and Solexa platforms.

Page 25: Bioclouds CAMDA (Robert Grossman) 09-v9p

Chromatin Developmental Time-Course H3K4me1 enhancers H3K4me3 promoters

& enhancers H3K9Ac activation H3K9me3

heterochromotin H3K27Ac activation H3K27me3 repression PolII transcript.

& promoters CBP HAT-

enhancers Total RNA expression

12 time points for which chromatin and RNA have been collected (Carolyn Morrison & Nicolas Negre)

8 antibodies used for ChIP experiments (Zirong Li and Bing Ren)

X

Page 26: Bioclouds CAMDA (Robert Grossman) 09-v9p

1. Cistrack Supports Cubes of Data

Drosophila regulatory elements from Drosophila modENCODE. ChIP-chip data using Agilent 244K dual-color arrays. Six histone modifications (H3K9me3, H3K27me3, H3K4me3, H3K4me1,

H3K27Ac, H3K9Ac), PolII and CBP. Each factor has been studied for 12 different time-points of Drosophila

development.

Page 27: Bioclouds CAMDA (Robert Grossman) 09-v9p

2. ChIP-Seq Data Volumes are Large

Cistrack integrates with large data clouds.

Page 28: Bioclouds CAMDA (Robert Grossman) 09-v9p

3. Continuous Reanalysis is Desirable

Cistrack supports VMs that can simplify re-appling Cistrack pipelines that have been updated to include a new algorithm.

In general, it is quite labor intensive to reanalyze your existing data with a new algorithm.

Page 29: Bioclouds CAMDA (Robert Grossman) 09-v9p

Cistrack Architecture

Cistrack Database

Analysis Pipelines & Re-analysis

Services

Cistrack Web Portal & Widgets

Cistrack Cloud Services

Ingestion Services

Page 30: Bioclouds CAMDA (Robert Grossman) 09-v9p

Part 4Reanalysis

30

Can you repeat an analytic pipeline one year after a post-doc leaves your lab?

Page 31: Bioclouds CAMDA (Robert Grossman) 09-v9p

Promoters: Use H3K4me3, PolII &RNA to Map Active Genes

Page 32: Bioclouds CAMDA (Robert Grossman) 09-v9p

Promoters: Use of H3K4me3, PolII & RNA to Map Active Genes

Page 33: Bioclouds CAMDA (Robert Grossman) 09-v9p

Active Genes - Solexa Result

Page 34: Bioclouds CAMDA (Robert Grossman) 09-v9p

Basic Idea

At anytime, you can launch one or virtual machines (VMs) to redo pipeline analysis and persist results to a database and VM to a cloud.

Replace

Cloud

VM VMVM

Page 35: Bioclouds CAMDA (Robert Grossman) 09-v9p

Raywulf We have designed a

cluster (called a Raywolf Cloud) that is optimized to serve as your own private cloud.

About $2K/TB. Will be used by the

Open Science Data Cloud.

Page 36: Bioclouds CAMDA (Robert Grossman) 09-v9p

Acknowledgements

Page 37: Bioclouds CAMDA (Robert Grossman) 09-v9p

Cis-Regulatory Map of the Drosophila Genome (modENCODE)

Data Generation– Kevin White, U. Chicago (Antibody pipeline, ChIP-chip pipeline)– Bing Ren, UCSD (Antibody validation, ChIP-chip pipeline)– Robert Grossman U. Illinois (LIMS, data management & analysis)

Computational identification of Cis-Regulatory Motifs– Manolis Kellis, MIT (Motif analysis, ChIP-chip data analysis)

Biological validation– Jim Posakony, UCSD (Promoters/Enhancers)– Steve Russell, Cambridge U. (Insulators/Silencers)– Hugo Bellen, Baylor (Element “necessity” validations)

Page 38: Bioclouds CAMDA (Robert Grossman) 09-v9p

Cistrack

Cistrack Cloud– Yunhong Gu – Michal Sabala

Cistrack DB– David Hanley– Xiangjun Liu– Nicolas Negre– Michal Sabala– Parantu K Shah

Cistrack Analysis Pipelines & Tools– Nick Bild– Jia Chen– Xiangjun Liu– Nicolas Negre– Damian Roqueiro– Parantu K Shah– Feng Tian– Kevin White

Page 39: Bioclouds CAMDA (Robert Grossman) 09-v9p

Thank You