Ntino Cloud BioLinux Barcelona Spain 2012

Preview:

Citation preview

Cloud BioLinux: Pre-configured Bioinformatics Computing for the Genomics Community

Ntino KrampisAsst. Professor - Informatics

J. Craig Venter Institute

kkrampis@jcvi.orghttp://www.jcvi.org/cms/about/bios/kkrampis/

Tuesday, November 6, 12

J. Craig Venter Institute ( JCVI )

• Human Microbiome Project (Nelson et al. Science 2010; 328: 994–99)

• NIH funded, launched in 2008, $115 million

• metagenomic sequencing of microbial genomes from the human body

• sequence everything in sample, use informatics to separate genomes

Tuesday, November 6, 12

J. Craig Venter Institute

• Global Ocean Survey (first publication, Venter et al. Science 2004; 304: 66-74)

• metagenomic sequencing of microbes from oceans around the world

• Darwin’s route ?

• Numbers: HMP > 2 mil. new proteins, GOS > 1.2

Tuesday, November 6, 12

Big Data and sequencing

• JCVI sequencing facility: 454, Solexa, HiSeq, and IonTorrent on the way

• Processed data: size information content

• But... look at SOLiD 3

Source: http://www.politigenomics.com/next-generation-

sequencing-informatics

Tuesday, November 6, 12

JCVI: sequencing and computing infrastructure

• “big” sequencing needs large-scale informatics

• ~1000 node Grid Engine cluster

• research with Hadoop / MapRecuce, and a small private cloud

• 50+ bioinformaticians and software developers

Tuesday, November 6, 12

A new paradigm:Low-cost, bench-top sequencers

• GS Junior - 454, MiSeq -Illumina

• complete sequencing of bacterial, viral, fungal genomes

• RNAseq (gene expression), ChiPseq (protein interactions), gene variant discovery

• sequencing as a standard technique in basic genetics research - like PCR ?

Tuesday, November 6, 12

Will smaller academic labs become the long tail of sequencing ?

“sequencing factories” :JCVI, Broad Inst. Washington Univ.

Inst. of Genome Sciences

small academic labs withbench-top sequencers

Amountof

sequencing

Number of labs

Tuesday, November 6, 12

Sequencers shipped without clusters

• Problem A : sequence analysis requires computational capacity

• genome assembly, BLAST, gene finders - annotation

• Problem B: bioinformatics tools need software engineering expertise

• unix/linux operating systems, maintaining software libraries, compiling source code

???

Tuesday, November 6, 12

Each lab builds a cluster ?

• need additional funds to buy the hardware

• funds for personnel to maintain the cluster and software

• duplication of effort across labs

• sub-optimal utilization of the hardware

Tuesday, November 6, 12

Centralized bioinformatics services

• Bioinformatic Resource Centers ex. GSCID

• bioinformatic services usually coupled with sequencing of a genome

• provide mostly data access to external PIs

• cannot support to every lab with a sequencer

Tuesday, November 6, 12

Problem A : sequence analysis requires computational capacity

• Amazon Elastic Compute Cloud (EC2), pay-by-the-hour computing

• cloud servers cost $0.085 - $2 per hour

• max capacity 64GB RAM / 8 CPU (can boot hundreds of servers)

750 hours free for new users: aws.amazon.com/free/

free compute for teaching: aws.amazon.com/grants/

World-wide data centers

Tuesday, November 6, 12

Cloud Computing and Virtualization

• OS, software and data, pre-installed in Virtual Machine (VM)

• cloud provider: hardware and virtualization layer

• VM is a full-featured server in a single file

• VM transfer on private cloud

Credit: VMware Inc.

Tuesday, November 6, 12

Problem B: bioinformatics tools need software engineering expertise

• VM with pre-installed software on the cloud

• avoid compiling source code, or other software dependencies

• rent computational capacity, on a pay as you go basis

• run the VM on the closest Amazon data center

Tuesday, November 6, 12

Solving Problems A & B : Cloud BioLinux

• Cloud BioLinux: publicly accessible VM on EC2

• 100+ pre-installed bioinformatics tools

• remote desktop for non-command line experts

• you can create a cluster with Cloud BioLinux - CloudMan Krampis K, Booth T, Chapman B, Tiwari B, Bicak M,

Field D, Nelson K

Cloud BioLinux: pre-configured and on-demand bioinformatics computing for the genomics community.

BMC Bioinformatics. 2012 Mar 19; 13: 42.

Tuesday, November 6, 12

Accessing Cloud BioLinux

http://aws.amazon.com/console

Tuesday, November 6, 12

Launch through the EC2 cloud console

Tuesday, November 6, 12

Amazon EC2 VM launch wizard

cloudbiolinux.org

Tuesday, November 6, 12

Tuesday, November 6, 12

Cloud BioLinux desktop remote connection

tinyurl.com/bootcloud1 tinyurl.com/bootcloud2

Tuesday, November 6, 12

Cloud BioLinux desktop

Tuesday, November 6, 12

Cloud BioLinux desktop

Tuesday, November 6, 12

Data exchange on the cloudVM snapshots

Tuesday, November 6, 12

Cloud computing research at JCVI

• open-source cloud platforms, fully compatible with Amazon EC2

• active funding, NIAID viral genomics pipeline on cloud

• end-to-end, sequence to assembly, annotation, visualization via Galaxy

• run on Amazon, private cloud, or desktop

Tuesday, November 6, 12

Scriptable Cloud Infrastructures

• Cloud BioLinux VM configuration in plain text

• high-level configuration, software groups

• each group individual bioinformatics tools

Fabricframework

Tuesday, November 6, 12

• Python Fabric leverages Linux packages (APTitude repositories)

• mix and match software from repositories

• share VM configuration as source code

• clone across clouds

Krampis K, Booth T, Chapman B, Tiwari B, Bicak M, Field D, Nelson KCloud BioLinux: pre-configured and on-demand bioinformatics computing for the genomics community.

BMC Bioinformatics. 2012 Mar 19; 13: 42.

Scriptable Cloud Infrastructures

Tuesday, November 6, 12

Scalable Data Analysis

• Cloud BioLinux + Cloudman

• dual role : Master / Worker

• Cloud BioLinux VM, has Cloudman scripts that start more copies of itself

• Grid Engine (SGE) cluster

• http://usecloudman.org/Afgan, E., Chapman, B. et al. (2012). Using Cloud Computing Infrastructure with CloudBioLinux, CloudMan, and Galaxy.Current Protocols in Bioinformatics, 11-9.

Tuesday, November 6, 12

Goodies with Cloud BioLinux

Tuesday, November 6, 12

Goodies with Cloud BioLinux

Tuesday, November 6, 12

From sequencer to the cloud

credit:basespace.illumina.com

Tuesday, November 6, 12

Acknowledgments

• Cloud BioLinux community: Brad Chapman, Enis Afgan,Tim Booth, Mesude Bicak, Dawn Field

• JCVI collaborators: Alex Richter, Ravi Sanka, Andrey Tovichgrechko, Johannes Goll, Karen Nelson, Bill Nierman, JCVI IT support.

• NIAID and for funding: Maria Giovani, Punam Mathur

cloudbiolinux.org

groups.google.com/group/cloudbiolinux

tinyurl.com/cloudboot1

tinyurl.com/cloudboot2

kkrampis@jcvi.org

slideshare.com/agbiotec

Thank you !Tuesday, November 6, 12

Recommended