30
Cloud BioLinux: Pre-configured Bioinformatics Computing for the Genomics Community Ntino Krampis Asst. Professor - Informatics J. Craig Venter Institute [email protected] http://www.jcvi.org/cms/about/bios/kkrampis/ Tuesday, November 6, 12

Ntino Cloud BioLinux Barcelona Spain 2012

Embed Size (px)

Citation preview

Page 1: Ntino Cloud BioLinux Barcelona Spain 2012

Cloud BioLinux: Pre-configured Bioinformatics Computing for the Genomics Community

Ntino KrampisAsst. Professor - Informatics

J. Craig Venter Institute

[email protected]://www.jcvi.org/cms/about/bios/kkrampis/

Tuesday, November 6, 12

Page 2: Ntino Cloud BioLinux Barcelona Spain 2012

J. Craig Venter Institute ( JCVI )

• Human Microbiome Project (Nelson et al. Science 2010; 328: 994–99)

• NIH funded, launched in 2008, $115 million

• metagenomic sequencing of microbial genomes from the human body

• sequence everything in sample, use informatics to separate genomes

Tuesday, November 6, 12

Page 3: Ntino Cloud BioLinux Barcelona Spain 2012

J. Craig Venter Institute

• Global Ocean Survey (first publication, Venter et al. Science 2004; 304: 66-74)

• metagenomic sequencing of microbes from oceans around the world

• Darwin’s route ?

• Numbers: HMP > 2 mil. new proteins, GOS > 1.2

Tuesday, November 6, 12

Page 4: Ntino Cloud BioLinux Barcelona Spain 2012

Big Data and sequencing

• JCVI sequencing facility: 454, Solexa, HiSeq, and IonTorrent on the way

• Processed data: size information content

• But... look at SOLiD 3

Source: http://www.politigenomics.com/next-generation-

sequencing-informatics

Tuesday, November 6, 12

Page 5: Ntino Cloud BioLinux Barcelona Spain 2012

JCVI: sequencing and computing infrastructure

• “big” sequencing needs large-scale informatics

• ~1000 node Grid Engine cluster

• research with Hadoop / MapRecuce, and a small private cloud

• 50+ bioinformaticians and software developers

Tuesday, November 6, 12

Page 6: Ntino Cloud BioLinux Barcelona Spain 2012

A new paradigm:Low-cost, bench-top sequencers

• GS Junior - 454, MiSeq -Illumina

• complete sequencing of bacterial, viral, fungal genomes

• RNAseq (gene expression), ChiPseq (protein interactions), gene variant discovery

• sequencing as a standard technique in basic genetics research - like PCR ?

Tuesday, November 6, 12

Page 7: Ntino Cloud BioLinux Barcelona Spain 2012

Will smaller academic labs become the long tail of sequencing ?

“sequencing factories” :JCVI, Broad Inst. Washington Univ.

Inst. of Genome Sciences

small academic labs withbench-top sequencers

Amountof

sequencing

Number of labs

Tuesday, November 6, 12

Page 8: Ntino Cloud BioLinux Barcelona Spain 2012

Sequencers shipped without clusters

• Problem A : sequence analysis requires computational capacity

• genome assembly, BLAST, gene finders - annotation

• Problem B: bioinformatics tools need software engineering expertise

• unix/linux operating systems, maintaining software libraries, compiling source code

???

Tuesday, November 6, 12

Page 9: Ntino Cloud BioLinux Barcelona Spain 2012

Each lab builds a cluster ?

• need additional funds to buy the hardware

• funds for personnel to maintain the cluster and software

• duplication of effort across labs

• sub-optimal utilization of the hardware

Tuesday, November 6, 12

Page 10: Ntino Cloud BioLinux Barcelona Spain 2012

Centralized bioinformatics services

• Bioinformatic Resource Centers ex. GSCID

• bioinformatic services usually coupled with sequencing of a genome

• provide mostly data access to external PIs

• cannot support to every lab with a sequencer

Tuesday, November 6, 12

Page 11: Ntino Cloud BioLinux Barcelona Spain 2012

Problem A : sequence analysis requires computational capacity

• Amazon Elastic Compute Cloud (EC2), pay-by-the-hour computing

• cloud servers cost $0.085 - $2 per hour

• max capacity 64GB RAM / 8 CPU (can boot hundreds of servers)

750 hours free for new users: aws.amazon.com/free/

free compute for teaching: aws.amazon.com/grants/

World-wide data centers

Tuesday, November 6, 12

Page 12: Ntino Cloud BioLinux Barcelona Spain 2012

Cloud Computing and Virtualization

• OS, software and data, pre-installed in Virtual Machine (VM)

• cloud provider: hardware and virtualization layer

• VM is a full-featured server in a single file

• VM transfer on private cloud

Credit: VMware Inc.

Tuesday, November 6, 12

Page 13: Ntino Cloud BioLinux Barcelona Spain 2012

Problem B: bioinformatics tools need software engineering expertise

• VM with pre-installed software on the cloud

• avoid compiling source code, or other software dependencies

• rent computational capacity, on a pay as you go basis

• run the VM on the closest Amazon data center

Tuesday, November 6, 12

Page 14: Ntino Cloud BioLinux Barcelona Spain 2012

Solving Problems A & B : Cloud BioLinux

• Cloud BioLinux: publicly accessible VM on EC2

• 100+ pre-installed bioinformatics tools

• remote desktop for non-command line experts

• you can create a cluster with Cloud BioLinux - CloudMan Krampis K, Booth T, Chapman B, Tiwari B, Bicak M,

Field D, Nelson K

Cloud BioLinux: pre-configured and on-demand bioinformatics computing for the genomics community.

BMC Bioinformatics. 2012 Mar 19; 13: 42.

Tuesday, November 6, 12

Page 15: Ntino Cloud BioLinux Barcelona Spain 2012

Accessing Cloud BioLinux

http://aws.amazon.com/console

Tuesday, November 6, 12

Page 16: Ntino Cloud BioLinux Barcelona Spain 2012

Launch through the EC2 cloud console

Tuesday, November 6, 12

Page 17: Ntino Cloud BioLinux Barcelona Spain 2012

Amazon EC2 VM launch wizard

cloudbiolinux.org

Tuesday, November 6, 12

Page 18: Ntino Cloud BioLinux Barcelona Spain 2012

Tuesday, November 6, 12

Page 19: Ntino Cloud BioLinux Barcelona Spain 2012

Cloud BioLinux desktop remote connection

tinyurl.com/bootcloud1 tinyurl.com/bootcloud2

Tuesday, November 6, 12

Page 20: Ntino Cloud BioLinux Barcelona Spain 2012

Cloud BioLinux desktop

Tuesday, November 6, 12

Page 21: Ntino Cloud BioLinux Barcelona Spain 2012

Cloud BioLinux desktop

Tuesday, November 6, 12

Page 22: Ntino Cloud BioLinux Barcelona Spain 2012

Data exchange on the cloudVM snapshots

Tuesday, November 6, 12

Page 23: Ntino Cloud BioLinux Barcelona Spain 2012

Cloud computing research at JCVI

• open-source cloud platforms, fully compatible with Amazon EC2

• active funding, NIAID viral genomics pipeline on cloud

• end-to-end, sequence to assembly, annotation, visualization via Galaxy

• run on Amazon, private cloud, or desktop

Tuesday, November 6, 12

Page 24: Ntino Cloud BioLinux Barcelona Spain 2012

Scriptable Cloud Infrastructures

• Cloud BioLinux VM configuration in plain text

• high-level configuration, software groups

• each group individual bioinformatics tools

Fabricframework

Tuesday, November 6, 12

Page 25: Ntino Cloud BioLinux Barcelona Spain 2012

• Python Fabric leverages Linux packages (APTitude repositories)

• mix and match software from repositories

• share VM configuration as source code

• clone across clouds

Krampis K, Booth T, Chapman B, Tiwari B, Bicak M, Field D, Nelson KCloud BioLinux: pre-configured and on-demand bioinformatics computing for the genomics community.

BMC Bioinformatics. 2012 Mar 19; 13: 42.

Scriptable Cloud Infrastructures

Tuesday, November 6, 12

Page 26: Ntino Cloud BioLinux Barcelona Spain 2012

Scalable Data Analysis

• Cloud BioLinux + Cloudman

• dual role : Master / Worker

• Cloud BioLinux VM, has Cloudman scripts that start more copies of itself

• Grid Engine (SGE) cluster

• http://usecloudman.org/Afgan, E., Chapman, B. et al. (2012). Using Cloud Computing Infrastructure with CloudBioLinux, CloudMan, and Galaxy.Current Protocols in Bioinformatics, 11-9.

Tuesday, November 6, 12

Page 27: Ntino Cloud BioLinux Barcelona Spain 2012

Goodies with Cloud BioLinux

Tuesday, November 6, 12

Page 28: Ntino Cloud BioLinux Barcelona Spain 2012

Goodies with Cloud BioLinux

Tuesday, November 6, 12

Page 29: Ntino Cloud BioLinux Barcelona Spain 2012

From sequencer to the cloud

credit:basespace.illumina.com

Tuesday, November 6, 12

Page 30: Ntino Cloud BioLinux Barcelona Spain 2012

Acknowledgments

• Cloud BioLinux community: Brad Chapman, Enis Afgan,Tim Booth, Mesude Bicak, Dawn Field

• JCVI collaborators: Alex Richter, Ravi Sanka, Andrey Tovichgrechko, Johannes Goll, Karen Nelson, Bill Nierman, JCVI IT support.

• NIAID and for funding: Maria Giovani, Punam Mathur

cloudbiolinux.org

groups.google.com/group/cloudbiolinux

tinyurl.com/cloudboot1

tinyurl.com/cloudboot2

[email protected]

slideshare.com/agbiotec

Thank you !Tuesday, November 6, 12