39
Cloud Computing and Bioinformatics Enis Afgan*, Nuwan Goonasekera * Johns Hopkins University, Taylor Lab, USA University of Melbourne, Victorian Life Science Computation Initiative, Australia @ University of Colombo Feb 2017

Cloud computing and bioinformatics

Embed Size (px)

Citation preview

Page 1: Cloud computing and bioinformatics

Cloud Computing and Bioinformatics

Enis Afgan*, Nuwan Goonasekera†

* Johns Hopkins University, Taylor Lab, USA† University of Melbourne, Victorian Life Science Computation Initiative, Australia

@ University of ColomboFeb 2017

Page 2: Cloud computing and bioinformatics

Overview

• The key characteristics of cloud computing• Dynamically scaling cloud resources• Using Cloud Computing for bioinformatics

Source: http://dilbert.com/strips/comic/2012-05-25/

Page 3: Cloud computing and bioinformatics

Life before cloud computing

source: http://www.rackspace.com/knowledge_center/whitepaper/revolution-not-evolution-how-cloud-computing-differs-from-traditional-it-and-why-it

Page 4: Cloud computing and bioinformatics

Cloud Computing: A Definition

• NIST definition: “Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.”

» National Institute of Standards and Technology(http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf)

Page 5: Cloud computing and bioinformatics

The Cloud Model

Private Community Public HybridDeployment Models

Delivery Models

Essential Characteristics

Software as a Service (SaaS)

Platform as a Service (PaaS)

Infrastructure as a Service (IaaS)

• On-demand self-service• Broad network access• Resource pooling• Rapid elasticity• Measured service

Page 6: Cloud computing and bioinformatics

Delivery Models

source: http://www.businessinsider.com.au/10-most-important-in-cloud-computing-2013-4?op=1#a-word-about-clouds-1

Page 7: Cloud computing and bioinformatics

Public SaaS examples

• Gmail• Sharepoint• Salesforce.com CRM• On-live• Gaikai• Microsoft Office 365• Some definitions include those that do not require payment.

E.g. ad-supported sites

Page 8: Cloud computing and bioinformatics

Public PaaS Examples

Cloud Name Language and Developer Tools

Programming Models Supported by Provider

Target Applications and Storage Options

Google App Engine Python, Java, Go, PHP + JVM languages (scala, groovy, jruby)

MapReduce, Web, DataStore, Storage and other APIs

Web applications and BigTable storage

Salesforce.com’s Force.com

Apex, Eclipsed-based IDE, web-based wizard

Workflow, excel-like formula, web programming

Business applications such as CRM

Microsoft Azure .NET, Visual Studio, Azure tools

Unrestricted model Enterprise and web apps

Amazon Elastic MapReduce

Hive, Pig, Java, Ruby etc.

MapReduce Data processing and e-commerce

Aneka .NET, stand-alone SDK

Threads, task, MapReduce

.NET enterprise applications, HPC

Page 9: Cloud computing and bioinformatics

Infrastructure-as-a-Service (IaaS)

• Amazon Web Services (Market leader)• Rackspace Cloud• NeCTAR/OpenStack Research Cloud• Joyent Cloud• GoGrid• FlexiScale

Page 10: Cloud computing and bioinformatics

Common Terms

Machine Image: A stored image/template from which a new virtual machine can be launched. E.g. Ubuntu

Instance: A running virtual machine based on some machine image.

Volume: Attachable Block Storage, which is the equivalent of a virtual disk drive.

Object Store: A large store for storing simple binary objects + metadata within containers

Security Groups: A means of specifying firewall rules

Key-pairs: Public/private key pairs for accessing a virtual machine

Page 11: Cloud computing and bioinformatics

Getting started with Cloud resources

Page 12: Cloud computing and bioinformatics

Demo 1

Page 13: Cloud computing and bioinformatics

Many clouds exist - how do we use them?

Page 14: Cloud computing and bioinformatics

Many clouds and many solutions

launch.genome.edu.au ; use.jetstream-cloud.org ; launch.usegalaxy.org

?!?!

Page 15: Cloud computing and bioinformatics

Architectural stack

CloudLaunch.usegalaxy.org

A P P L I C A T I O N S

CloudBridge

CloudMan

Goonasekera, N., Lonie, A., Taylor, J., Afgan, E., “CloudBridge – a Simple Cross-Cloud Python Library”, XSEDE 16, Miami, FL, July 2016.

Page 16: Cloud computing and bioinformatics

Demo 2

Page 17: Cloud computing and bioinformatics

CloudBridge Design Principles

A simple, open-source python multi-cloud library.

Uniform API irrespective of the underlying providerNo special casing of application codeSimpler code

Provide a set of conformance tests for all supported cloudsNo need to test against each cloud“Write-once-run-anywhere”> 92% test coverage at present

Supports OpenStack and AWS right nowCommunity contributions for GCE and Azure forthcoming!

http://cloudbridge.readthedocs.org/https://github.com/gvlproject/cloudbridge

Page 18: Cloud computing and bioinformatics

Sample code: launch an instance1. kp = provider.security.key_pairs.create('cloudbridge_intro')2. with open('cloudbridge_intro.pem', 'w') as f:3. f.write(kp.material)

4. sg = provider.security.security_groups.create(5. 'cloudbridge_intro', 'A security group used by CloudBridge')6. sg.add_rule('tcp', 22, 22, '0.0.0.0/0')

7. img = provider.compute.images.get(image_id)8. inst_type = sorted([t for t in provider.compute.instance_types.list() if t.vcpus >= 2 and t.ram >= 4],

key=lambda x: x.vcpus*x.ram)[0] 9. inst = provider.compute.instances.create(

name='CloudBridge-intro', image=img, instance_type=inst_type, key_pair=kp, security_groups=[sg])

10. # Wait until ready11. inst.wait_till_ready()12. # Show instance state13. inst.state14. # 'running'15. inst.public_ips16. # [u'54.166.125.219']

Create a key pair

Create a security group

Launch an instance

Page 19: Cloud computing and bioinformatics

Portal for deploying cloud-enable applications.

Support for customizationSupport launch for diff versions, apps, configs, clouds → fill a role of a science gateway discovery and access portal

Modular and extensible platformApp-store for cloud-enabled applicationsUsers can develop and integrate custom application launch and management components, at the UI and backend

Natively multi-cloudBacked by CloudBridge

CloudLaunch Feature Highlights

https://beta.launch.usegalaxy.org/https://github.com/galaxyproject/cloudlaunch-uihttps://github.com/galaxyproject/cloudlaunch

Page 20: Cloud computing and bioinformatics

CloudLaunch architecture

CloudBridge

Django + REST framework + Celery

Angular 2

GVL

CloudMan

Galaxy

CloudMan

Ubuntu Pluggable components

Page 21: Cloud computing and bioinformatics

Pluggable component example<form class="form" [ngFormModel]="gvlLaunchForm" (ngSubmit)="onSubmit(gvlLaunchForm.value)"> <!-- GVL Component Selection --> <config-panel> <panel-header>GVL Settings</panel-header> <panel-body> <div class="form-group"> <label>Auto-start the selected applications</label> <div class="checkbox"> <label> <input type="checkbox" name="gvlapp_cmdlineutils" ngControl="gvl_cmdline_utilities" /> GVL Commandline Utilities </label> </div> <div class="checkbox"> <label> <input type="checkbox" name="gvlapp_smrt_analysis" ngControl="smrt_portal" /> PacBio SMRT Analysis </label> </div> </div> </panel-body> </config-panel>

<!-- CloudMan settings --> <cloudman-config [initialConfig]="initialConfig.config_gvl"></cloudman-config></form>

Page 22: Cloud computing and bioinformatics

Cloud capacity is great - but what do we use it for?

Page 23: Cloud computing and bioinformatics

Bioinformatics: in one slide

A multi-disciplinary science using computers for acquiring, managing and analyzing biological data.

It is a data-driven science.

Biology Medicine

Math & Physics

Computer Science

Bioinformatics

Page 24: Cloud computing and bioinformatics

What type of data are we talking about?

DNA → RNA → Protein → to Complex… to Tissues… to Organs… to full Organisms

Each cell contains an (almost) the same DNA in it nuclei.

Adult human body has approximately 37 trillion cells.

Page 25: Cloud computing and bioinformatics

Apply data transformations to extract useful information

This is not always a well-defined process

This is typically done with existing tools, or by developing one’s own

Tools can be chained into workflows

What do we do with the data?

Page 26: Cloud computing and bioinformatics

And how do we obtain such data?

First methods developed in the mid-1970’s, called Sanger sequencing.

In the 1990’s, the international Human Genome Project took 13 years to sequence the human genome.

In the 2000’s, massively parallel Next Generation Sequencers (NGS) were developed that took days to sequence a human genome at a much lesser cost.

Today, nanopore sequencers are emerging offering real time sequencing.

There are many public data repositories with free access to data (e.g., TCGA, 1000 genomes, GenBank).

Page 27: Cloud computing and bioinformatics

omicsmaps.com

Page 28: Cloud computing and bioinformatics
Page 29: Cloud computing and bioinformatics
Page 30: Cloud computing and bioinformatics

Results

External reference data

Raw data

Raw data to results

100-1000's GBfew GB

Typical genomics flow

Page 31: Cloud computing and bioinformatics

ResultsRaw data

Some computers + reliable persistent data storage + bioinf tools + reference data + workflow system

100-1000's GB

few GB

Indexed genomes

10-100's GB

AugSepOctNov...

A real-world system

Page 32: Cloud computing and bioinformatics

A Data analysis and integration tool

A (free for everyone) web service integrating a

wealth of tools, compute resources, terabytes of

reference data and permanent storage

Open source software that makes integrating your

own tools and data and customizing for your own

site simple

Page 33: Cloud computing and bioinformatics

Galaxy: accessible analysis system

Page 34: Cloud computing and bioinformatics

Three ways to use Galaxy

1. Download and run locally

2. Public website (http://usegalaxy.org)

3. Run on the Cloud

Page 35: Cloud computing and bioinformatics

100sGB

100+

Page 36: Cloud computing and bioinformatics

ResultsRaw data

Some computers + reliable persistent data storage + bioinf tools + reference data + workflow system

100-1000's GB

few GB

Indexed genomes

10-100's GB

AugSepOctNov...

A real-world system

Page 37: Cloud computing and bioinformatics

CloudBridge

CloudLaunch

CloudMan

CloudScale

Pathway Expected Outcomes

Improved features (root volume size)

Cloud independenceImproved stability

Federated single-sign on“One-click” launch

Bulk launch

Cloud Independence

Tasks

Complete

Use CloudBridge

Assemble image from Docker containers

Remove shared filesystem

Simpler deployment

Extensible platform

Scaling for institutional Galaxy instances

Scale-out support for labs

Audience

All users

Academic users

All

Virtual labs (e.g., GVL, CLIMB)

CLIMB/Other labs

Hosted services

TutorialsComplete

Progress roadmap

Page 38: Cloud computing and bioinformatics

Acknowledgments

Page 39: Cloud computing and bioinformatics

Did this sound interesting?

This entire project is an effort from a large community!

Come talk to us - get involved.

[email protected] or [email protected]