Cloud computing and bioinformatics

Cloud Computing and Bioinformatics

Enis Afgan*, Nuwan Goonasekera†

* Johns Hopkins University, Taylor Lab, USA† University of Melbourne, Victorian Life Science Computation Initiative, Australia

@ University of ColomboFeb 2017

Overview

• The key characteristics of cloud computing• Dynamically scaling cloud resources• Using Cloud Computing for bioinformatics

Source: http://dilbert.com/strips/comic/2012-05-25/

Life before cloud computing

source: http://www.rackspace.com/knowledge_center/whitepaper/revolution-not-evolution-how-cloud-computing-differs-from-traditional-it-and-why-it

Cloud Computing: A Definition

• NIST definition: “Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.”

» National Institute of Standards and Technology(http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf)

The Cloud Model

Private Community Public HybridDeployment Models

Delivery Models

Essential Characteristics

Software as a Service (SaaS)

Platform as a Service (PaaS)

Infrastructure as a Service (IaaS)

• On-demand self-service• Broad network access• Resource pooling• Rapid elasticity• Measured service

Delivery Models

source: http://www.businessinsider.com.au/10-most-important-in-cloud-computing-2013-4?op=1#a-word-about-clouds-1

Public SaaS examples

• Gmail• Sharepoint• Salesforce.com CRM• On-live• Gaikai• Microsoft Office 365• Some definitions include those that do not require payment.

E.g. ad-supported sites

Public PaaS Examples

Cloud Name Language and Developer Tools

Programming Models Supported by Provider

Target Applications and Storage Options

Google App Engine Python, Java, Go, PHP + JVM languages (scala, groovy, jruby)

MapReduce, Web, DataStore, Storage and other APIs

Web applications and BigTable storage

Salesforce.com’s Force.com

Apex, Eclipsed-based IDE, web-based wizard

Workflow, excel-like formula, web programming

Business applications such as CRM

Microsoft Azure .NET, Visual Studio, Azure tools

Unrestricted model Enterprise and web apps

Amazon Elastic MapReduce

Hive, Pig, Java, Ruby etc.

MapReduce Data processing and e-commerce

Aneka .NET, stand-alone SDK

Threads, task, MapReduce

.NET enterprise applications, HPC

Infrastructure-as-a-Service (IaaS)

• Amazon Web Services (Market leader)• Rackspace Cloud• NeCTAR/OpenStack Research Cloud• Joyent Cloud• GoGrid• FlexiScale

Common Terms

Machine Image: A stored image/template from which a new virtual machine can be launched. E.g. Ubuntu

Instance: A running virtual machine based on some machine image.

Volume: Attachable Block Storage, which is the equivalent of a virtual disk drive.

Object Store: A large store for storing simple binary objects + metadata within containers

Security Groups: A means of specifying firewall rules

Key-pairs: Public/private key pairs for accessing a virtual machine

Getting started with Cloud resources

Demo 1

Many clouds exist - how do we use them?

Many clouds and many solutions

launch.genome.edu.au ; use.jetstream-cloud.org ; launch.usegalaxy.org

?!?!

Architectural stack

CloudLaunch.usegalaxy.org

A P P L I C A T I O N S

CloudBridge

CloudMan

Goonasekera, N., Lonie, A., Taylor, J., Afgan, E., “CloudBridge – a Simple Cross-Cloud Python Library”, XSEDE 16, Miami, FL, July 2016.

Demo 2

CloudBridge Design Principles

A simple, open-source python multi-cloud library.

Uniform API irrespective of the underlying providerNo special casing of application codeSimpler code

Provide a set of conformance tests for all supported cloudsNo need to test against each cloud“Write-once-run-anywhere”> 92% test coverage at present

Supports OpenStack and AWS right nowCommunity contributions for GCE and Azure forthcoming!

http://cloudbridge.readthedocs.org/https://github.com/gvlproject/cloudbridge

http://cloudbridge.readthedocs.org/

http://cloudbridge.readthedocs.org/

https://github.com/gvlproject/cloudbridge

https://github.com/gvlproject/cloudbridge

Sample code: launch an instance1. kp = provider.security.key_pairs.create('cloudbridge_intro')2. with open('cloudbridge_intro.pem', 'w') as f:3. f.write(kp.material)

4. sg = provider.security.security_groups.create(5. 'cloudbridge_intro', 'A security group used by CloudBridge')6. sg.add_rule('tcp', 22, 22, '0.0.0.0/0')

7. img = provider.compute.images.get(image_id)8. inst_type = sorted([t for t in provider.compute.instance_types.list() if t.vcpus >= 2 and t.ram >= 4],

key=lambda x: x.vcpus*x.ram)[0] 9. inst = provider.compute.instances.create(

name='CloudBridge-intro', image=img, instance_type=inst_type, key_pair=kp, security_groups=[sg])

10. # Wait until ready11. inst.wait_till_ready()12. # Show instance state13. inst.state14. # 'running'15. inst.public_ips16. # [u'54.166.125.219']

Create a key pair

Create a security group

Launch an instance

Portal for deploying cloud-enable applications.

Support for customizationSupport launch for diff versions, apps, configs, clouds → fill a role of a science gateway discovery and access portal

Modular and extensible platformApp-store for cloud-enabled applicationsUsers can develop and integrate custom application launch and management components, at the UI and backend

Natively multi-cloudBacked by CloudBridge

CloudLaunch Feature Highlights

https://beta.launch.usegalaxy.org/https://github.com/galaxyproject/cloudlaunch-uihttps://github.com/galaxyproject/cloudlaunch

https://beta.launch.usegalaxy.org/

https://beta.launch.usegalaxy.org/

https://github.com/galaxyproject/cloudlaunch-ui

https://github.com/galaxyproject/cloudlaunch-ui

https://github.com/galaxyproject/cloudlaunch/

https://github.com/galaxyproject/cloudlaunch/

CloudLaunch architecture

CloudBridge

Django + REST framework + Celery

Angular 2

GVL

CloudMan

Galaxy

CloudMan

Ubuntu Pluggable components

Pluggable component example<form class="form" [ngFormModel]="gvlLaunchForm" (ngSubmit)="onSubmit(gvlLaunchForm.value)">  <config-panel> <panel-header>GVL Settings</panel-header> <panel-body> <div class="form-group"> <label>Auto-start the selected applications</label> <div class="checkbox"> <label> <input type="checkbox" name="gvlapp_cmdlineutils" ngControl="gvl_cmdline_utilities" /> GVL Commandline Utilities </label> </div> <div class="checkbox"> <label> <input type="checkbox" name="gvlapp_smrt_analysis" ngControl="smrt_portal" /> PacBio SMRT Analysis </label> </div> </div> </panel-body> </config-panel>

 <cloudman-config [initialConfig]="initialConfig.config_gvl"></cloudman-config></form>

Cloud capacity is great - but what do we use it for?

Bioinformatics: in one slide

A multi-disciplinary science using computers for acquiring, managing and analyzing biological data.

It is a data-driven science.

Biology Medicine

Math & Physics

Computer Science

Bioinformatics

What type of data are we talking about?

DNA → RNA → Protein → to Complex… to Tissues… to Organs… to full Organisms

Each cell contains an (almost) the same DNA in it nuclei.

Adult human body has approximately 37 trillion cells.

Apply data transformations to extract useful information

This is not always a well-defined process

This is typically done with existing tools, or by developing one’s own

Tools can be chained into workflows

What do we do with the data?

And how do we obtain such data?

First methods developed in the mid-1970’s, called Sanger sequencing.

In the 1990’s, the international Human Genome Project took 13 years to sequence the human genome.

In the 2000’s, massively parallel Next Generation Sequencers (NGS) were developed that took days to sequence a human genome at a much lesser cost.

Today, nanopore sequencers are emerging offering real time sequencing.

There are many public data repositories with free access to data (e.g., TCGA, 1000 genomes, GenBank).

omicsmaps.com

Results

External reference data

Raw data

Raw data to results

100-1000's GBfew GB

Typical genomics flow

ResultsRaw data

Some computers + reliable persistent data storage + bioinf tools + reference data + workflow system

100-1000's GB

few GB

Indexed genomes

10-100's GB

AugSepOctNov...

A real-world system

A Data analysis and integration tool

A (free for everyone) web service integrating a

wealth of tools, compute resources, terabytes of

reference data and permanent storage

Open source software that makes integrating your

own tools and data and customizing for your own

site simple

Galaxy: accessible analysis system

Three ways to use Galaxy

1. Download and run locally

2. Public website (http://usegalaxy.org)

3. Run on the Cloud

http://usegalaxy.org

100sGB

100+

ResultsRaw data

Some computers + reliable persistent data storage + bioinf tools + reference data + workflow system

100-1000's GB

few GB

Indexed genomes

10-100's GB

AugSepOctNov...

A real-world system

CloudBridge

CloudLaunch

CloudMan

CloudScale

Pathway Expected Outcomes

Improved features (root volume size)

Cloud independenceImproved stability

Federated single-sign on“One-click” launch

Bulk launch

Cloud Independence

Tasks

Complete

Use CloudBridge

Assemble image from Docker containers

Remove shared filesystem

Simpler deployment

Extensible platform

Scaling for institutional Galaxy instances

Scale-out support for labs

Audience

All users

Academic users

All

Virtual labs (e.g., GVL, CLIMB)

CLIMB/Other labs

Hosted services

TutorialsComplete

Progress roadmap

Acknowledgments

Did this sound interesting?

This entire project is an effort from a large community!

Come talk to us - get involved.

[email protected] or [email protected]

mailto:[email protected]