Upload
enis-afgan
View
69
Download
2
Embed Size (px)
Citation preview
Cloud Computing and Bioinformatics
Enis Afgan*, Nuwan Goonasekera†
* Johns Hopkins University, Taylor Lab, USA† University of Melbourne, Victorian Life Science Computation Initiative, Australia
@ University of ColomboFeb 2017
Overview
• The key characteristics of cloud computing• Dynamically scaling cloud resources• Using Cloud Computing for bioinformatics
Source: http://dilbert.com/strips/comic/2012-05-25/
Life before cloud computing
source: http://www.rackspace.com/knowledge_center/whitepaper/revolution-not-evolution-how-cloud-computing-differs-from-traditional-it-and-why-it
Cloud Computing: A Definition
• NIST definition: “Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.”
» National Institute of Standards and Technology(http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf)
The Cloud Model
Private Community Public HybridDeployment Models
Delivery Models
Essential Characteristics
Software as a Service (SaaS)
Platform as a Service (PaaS)
Infrastructure as a Service (IaaS)
• On-demand self-service• Broad network access• Resource pooling• Rapid elasticity• Measured service
Delivery Models
source: http://www.businessinsider.com.au/10-most-important-in-cloud-computing-2013-4?op=1#a-word-about-clouds-1
Public SaaS examples
• Gmail• Sharepoint• Salesforce.com CRM• On-live• Gaikai• Microsoft Office 365• Some definitions include those that do not require payment.
E.g. ad-supported sites
Public PaaS Examples
Cloud Name Language and Developer Tools
Programming Models Supported by Provider
Target Applications and Storage Options
Google App Engine Python, Java, Go, PHP + JVM languages (scala, groovy, jruby)
MapReduce, Web, DataStore, Storage and other APIs
Web applications and BigTable storage
Salesforce.com’s Force.com
Apex, Eclipsed-based IDE, web-based wizard
Workflow, excel-like formula, web programming
Business applications such as CRM
Microsoft Azure .NET, Visual Studio, Azure tools
Unrestricted model Enterprise and web apps
Amazon Elastic MapReduce
Hive, Pig, Java, Ruby etc.
MapReduce Data processing and e-commerce
Aneka .NET, stand-alone SDK
Threads, task, MapReduce
.NET enterprise applications, HPC
Infrastructure-as-a-Service (IaaS)
• Amazon Web Services (Market leader)• Rackspace Cloud• NeCTAR/OpenStack Research Cloud• Joyent Cloud• GoGrid• FlexiScale
Common Terms
Machine Image: A stored image/template from which a new virtual machine can be launched. E.g. Ubuntu
Instance: A running virtual machine based on some machine image.
Volume: Attachable Block Storage, which is the equivalent of a virtual disk drive.
Object Store: A large store for storing simple binary objects + metadata within containers
Security Groups: A means of specifying firewall rules
Key-pairs: Public/private key pairs for accessing a virtual machine
Getting started with Cloud resources
Demo 1
Many clouds exist - how do we use them?
Many clouds and many solutions
launch.genome.edu.au ; use.jetstream-cloud.org ; launch.usegalaxy.org
?!?!
Architectural stack
CloudLaunch.usegalaxy.org
A P P L I C A T I O N S
CloudBridge
CloudMan
Goonasekera, N., Lonie, A., Taylor, J., Afgan, E., “CloudBridge – a Simple Cross-Cloud Python Library”, XSEDE 16, Miami, FL, July 2016.
Demo 2
CloudBridge Design Principles
A simple, open-source python multi-cloud library.
Uniform API irrespective of the underlying providerNo special casing of application codeSimpler code
Provide a set of conformance tests for all supported cloudsNo need to test against each cloud“Write-once-run-anywhere”> 92% test coverage at present
Supports OpenStack and AWS right nowCommunity contributions for GCE and Azure forthcoming!
http://cloudbridge.readthedocs.org/https://github.com/gvlproject/cloudbridge
Sample code: launch an instance1. kp = provider.security.key_pairs.create('cloudbridge_intro')2. with open('cloudbridge_intro.pem', 'w') as f:3. f.write(kp.material)
4. sg = provider.security.security_groups.create(5. 'cloudbridge_intro', 'A security group used by CloudBridge')6. sg.add_rule('tcp', 22, 22, '0.0.0.0/0')
7. img = provider.compute.images.get(image_id)8. inst_type = sorted([t for t in provider.compute.instance_types.list() if t.vcpus >= 2 and t.ram >= 4],
key=lambda x: x.vcpus*x.ram)[0] 9. inst = provider.compute.instances.create(
name='CloudBridge-intro', image=img, instance_type=inst_type, key_pair=kp, security_groups=[sg])
10. # Wait until ready11. inst.wait_till_ready()12. # Show instance state13. inst.state14. # 'running'15. inst.public_ips16. # [u'54.166.125.219']
Create a key pair
Create a security group
Launch an instance
Portal for deploying cloud-enable applications.
Support for customizationSupport launch for diff versions, apps, configs, clouds → fill a role of a science gateway discovery and access portal
Modular and extensible platformApp-store for cloud-enabled applicationsUsers can develop and integrate custom application launch and management components, at the UI and backend
Natively multi-cloudBacked by CloudBridge
CloudLaunch Feature Highlights
https://beta.launch.usegalaxy.org/https://github.com/galaxyproject/cloudlaunch-uihttps://github.com/galaxyproject/cloudlaunch
CloudLaunch architecture
CloudBridge
Django + REST framework + Celery
Angular 2
GVL
CloudMan
Galaxy
CloudMan
Ubuntu Pluggable components
Pluggable component example<form class="form" [ngFormModel]="gvlLaunchForm" (ngSubmit)="onSubmit(gvlLaunchForm.value)"> <!-- GVL Component Selection --> <config-panel> <panel-header>GVL Settings</panel-header> <panel-body> <div class="form-group"> <label>Auto-start the selected applications</label> <div class="checkbox"> <label> <input type="checkbox" name="gvlapp_cmdlineutils" ngControl="gvl_cmdline_utilities" /> GVL Commandline Utilities </label> </div> <div class="checkbox"> <label> <input type="checkbox" name="gvlapp_smrt_analysis" ngControl="smrt_portal" /> PacBio SMRT Analysis </label> </div> </div> </panel-body> </config-panel>
<!-- CloudMan settings --> <cloudman-config [initialConfig]="initialConfig.config_gvl"></cloudman-config></form>
Cloud capacity is great - but what do we use it for?
Bioinformatics: in one slide
A multi-disciplinary science using computers for acquiring, managing and analyzing biological data.
It is a data-driven science.
Biology Medicine
Math & Physics
Computer Science
Bioinformatics
What type of data are we talking about?
DNA → RNA → Protein → to Complex… to Tissues… to Organs… to full Organisms
Each cell contains an (almost) the same DNA in it nuclei.
Adult human body has approximately 37 trillion cells.
Apply data transformations to extract useful information
This is not always a well-defined process
This is typically done with existing tools, or by developing one’s own
Tools can be chained into workflows
What do we do with the data?
And how do we obtain such data?
First methods developed in the mid-1970’s, called Sanger sequencing.
In the 1990’s, the international Human Genome Project took 13 years to sequence the human genome.
In the 2000’s, massively parallel Next Generation Sequencers (NGS) were developed that took days to sequence a human genome at a much lesser cost.
Today, nanopore sequencers are emerging offering real time sequencing.
There are many public data repositories with free access to data (e.g., TCGA, 1000 genomes, GenBank).
omicsmaps.com
Results
External reference data
Raw data
Raw data to results
100-1000's GBfew GB
Typical genomics flow
ResultsRaw data
Some computers + reliable persistent data storage + bioinf tools + reference data + workflow system
100-1000's GB
few GB
Indexed genomes
10-100's GB
AugSepOctNov...
A real-world system
A Data analysis and integration tool
A (free for everyone) web service integrating a
wealth of tools, compute resources, terabytes of
reference data and permanent storage
Open source software that makes integrating your
own tools and data and customizing for your own
site simple
Galaxy: accessible analysis system
Three ways to use Galaxy
1. Download and run locally
2. Public website (http://usegalaxy.org)
3. Run on the Cloud
100sGB
100+
ResultsRaw data
Some computers + reliable persistent data storage + bioinf tools + reference data + workflow system
100-1000's GB
few GB
Indexed genomes
10-100's GB
AugSepOctNov...
A real-world system
CloudBridge
CloudLaunch
CloudMan
CloudScale
Pathway Expected Outcomes
Improved features (root volume size)
Cloud independenceImproved stability
Federated single-sign on“One-click” launch
Bulk launch
Cloud Independence
Tasks
Complete
Use CloudBridge
Assemble image from Docker containers
Remove shared filesystem
Simpler deployment
Extensible platform
Scaling for institutional Galaxy instances
Scale-out support for labs
Audience
All users
Academic users
All
Virtual labs (e.g., GVL, CLIMB)
CLIMB/Other labs
Hosted services
TutorialsComplete
Progress roadmap
Acknowledgments
Did this sound interesting?
This entire project is an effort from a large community!
Come talk to us - get involved.