The cloud The cluster... What is the cloud? 1.Many computers in the sky 2.A service in the sky 3.Sometimes #1 and #2

Embed Size (px)

DESCRIPTION

What is “the cloud”? 1.Many computers “in the sky” 2.A service “in the sky” 3.Sometimes #1 and #2

Citation preview

.. The cloud The cluster.. What is the cloud? 1.Many computers in the sky 2.A service in the sky 3.Sometimes #1 and #2 Basically virtual computers To you. What is a virtual computer? What is a regular computer? Core 1 Core 2 Core 4 Core 3 8 GB Core 1 Core 2 8 GB Core 3 Core 4 transcript assembly mrbayes model 1 mrbayes model 2 But its even cooler than that. You can have it your way! Each machine can be setup just like your computer Programs, settings, etc. Different machines for different tasks Or one large machine for all tasks Caveat pretty much command line only Momentary Digression What is the command line? Text-based means of interacting with your computer More likely to use on OSX or Linux Fast Somewhat obtuse So, why, again, is this helpful? The Cloud can make similar resources available at a fraction of their overall cost. Its essentially on- demand computing power. 48 Cores, 256 GB RAM = $33,500 Benefits of The Cloud Pay by the hour Use what you need No purchase/depreciation of equipment Almost instant access to many resources If you need 1 node, no problem If you need 500 nodes, no problem Costs of The Cloud Few safety nets With flexibility comes the power to do wrong Interactions can be complex Requires proficiency in seemingly arcane tools (the CLI) Can be expensive Must rely on others 68.4 GB RAM 8 Cores z $2.00/hr. Why would you use this? Data pre-processing Read trimming, Adapter trimming Genome assembly Long-running processes that tie up machines mrbayes, raxml, best alignments (blast, blat, lastz, bwa) Practical example De novo Genome assembly Have many reads Need to put them together Generally RAM intensive Generally slow Actual example Start an Amazon ec2 instance Add in necessary software Add 454 assembly software Get data to machine Start assembly Let it run Download assembled data Reads Align and orient Assemble Why is this hard? Must ensure correct ends overlap Must put correct pieces together Must do this quickly Do things in RAM/memory Must deal with massive amounts of data 0.5 to 2 to 20 GB or more What, exactly, is a cluster Group of machines interacting to achieve a common goal 1000 Work Units Clusters 125 Work Units ~ 8X speedup or 1/8 th time Why? Very long running processes/complex jobs Genome:Genome alignments Substitution models for thousands of loci Species trees for thousands of loci Sometimes the only way to accomplish a genome-scale job in a reasonable time- frame Practical example chr1 Similar Practical example chr1 chr2 chr3 chr4 Practical example chr1 chr2 chr3 chr4 chr1 chr2 chr3 chr4 Practical example chr1 chr2 chr3 chr4 chr1 chr2 chr3 chr4 Cluster Caveats Sometimes not suited to certain jobs Essentially those without component parts Some modeling (e.g. mcmc) Complex More moving parts = more to break Clusters in the Cloud You have a big, complicated job You need many computers for a job You need to run job infrequently You dont have massive computer resources The Cloud as a service Alternative meaning of The Cloud Essentially web-powered software Galaxy is one such service Galaxy Very powerful analyses Relatively simple to use Repeatable Understandable Extendable Galaxy Basic services Convert fastq to fasta Summarize fastq reads Fasta + Qual to Fastq Trim fastq reads Merge data sets Convert SFF Galaxy Advanced services Intersect genomic regions Merge genomic regions Map with bowtie Map with bwa Use bwa to identify variants Convert genome coordinates Actual example Finding missing genes You have a genome sequence You have gene annotation (i,e. refseq) You have aligned mRNA data You want to know where these do not overlap Galaxy is very flexible Runs locally Runs on network Runs on cluster Runs in cloud Runs on cluster in cloud Galaxy has some pre-requisites You know what you want to do You generally know how to do it You know what the data are that you need You know how to ensure the results are correct Galaxy abstracts away the complexity of the implementation steps