Coates bosc2010 clouds-fluff-and-no-substance

Clouds: All fluff and no substance?

Guy Coates

Wellcome Trust Sanger Institute

[email protected]

Outline

About the Sanger Institute.

Experience with cloud to date.

Future Directions.

The Sanger Institute Funded by Wellcome Trust.

• 2nd largest research charity in the world.• ~700 employees.• Based on Hinxton Genome Campus,

Cambridge, UK.

Large scale genomic research.• We have active cancer, malaria,

pathogen and genomic variation / human health studies.

• 1k genomes, & 10k-UK Genomes, Cancer genome projects.

All data is made publicly available.• Websites, ftp, direct database access,

programmatic APIs.

Economic Trends:

As cost of sequencing halves every 12 months.• cf Moore's Law

The Human genome project: • 13 years.• 23 labs.• $500 Million.

A Human genome today:• 3 days.• 1 machine.• $10,000.• Large centres are now doing studies with 1000s and

10,000s of genomes.

Changes in sequencing technology are going to continue this trend.• “Next-next” generation sequencers are on their way.• $500 genome is probable within 5 years.

The scary graph

Instrument upgrades

Peak Yearly capillary sequencing

19941995

19961997

19981999

20002001

20022003

20042005

20062007

20082009

0

1000

2000

3000

4000

5000

6000

Disk Storage

Year

Tera

byt

es

Managing Growth We have exponential growth in

storage and compute.• Storage /compute doubles every 12

months.• 2009 ~7 PB raw

Moore's law will not save us.• Transistor/disk density: T

d=18 months

• Sequencing cost: Td=12 months

My Job:• Running the team who do the IT

systems heavy-lifting to make it all work.• Tech evaluations.• Systems architecture.• Day-to-day administration.• All in conjunction with informaticians,

programmers & investigators who are doing the science.

Cloud: Where are we at?

What is cloud?

Technical view:

• On demand, virtual machines.

• Root access, total ownership.

• Pay-as-you-go model.

Non-technical view:

• “Free” compute we can use to solve all of the hard problems thrown up by new sequencing. • (8cents/hour is almost free, right...?)

• Web 2.0 / Friendface use it, so it must be good.

Hype Cycle

Awesome!

Just works...

Out of the trough of disillusionment...

Victory!

Cloud Use-Cases

We currently have three areas of activity:

• Web presence

• HPC workload

• Data Warehousing

Ensembl

Ensembl is a system for genome Annotation.

Data visualisation (Web Presence)• www.ensembl.org• Provides web / programmatic interfaces to genomic data.• 10k visitors / 126k page views per day.

Compute Pipeline (HPC Workload)• Take a raw genome and run it through a compute pipeline to find genes

and other features of interest.• Ensembl at Sanger/EBI provides automated analysis for 51 vertebrate

genomes.

• Software is Open Source (apache license).• Data is free for download.

We have done cloud experiments with both the web site and pipeline.

http://www.ensembl.org/

Web presence

Web Presence

Ensembl has a worldwide audience.

Historically, web site performance was not great, especially for non-european institutes.• Pages were quite heavyweight.• Not properly cached etc.

Web team spent a lot of time re-designing the code to make it more streamlined.• Greatly improved performance.

Coding can only get you so-far.• “A canna' change the laws of physics.”

• 150-240ms round trip time from Europe to the US.• We need a set of geographically dispersed mirrors.

uswest.ensembl.org

Traditional mirror: Real machines in a co-lo facility in California.

Hardware was initially configured on site.• 16 servers, SAN storage, SAN switches, SAN management appliance,

Ethernet switches, firewall, out-of-band management etc.

Shipped to the co-lo for installation.• Sent a person to California for 3 weeks.• Spent 1 week getting stuff into/out of customs.

• ****ing FCC paperwork!

Additional infrastructure work.• VPN between UK and US.

Incredibly time consuming.• Really don't want to end up having to send someone on a plane to the US

to fix things.

Usage

US-West currently takes ~1/3rd of total Ensembl web traffic.• Much lower latency and improved site usibility.

What has this got to do with clouds?

useast.ensembl.org

We want an east coast US mirror to complement our west coast mirror.

Built the mirror in AWS.• Initially a proof of concept / test-bed.• Production-level in due course.

Gives us operational experience.• We can compare to a “real” colo.

Building a mirror on AWS

Some software development / sysadmin work needed.• Preparation of OS images, software stack configuration.• West-coast was built as an extension of Sanger internal network via VPN.• AWS images built as standalone systems.

Web code changes• Significant code changes required to make the webcode “mirror aware”.

• Seach, site login etc.• We chose not to set up VPN into AWS.• Work already done for the first mirror.

Significant amount of tuning required.• Initial mysql performance was pretty bad, especially for the large ensembl

databases. (~1TB).• Lots of people doing Apache/mysql on AWS, so there is a good amount of

best-practice etc available.

Does it work?

BETA!BETA!

Is it better than the co-lo?

No physical hardware.• Work can start as soon as we enter our credit card numbers...• No US customs, Fedex etc.

Much simpler management infra-stucture.• AWS give you out of band management “for free”.

• Much simpler to deal with hardware problems.• And we do remote-management all the time.

“Free” hardware upgrades.• As faster machines become available we can take advantage of them

immediately.• No need to get tin decommissioned /re-installed at Co-lo.

Is it cost effective?

Lots of misleading cost statements made about cloud.• “Our analysis only cost $500.”• CPU is only “$0.085 / hr”.

What are we comparing against?• Doing the analysis once? Continually? • Buying a $2000 server?• Leasing a $2000 server for 3 years?• Using $150 of time at your local supercomputing facility?• Buying a $2000 of server but having to build a $1M datacentre to put it

in?

Requires the dreaded Total Cost of Ownership (TCO) calculation.• hardware + power + cooling + facilities + admin/developers etc

• Incredibly hard to do.

Lets do it anyway...

Comparing costs to the co-lo is simpler.• power, cooling costs are all included.• Admin costs are the same, so we can ignore them.

• Same people responsible for both.

Cost for Co-location facility:• $120,000 hardware + $51,000 /yr colo.• $91,000 per year (3 years hardware lifetime).

Cost for AWS :• $77,000 per year (estimated based on US-east traffic / IOPs)

Result: Estimated 16% cost saving. • It is not free!

Additional Benefits

Website + code is packaged together.• Can be conveniently given away to end users in a “ready-to-run” config.• Simplifies configuration for other users wanting to run Ensembl sites.• Configuring an ensembl site is non-trivial for non-informaticians.

• Cvs, mysql setup, apache configuration etc.

Ensembl data is already available as an Amazon public dataset.• Makes a complete system.

Unknowns

What about scale-up?

Current installation is a minimal config.• Single web / database nodes.• Main site and us-east use multiple load balanced servers.

AWS load-balancing architecture is different from what we currently use.• In theory there should be no problems... • ...but we don't know until we try.• Do we go for automatic scale-out?

Downsides

Underestimated the time it would take to make the web-code mirror-ready.• Not a cloud specific problem, but something to be aware of when you take

big applications and move them outside your home institution.

Packaging OS images, code and data needs to be done for every ensembl release.• Ensembl team now has a dedicated person responsible for the cloud.• Somebody has to look after the systems.

Management overhead does not necessarily go down.• But it does change.

Going forward

useast.ensembl.org to go into production later this year.• Far-east Amazon availability zone is also of interest.

• Likely to be next, assuming useast works.

“Virtual” Co-location concept will be useful for a number of other projects.• Other Sanger websites?

Disaster recovery.• Eg replicate critical databases / storage into AWS.• Currently all of Sanger data lives in a single datacentre.• We have a small amount of co-lo space for mirroring critical data.

• Same argument apply as for the uswest mirror.

Hype Cycle

Web servicesWeb services

Ensembl Pipeline

HPC element of Ensembl.• Takes raw genomes and performs automated annotation on them.

Compute Pipeline

TCCTCTCTTTATTTTAGCTGGACCAGACCAATTTTGAGGAAAGGATACAGACAGCGCCTGGAATTGTCAGACATATACCAAATCCCTTCTGTTGATTCTGCTGACAATCTATCTGAAAAATTGGAAAGGTATGTTCATGTACATTGTTTAGTTGAAGAGAGAAATTCATATTATTAATTATTTAGAGAAGAGAAAGCAAACATATTATAAGTTTAATTCTTATATTTAAAAATAGGAGCCAAGTATGGTGGCTAATGCCTGTAATCCCAACTATTTGGGAGGCCAAGATGAGAGGATTGCTTGAGACCAGGAGTTTGATACCAGCCTGGGCAACATAGCAAGATGTTATCTCTACACAAAATAAAAAAGTTAGCTGGGAATGGTAGTGCATGCTTGTATTCCCAGCTACTCAGGAGGCTGAAGCAGGAGGGTTACTTGAGCCCAGGAGTTTGAGGTTGCAGTGAGCTATGATTGTGCCACTGCACTCCAGCTTGGGTGACACAGCAAAACCCTCTCTCTCTAAAAAAAAAAAAAAAAAGGAACATCTCATTTTCACACTGAAATGTTGACTGAAATCATTAAACAATAAAATCATAAAAGAAAAATAATCAGTTTCCTAAGAAATGATTTTTTTTCCTGAAAAATACACATTTGGTTTCAGAGAATTTGTCTTATTAGAGACCATGAGATGGATTTTGTGAAAACTAAAGTAACACCATTATGAAGTAAATCGTGTATATTTGCTTTCAAAACCTTTATATTTGAATACAAATGTACTCC

Raw Sequence → Something useful

Example annotation

Gene Finding

DNA

HMM Prediction

Alignment with known proteins

Alignment with fragments recovered in vivo

Alignment with other genes and other species

Workflow

Compute Pipeline

Architecture:• OO perl pipeline manager.• Core algorithms are C.• 200 auxiliary binaries.

Workflow:• Investigator describes analysis at high level.• Pipeline manager splits the analysis into parallel chunks.

• Typically 50k-100k jobs.• Sorts out the dependences and then submits jobs to a DRM.

• Typically LSF or SGE.• Pipeline state and results are stored in a mysql database.

Workflow is embarrassingly parallel.• Integer, not floating point.• 64 bit memory address is nice, but not required.

• 64 bit file access is required.• Single threaded jobs.• Very IO intensive.

Running the pipeline in practice

Requires a significant amount of domain knowledge.

Software install is complicated.• Lots of perl modules and dependencies.

Need a well tuned compute cluster.• Pipeline takes ~500 CPU days for a moderate genome.

• Ensembl chewed up 160k CPU days last year.• Code is IO bound in a number of places.• Typically need a high performance filesystem.

• Lustre, GPFS, Isilon, Ibrix etc.• Need large mysql database.

• 100GB-TB mysql instances, very high query load generated from the cluster.

Why Cloud?

Proof of concept• Is HPC is even possible in Cloud infrastructures?

Coping with the big increase in data• Will we be able to provision new machines/datacentre space to keep up?• What happens if we need to “out-source” our compute?• Can we be in a position to shift peaks of demand to cloud facilities?

Expanding markets

There are going to be lots of new genomes that need annotating.• Sequencers moving into small labs, clinical settings.• Limited informatics / systems experience.

• Typically postdocs/PhD who have a “real” job to do.• They may want to run the genebuild pipeline on their data, but they may

not have the expertise to do so.

We have already done all the hard work on installing the software and tuning it.• Can we package up the pipeline, put it in the cloud?

Goal: End user should simply be able to upload their data, insert their credit-card number, and press “GO”.

Porting HPC code to the cloud

Lets build a compute cluster in the cloud.

Software stack / machine image.• Creating images with software is reasonably straightforward.• No big surprises.

Queuing system• Pipeline requires a queueing system: (LSF/SGE)

• Licensing problems.• Getting them to run took a lot of fiddling.

• Machines need to find each other one they are inside the cloud.• Building an automated “self discovering” cluster takes some hacking.• Hopefully others can re-use it.

Mysql databases• Lots of best practice on how to do that on EC2.

It took time, even for experienced systems people.• (You will not be firing your system-administrators just yet!).

Did it work? NO!

“High performance computing is not facebook.” -- Chris Dagdigian

The big problem data:

• Moving data into the cloud is hard.

• Doing stuff with data once it is in the cloud is also hard.

If you look closely, most successful cloud projects have small amounts of data (10-100 Mbytes).

Genomics projects have Tbytes → Pbytes of data.

Moving data is hard

Commonly used tools (FTP,ssh/rsync) are not suited to wide-area networks.• Need to use specialised WAN tools: gridFTP/FDT/Aspera.

There is a lot of broken internet.

Data transfer rates (gridFTP/FDT via our 2 Gbit/s site link).• Cambridge → EC2 East coast: 12 Mbytes/s (96 Mbits/s)• Cambridge → EC2 Dublin: 25 Mbytes/s (200 Mbits/s) • 11 hours to move 1TB to Dublin.• 23 hours to move 1 TB to East coast.

What speed should we get?• Once we leave JANET (UK academic network) finding out what the

connectivity is and what we should expect is almost impossible.• Finding out who to talk to when you diagnose a troublesome link is also

almost impossible.

Networking

“But the physicists do this all the time.”• No they don't.• LHC Grid; Dedicated networking

between CERN and the T1 centres who get all of the data.

Can we use this model?• We have relatively short lived and

fluid collaborations. (1-2 years, many institutions).

• As more labs get sequencers, our potential collaborators also increase.

• We need good connectivity to everywhere.

Using data within the cloud

Compute nodes need to have fast access to the data.• We solve this with exotic and temperamental filesystems/storage.

No viable global filesystems on EC2.• NFS has poor scaling at the best of times.• EC2 has poor inter-node networking. > 8 NFS clients, everything stops.

Nasty-hacks:• Subcloud; commercial product that allows you to run a POSIX filesystem

on top of S3.• Interesting performance, and you are paying by the hour...

Compute architecture

CPUCPU CPU

Fat Network

Posix Global filesystem

CPU CPUCPUCPU

thin network

Localstorage

Localstorage

Localstorage

Localstorage

Batch schedular hadoop/S3

VS

Data-store

Data-store

Why not S3 /hadoop/map-reduce?

Not POSIX.• Lots of code expects file on a filesystem.

• Limitations; cannot store objects > 5GB.• Throw away file formats?

Nobody want to re-write existing applications.• They already work on our compute farm.• How do hadoop apps co-exist with non-hadoop ones?

• Do we have to have two different type of infrastructure and move data between them?

• Barrier for entry seems much lower for file-systems.

Am I being a reactionary old fart?• 15 years ago clusters of PCs were not “real” supercomputers.

• ...then beowulf took over the world.• Big difference: porting applications between the two architectures was

easy.• MPI/PVM etc.

Will the market provide “traditional” compute clusters in the cloud?

Hype cycle

HPC

HPC app summary

You cannot take an existing data-rich HPC app and expect it to work.• IO architectures are too different.

There is some re-factoring going on for the ensembl pipeline to make it EC2 friendly.• Currently on a case-by-case basis.• For the less-data intensive parts.

Waiting for the market to deliver...

Shared data archives

Past Collaborations

Data

SequencingCentre + DCC

Sequencingcentre

Sequencingcentre

Sequencingcentre Sequencing

centre

Genomics Data

Intensities / raw data (2TB)

Alignments (200 GB)

Sequence + quality data (500 GB)

Variation data (1GB)

Individual features

(3MB)

Structured data(databases)

Unstructured data(flat files)

Data size per Genome

Clinical Researchers,Clinical Researchers,non-infomaticiansnon-infomaticians

Sequencing informatics Sequencing informatics specialistsspecialists

The Problem With Current Archives

Data in current archives is “dark”.• You can put/get data, but cannot

compute across it.

Data is all in one place.• Problematic if you are not the DCC:• You have to pull the data down to do

something with it,• Holding data in one place is bad for

disaster-recovery and network access.

Is data in an inaccessible archive really useful?

A real example...

“We want to run out pipeline across 100TB of data currently in EGA/SRA.”

We will need to de-stage the data to Sanger, and then run the compute.• Extra 0.5 PB of storage, 1000 cores of compute.• 3 month lead time.• ~$1.5M capex.• Download:

• 46 days at 25 Mbytes/s (best transatlantic link).• 10 days at 1 Gbit/s. (sling an cable across the datacentre to EBI).

Easy to solve problem in powerpoint:

Put data into a cloud.• Big cloud providers already have replicated storage infrastructures.

Upload workload onto VMs.• Put VMs on compute that is “attached” to the data.

Data

CPU CPU CPU CPU

Data

CPU CPU CPU CPU

VM

Practical Hurdles

How do you expose the data?• Flat files? Database?

How do you make the compute efficient?• Cloud IO problems still there.

• And you make the end user pay for them.

How do we deal with controlled access?• Hard problem. Grid / delegated security mechanisms are complicated for

a reason.

Whose Cloud?

Most of us are funded to hold data, not to fund everyone else's compute costs to.• Now need to budget for raw compute power as well as disk.• Implement visualisation infrastructure, billing etc.

• Are you legally allowed to charge?• Who underwrites it if nobody actually uses your service?

Strongly implies data has to be held on a commercial provider.

Can it solve our networking problems?

Moving data across the internet is hard.• Fixing the internet is not going to be cost effective for us.

Fixing the internet may be cost effective for big cloud providers.• Core to their business model.• All we need to do is get data into Amazon, and then everyone else can get

the data from there.

Do we invest in a fast links to Amazon?• It changes the business dynamic.• We have effectively tied ourselves to a single provider.

Where are we?

Computablearchives

Summary

Cloud work well for webservices.

Data rich HPC workloads are still hard.

Cloud based data archives look really interesting.

Acknowledgements

Phil Butcher

ISG Team• James Beal• Gen-Tao Chiang• Pete Clapham• Simon Kelley

Ensembl• Steve Searle• Jan-Hinnerk Vogel• Bronwen Aken• Glenn Proctor• Stephen Keenan

Cancer Genome Project• Adam Butler• John Teague

Technology

Coates bosc2010 clouds-fluff-and-no-substance