Transcript
Page 1: Mapping Life Science Informatics to the Cloud

Mapping Informatics To the Cloud

2012 AIRI Petabyte Challenge

Chris [email protected]

Page 2: Mapping Life Science Informatics to the Cloud

I’m Chris.

I’m an infrastructure geek.

I work for the BioTeam.

Page 3: Mapping Life Science Informatics to the Cloud

The “C” Word.

Page 4: Mapping Life Science Informatics to the Cloud

When I say “cloud”I’m talking IaaS.

Page 5: Mapping Life Science Informatics to the Cloud

Amazon AWSIs the IaaS cloud.

Most others are fooling themselves.(Has-beens, also-rans & delusional marketing

zombies)

Page 6: Mapping Life Science Informatics to the Cloud

A message for thepretenders…

Page 7: Mapping Life Science Informatics to the Cloud

No APIs?Not a cloud.

Page 8: Mapping Life Science Informatics to the Cloud

No self-service?Not a cloud.

Page 9: Mapping Life Science Informatics to the Cloud

I have to email a human?

Not a cloud.

Page 10: Mapping Life Science Informatics to the Cloud

~50% failure rate when provisioning new servers?

Stupid cloud.

Page 11: Mapping Life Science Informatics to the Cloud

Block storage and virtual servers

only?(barely) a cloud;

Page 12: Mapping Life Science Informatics to the Cloud

Private Clouds: My $.02

Page 13: Mapping Life Science Informatics to the Cloud

Private Clouds in 2012:

• Hype vs. Reality ratio still wacky

• Sensible only for certain shops• Have you seen what you have to do to your networks & gear?

• There are easier ways

Page 14: Mapping Life Science Informatics to the Cloud

Private Clouds: My Advice for ‘12

• Remain cynical (test vendor claims)

• Due Diligence still essential• I personally would not deploy/buy

anything that does not explicitly provide Amazon API compatibility

Page 15: Mapping Life Science Informatics to the Cloud

Private Clouds: My Advice for ‘12

• Most people are better off:• Adding VM platforms to existing

HPC clusters & environments• Extending enterprise VM

platforms to allow user self-service & server catalogs

Page 16: Mapping Life Science Informatics to the Cloud

Enough Bloviating. Advice time.

Page 17: Mapping Life Science Informatics to the Cloud

Tip #1

Page 18: Mapping Life Science Informatics to the Cloud

HPC & Clouds: Whole New World

Page 19: Mapping Life Science Informatics to the Cloud

• We have spent decades learning to tune research HPC systems for shared access & many users.

• The cloud upends this model

Page 20: Mapping Life Science Informatics to the Cloud

• Far more common to see …• Dedicated cloud resources

spun up for each app or use case• Each system gets individually

tuned & optimized

Page 21: Mapping Life Science Informatics to the Cloud

Tip #2

Page 22: Mapping Life Science Informatics to the Cloud

Hybrid Clouds & Cloud Bursting

Page 23: Mapping Life Science Informatics to the Cloud

• Lots of aggressive marketing• Lots of carefully constructed

“case studies” and prototypes• The truth?• Less usable than you’ve been

told• Possible? Heck yeah.• Practical? Only sometimes.

Page 24: Mapping Life Science Informatics to the Cloud

• Advice• Be cynical• Demand proof• Test carefully

Page 25: Mapping Life Science Informatics to the Cloud

• Still want to do it?• Buy it, don’t build it• Cycle Computing• Univa• BrightComputing• …

Page 26: Mapping Life Science Informatics to the Cloud

• Follow the crowd• In the real world we see:• Separation between local

and cloud HPC resources• Send your work to the

system most suitable

Page 27: Mapping Life Science Informatics to the Cloud

Tip #3

Page 28: Mapping Life Science Informatics to the Cloud

You can’t rewrite EVERYTHING.

Page 29: Mapping Life Science Informatics to the Cloud

• Salesfolk will just glibly tell you to rewrite your apps so you can use whatever big data analysis framework they happen to be selling today

Page 30: Mapping Life Science Informatics to the Cloud

• They have no clue.

Page 31: Mapping Life Science Informatics to the Cloud

• In life science informatics we have hundreds of codes that will never be rewritten.

• We’ll be needing them for years to come.

Page 32: Mapping Life Science Informatics to the Cloud

• Advice:• MapReduceish methods

are the future for big-data informatics

• It will take years to get there

• We still have to deal with legacy algorithms and codes

Page 33: Mapping Life Science Informatics to the Cloud

• You will need:• A process for figuring out

when it’s worthwhile to rewrite/re-architect

• Tested cloud strategies for handling three use cases

Page 34: Mapping Life Science Informatics to the Cloud

You need 3 cloud architectures:

1. Legacy HPC2. “Cloudy” HPC3. Big Data HPC (Hadoop)

Page 35: Mapping Life Science Informatics to the Cloud

Legacy HPC on the cloud

• MIT StarCluster• http://web.mit.edu

/star/cluster/• This is your baseline• Extend as needed

Page 36: Mapping Life Science Informatics to the Cloud

“Cloudy” HPC

• Use this method when …• It makes sense to rewrite or

rearchitect an HPC workflow to better leverage modern cloud capabilities

Page 37: Mapping Life Science Informatics to the Cloud

“Cloudy” HPC, continued

• Ditch the legacy compute farm model

• Leverage elastic scale-out tools (***)

• Spot Instances for elastic & cheap compute

• SimpleDB for job statekeeping• SQS for job queues & workrflow “glue”• SNS for message passing & monitoring• S3 for input & output data• Etc.

Page 38: Mapping Life Science Informatics to the Cloud

Big Data HPC

• It’s gonna be a MapReduce world

• Little need to roll your own• Ecosystem already healthy• Multiple providers today• Often a slam-dunk cloud use

case

Page 39: Mapping Life Science Informatics to the Cloud

Tip #4

Page 40: Mapping Life Science Informatics to the Cloud

The Cloud was not designed for “us”

Page 41: Mapping Life Science Informatics to the Cloud

• HPC is an edge case for the hyperscale IaaS clouds

• We need to deal with this and engineer around it.

Page 42: Mapping Life Science Informatics to the Cloud

• Many examples• Eventual consistency• Networking & subnets• Latency• Node placement

Page 43: Mapping Life Science Informatics to the Cloud

• Advice• Manage expectations• Benchmark & test• Evangelize• (pester the cloud sales reps

…)

Page 44: Mapping Life Science Informatics to the Cloud

Tip #5

Page 45: Mapping Life Science Informatics to the Cloud

Data Movement Is Still Hard

Page 46: Mapping Life Science Informatics to the Cloud

• Consistently getting easier• Amazon is not a

bottleneck• AWS Import/Export• AWS Direct Connect• Aspera has some

amazing stuff out right now

Page 47: Mapping Life Science Informatics to the Cloud

• Advice• AWS Import/Export works

well• Size of pipe is not

everything• Sweat the small stuff• Tracking, checksums, disk

speed• Dedicated workstations• Secure media storage

Page 48: Mapping Life Science Informatics to the Cloud

Dedicated data movement station

Page 49: Mapping Life Science Informatics to the Cloud

‘naked’ Terabyte-scale data movement

Page 50: Mapping Life Science Informatics to the Cloud

Don’t overlook media storage …

Page 51: Mapping Life Science Informatics to the Cloud

• Advice for 2012• BioTeam is dialing down our

advocacy of physical data ingestion into the cloud

• Why?• Operationally hard,

expensive and no longer strictly needed

Page 52: Mapping Life Science Informatics to the Cloud

Real world cross-country internet-based data movement

March 2012

Page 53: Mapping Life Science Informatics to the Cloud

700Mb/sec into Amazon, stress-free & zero tuning

March 2012

Page 54: Mapping Life Science Informatics to the Cloud

• People trying to move data via physical media quickly realize the operational difficulties

• Bandwidth is cheaper than hiring another body to manage physical data ingestion & movement

• In 2012 we strongly recommend network-based data movement when at all possible

Page 55: Mapping Life Science Informatics to the Cloud

u r doing it wrong

Page 56: Mapping Life Science Informatics to the Cloud

cool data movement, bro!

Page 57: Mapping Life Science Informatics to the Cloud

Tips #6 & 7

Page 58: Mapping Life Science Informatics to the Cloud

Cloud storage. Still slow.

Page 59: Mapping Life Science Informatics to the Cloud

Big shared storage. Still hard.

Page 60: Mapping Life Science Informatics to the Cloud

• Not much we can do except engineer around it

• AWS compute cluster instances are a huge step forward

• AWS competitors take note

Page 61: Mapping Life Science Informatics to the Cloud

• We are not database nerds

• We care about more than just random IO performance

• We need it all• Random I/O• Long sequential

read/write

Page 62: Mapping Life Science Informatics to the Cloud

• Faster Storage Options• Software RAID on EBS• Various GlusterFS

options• Even if you optimize

everything, the virtual NICs are still a bottleneck

Page 63: Mapping Life Science Informatics to the Cloud

• Big Shared Storage• 10GbE nodes and NFS• Software RAID sets• GlusterFS or similar• 2012: pNFS finally?

Page 64: Mapping Life Science Informatics to the Cloud

Tip #8

Page 65: Mapping Life Science Informatics to the Cloud

Things fail differently in the cloud.

Page 66: Mapping Life Science Informatics to the Cloud

• Stuff breaks• It breaks in weird ways• Transient/temporary

issues more common than what we see “at home”

Page 67: Mapping Life Science Informatics to the Cloud

• Advice• Pessimism is good• Design for failure• Think hard about• How will you detect?• How will you respond?

Page 68: Mapping Life Science Informatics to the Cloud

• Advice• Remove humans from

loop• Automate recovery• Automate your backups

Page 69: Mapping Life Science Informatics to the Cloud

Tip #9

Page 70: Mapping Life Science Informatics to the Cloud

Serial/batch computing at-scale

Page 71: Mapping Life Science Informatics to the Cloud

• Loosely coupled workflows are ideal

• Break the pipeline into discrete components

• Components should be able to scale up|down independently

Page 72: Mapping Life Science Informatics to the Cloud

• Component = Opportunity to:• … Make a scaling

decision• (# nodes in use)

• … Make sizing decision• (instance type in use)

Page 73: Mapping Life Science Informatics to the Cloud

Nirvana is …

Page 74: Mapping Life Science Informatics to the Cloud

… independent loosely connected components that can self-scale and communicate asynchronously

Page 75: Mapping Life Science Informatics to the Cloud

Advice:• Many people already doing

this• Best practices are well

known• Steal from the best:• RightScale, Opscode &

Cycle Computing

Page 76: Mapping Life Science Informatics to the Cloud

Phew. Think I’m done now.

Page 77: Mapping Life Science Informatics to the Cloud

Questions?Slides available at

http://slideshare.net/chrisdag/

Page 78: Mapping Life Science Informatics to the Cloud

End;

Page 79: Mapping Life Science Informatics to the Cloud

Backup Slides

Page 80: Mapping Life Science Informatics to the Cloud

Private Clouds: Pick Your Poison

• OpenStack - http://openstack.org • Pro: Super smart

developers; significant mindshare; True Open Source

• Con: Commitment to AWS API compatibility (?) & stability

Page 81: Mapping Life Science Informatics to the Cloud

Private Clouds: Pick Your Poison

• CloudStack- http://cloudstack.org • Pro: Explicit AWS API

support; very recent move away from “open-core” model; usability

• Con: Developer mindshare? Sudden switch to Apache

Page 82: Mapping Life Science Informatics to the Cloud

Private Clouds: Pick Your Poison

• Eucalyptus- http://eucalyptus.com • Pro: Direct AWS API

compatibility; lots of hypervisor support

• Con: Open-core model; mindshare; Recent ressurection


Recommended