Upload
chris-dagdigian
View
5.334
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Infrastructure cloud platforms such as those offered by Amazon Web Services are not designed and built with scientific research as the primary use case. These presentation slides cover the current state of mapping life science research and HPC technique onto “the cloud” and how to work around the common engineering, orchestration and data movement problems. [Note: I've replaced the 2011 version of this talk deck with a slightly updated version as delivered at the AIRI Petabyte Challenge Meeting]
Citation preview
I’m Chris.
I’m an infrastructure geek.
I work for the BioTeam.
The “C” Word.
When I say “cloud”I’m talking IaaS.
Amazon AWSIs the IaaS cloud.
Most others are fooling themselves.(Has-beens, also-rans & delusional marketing
zombies)
A message for thepretenders…
No APIs?Not a cloud.
No self-service?Not a cloud.
I have to email a human?
Not a cloud.
~50% failure rate when provisioning new servers?
Stupid cloud.
Block storage and virtual servers
only?(barely) a cloud;
Private Clouds: My $.02
Private Clouds in 2012:
• Hype vs. Reality ratio still wacky
• Sensible only for certain shops• Have you seen what you have to do to your networks & gear?
• There are easier ways
Private Clouds: My Advice for ‘12
• Remain cynical (test vendor claims)
• Due Diligence still essential• I personally would not deploy/buy
anything that does not explicitly provide Amazon API compatibility
Private Clouds: My Advice for ‘12
• Most people are better off:• Adding VM platforms to existing
HPC clusters & environments• Extending enterprise VM
platforms to allow user self-service & server catalogs
Enough Bloviating. Advice time.
Tip #1
HPC & Clouds: Whole New World
• We have spent decades learning to tune research HPC systems for shared access & many users.
• The cloud upends this model
• Far more common to see …• Dedicated cloud resources
spun up for each app or use case• Each system gets individually
tuned & optimized
Tip #2
Hybrid Clouds & Cloud Bursting
• Lots of aggressive marketing• Lots of carefully constructed
“case studies” and prototypes• The truth?• Less usable than you’ve been
told• Possible? Heck yeah.• Practical? Only sometimes.
• Advice• Be cynical• Demand proof• Test carefully
• Still want to do it?• Buy it, don’t build it• Cycle Computing• Univa• BrightComputing• …
• Follow the crowd• In the real world we see:• Separation between local
and cloud HPC resources• Send your work to the
system most suitable
Tip #3
You can’t rewrite EVERYTHING.
• Salesfolk will just glibly tell you to rewrite your apps so you can use whatever big data analysis framework they happen to be selling today
• They have no clue.
• In life science informatics we have hundreds of codes that will never be rewritten.
• We’ll be needing them for years to come.
• Advice:• MapReduceish methods
are the future for big-data informatics
• It will take years to get there
• We still have to deal with legacy algorithms and codes
• You will need:• A process for figuring out
when it’s worthwhile to rewrite/re-architect
• Tested cloud strategies for handling three use cases
You need 3 cloud architectures:
1. Legacy HPC2. “Cloudy” HPC3. Big Data HPC (Hadoop)
Legacy HPC on the cloud
• MIT StarCluster• http://web.mit.edu
/star/cluster/• This is your baseline• Extend as needed
“Cloudy” HPC
• Use this method when …• It makes sense to rewrite or
rearchitect an HPC workflow to better leverage modern cloud capabilities
“Cloudy” HPC, continued
• Ditch the legacy compute farm model
• Leverage elastic scale-out tools (***)
• Spot Instances for elastic & cheap compute
• SimpleDB for job statekeeping• SQS for job queues & workrflow “glue”• SNS for message passing & monitoring• S3 for input & output data• Etc.
Big Data HPC
• It’s gonna be a MapReduce world
• Little need to roll your own• Ecosystem already healthy• Multiple providers today• Often a slam-dunk cloud use
case
Tip #4
The Cloud was not designed for “us”
• HPC is an edge case for the hyperscale IaaS clouds
• We need to deal with this and engineer around it.
• Many examples• Eventual consistency• Networking & subnets• Latency• Node placement
• Advice• Manage expectations• Benchmark & test• Evangelize• (pester the cloud sales reps
…)
Tip #5
Data Movement Is Still Hard
• Consistently getting easier• Amazon is not a
bottleneck• AWS Import/Export• AWS Direct Connect• Aspera has some
amazing stuff out right now
• Advice• AWS Import/Export works
well• Size of pipe is not
everything• Sweat the small stuff• Tracking, checksums, disk
speed• Dedicated workstations• Secure media storage
Dedicated data movement station
‘naked’ Terabyte-scale data movement
Don’t overlook media storage …
• Advice for 2012• BioTeam is dialing down our
advocacy of physical data ingestion into the cloud
• Why?• Operationally hard,
expensive and no longer strictly needed
Real world cross-country internet-based data movement
March 2012
700Mb/sec into Amazon, stress-free & zero tuning
March 2012
• People trying to move data via physical media quickly realize the operational difficulties
• Bandwidth is cheaper than hiring another body to manage physical data ingestion & movement
• In 2012 we strongly recommend network-based data movement when at all possible
u r doing it wrong
cool data movement, bro!
Tips #6 & 7
Cloud storage. Still slow.
Big shared storage. Still hard.
• Not much we can do except engineer around it
• AWS compute cluster instances are a huge step forward
• AWS competitors take note
• We are not database nerds
• We care about more than just random IO performance
• We need it all• Random I/O• Long sequential
read/write
• Faster Storage Options• Software RAID on EBS• Various GlusterFS
options• Even if you optimize
everything, the virtual NICs are still a bottleneck
• Big Shared Storage• 10GbE nodes and NFS• Software RAID sets• GlusterFS or similar• 2012: pNFS finally?
Tip #8
Things fail differently in the cloud.
• Stuff breaks• It breaks in weird ways• Transient/temporary
issues more common than what we see “at home”
• Advice• Pessimism is good• Design for failure• Think hard about• How will you detect?• How will you respond?
• Advice• Remove humans from
loop• Automate recovery• Automate your backups
Tip #9
Serial/batch computing at-scale
• Loosely coupled workflows are ideal
• Break the pipeline into discrete components
• Components should be able to scale up|down independently
• Component = Opportunity to:• … Make a scaling
decision• (# nodes in use)
• … Make sizing decision• (instance type in use)
Nirvana is …
… independent loosely connected components that can self-scale and communicate asynchronously
Advice:• Many people already doing
this• Best practices are well
known• Steal from the best:• RightScale, Opscode &
Cycle Computing
Phew. Think I’m done now.
End;
Backup Slides
Private Clouds: Pick Your Poison
• OpenStack - http://openstack.org • Pro: Super smart
developers; significant mindshare; True Open Source
• Con: Commitment to AWS API compatibility (?) & stability
Private Clouds: Pick Your Poison
• CloudStack- http://cloudstack.org • Pro: Explicit AWS API
support; very recent move away from “open-core” model; usability
• Con: Developer mindshare? Sudden switch to Apache
Private Clouds: Pick Your Poison
• Eucalyptus- http://eucalyptus.com • Pro: Direct AWS API
compatibility; lots of hypervisor support
• Con: Open-core model; mindshare; Recent ressurection