Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Best Practices & Lessons Learned Life Science Informatics & The Cloud

Tuesday, May 28, 13

2

I’m Chris.

I’m an infrastructure geek.

I work for the BioTeam.

Twitter: @chris_dagTuesday, May 28, 13

Who, what & whyBioTeam

‣ Independent consulting shop‣ Staffed by scientists forced to

learn IT, SW & HPC to get our own research done

‣ 12+ years bridging the “gap” between science, IT & high performance computing

‣ www.bioteam.net

3Tuesday, May 28, 13

http://www.bioteam.net

http://www.bioteam.net

Seriously.Listen to me at your own risk

‣ Clever people find multiple solutions to common issues

‣ I’m fairly blunt, burnt-out and cynical in my advanced age

‣ Significant portion of my work has been done in demanding production Biotech & Pharma environments

‣ Filter my words accordingly4

Tuesday, May 28, 13

Other 2013 Presentations ...Bio-IT World Boston


Bio-IT World Boston: “Multi-Tenant Research Clusters”

6

http://slideshare.net/chrisdag/ Tuesday, May 28, 13

http://slideshare.net/chrisdag/


Bio-IT World Boston: “HPC Trends from the trenches.”

7

http://slideshare.net/chrisdag/ Tuesday, May 28, 13



8

Meta: Why Cloud?

What the sales & marketing folks won’t tell you

Getting Practical

Intro

HPC Case Study

1

2

3

4


9

The big pictureWhy we need IaaS clouds ...

Tuesday, May 28, 13

Why life science needs infrastructure clouds

10

Big Picture

‣ HUGE revolution in the rate at which lab platforms are being redesigned, improved & refreshed

• Example: CCD sensor upgrade on that confocal microscopy rig just doubled your storage requirements

• Example: That 2D ultrasound imager is now a 3D imager

• Example: Illumina HiSeq upgrade just doubled the rate at which you can acquire genomes. Massive downstream increase in storage, compute & data movement needs

Tuesday, May 28, 13

11

The Central Problem Is ...

‣ Instrumentation & protocols are changing FAR FASTER than we can refresh our Research-IT & Scientific Computing infrastructure

• The science is changing month-to-month ...

• ... while our IT infrastructure only gets refreshed every 2-7 years

‣ We have to design systems TODAY that can support unknown research requirements & workflows over many years (gulp ...)

Tuesday, May 28, 13

12

The Central Problem Is ...

‣ The easy period is over‣ 5 years ago you could toss inexpensive storage and

servers at the problem; even in a nearby closet or under a lab bench if necessary

‣ That does not work any more; real solutions required

Tuesday, May 28, 13

13

And a related problem ...

‣ It has never been easier to acquire vast amounts of data cheaply and easily

‣ Growth rate of data creation/ingest exceeds rate at which the storage industry is improving disk capacity

‣ Not just a storage lifecycle problem. This data *moves* and often needs to be shared among multiple entities and providers

• ... ideally without punching holes in your firewall or consuming all available internet bandwidth

Tuesday, May 28, 13

If you get it wrong ...

‣ Lost opportunity‣ Missing capability‣ Frustrated & very vocal scientific staff‣ Problems in recruiting, retention,

publication & product development


15IaaS to the Rescue

Tuesday, May 28, 13

IaaS solves the current critical “Research IT” dilemma

16

Why Cloud?

‣ IaaS clouds let us react and respond to scientific requirements that change far faster than we can refresh local datacenters and enterprise IT platforms

Image: shanelin via Flickr

Tuesday, May 28, 13

http://www.flickr.com/photos/shanelin/4458565733/in/photostream/

http://www.flickr.com/photos/shanelin/4458565733/in/photostream/

Beyond capability and agility gains ...

17

Why Cloud?

‣ The economic benefits are real, inescapable and trending in the proper direction

‣ Internet-scale providers with millions of cores and exabytes of spinning disk spanning the globe leverage operational efficiencies you will never come close to matching internally

‣ ... be suspicious of people who claim otherwise

Tuesday, May 28, 13

Also ...

18

Why Cloud?

‣ Clouds becoming a natural place for data exchange & access

‣ “scriptable everything” enables entirely new capabilities not possible internally*

‣ Finance people love converting CapEx to OpEx

Tuesday, May 28, 13

19

Meta: Why Cloud?


Getting Practical

Intro

HPC Case Study

1

2

3

4


What the salesfolk won’t tell you ...

20

‣ There is no one-size-fits-all research design pattern ...

‣ You are not going to toss everything and replace it with “Big Data”

‣ Very few of us have a single pipeline or workflow that we can devote endless engineering effort to

‣ We are not going to toss out hundreds of legacy codes and rewrite everything for GPUs or MapReduce

‣ For research HPC it’s all about the building blocks { and how we can effectively use/deploy them }

Tuesday, May 28, 13

21

What the salesfolk won’t tell you

‣ Your organization actually needs THREE tested cloud design patterns:

‣ (1) To handle ‘legacy’ scientific apps & workflows‣ (2) The special stuff that is worth re-architecting ‣ (3) Hadoop & big data analytics

Tuesday, May 28, 13

Legacy HPC on the Cloud

22

Design Pattern #1 - Legacy

‣ There are many hundreds of existing algorithms and applications in the life science informatics space

‣ We’ll be running/using these codes for years to come

‣ Many can’t or will never be refactored or rewritten

‣ I call this the “legacy” design pattern

Tuesday, May 28, 13

23One Easy Solution.

Tuesday, May 28, 13

StarCluster

24

Design Pattern #1 - Legacy

‣ MIT StarCluster• http://web.mit.edu/star/cluster/

‣ Infinite Awesomeness. Worth a talk by itself.‣ This is your baseline‣ Extend as needed

Tuesday, May 28, 13

http://web.mit.edu/star/cluster/

http://web.mit.edu/star/cluster/

25

Design Pattern #2 - “Cloudy”

‣ Some of our research workflows are important enough to be rewritten for “the cloud” and the advantages that a truly elastic & API-driven infrastructure can deliver

‣ This is where you have the most freedom‣ Many published best practices you can borrow‣ Warning: Cloud vendor lock-in potential is strongest here

Tuesday, May 28, 13

26

Design Pattern #3 - Hadoop/BigData

‣ Hadoop and “big data” need to be on your radar‣ Be careful though, you’ll need a gas mask to avoid the

smog of marketing and vapid hype‣ The utility is real and this does represent one “future

path” for analysis of large data sets

Tuesday, May 28, 13

27


‣ It’s going to be a MapReduce world, get used to it‣ Little need to roll your own Hadoop in 2013‣ ISV & commercial ecosystem already healthy‣ Multiple providers today; both onsite & cloud-based‣ Often a slam-dunk cloud use case

Tuesday, May 28, 13

What you need to know

28


‣ “Hadoop” and “Big Data” are now general terms‣ You need to drill down to find out what people actually

mean‣ We are still in the period where senior leadership may

demand “Hadoop” or “BigData” capability without any actual business or scientific need

Tuesday, May 28, 13


29

Hadoop & “Big Data”

‣ In broad terms you can break “Big Data” down into two very basic use cases:

1. Compute: Hadoop can be used as a very powerful platform for the analysis of very large data sets. The google search term here is “map reduce”

2. Data Stores: Hadoop is driving the development of very sophisticated “no-SQL” “non-Relational” databases and data query engines. The google search terms include “nosql”, “couchdb”, “hive”, “pig” & “mongodb”, etc.

‣ Your job is to figure out which type applies for the groups requesting “Hadoop” or “BigData” capability

Tuesday, May 28, 13

Hadoop vs traditional Linux Clusters

30

High Throughput Science

‣ Hadoop is a very complex beast‣ It’s also the way of the future so you can’t ignore it‣ Very tight dependency on moving the ‘compute’ as close

as possible to the ‘data’‣ Hadoop clusters are just different enough that they do

not integrate cleanly with traditional Linux HPC system‣ Often treated as separate silo or punted to the cloud

Tuesday, May 28, 13


31

Hadoop & “Big Data”

‣ Hadoop is being driven by a small group of academics writing and releasing open source life science hadoop applications;

‣ Your people will want to run these codes‣ In some academic environments you may find people

wanting to develop on this platform

Tuesday, May 28, 13

32

Meta: Why Cloud?


Getting Practical

Intro

HPC Case Study

1

2

3

4


Strategy

33

Practical Advice

‣ Research oriented IT organizations need a cloud strategy today; or risk being bypassed by employees

Tuesday, May 28, 13

Design Patterns

34

Practical Advice

‣ Remember the three design patterns on the cloud:• Legacy HPC systems

(replicate traditional clusters in the cloud)

• Hadoop

• Cloudy (when you rewrite something to fully leverage cloud capability)

Tuesday, May 28, 13

Policies and Procedures

35

Practical Advice

‣ Cloud technology bits are easy. Cloud Process and Policy discussions take forever

‣ Start these conversations sooner rather than later!

Tuesday, May 28, 13

Core services that take time and advance planning

36

Practical Advice

‣ A few of key foundational cloud services take time and advanced planning to deploy properly:

‣ VPNs & subnet schemes‣ Identity Management & Access Control‣ Data Movement

Tuesday, May 28, 13

Data Movemement

37

Practical Advice

‣ A few words & pictures on data movement ...

Tuesday, May 28, 13

38

Physical data movement station 1

Tuesday, May 28, 13

39

Physical data movement station 2

Tuesday, May 28, 13

40

“Naked” Data Movement

Tuesday, May 28, 13

41

“Naked” Data Archive

Tuesday, May 28, 13

42

Cloud Data Movement

‣ Things changed pretty definitively in 2012‣ And the next image shows why ...

Tuesday, May 28, 13

43

March 2012Tuesday, May 28, 13

Network vs. PhysicalCloud Data Movement

‣ With a 1GbE internet connection ...‣ and using Aspera software ....‣ We sustained 700 MB/sec for more than 7 hours

freighting genomes into Amazon Web Services‣ This is fast enough for many use cases, including

genome sequencing core facilities*‣ Chris Dwan’s webinar on this topic:

http://biote.am/7e


http://biote.am/7e

http://biote.am/7e

Network vs. PhysicalCloud Data Movement

‣ Results like this mean we now favor network-based data movement over physical media movement

‣ Large-scale physical data movement carries a high operational burden and consumes non-trivial staff time & resources


There are three ways to do network data movement ...Cloud Data Movement

‣ Buy software from Aspera and be done with it‣ Attend the annual SuperComputing conference & see

which student group wins the bandwidth challenge contest; use their code

‣ Get GridFTP from the Globus folks


SysAdmin vs Programmer

47

Practical Advice

‣ Recognize the blurring line between IT / Informatics / SW Engineer

‣ ... and how it may mix up your org chart

Tuesday, May 28, 13

Very blurry lines in 2013 for all of these roles

48

Scientist/SysAdmin/Programmer‣ Radical change in last ~2 years

for how IT is provisioned, delivered, managed & supported

‣ Root cause (Technology) Virtualization & Cloud

‣ Root Cause (Operations) Configuration Mgmt, Systems Orchestration & Infrastructure Automation

‣ SysAdmins & IT staff need to re-skill and retrain to stay relevant

Tuesday, May 28, 13


49

Scientist/SysAdmin/Programmer

‣ When everything has an API ..‣ .. anything can be

‘orchestrated’ or ‘automated’ remotely

‣ And by the way ...‣ The APIs (‘knobs & buttons’)

are accessible to all

Tuesday, May 28, 13


50

Scientist/SysAdmin/Programmer

‣ IT jobs, roles and responsibilities are undergoing rapid upheaval

‣ SysAdmins must learn to program in order to harness automation tools

‣ Programmers & Scientists can now self-provision and control sophisticated IT resources

Tuesday, May 28, 13


51

Scientist/SysAdmin/Programmer‣ My take on the future ...‣ Far more control is going into the

hands of the research end user ‣ IT support roles will radically

change -- no longer owners or gatekeepers

‣ IT will handle policies, procedures, reference patterns , security & best practices

‣ Researchers will control the “what”, “when” and “how big”

Tuesday, May 28, 13

52

Thanks! Email: [email protected]


Tuesday, May 28, 13

mailto:[email protected]




53

Cloud HPC Case StudyTime Permitting ...

Tuesday, May 28, 13

Next Generation Nuclear Magnetic Resonance

54

NMR Probehead Simulation on AWS

‣ CAE Simulation Project‣ via www.hpcexperiment.com‣ Software: CST Studio 2012‣ My role: Volunteer HPC Mentor

Tuesday, May 28, 13

http://www.hpcexperiment.com

http://www.hpcexperiment.com

Simulating next-generation NMR probeheads

55

Why this was an interesting project

‣ Frontend interface is graphics heavy and requires Windows

‣ Studio ‘solvers’ run Linux or Windows; support GPUs and MPI task distribution

‣ Simultaneous use of local and cloud-based solvers actually works

‣ flexLM license server involved

‣ Non-trivial security and geo-location requirements

Tuesday, May 28, 13

56

When we ran at modest scale ...

16 large compute nodes + 22 GPU nodes$30/hour on AWS Spot Market.

HPC on the cloud is real.Tuesday, May 28, 13

Design Attempt #1

57

‣ Hybrid Linux/Windows cloud running in AWS EU Region‣ Failure:

• No GPU nodes in EU at the time

• No cc2.4xlarge at the time

Tuesday, May 28, 13

Design Attempt #2

58

‣ Move Hybrid Linux/Windows system to US-EAST‣ ... with synthetic test data‣ Best-practices VPC isolation & VPN access‣ It looked like this ...

Tuesday, May 28, 13

Architecture #259

Tuesday, May 28, 13

Design Attempt #2

60

‣ Attempt #2 Failed:‣ CST FrontEnd Controller running at end-user site could

not tolerate NAT translation used by solvers‣ No GPU nodes available within VPC at that time

Tuesday, May 28, 13

Design Attempt #3

61

‣ Design #3 Finally works‣ VPC shrunk to single license server running in US EAST‣ All Windows/Linux/GPU solover nodes running in EU‣ NO NAT, NO VPC For Solvers‣ Extensive use of AWS spot instance servers

Tuesday, May 28, 13

At experiment end it looked like this ...62

Tuesday, May 28, 13

63

Non Trivial HPC on the Cloud

16 large compute nodes + 22 GPU nodes$30/hour on AWS Spot Market.

Tuesday, May 28, 13

Why this work was ‘easy’ on Amazon AWS ...

64

Nightmare on any other cloud

‣ Lets discuss why this simulation workload would be much, much harder to do on some other cloud platform ...

Tuesday, May 28, 13

Why this work was ‘easy’ on Amazon AWS ...

65

Nightmare on any other cloud

1. Virtual Servers2. Block Storage3. Object Storage4. ... and maybe some other

stuff if I’m lucky

‣ EC2, S3, EBS, RDS, SNS, SQS, SWS, GPUs, SSDs, CloudFormation, VPC, ENIs, SecurityGroups, 10GbE DirectConnect, Reserved Instances, ImportExport, Spot Market

‣ And ~25 other products and service features with more added monthly

‘Brand X’ Cloud AWS

Tuesday, May 28, 13

Easy on AWS; much harder elsewhereOne very specific example

66

‣ The widely used FLEXlm license server uses NIC MAC addresses when generating license keys

‣ Different MAC? Science stops. Screwed.

‣ VPC ENIs allow separation of MAC address from Network Interface. Badass.

Tuesday, May 28, 13

Why this work was ‘easy’ on Amazon AWS ...A few other examples ...

67

VPC

Spot Market

cc* & cg* ec2 instance

types

Incredibly powerful. Actually useful.

Approachable even if you are not an IPSEC or BGP routing god.

Compelling economics. Once you start you’ll likely never run anywhere else.

The competition can’t compete.

Fat nodes with bidirectional 10GbE bandwidth.

And don’t get me started on SSD or Provisioned-performance EBS volumes.

Tuesday, May 28, 13

68

Thanks! Email: [email protected]

Tuesday, May 28, 13