68
Best Practices & Lessons Learned Life Science Informatics & The Cloud Tuesday, May 28, 13

Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Embed Size (px)

DESCRIPTION

Slides from a talk @ Bio-IT World Asia

Citation preview

Page 1: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Best Practices & Lessons Learned Life Science Informatics & The Cloud

Tuesday, May 28, 13

Page 2: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

2

I’m Chris.

I’m an infrastructure geek.

I work for the BioTeam.

Twitter: @chris_dagTuesday, May 28, 13

Page 3: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Who, what & whyBioTeam

‣ Independent consulting shop‣ Staffed by scientists forced to

learn IT, SW & HPC to get our own research done

‣ 12+ years bridging the “gap” between science, IT & high performance computing

‣ www.bioteam.net

3Tuesday, May 28, 13

Page 4: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Seriously.Listen to me at your own risk

‣ Clever people find multiple solutions to common issues

‣ I’m fairly blunt, burnt-out and cynical in my advanced age

‣ Significant portion of my work has been done in demanding production Biotech & Pharma environments

‣ Filter my words accordingly4

Tuesday, May 28, 13

Page 5: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Other 2013 Presentations ...Bio-IT World Boston

5Tuesday, May 28, 13

Page 6: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Bio-IT World Boston: “Multi-Tenant Research Clusters”

6

http://slideshare.net/chrisdag/ Tuesday, May 28, 13

Page 7: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Bio-IT World Boston: “HPC Trends from the trenches.”

7

http://slideshare.net/chrisdag/ Tuesday, May 28, 13

Page 8: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

8

Meta: Why Cloud?

What the sales & marketing folks won’t tell you

Getting Practical

Intro

HPC Case Study

1

2

3

4

5Tuesday, May 28, 13

Page 9: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

9

The big pictureWhy we need IaaS clouds ...

Tuesday, May 28, 13

Page 10: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Why life science needs infrastructure clouds

10

Big Picture

‣ HUGE revolution in the rate at which lab platforms are being redesigned, improved & refreshed

• Example: CCD sensor upgrade on that confocal microscopy rig just doubled your storage requirements

• Example: That 2D ultrasound imager is now a 3D imager

• Example: Illumina HiSeq upgrade just doubled the rate at which you can acquire genomes. Massive downstream increase in storage, compute & data movement needs

Tuesday, May 28, 13

Page 11: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

11

The Central Problem Is ...

‣ Instrumentation & protocols are changing FAR FASTER than we can refresh our Research-IT & Scientific Computing infrastructure

• The science is changing month-to-month ...

• ... while our IT infrastructure only gets refreshed every 2-7 years

‣ We have to design systems TODAY that can support unknown research requirements & workflows over many years (gulp ...)

Tuesday, May 28, 13

Page 12: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

12

The Central Problem Is ...

‣ The easy period is over‣ 5 years ago you could toss inexpensive storage and

servers at the problem; even in a nearby closet or under a lab bench if necessary

‣ That does not work any more; real solutions required

Tuesday, May 28, 13

Page 13: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

13

And a related problem ...

‣ It has never been easier to acquire vast amounts of data cheaply and easily

‣ Growth rate of data creation/ingest exceeds rate at which the storage industry is improving disk capacity

‣ Not just a storage lifecycle problem. This data *moves* and often needs to be shared among multiple entities and providers

• ... ideally without punching holes in your firewall or consuming all available internet bandwidth

Tuesday, May 28, 13

Page 14: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

If you get it wrong ...

‣ Lost opportunity‣ Missing capability‣ Frustrated & very vocal scientific staff‣ Problems in recruiting, retention,

publication & product development

14Tuesday, May 28, 13

Page 15: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

15IaaS to the Rescue

Tuesday, May 28, 13

Page 16: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

IaaS solves the current critical “Research IT” dilemma

16

Why Cloud?

‣ IaaS clouds let us react and respond to scientific requirements that change far faster than we can refresh local datacenters and enterprise IT platforms

Image: shanelin via Flickr

Tuesday, May 28, 13

Page 17: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Beyond capability and agility gains ...

17

Why Cloud?

‣ The economic benefits are real, inescapable and trending in the proper direction

‣ Internet-scale providers with millions of cores and exabytes of spinning disk spanning the globe leverage operational efficiencies you will never come close to matching internally

‣ ... be suspicious of people who claim otherwise

Tuesday, May 28, 13

Page 18: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Also ...

18

Why Cloud?

‣ Clouds becoming a natural place for data exchange & access

‣ “scriptable everything” enables entirely new capabilities not possible internally*

‣ Finance people love converting CapEx to OpEx

Tuesday, May 28, 13

Page 19: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

19

Meta: Why Cloud?

What the sales & marketing folks won’t tell you

Getting Practical

Intro

HPC Case Study

1

2

3

4

5Tuesday, May 28, 13

Page 20: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

What the salesfolk won’t tell you ...

20

‣ There is no one-size-fits-all research design pattern ...

‣ You are not going to toss everything and replace it with “Big Data”

‣ Very few of us have a single pipeline or workflow that we can devote endless engineering effort to

‣ We are not going to toss out hundreds of legacy codes and rewrite everything for GPUs or MapReduce

‣ For research HPC it’s all about the building blocks { and how we can effectively use/deploy them }

Tuesday, May 28, 13

Page 21: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

21

What the salesfolk won’t tell you

‣ Your organization actually needs THREE tested cloud design patterns:

‣ (1) To handle ‘legacy’ scientific apps & workflows‣ (2) The special stuff that is worth re-architecting ‣ (3) Hadoop & big data analytics

Tuesday, May 28, 13

Page 22: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Legacy HPC on the Cloud

22

Design Pattern #1 - Legacy

‣ There are many hundreds of existing algorithms and applications in the life science informatics space

‣ We’ll be running/using these codes for years to come

‣ Many can’t or will never be refactored or rewritten

‣ I call this the “legacy” design pattern

Tuesday, May 28, 13

Page 23: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

23One Easy Solution.

Tuesday, May 28, 13

Page 24: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

StarCluster

24

Design Pattern #1 - Legacy

‣ MIT StarCluster• http://web.mit.edu/star/cluster/

‣ Infinite Awesomeness. Worth a talk by itself.‣ This is your baseline‣ Extend as needed

Tuesday, May 28, 13

Page 25: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

25

Design Pattern #2 - “Cloudy”

‣ Some of our research workflows are important enough to be rewritten for “the cloud” and the advantages that a truly elastic & API-driven infrastructure can deliver

‣ This is where you have the most freedom‣ Many published best practices you can borrow‣ Warning: Cloud vendor lock-in potential is strongest here

Tuesday, May 28, 13

Page 26: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

26

Design Pattern #3 - Hadoop/BigData

‣ Hadoop and “big data” need to be on your radar‣ Be careful though, you’ll need a gas mask to avoid the

smog of marketing and vapid hype‣ The utility is real and this does represent one “future

path” for analysis of large data sets

Tuesday, May 28, 13

Page 27: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

27

Design Pattern #3 - Hadoop/BigData

‣ It’s going to be a MapReduce world, get used to it‣ Little need to roll your own Hadoop in 2013‣ ISV & commercial ecosystem already healthy‣ Multiple providers today; both onsite & cloud-based‣ Often a slam-dunk cloud use case

Tuesday, May 28, 13

Page 28: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

What you need to know

28

Design Pattern #3 - Hadoop/BigData

‣ “Hadoop” and “Big Data” are now general terms‣ You need to drill down to find out what people actually

mean‣ We are still in the period where senior leadership may

demand “Hadoop” or “BigData” capability without any actual business or scientific need

Tuesday, May 28, 13

Page 29: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

What you need to know

29

Hadoop & “Big Data”

‣ In broad terms you can break “Big Data” down into two very basic use cases:

1. Compute: Hadoop can be used as a very powerful platform for the analysis of very large data sets. The google search term here is “map reduce”

2. Data Stores: Hadoop is driving the development of very sophisticated “no-SQL” “non-Relational” databases and data query engines. The google search terms include “nosql”, “couchdb”, “hive”, “pig” & “mongodb”, etc.

‣ Your job is to figure out which type applies for the groups requesting “Hadoop” or “BigData” capability

Tuesday, May 28, 13

Page 30: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Hadoop vs traditional Linux Clusters

30

High Throughput Science

‣ Hadoop is a very complex beast‣ It’s also the way of the future so you can’t ignore it‣ Very tight dependency on moving the ‘compute’ as close

as possible to the ‘data’‣ Hadoop clusters are just different enough that they do

not integrate cleanly with traditional Linux HPC system‣ Often treated as separate silo or punted to the cloud

Tuesday, May 28, 13

Page 31: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

What you need to know

31

Hadoop & “Big Data”

‣ Hadoop is being driven by a small group of academics writing and releasing open source life science hadoop applications;

‣ Your people will want to run these codes‣ In some academic environments you may find people

wanting to develop on this platform

Tuesday, May 28, 13

Page 32: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

32

Meta: Why Cloud?

What the sales & marketing folks won’t tell you

Getting Practical

Intro

HPC Case Study

1

2

3

4

5Tuesday, May 28, 13

Page 33: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Strategy

33

Practical Advice

‣ Research oriented IT organizations need a cloud strategy today; or risk being bypassed by employees

Tuesday, May 28, 13

Page 34: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Design Patterns

34

Practical Advice

‣ Remember the three design patterns on the cloud:• Legacy HPC systems

(replicate traditional clusters in the cloud)

• Hadoop

• Cloudy (when you rewrite something to fully leverage cloud capability)

Tuesday, May 28, 13

Page 35: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Policies and Procedures

35

Practical Advice

‣ Cloud technology bits are easy. Cloud Process and Policy discussions take forever

‣ Start these conversations sooner rather than later!

Tuesday, May 28, 13

Page 36: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Core services that take time and advance planning

36

Practical Advice

‣ A few of key foundational cloud services take time and advanced planning to deploy properly:

‣ VPNs & subnet schemes‣ Identity Management & Access Control‣ Data Movement

Tuesday, May 28, 13

Page 37: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Data Movemement

37

Practical Advice

‣ A few words & pictures on data movement ...

Tuesday, May 28, 13

Page 38: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

38

Physical data movement station 1

Tuesday, May 28, 13

Page 39: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

39

Physical data movement station 2

Tuesday, May 28, 13

Page 40: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

40

“Naked” Data Movement

Tuesday, May 28, 13

Page 41: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

41

“Naked” Data Archive

Tuesday, May 28, 13

Page 42: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

42

Cloud Data Movement

‣ Things changed pretty definitively in 2012‣ And the next image shows why ...

Tuesday, May 28, 13

Page 43: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

43

March 2012Tuesday, May 28, 13

Page 44: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Network vs. PhysicalCloud Data Movement

‣ With a 1GbE internet connection ...‣ and using Aspera software ....‣ We sustained 700 MB/sec for more than 7 hours

freighting genomes into Amazon Web Services‣ This is fast enough for many use cases, including

genome sequencing core facilities*‣ Chris Dwan’s webinar on this topic:

http://biote.am/7e

44Tuesday, May 28, 13

Page 45: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Network vs. PhysicalCloud Data Movement

‣ Results like this mean we now favor network-based data movement over physical media movement

‣ Large-scale physical data movement carries a high operational burden and consumes non-trivial staff time & resources

45Tuesday, May 28, 13

Page 46: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

There are three ways to do network data movement ...Cloud Data Movement

‣ Buy software from Aspera and be done with it‣ Attend the annual SuperComputing conference & see

which student group wins the bandwidth challenge contest; use their code

‣ Get GridFTP from the Globus folks

46Tuesday, May 28, 13

Page 47: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

SysAdmin vs Programmer

47

Practical Advice

‣ Recognize the blurring line between IT / Informatics / SW Engineer

‣ ... and how it may mix up your org chart

Tuesday, May 28, 13

Page 48: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Very blurry lines in 2013 for all of these roles

48

Scientist/SysAdmin/Programmer‣ Radical change in last ~2 years

for how IT is provisioned, delivered, managed & supported

‣ Root cause (Technology) Virtualization & Cloud

‣ Root Cause (Operations) Configuration Mgmt, Systems Orchestration & Infrastructure Automation

‣ SysAdmins & IT staff need to re-skill and retrain to stay relevant

Tuesday, May 28, 13

Page 49: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Very blurry lines in 2013 for all of these roles

49

Scientist/SysAdmin/Programmer

‣ When everything has an API ..‣ .. anything can be

‘orchestrated’ or ‘automated’ remotely

‣ And by the way ...‣ The APIs (‘knobs & buttons’)

are accessible to all

Tuesday, May 28, 13

Page 50: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Very blurry lines in 2013 for all of these roles

50

Scientist/SysAdmin/Programmer

‣ IT jobs, roles and responsibilities are undergoing rapid upheaval

‣ SysAdmins must learn to program in order to harness automation tools

‣ Programmers & Scientists can now self-provision and control sophisticated IT resources

Tuesday, May 28, 13

Page 51: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Very blurry lines in 2012 for all of these roles

51

Scientist/SysAdmin/Programmer‣ My take on the future ...‣ Far more control is going into the

hands of the research end user ‣ IT support roles will radically

change -- no longer owners or gatekeepers

‣ IT will handle policies, procedures, reference patterns , security & best practices

‣ Researchers will control the “what”, “when” and “how big”

Tuesday, May 28, 13

Page 53: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

53

Cloud HPC Case StudyTime Permitting ...

Tuesday, May 28, 13

Page 54: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Next Generation Nuclear Magnetic Resonance

54

NMR Probehead Simulation on AWS

‣ CAE Simulation Project‣ via www.hpcexperiment.com‣ Software: CST Studio 2012‣ My role: Volunteer HPC Mentor

Tuesday, May 28, 13

Page 55: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Simulating next-generation NMR probeheads

55

Why this was an interesting project

‣ Frontend interface is graphics heavy and requires Windows

‣ Studio ‘solvers’ run Linux or Windows; support GPUs and MPI task distribution

‣ Simultaneous use of local and cloud-based solvers actually works

‣ flexLM license server involved

‣ Non-trivial security and geo-location requirements

Tuesday, May 28, 13

Page 56: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

56

When we ran at modest scale ...

16 large compute nodes + 22 GPU nodes$30/hour on AWS Spot Market.

HPC on the cloud is real.Tuesday, May 28, 13

Page 57: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Design Attempt #1

57

‣ Hybrid Linux/Windows cloud running in AWS EU Region‣ Failure:

• No GPU nodes in EU at the time

• No cc2.4xlarge at the time

Tuesday, May 28, 13

Page 58: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Design Attempt #2

58

‣ Move Hybrid Linux/Windows system to US-EAST‣ ... with synthetic test data‣ Best-practices VPC isolation & VPN access‣ It looked like this ...

Tuesday, May 28, 13

Page 59: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Architecture #259

Tuesday, May 28, 13

Page 60: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Design Attempt #2

60

‣ Attempt #2 Failed:‣ CST FrontEnd Controller running at end-user site could

not tolerate NAT translation used by solvers‣ No GPU nodes available within VPC at that time

Tuesday, May 28, 13

Page 61: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Design Attempt #3

61

‣ Design #3 Finally works‣ VPC shrunk to single license server running in US EAST‣ All Windows/Linux/GPU solover nodes running in EU‣ NO NAT, NO VPC For Solvers‣ Extensive use of AWS spot instance servers

Tuesday, May 28, 13

Page 62: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

At experiment end it looked like this ...62

Tuesday, May 28, 13

Page 63: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

63

Non Trivial HPC on the Cloud

16 large compute nodes + 22 GPU nodes$30/hour on AWS Spot Market.

Tuesday, May 28, 13

Page 64: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Why this work was ‘easy’ on Amazon AWS ...

64

Nightmare on any other cloud

‣ Lets discuss why this simulation workload would be much, much harder to do on some other cloud platform ...

Tuesday, May 28, 13

Page 65: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Why this work was ‘easy’ on Amazon AWS ...

65

Nightmare on any other cloud

1. Virtual Servers2. Block Storage3. Object Storage4. ... and maybe some other

stuff if I’m lucky

‣ EC2, S3, EBS, RDS, SNS, SQS, SWS, GPUs, SSDs, CloudFormation, VPC, ENIs, SecurityGroups, 10GbE DirectConnect, Reserved Instances, ImportExport, Spot Market

‣ And ~25 other products and service features with more added monthly

‘Brand X’ Cloud AWS

Tuesday, May 28, 13

Page 66: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Easy on AWS; much harder elsewhereOne very specific example

66

‣ The widely used FLEXlm license server uses NIC MAC addresses when generating license keys

‣ Different MAC? Science stops. Screwed.

‣ VPC ENIs allow separation of MAC address from Network Interface. Badass.

Tuesday, May 28, 13

Page 67: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Why this work was ‘easy’ on Amazon AWS ...A few other examples ...

67

VPC

Spot Market

cc* & cg* ec2 instance

types

Incredibly powerful. Actually useful.

Approachable even if you are not an IPSEC or BGP routing god.

Compelling economics. Once you start you’ll likely never run anywhere else.

The competition can’t compete.

Fat nodes with bidirectional 10GbE bandwidth.

And don’t get me started on SSD or Provisioned-performance EBS volumes.

Tuesday, May 28, 13

Page 68: Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

68

Thanks! Email: [email protected]

Tuesday, May 28, 13