2015 09 emc lsug

Collaboration @ ScaleSeptember, 2015: Life Sciences User Group, Cambridge MA

Chris Dwan ([email protected])Director, Research Computing and Data ServicesActing Director, IT

mailto:[email protected]

Conclusions

• Good news: The fundamentals still apply.

• Understand your data.– Get intense about what you need and why you need it who is responsible

and how / when you plan to compute against it.– This will require organizational courage.

• Stop thinking about “moving” data. – Archive first. After that, all copies are transient.

• Object storage is different from files– at many weird levels.

• Elasticity in compute is not like elasticity in data– Availability of CPUs vs. proximity to elastic compute.– Also, “trash storage?”

• The Broad Institute is a non-profit biomedical research institute founded in 2004

• Fifty core faculty members and hundreds of associate members from MIT and Harvard

• ~1000 research and administrative personnel, plus ~2,400+ associated researchers

• ~1.4 x 106 genotyped samples

Programs and Initiativesfocused on specific disease or biology areas

CancerGenome BiologyCell CircuitsPsychiatric DiseaseMetabolismMedical and Population GeneticsInfectious DiseaseEpigenomics

Platformsfocused technological innovation and application

GenomicsTherapeuticsImagingMetabolite ProfilingProteomicsGenetic Perturbation

The Broad Institute

• The Broad Institute is a non-profit biomedical research institute founded in 2004

• Fifty core faculty members and hundreds of associate members from MIT and Harvard

• ~1000 research and administrative personnel, plus ~2,400+ associated researchers

• ~1.4 x 106 genotyped samples

Programs and Initiativesfocused on specific disease or biology areas

CancerGenome BiologyCell CircuitsPsychiatric DiseaseMetabolismMedical and Population GeneticsInfectious DiseaseEpigenomics

Platformsfocused technological innovation and application

GenomicsTherapeuticsImagingMetabolite ProfilingProteomicsGenetic Perturbation

The Broad Institute

“This generation has a historic opportunity and responsibility to transform medicine by using systematic approaches in the biological sciences to dramatically accelerate the understanding and cure of disease”

If a man’s at odds to know his own mind it’s because he hasn’t got aught but his mind to know it with.

Cormac McCarthy, Blood Meridian or The Evening Redness in the West

Broad Genomics Data Production

338 trillion base pairs (PF) in AugustAt ~1.25 bytes per base:422 TByte / month ~= 170 MByte / sec

Broad Genomics Data Production: Context


We were all talking about “data tsunamis” here.


I joined the Broad here

We were all talking about “data tsunamis” here.

Under the hood: ~1TB of MongoDB

Organizations which design systems … are constrained to produce designs which are copies of the communication structures of those organizations

Melvin Conway, 1968

If you have four groups working on a compiler, you’ll get a four pass compiler

Eric S Raymond, The New Hacker’s Dictionary, 1996

Never send a human to do a machine’s job.

Agent Smith, The Matrix

Broad IT Services

Traditional IT:• Globally shared services• NFS, AD / LDAP, DNS, …• Many services provided using

public cloudsResponsibility: CIO

Broad IT Services



Cancer Genome Analysis Connectivity MapBilling Support:• IT provides coordination between internal cost

objects and cloud vendor “projects” or “roles”• No shared servicesResponsibility: User

Broad IT Services



Cancer Genome Analysis Connectivity MapBilling Support:• IT provides coordination between internal cost

objects and cloud vendor “projects” or “roles”• No shared servicesResponsibility: User

Cloud / Hybrid Model• Granular shared services• VPN used to expose selected

services to particular projects Responsibility: Project / Service Lead

BITS DevOps DSDE Dev Cloud Pilot

VPN VPN VPN

The future is already here – it’s just not very well distributed

William Gibson

CycleCloud provides straightforward, recognizable cluster functionality with autoscaling and a clean management UI.

Do not be fooled by the 85 page “quick start guide,” it’s just a cluster.

Instances are provisioned based on queued jobs

3,000 tasks completed in two hours(differential dependency on gene sets in R)

5 instances @ 32 cores:$8.54 / hrThis was a $20 analysis

Searching for the right use case …

Cycle Cloud on Google Pre-emptible Instances

50,000+ cores used for ~2 hours

If you want to recruit the best people, you have to hit home runs from time to time.

My Metal

Boot Image Provisioning (PXE / Cobbler, Kickstart)

Hardware Provisioning (UCS, Xcat)

Broad configuration (Puppet)

User or execution environment (Dotkit, docker, JVM, Tomcat)

Hypervisor OS

Instance Provisioning (Openstack)

Bare Metal

End User visible OS and vendor patches (Red Hat, plus satellite)

Private Cloud Public CloudContainerized Wonderland

The basics still apply

Network topology (VLANS, et al)

Public Cloud Infrastructure

Instance Provisioning (CycleCloud)

… Docker / Mesos Kubernetes / Cloud Foundry / Workflow

Description Language / …

bragg bragg iodineSequencer

Flowcell Directories • Base Calling• Paired reads• /seq/illumina

Lane BAMs• Aligned • Not aggregated• /seq/picard

Aggregated BAMs• Aligned to a reference• /seq/picard_aggregation

Deleted after six weeks

“Keep forever”

gVCF

VCF

argon

A nightmare* of files





bragg


knox

Six months on high performance storage, then migrated to cost effective filers.

gVCF

VCF

Over time, these directories become a highly curated forest of symbolic links, spanning several filesystems

A nightmare of files





bragg


knox


gVCF

VCF


kiwi

flynn

argon

mint

A nightmare of files





bragg


knox


gVCF

VCF


kiwi

flynn

argon

Setting aside the operational issues, meaningful access management is frankly impossible in this architecture.

mint

A nightmare* of files

Caching edge filers for shared references

10 Gb/sec Network80+ Gb/sec Network

Openstack

Production Farm

Avere Edge Filer(physical)

On premise data stores

Shared Research Farm

Coherence on small volumes of files provided by a combination of clever network routing and Avere’s caching algorithms.

Cloud-backed, file-based storage


Openstack

Production Farm

Multiple Public Clouds



Cloud backed data stores


We decided to call this fargo. It’s cold, sort of far away, and not really where we were planning to go.

Caching edge filers for unlimited expansion space


Openstack

Production Farm

Multiple Public Clouds



Avere Edge Filer (virtual)Cloud backed

data stores


Eventually we can stand up “cloud pods” that make direct reference to fargo.





bragg


knox


gVCF

VCF


kiwi

flynn

argon


mint

Fargo (avere backed, file storage)

This is cool, but it’s not the answer.







gVCF

VCF


argon


Fargo (avere backed, file storage)

This is cool, but it’s not the answer.

Data push to “Fargo”

September 1, 2015:• Sustained 250MB/sec for several weeks• 646TB of files occupying 579TB of usable space (compression, even at 10% savings,

is totally worth it)• Client side encryption in-line: Skip the conversation, just click the button.

The edges are still a little rough

The billing API is the best way to get usage information out of cloud providers.


The billing API is the best way to get usage information out of google’s cloud offerings.

“df” can be off by hundreds of TB.



Seriously? “df” is off by hundreds of TB.

Eight exabytes is cool though.



I guess it’s better than waiting all day for ‘du’ to finish…



We write ~250 objects, 1MB each, every second of every day.

“ls” is not a meaningful tool at this scale.



Old style dashboards simply won’t cut it.

File based storage: The Information Limits• Single namespace filers hit real-world limits at:

– ~5PB (restriping times, operational hotspots, MTBF headaches)– ~109 files: Directories must either be wider or deeper than human

brains can handle.

• Filesystem paths are presumed to persist forever– Leads inevitably to forests of symbolic links

• Access semantics are inadequate for the federated world.– We need complex, dynamic, context sensitive semantics including

consent for research use.

File based storage: The Information Limits• Single namespace filers hit real-world limits at:

– ~5PB (restriping times, operational hotspots, MTBF headaches)– ~109 files: Directories must either be wider or deeper than human

brains can handle.

• Filesystem paths are presumed to persist forever– Leads inevitably to forests of symbolic links

• Access semantics are inadequate for the federated world.– We need complex, dynamic, context sensitive semantics including

consent for research use.

Object storage

• It’s still made out of disks and servers.

• You get the option of striping across on-premise and cloud in dynamic and sensible ways.

My object storage opinions

• The S3 standard defines object storage– Any application that uses any special / proprietary features is a

nonstarter – including clever metadata stuff.

• All object storage must be durable to the loss of an entire data center– Conversations about sizing / usage need to be incredibly simple

• Must be cost effective at scale– Throughput and latency are considerations, not requirements– This breaks the data question into stewardship and usage

• Must not merely re-iterate the failure modes of filesystems

Do not call the tortoise unworthy because she is not something else.

Walt Whitman, Song of Myself

Object Storage is different

• Filesystems– I/O errors or stalls are rare, and are usually evidence of

serious problems– Optimize for throughput by using long streaming reads

and writes.

• Object Storage– I/O errors are common, with an expectation of several

retries– Optimize for throughput by parallelizing and reducing the

cost of a retry – Multipart upload and download are essential

Broad Data Production, 2015: ~100TB /wk of unique information

“Data is heavy: It goes to the cheapest, closest place, and it stays there”

Jeff Hammerbacher

This means that you should put data in its final resting place as soon as it is generated. Anything else leads to madness.






gVCF

VCF

Our long term archive must be “object native”

Long term archiveObject native

Archive first

Must re-tool all pipelines to support object storage stage-in and stage out.

Once you have your archive right, all other data is transientCrammed,

encrypted BAMs• Not aligned• Not aggregated






gVCF

VCF

Long term archiveObject native

Archive first

Once you have your archive right, all other data is transient

Once the long term archive is object-native, we can move the main-line production to the cloud.

The dashboard should look opaque, because metadata lives elsewhere.

The dashboard should look opaque

• Object “names” should be a bag of UUIDs• Object storage should be basically unusable without the

metadata index.• Anything else recapitulates the failure mode of file based

storage.

• This should scare you.

Data Deletion @ Scale

Me: “Blah Blah … I think we’re cool to delete about 600TB of data from a cloud bucket. What do you think?”


Blah Blah … I think we’re cool to delete about 600TB of data from a cloud bucket Ray: “BOOM!”


Blah Blah … I think we’re cool to delete about 600TB of data from a cloud bucket

• This was my first deliberate data deletion at this scale. • It scared me how fast / easy it was.• Considering a “pull request” model for large scale deletions.

Standards are needed for genomic data

“The mission of the Global Alliance for Genomics and Health is to accelerate progress in human health by helping to establish a common framework of harmonized approaches to enable effective and responsible sharing of genomic and clinical data, and by catalyzing data sharing projects that drive and demonstrate the value of data sharing.”

Regulatory IssuesEthical IssuesTechnical Issues

This stuff is important

We have an opportunity to change lives and health outcomes, and to realize the gains of genomic medicine, this year.

We also have an opportunity to waste vast amounts of money and still not really help the world.

I would like to work together with you to build a better future, sooner.

[email protected]

Conclusions

• Good news: The fundamentals still apply.

• Understand your data.– Get intense about what you need and why you need it who is responsible

and how / when you plan to compute against it.

• Stop thinking about “moving” data. – Archive first. After that, all copies are transient.

• Object storage is different from files– at many weird levels.

• Elasticity in compute is not like elasticity in data– Availability of CPUs vs. proximity to elastic compute.– Also, “trash storage?”

The opposite of play is not work, it’s depression

Jane McGonnigal, Reality is Broken

Thank You

Technology

2015 09 emc lsug