Upload
chris-dwan
View
91
Download
0
Embed Size (px)
Citation preview
Collaboration @ ScaleSeptember, 2015: Life Sciences User Group, Cambridge MA
Chris Dwan ([email protected])Director, Research Computing and Data ServicesActing Director, IT
Conclusions
• Good news: The fundamentals still apply.
• Understand your data.– Get intense about what you need and why you need it who is responsible
and how / when you plan to compute against it.– This will require organizational courage.
• Stop thinking about “moving” data. – Archive first. After that, all copies are transient.
• Object storage is different from files– at many weird levels.
• Elasticity in compute is not like elasticity in data– Availability of CPUs vs. proximity to elastic compute.– Also, “trash storage?”
• The Broad Institute is a non-profit biomedical research institute founded in 2004
• Fifty core faculty members and hundreds of associate members from MIT and Harvard
• ~1000 research and administrative personnel, plus ~2,400+ associated researchers
• ~1.4 x 106 genotyped samples
Programs and Initiativesfocused on specific disease or biology areas
CancerGenome BiologyCell CircuitsPsychiatric DiseaseMetabolismMedical and Population GeneticsInfectious DiseaseEpigenomics
Platformsfocused technological innovation and application
GenomicsTherapeuticsImagingMetabolite ProfilingProteomicsGenetic Perturbation
The Broad Institute
• The Broad Institute is a non-profit biomedical research institute founded in 2004
• Fifty core faculty members and hundreds of associate members from MIT and Harvard
• ~1000 research and administrative personnel, plus ~2,400+ associated researchers
• ~1.4 x 106 genotyped samples
Programs and Initiativesfocused on specific disease or biology areas
CancerGenome BiologyCell CircuitsPsychiatric DiseaseMetabolismMedical and Population GeneticsInfectious DiseaseEpigenomics
Platformsfocused technological innovation and application
GenomicsTherapeuticsImagingMetabolite ProfilingProteomicsGenetic Perturbation
The Broad Institute
“This generation has a historic opportunity and responsibility to transform medicine by using systematic approaches in the biological sciences to dramatically accelerate the understanding and cure of disease”
If a man’s at odds to know his own mind it’s because he hasn’t got aught but his mind to know it with.
Cormac McCarthy, Blood Meridian or The Evening Redness in the West
Broad Genomics Data Production
338 trillion base pairs (PF) in AugustAt ~1.25 bytes per base:422 TByte / month ~= 170 MByte / sec
Broad Genomics Data Production: Context
Broad Genomics Data Production: Context
We were all talking about “data tsunamis” here.
Broad Genomics Data Production: Context
I joined the Broad here
We were all talking about “data tsunamis” here.
Under the hood: ~1TB of MongoDB
Organizations which design systems … are constrained to produce designs which are copies of the communication structures of those organizations
Melvin Conway, 1968
If you have four groups working on a compiler, you’ll get a four pass compiler
Eric S Raymond, The New Hacker’s Dictionary, 1996
Never send a human to do a machine’s job.
Agent Smith, The Matrix
Broad IT Services
Traditional IT:• Globally shared services• NFS, AD / LDAP, DNS, …• Many services provided using
public cloudsResponsibility: CIO
Broad IT Services
Traditional IT:• Globally shared services• NFS, AD / LDAP, DNS, …• Many services provided using
public cloudsResponsibility: CIO
Cancer Genome Analysis Connectivity MapBilling Support:• IT provides coordination between internal cost
objects and cloud vendor “projects” or “roles”• No shared servicesResponsibility: User
Broad IT Services
Traditional IT:• Globally shared services• NFS, AD / LDAP, DNS, …• Many services provided using
public cloudsResponsibility: CIO
Cancer Genome Analysis Connectivity MapBilling Support:• IT provides coordination between internal cost
objects and cloud vendor “projects” or “roles”• No shared servicesResponsibility: User
Cloud / Hybrid Model• Granular shared services• VPN used to expose selected
services to particular projects Responsibility: Project / Service Lead
BITS DevOps DSDE Dev Cloud Pilot
VPN VPN VPN
The future is already here – it’s just not very well distributed
William Gibson
CycleCloud provides straightforward, recognizable cluster functionality with autoscaling and a clean management UI.
Do not be fooled by the 85 page “quick start guide,” it’s just a cluster.
Instances are provisioned based on queued jobs
3,000 tasks completed in two hours(differential dependency on gene sets in R)
5 instances @ 32 cores:$8.54 / hrThis was a $20 analysis
Searching for the right use case …
Cycle Cloud on Google Pre-emptible Instances
50,000+ cores used for ~2 hours
If you want to recruit the best people, you have to hit home runs from time to time.
My Metal
Boot Image Provisioning (PXE / Cobbler, Kickstart)
Hardware Provisioning (UCS, Xcat)
Broad configuration (Puppet)
User or execution environment (Dotkit, docker, JVM, Tomcat)
Hypervisor OS
Instance Provisioning (Openstack)
Bare Metal
End User visible OS and vendor patches (Red Hat, plus satellite)
Private Cloud Public CloudContainerized Wonderland
The basics still apply
Network topology (VLANS, et al)
Public Cloud Infrastructure
Instance Provisioning (CycleCloud)
… Docker / Mesos Kubernetes / Cloud Foundry / Workflow
Description Language / …
bragg bragg iodineSequencer
Flowcell Directories • Base Calling• Paired reads• /seq/illumina
Lane BAMs• Aligned • Not aggregated• /seq/picard
Aggregated BAMs• Aligned to a reference• /seq/picard_aggregation
Deleted after six weeks
“Keep forever”
gVCF
VCF
argon
A nightmare* of files
bragg bragg iodineSequencer
Flowcell Directories • Base Calling• Paired reads• /seq/illumina
Lane BAMs• Aligned • Not aggregated• /seq/picard
Aggregated BAMs• Aligned to a reference• /seq/picard_aggregation
bragg
Deleted after six weeks
knox
Six months on high performance storage, then migrated to cost effective filers.
gVCF
VCF
Over time, these directories become a highly curated forest of symbolic links, spanning several filesystems
A nightmare of files
bragg bragg iodineSequencer
Flowcell Directories • Base Calling• Paired reads• /seq/illumina
Lane BAMs• Aligned • Not aggregated• /seq/picard
Aggregated BAMs• Aligned to a reference• /seq/picard_aggregation
bragg
Deleted after six weeks
knox
Six months on high performance storage, then migrated to cost effective filers.
gVCF
VCF
Over time, these directories become a highly curated forest of symbolic links, spanning several filesystems
kiwi
flynn
argon
mint
A nightmare of files
bragg bragg iodineSequencer
Flowcell Directories • Base Calling• Paired reads• /seq/illumina
Lane BAMs• Aligned • Not aggregated• /seq/picard
Aggregated BAMs• Aligned to a reference• /seq/picard_aggregation
bragg
Deleted after six weeks
knox
Six months on high performance storage, then migrated to cost effective filers.
gVCF
VCF
Over time, these directories become a highly curated forest of symbolic links, spanning several filesystems
kiwi
flynn
argon
Setting aside the operational issues, meaningful access management is frankly impossible in this architecture.
mint
A nightmare* of files
Caching edge filers for shared references
10 Gb/sec Network80+ Gb/sec Network
Openstack
Production Farm
Avere Edge Filer(physical)
On premise data stores
Shared Research Farm
Coherence on small volumes of files provided by a combination of clever network routing and Avere’s caching algorithms.
Cloud-backed, file-based storage
10 Gb/sec Network80+ Gb/sec Network
Openstack
Production Farm
Multiple Public Clouds
Avere Edge Filer(physical)
On premise data stores
Cloud backed data stores
Shared Research Farm
We decided to call this fargo. It’s cold, sort of far away, and not really where we were planning to go.
Caching edge filers for unlimited expansion space
10 Gb/sec Network80+ Gb/sec Network
Openstack
Production Farm
Multiple Public Clouds
Avere Edge Filer(physical)
On premise data stores
Avere Edge Filer (virtual)Cloud backed
data stores
Shared Research Farm
Eventually we can stand up “cloud pods” that make direct reference to fargo.
bragg bragg iodineSequencer
Flowcell Directories • Base Calling• Paired reads• /seq/illumina
Lane BAMs• Aligned • Not aggregated• /seq/picard
Aggregated BAMs• Aligned to a reference• /seq/picard_aggregation
bragg
Deleted after six weeks
knox
Six months on high performance storage, then migrated to cost effective filers.
gVCF
VCF
Over time, these directories become a highly curated forest of symbolic links, spanning several filesystems
kiwi
flynn
argon
Setting aside the operational issues, meaningful access management is frankly impossible in this architecture.
mint
Fargo (avere backed, file storage)
This is cool, but it’s not the answer.
bragg bragg iodineSequencer
Flowcell Directories • Base Calling• Paired reads• /seq/illumina
Lane BAMs• Aligned • Not aggregated• /seq/picard
Aggregated BAMs• Aligned to a reference• /seq/picard_aggregation
Deleted after six weeks
Six months on high performance storage, then migrated to cost effective filers.
gVCF
VCF
Over time, these directories become a highly curated forest of symbolic links, spanning several filesystems
argon
Setting aside the operational issues, meaningful access management is frankly impossible in this architecture.
Fargo (avere backed, file storage)
This is cool, but it’s not the answer.
Data push to “Fargo”
September 1, 2015:• Sustained 250MB/sec for several weeks• 646TB of files occupying 579TB of usable space (compression, even at 10% savings,
is totally worth it)• Client side encryption in-line: Skip the conversation, just click the button.
The edges are still a little rough
The billing API is the best way to get usage information out of cloud providers.
The edges are still a little rough
The billing API is the best way to get usage information out of google’s cloud offerings.
“df” can be off by hundreds of TB.
The edges are still a little rough
The billing API is the best way to get usage information out of google’s cloud offerings.
Seriously? “df” is off by hundreds of TB.
Eight exabytes is cool though.
The edges are still a little rough
The billing API is the best way to get usage information out of google’s cloud offerings.
I guess it’s better than waiting all day for ‘du’ to finish…
The edges are still a little rough
The billing API is the best way to get usage information out of google’s cloud offerings.
We write ~250 objects, 1MB each, every second of every day.
“ls” is not a meaningful tool at this scale.
The edges are still a little rough
The billing API is the best way to get usage information out of google’s cloud offerings.
Old style dashboards simply won’t cut it.
File based storage: The Information Limits• Single namespace filers hit real-world limits at:
– ~5PB (restriping times, operational hotspots, MTBF headaches)– ~109 files: Directories must either be wider or deeper than human
brains can handle.
• Filesystem paths are presumed to persist forever– Leads inevitably to forests of symbolic links
• Access semantics are inadequate for the federated world.– We need complex, dynamic, context sensitive semantics including
consent for research use.
File based storage: The Information Limits• Single namespace filers hit real-world limits at:
– ~5PB (restriping times, operational hotspots, MTBF headaches)– ~109 files: Directories must either be wider or deeper than human
brains can handle.
• Filesystem paths are presumed to persist forever– Leads inevitably to forests of symbolic links
• Access semantics are inadequate for the federated world.– We need complex, dynamic, context sensitive semantics including
consent for research use.
Object storage
• It’s still made out of disks and servers.
• You get the option of striping across on-premise and cloud in dynamic and sensible ways.
My object storage opinions
• The S3 standard defines object storage– Any application that uses any special / proprietary features is a
nonstarter – including clever metadata stuff.
• All object storage must be durable to the loss of an entire data center– Conversations about sizing / usage need to be incredibly simple
• Must be cost effective at scale– Throughput and latency are considerations, not requirements– This breaks the data question into stewardship and usage
• Must not merely re-iterate the failure modes of filesystems
Do not call the tortoise unworthy because she is not something else.
Walt Whitman, Song of Myself
Object Storage is different
• Filesystems– I/O errors or stalls are rare, and are usually evidence of
serious problems– Optimize for throughput by using long streaming reads
and writes.
• Object Storage– I/O errors are common, with an expectation of several
retries– Optimize for throughput by parallelizing and reducing the
cost of a retry – Multipart upload and download are essential
Broad Data Production, 2015: ~100TB /wk of unique information
“Data is heavy: It goes to the cheapest, closest place, and it stays there”
Jeff Hammerbacher
This means that you should put data in its final resting place as soon as it is generated. Anything else leads to madness.
bragg bragg iodineSequencer
Flowcell Directories • Base Calling• Paired reads• /seq/illumina
Lane BAMs• Aligned • Not aggregated• /seq/picard
Aggregated BAMs• Aligned to a reference• /seq/picard_aggregation
Deleted after six weeks
gVCF
VCF
Our long term archive must be “object native”
Long term archiveObject native
Archive first
Must re-tool all pipelines to support object storage stage-in and stage out.
Once you have your archive right, all other data is transientCrammed,
encrypted BAMs• Not aligned• Not aggregated
bragg bragg iodineSequencer
Flowcell Directories • Base Calling• Paired reads• /seq/illumina
Lane BAMs• Aligned • Not aggregated• /seq/picard
Aggregated BAMs• Aligned to a reference• /seq/picard_aggregation
Deleted after six weeks
gVCF
VCF
Long term archiveObject native
Archive first
Once you have your archive right, all other data is transient
Once the long term archive is object-native, we can move the main-line production to the cloud.
The dashboard should look opaque, because metadata lives elsewhere.
The dashboard should look opaque
• Object “names” should be a bag of UUIDs• Object storage should be basically unusable without the
metadata index.• Anything else recapitulates the failure mode of file based
storage.
• This should scare you.
Data Deletion @ Scale
Me: “Blah Blah … I think we’re cool to delete about 600TB of data from a cloud bucket. What do you think?”
Data Deletion @ Scale
Blah Blah … I think we’re cool to delete about 600TB of data from a cloud bucket Ray: “BOOM!”
Data Deletion @ Scale
Blah Blah … I think we’re cool to delete about 600TB of data from a cloud bucket
• This was my first deliberate data deletion at this scale. • It scared me how fast / easy it was.• Considering a “pull request” model for large scale deletions.
Standards are needed for genomic data
“The mission of the Global Alliance for Genomics and Health is to accelerate progress in human health by helping to establish a common framework of harmonized approaches to enable effective and responsible sharing of genomic and clinical data, and by catalyzing data sharing projects that drive and demonstrate the value of data sharing.”
Regulatory IssuesEthical IssuesTechnical Issues
This stuff is important
We have an opportunity to change lives and health outcomes, and to realize the gains of genomic medicine, this year.
We also have an opportunity to waste vast amounts of money and still not really help the world.
I would like to work together with you to build a better future, sooner.
Conclusions
• Good news: The fundamentals still apply.
• Understand your data.– Get intense about what you need and why you need it who is responsible
and how / when you plan to compute against it.
• Stop thinking about “moving” data. – Archive first. After that, all copies are transient.
• Object storage is different from files– at many weird levels.
• Elasticity in compute is not like elasticity in data– Availability of CPUs vs. proximity to elastic compute.– Also, “trash storage?”
The opposite of play is not work, it’s depression
Jane McGonnigal, Reality is Broken
Thank You