44
HPC Bursting with Google Cloud Platform Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016

Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment

HPC Bursting with Google Cloud PlatformKaran Bhatia, GoogleJonathon LeFaive, University of Michigan

Proprietary + ConfidentialSEPTEMBER 28, 2016

Page 3: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment
Page 4: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment

Cloud Deployment Manager

Proprietary + Confidential

● Repeatable

● Declarative

● Template Driven

Page 5: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment

Cloud Deployment Manager

Proprietary + Confidential

% gcloud deployment-manager deployments create mycluster --config condor-cluster.yaml

% gcloud compute ssh condor-submit

Page 6: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment

Proprietary + Confidential

Cores from Google

Page 7: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment

Proprietary + Confidential

Page 8: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment

Not interested in Infrastructure? Just want to run jobs and get results...

Confidential & ProprietaryGoogle Cloud Platform 8

Page 9: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment

MIT Research w/ VMs

Products used: Google Compute Engine, Cloud Storage, DataStore

220,000 cores on preemptible VMs

2,250 32-core instances, 60 CPU-years of computation in a single afternoon

Answers in hours v. months

580,000 cores

Page 10: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment

Broad Firecloud:WDL, Cromwell and Google Genomics

WDL: an external DSL used by computational biologists to express the analytical pipelines

Cromwell: a scalable, robust engine for executing WDL against pluggable backends including local, Docker, Grid Engine or …

Google Genomics Pipelines API: co-developed by Broad and Google Genomics, a scalable Docker-as-a-Service with data scheduling

Page 11: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment

Pipeline definition{

"name": "samtools index",

"description": "Run samtools index to generate a BAM index file",

"inputParameters": [

{"name": "inputFile",

"localCopy": {

"disk": "data",

"path": "input.bam"

}

},

{"name": "outputFile",

"localCopy": {

"disk": "data",

"path": "output.bam.bai"

}

},

],

"resources": {

"minimumCpuCores": 1,

"minimumRamGb": 1,

"disks": [{

"name": "data",

"type": "PERSISTENT_HDD"

"sizeGb": 200,

"mountPoint": "/mnt/data",

}]

},

"docker": {

"imageName": "quay.io/cancercollaboratory/dockstore-tool-samtools-index",

"cmd": "samtools index /mnt/data/input.bam /mnt/data/output.bam.bai"

}

}

Page 12: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment

Create, run, monitor, and kill pipelines

Create$ gcloud alpha genomics pipelines create --pipeline-json-file PIPELINE-FILE.json --pipeline-json-file samtools_index.json

Created samtools index, id: PIPELINE-ID

Run$ gcloud alpha genomics pipelines run --pipeline_id PIPELINE-ID \

--logging gs://YOUR-BUCKET/YOUR-DIRECTORY/logs \

--inputs inputFile=gs://genomics-public-data/gatk-examples/example1/NA12878_chr22.bam \

--outputs outputFile=gs://YOUR-BUCKET/YOUR-DIRECTORY/output/NA12878_chr22.bam.bai

Running: operations/OPERATION-ID

Status$ gcloud alpha genomics operations describe OPERATION-ID

Kill$ gcloud alpha genomics operations cancel OPERATION-ID

Page 13: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment

DSUB (google genomics pipelines)

Page 14: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment

Task Tailored Resources

Page 15: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment

Preemptible VM Instances

● What Preemptible VMs are○ Up to 80% cheaper than regular VMs. (~$0.01 per core hour)○ Very easy to use -- just flip one switch in the UI, API or command line○ Many of our biggest customers run huge clusters (10k+ cores) with great success and

savings.

● Things to keep in mind○ Same great disk, OS images and network○ Google Compute Engine can preempt (i.e. shutdown/take-away) the VM with 30

seconds of notice ○ Maximum 24 hours of uptime○ No SLAs or guarantees of any kind but we historically see preemption rates of 5-15%

Page 16: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment

Intel Skylake

● Significant “per core” performance improvements

● Intel® Advanced Vector Extension 512 (Intel® AVX-512)

○ 2x flops/second● Accelerated IO with Intel® Omni-Path

Architecture (Fabric)● Integrated Intel® QuickAssist Technology

(crypto & compression offload)● Intel® Resource Director Technology (Intel®

RDT) for Efficiency & TCO

Page 17: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment

Hardware Accelerated

● Available Today: NVIDIA K80 GPU, P100s

● Coming Soon: Tensor Processing Unit (TPU)

● Custom ASIC built and optimized for TensorFlow

● Used in production at Google for over 16 months

● 7 years ahead of GPU performance per watt

Page 18: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment

Tensorflow Processor Units (TPU) - 180TFlops

Page 19: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment

Cloud Storage

Cloud Bigtable

Cloud Datastore

Cloud SQL

Good for:Input & Output Binary or object data (BLOB) Long term storage

Such as:Images, videos, etc.

Good for:Hierarchical metadata

Good for:Metadata & other related data

Good for:Heavy read + write, events

Big Query

Good for:Data Warehouse

Analytics, Dashboards

Relational NoSQL Object Warehouse

Good for:Local VM file storage & work space

Block

Persistent Disk (GCE)

Where do I store data for my batch jobs?

Page 20: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment

Block storageReliable, high-performance block storage for any GCE VM instance

Local SSDFastest, Attached, Ephemeral

Persistent Disk: SSDFast, Persistent, Durable, Remote

Persistent Disk: HDDCheapest, Persistent, Durable, Remote

Targetscenarios

- High-performance scratch space. Frequently accessed data.- Excellent for scientific workloads, especially when combined with fast compute VMs like GPU instances

- Latency sensitive applications and files.- High performance database and enterprise applications- Databases

- Large data processing workloads- Latency incentive tasks with lots of data: Genomics processing, video transcoding in GCE

Features

- Ephemeral storage- Highest-performance ($0.218 GB)- IOPS: 680k read / 360k write

- Persistent storage- Performance sensitive ($0.17GB)- IOPS: up to 40k read / 30k write

- Persistent storage- Cost sensitive ($.04 GB)- IOPS: 3k read / 15k write

Encryption3TB - 375 GB per partition, up to 8 partitions

Encryption, Snapshots64 TB, Disk Size sets performance

(Attach larger VMs for max SSD performance)

Page 21: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment

GCS: Object/Blob store

● Google Cloud Storage is a scalable object storage service suitable for all kinds of unstructured data

● Cloud Storage vs Perst. Disk:○ Scales to exabytes○ Accessible from anywhere; REST interface○ Higher latency than PD○ Write semantics include insert and overwrite

file only○ Offers versioning○ Cheaper - put your data here until you need it

● Lots of guidelines on picking storage on our site

Page 22: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment

Human Genetics on Google Cloud Platform

Jonathon LeFaiveUniversity of Michigan

Center for Statistical Genetics

Page 23: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment

Human Genetics, Sample Sizes over Time

Page 24: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment

TOPMed Sequencing as of April 25, 2017http://nhlbi.sph.umich.edu/

● 71,713 genomes○ 70,497 pass quality checks (98.3%)○ 823 flagged for low coverage ( 1.2%)○ 393 fail quality checks ( 0.5%)

● Mean depth: 38.3x● Genome covered: 98.7%● Contamination: 0.27%

9 x 1015 sequenced bases

Page 25: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment

9 x 1015 sequenced bases

On the same scale as the number of grains of sand in a small beachImage: Wikimedia Commons

Page 26: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment

What Does Alignment Mean?

Image: Wikimedia Commons

Page 27: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment

Compute Workload● 70,000 genomes● ~ 20 GiB per genome● 456 core hours per genome

Page 28: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment

1.3 PiBHighly Compressed Input Data

Page 29: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment

3644Core Years

Page 30: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment

Local Cluster Limitations● Not enough CPU cores● NFS saturation● Shared resources

Page 31: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment

GCE Price Table

Machine Type vCPUs Memory Price (USD) Preemptible price (USD)

n1-standard-1 1 3.75 GB $0.0475 $0.0100

n1-standard-2 2 7.5GB $0.0950 $0.0200

n1-standard-4 4 15GB $0.1900 $0.0400

n1-standard-8 8 30GB $0.3800 $0.0800

PVMs are approximately ⅕ the cost.

Page 32: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment

$0.0375 x 456 hr = $17.10

Savings of $17 per genome.

Page 33: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment

$17.10 x 70,000 genomes = $1.2m

Total savings of $1.2m

Page 34: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment

PVMs are Important …… so how do we utilize them?

● Change nothing● Process checkpointing (CRIU, BLCR, etc.)● Linux “suspend to disk” (swsusp)● Proprietary solution

Page 35: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment

Pipeline Flow● Pre-process step outputs chunked

files● Main processing runs on chunked

files using preemptible instances● Chunks are merged before

post-processing

Page 36: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment

Application Design● Checks out job from master

database

● Downloads input from bucket

● Processes input

● Uploads output to bucket

● Uploads log to bucket

● Checks in job to master

database.

Page 37: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment

Advice for New Users● Use Google Cloud Storage● Create custom network● Exponential backoff retries● Cater to the majority when provisioning resources

Page 38: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment

What Catching Up Looks Like

Page 39: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment
Page 40: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment
Page 41: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment
Page 42: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment
Page 43: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment
Page 44: Google Cloud Platform HPC Bursting with · 11/7/2017  · Karan Bhatia, Google Jonathon LeFaive, University of Michigan Proprietary + Confidential SEPTEMBER 28, 2016. Cloud Deployment

Thank You!

Acknowledgments: Chris Scheller, Hyun M. Kang, Goncalo Abecasis, GCP Team