Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Scientific Computing on AWS
Why do researchers love using AWS?
Time to Science
Access research
infrastructure in minutes
Low Cost
Pay-as-you-go pricing
Elastic
Easily add or remove
capacity
Globally Accessible
Easily Collaborate with
researchers around the world
Secure
A collection of tools to
protect data and privacy
Scalable
Access to effectively
limitless capacity
• High-energy physics simulations
• Weather and climate modeling and prediction
• Analysis of fluids, structures, and materials
• Thermal and electromagnetic simulations
• Genomics, proteomics, and molecular dynamics
• Seismic and reservoir simulations
• 3D rendering and visualizations
• Deep learning training and inference
Cloud unlocks HPC for a broad range of use cases
AWS for High Performance Computing…
AWS Regions
HPC became an optimization problem
A top 500 supercomputer
For less than $100/hr
Ready in 100 seconds
Time traveling workloads
# CPUs
time
# CPUs
time
Wall clock time: 1 hour Wall clock time: 1 week
Cost: equal
The Solution
When you only pay for what you use …
• If you’re only able to use your compute, say, 30% of
the time, you only pay for that time.
1 Pocket the savings
• Buy chocolate
• Buy a spectrometer
• Hire a research assistant.
2Go faster
• Use 3x the cores to
run your jobs at 3x
the speed.
3Go Large
• Do 3x the science,
or consume 3x the
data.
… you have options.
Characterising HPC
Embarrassingly Parallel
Elastic
Batch workloads
Interconnected jobs
Network sensitivity
Job-specific algorithms
Loosely
Coupled
Tightly
Coupled
Mapping HPC Use-Cases
Data LightMinimal
requirements for
high performance
storage
Data HeavyBenefits from
access to high
performance
storage
Fluid dynamics
Weather forecasting
Materials simulations
Crash simulations
Risk simulations
Molecular modeling
Contextual search
Logistics simulations
Animation and VFX
Semiconductor verification
Image processing/GIS
Genomics
Seismic processing
Metagenomics
Astrophysics
Deep learning
Clustered (Tightly Coupled)
Distributed/Grid (Loosely Coupled)
Cluster HPC and Grid HPC on the Cloud
Cluster HPC
Tightly coupled,
latency sensitive
applications
Use larger EC2
compute instances,
placement groups,
enhanced networking
Grid HPC
Loosely coupled,
pleasingly parallel
Use a variety of EC2
instances, multiple
AZs, Spot, Auto
Scaling, Amazon
SQS
Grids of Clusters
Use a grid strategy on the cloud
to run a group of parallel,
individually clustered HPC jobs
HPC Job Queues Are A Problem!
Conflicting goals
• HPC users seek fastest possible time-to-
results
• IT support team seeks highest possible
utilization
Result:
• The job queue becomes the capacity buffer,
and there is little or no scalability
• Users are frustrated and run fewer jobs
• Money is being saved, but for the wrong
reasons!
?
Instead, run multiple clusters
at the same time, on-demand
Match the Architectures to the Jobs
What are AWS HPC customers building?
Hybrid solutions:
• Joining on-premises
HPC clusters to elastic
clusters on AWS
• Investing in both on and
off-cloud HPC systems
• Picking which workload
runs on which
environment
What to send to the cloud?
• Many HPC clusters
aren’t running HPC
workloads!
• Sending <= 16 way jobs
to AWS is a no-brainer
• Free your existing
cluster(s) up for tightly
coupled HPC jobs
EC2 Instance Types
for HPC
Broad Set of Compute Instance Types
M4
General
purpose
Compute
optimized
C4
C3
Storage and I/O
optimized
I3
G2
GPU or FPGA
enabled
Memory
optimized
D2
M3
X1
P2
F1
R4
R3
C5
I2 HS1
Diving Deep: M4.16xlarge M4
CPU-Based Instances for HPC
Intel CPUs
• Up to 2.9 GHz, Turbo enabled up to 3.6 GHz
• Intel® Advanced Vector Extensions (Intel® AVX2)
• Control over C-States, P-States, and Hyper-threading
• C4, M4 are the most common instance types for HPC:
• Up to 64 vCPUs (32 physical cores)
• R3 and X1 for higher memory applications
• Up to 128 vCPUs (64 physical cores), up to 2 TB RAM
• Proprietary network delivering up to 20 Gbps
GPU and FPGA Instances
P2: GPU instance
• Up to 16 NVIDIA GK210 (8 X K80) GPUs in a single instance, with
peer-to-peer PCIe GPU interconnect
• Supporting a wide variety of use cases including deep learning, HPC
simulations, financial computing, and batch rendering
F1: FPGA instance
• Up to 8 Xilinx Virtex® UltraScale+™ VU9P FPGAs in a single
instance, with peer-to-peer PCIe and bidirectional ring interconnects
• Designed for hardware-accelerated applications including financial
computing, genomics, accelerated search, and image processing
P2
F1
P2 GPU Instances
• Up to 16 K80 GPUs in a single instance
• Including peer-to-peer PCIe GPU interconnect
• Supporting a wide variety of use cases including deep
learning, HPC simulations, and batch rendering
P2
Instance
Size
GPUs GPU Peer
to Peer
vCPUs Memory
(GiB)
Network
Bandwidth*
p2.xlarge 1 - 4 61 1.25Gbps
p2.8xlarge 8 Y 32 488 10Gbps
p2.16xlarge 16 Y 64 732 20Gbps
*In a placement group
Deploying HPC
on AWS
Traditional HPC Stack
Shared file storage
HPC cluster
License managers and cluster
head nodes with job schedulers
3D graphics remote desktop servers
Remote
graphics workstations
Storage cache
Remote sites
Remote backup
Migrating HPC to AWS
Shared File Storage
Cloud-based, scaling HPC cluster
on EC2
License managers and cluster
head nodes with job schedulers
3D graphics virtual workstation
AWS Direct Connect
On-Premises IT
Resources
Thin or Zero Client
- No local data -
Storage CacheAmazon S3
and
Amazon
Glacier
cfnCluster - provision an HPC cluster in minutes
#cfncluster
https://github.com/awslabs/cfncluster
cfncluster is a sample code framework that deploys and maintains clusters on
AWS. It is reasonably agnostic to what the cluster is for and can easily be
extended to support different frameworks. The CLI is stateless, everything is
done using CloudFormation or resources within AWS.
10 minutes
Head
node
Instance
Compute
node
Instance
Compute
node
Instance
Compute
node
Instance
Compute
node
Instance
10G Network
Auto-scaling group
Virtual Private Cloud
/shared
Head Instance
2 or more cores (as needed)
CentOS 6.x
OpenMPI, gcc etc…
Choice of scheduler: Torque, SGE,
OpenLava
Slurm
Compute Instances
2 or more cores (as needed)
CentOS 6.x
Auto Scaling group driven by scheduler queue length.
Can start with 0 (zero) nodes and only scale when there
are jobs.
It's a real cluster
Configuration is really simple ….There’s not a great deal involved getting a cluster up and
running.
The config file below will do it. We’ll spend the next ten
minutes or so showing you how to assemble all the bits of
data you need.
[aws]
aws_region_name = us-east-1
[cluster default]
vpc_settings = public
key_name = boof-cluster
[vpc public]
master_subnet_id = subnet-fe83e3c4
vpc_id = vpc-7cf12419
[global]
update_check = true
sanity_check = true
cluster_template = default
10 minutes
HPC Partner on AWS: Alces Flight
www.alces-flight.com
Log in to master node!
Use “alces” as login (should
match what you input to create
cluster)
No password needed (uses
pass key)
Ready to go!
“alces gridware list”
Install an application
Search for application
using “alces gridware
search …”
Install application using
“alces gridware install …”
Environment modules are
updated with application is
installed
Modules after installing application
To run application
don’t forget to
load the
application
module!
Questions and Answers