Scientific Computing on AWS - NTU EEE AWS … · The Solution When you only pay for what you use … •If you’re only able to use your compute, say, 30% of the time, you only pay

Scientific Computing on AWS

Why do researchers love using AWS?

Time to Science

Access research

infrastructure in minutes

Low Cost

Pay-as-you-go pricing

Elastic

Easily add or remove

capacity

Globally Accessible

Easily Collaborate with

researchers around the world

Secure

A collection of tools to

protect data and privacy

Scalable

Access to effectively

limitless capacity

• High-energy physics simulations

• Weather and climate modeling and prediction

• Analysis of fluids, structures, and materials

• Thermal and electromagnetic simulations

• Genomics, proteomics, and molecular dynamics

• Seismic and reservoir simulations

• 3D rendering and visualizations

• Deep learning training and inference

Cloud unlocks HPC for a broad range of use cases

AWS for High Performance Computing…

AWS Regions

HPC became an optimization problem

A top 500 supercomputer

For less than $100/hr

Ready in 100 seconds

Time traveling workloads

# CPUs

time

# CPUs

time

Wall clock time: 1 hour Wall clock time: 1 week

Cost: equal

The Solution

When you only pay for what you use …

• If you’re only able to use your compute, say, 30% of

the time, you only pay for that time.

1 Pocket the savings

• Buy chocolate

• Buy a spectrometer

• Hire a research assistant.

2Go faster

• Use 3x the cores to

run your jobs at 3x

the speed.

3Go Large

• Do 3x the science,

or consume 3x the

data.

… you have options.

Characterising HPC

Embarrassingly Parallel

Elastic

Batch workloads

Interconnected jobs

Network sensitivity

Job-specific algorithms

Loosely

Coupled

Tightly

Coupled

Mapping HPC Use-Cases

Data LightMinimal

requirements for

high performance

storage

Data HeavyBenefits from

access to high

performance

storage

Fluid dynamics

Weather forecasting

Materials simulations

Crash simulations

Risk simulations

Molecular modeling

Contextual search

Logistics simulations

Animation and VFX

Semiconductor verification

Image processing/GIS

Genomics

Seismic processing

Metagenomics

Astrophysics

Deep learning

Clustered (Tightly Coupled)

Distributed/Grid (Loosely Coupled)

Cluster HPC and Grid HPC on the Cloud

Cluster HPC

Tightly coupled,

latency sensitive

applications

Use larger EC2

compute instances,

placement groups,

enhanced networking

Grid HPC

Loosely coupled,

pleasingly parallel

Use a variety of EC2

instances, multiple

AZs, Spot, Auto

Scaling, Amazon

SQS

Grids of Clusters

Use a grid strategy on the cloud

to run a group of parallel,

individually clustered HPC jobs

HPC Job Queues Are A Problem!

Conflicting goals

• HPC users seek fastest possible time-to-

results

• IT support team seeks highest possible

utilization

Result:

• The job queue becomes the capacity buffer,

and there is little or no scalability

• Users are frustrated and run fewer jobs

• Money is being saved, but for the wrong

reasons!

?

Instead, run multiple clusters

at the same time, on-demand

Match the Architectures to the Jobs

What are AWS HPC customers building?

Hybrid solutions:

• Joining on-premises

HPC clusters to elastic

clusters on AWS

• Investing in both on and

off-cloud HPC systems

• Picking which workload

runs on which

environment

What to send to the cloud?

• Many HPC clusters

aren’t running HPC

workloads!

• Sending <= 16 way jobs

to AWS is a no-brainer

• Free your existing

cluster(s) up for tightly

coupled HPC jobs

EC2 Instance Types

for HPC

Broad Set of Compute Instance Types

M4

General

purpose

Compute

optimized

C4

C3

Storage and I/O

optimized

I3

G2

GPU or FPGA

enabled

Memory

optimized

D2

M3

X1

P2

F1

R4

R3

C5

I2 HS1

Diving Deep: M4.16xlarge M4

CPU-Based Instances for HPC

Intel CPUs

• Up to 2.9 GHz, Turbo enabled up to 3.6 GHz

• Intel® Advanced Vector Extensions (Intel® AVX2)

• Control over C-States, P-States, and Hyper-threading

• C4, M4 are the most common instance types for HPC:

• Up to 64 vCPUs (32 physical cores)

• R3 and X1 for higher memory applications

• Up to 128 vCPUs (64 physical cores), up to 2 TB RAM

• Proprietary network delivering up to 20 Gbps

GPU and FPGA Instances

P2: GPU instance

• Up to 16 NVIDIA GK210 (8 X K80) GPUs in a single instance, with

peer-to-peer PCIe GPU interconnect

• Supporting a wide variety of use cases including deep learning, HPC

simulations, financial computing, and batch rendering

F1: FPGA instance

• Up to 8 Xilinx Virtex® UltraScale+™ VU9P FPGAs in a single

instance, with peer-to-peer PCIe and bidirectional ring interconnects

• Designed for hardware-accelerated applications including financial

computing, genomics, accelerated search, and image processing

P2

F1

P2 GPU Instances

• Up to 16 K80 GPUs in a single instance

• Including peer-to-peer PCIe GPU interconnect

• Supporting a wide variety of use cases including deep

learning, HPC simulations, and batch rendering

P2

Instance

Size

GPUs GPU Peer

to Peer

vCPUs Memory

(GiB)

Network

Bandwidth*

p2.xlarge 1 - 4 61 1.25Gbps

p2.8xlarge 8 Y 32 488 10Gbps

p2.16xlarge 16 Y 64 732 20Gbps

*In a placement group

Deploying HPC

on AWS

Traditional HPC Stack

Shared file storage

HPC cluster

License managers and cluster

head nodes with job schedulers

3D graphics remote desktop servers

Remote

graphics workstations

Storage cache

Remote sites

Remote backup

Migrating HPC to AWS

Shared File Storage

Cloud-based, scaling HPC cluster

on EC2

License managers and cluster

head nodes with job schedulers

3D graphics virtual workstation

AWS Direct Connect

On-Premises IT

Resources

Thin or Zero Client

- No local data -

Storage CacheAmazon S3

and

Amazon

Glacier

cfnCluster - provision an HPC cluster in minutes

#cfncluster

https://github.com/awslabs/cfncluster

cfncluster is a sample code framework that deploys and maintains clusters on

AWS. It is reasonably agnostic to what the cluster is for and can easily be

extended to support different frameworks. The CLI is stateless, everything is

done using CloudFormation or resources within AWS.

10 minutes

https://github.com/awslabs/cfncluster

Head

node

Instance

Compute

node

Instance

Compute

node

Instance

Compute

node

Instance

Compute

node

Instance

10G Network

Auto-scaling group

Virtual Private Cloud

/shared

Head Instance

2 or more cores (as needed)

CentOS 6.x

OpenMPI, gcc etc…

Choice of scheduler: Torque, SGE,

OpenLava

Slurm

Compute Instances

2 or more cores (as needed)

CentOS 6.x

Auto Scaling group driven by scheduler queue length.

Can start with 0 (zero) nodes and only scale when there

are jobs.

It's a real cluster

Configuration is really simple ….There’s not a great deal involved getting a cluster up and

running.

The config file below will do it. We’ll spend the next ten

minutes or so showing you how to assemble all the bits of

data you need.

[aws]

aws_region_name = us-east-1

[cluster default]

vpc_settings = public

key_name = boof-cluster

[vpc public]

master_subnet_id = subnet-fe83e3c4

vpc_id = vpc-7cf12419

[global]

update_check = true

sanity_check = true

cluster_template = default

10 minutes

HPC Partner on AWS: Alces Flight

www.alces-flight.com

Log in to master node!

Use “alces” as login (should

match what you input to create

cluster)

No password needed (uses

pass key)

Ready to go!

“alces gridware list”

Install an application

Search for application

using “alces gridware

search …”

Install application using

“alces gridware install …”

Environment modules are

updated with application is

installed

Modules after installing application

To run application

don’t forget to

load the

application

module!

Questions and Answers

Documents

Scientific Computing on AWS - NTU EEE AWS … · The Solution When you only pay for what you use … •If you’re only able to use your compute, say, 30% of the time, you only pay