29
Evaluating Cloud Computing for HPC Applications Lavanya Ramakrishnan CRD & NERSC

Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python

Evaluating Cloud Computing for HPC Applications

Lavanya Ramakrishnan CRD & NERSC

Page 2: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python

2

•  What are the unique needs and features of a science cloud? –  NERSC Magellan User Survey

•  What applications can efficiently run on a cloud? –  Benchmarking cloud technologies (Hadoop, Eucalyptus) and platforms (Amazon EC2, Azure)

•  Are cloud computing Programming Models such as Hadoop effective for scientific applications?

–  Experimentation with early applications •  Can scientific applications use a data-as-a-service or software-as-a-service model? •  What are the security implications of user-controlled cloud images? •  Is it practical to deploy a single logical cloud across multiple DOE sites? •  What is the cost and energy efficiency of clouds?

Magellan Research Agenda

Page 3: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python

3

Magellan User Survey

Program Office

Advanced Scientific Computing Research 17%

Biological and Environmental Research 9%

Basic Energy Sciences -Chemical Sciences 10%

Fusion Energy Sciences 10%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

Access to additional resources

Access to on-demand (commercial) paid resources closer to deadlines

Ability to control software environments specific to my application

Ability to share setup of software or experiments with collaborators

Ability to control groups/users

Exclusive access to the computing resources/ability to schedule independently of other groups/users

Easier to acquire/operate than a local cluster

Cost associativity? (i.e., I can get 10 cpus for 1 hr now or 2 cpus for 5 hrs at the same cost)

MapReduce Programming Model/Hadoop

Hadoop File System

User interfaces/Science Gateways: Use of clouds to host science gateways and/or access to cloud resources through science

Program Office

High Energy Physics 20%

Nuclear Physics 13%

Advanced Networking Initiative (ANI) Project 3%

Other 14%

Page 4: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python

4

•  Infrastructure as a Service (IaaS) –  Provide unlimited access to data storage and compute cycles –  e.g., Amazon EC2, Eucalyptus

•  Platform as a Service (PaaS) –  Delivery of a computing platform/software stack –  Container/images for specific user groups –  e.g., Hadoop, Azure

•  Software as a Service –  Specific function provided for use across multiple user groups (i.e. Science Gateways)

Cloud Computing Services

Page 5: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python

Magellan Software

5

Bat

ch Q

ueue

s

Virtu

al M

achi

nes

(Euc

alyp

tus)

Had

oop

Priv

ate

Clu

ster

s

Pub

lic o

r Rem

ote

Clo

ud

Sci

ence

and

Sto

rage

G

atew

ays ANI

Magellan Cluster

Page 6: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python

•  Web-service API to IaaS offering •  Uses Xen paravirtualization

–  cluster compute instance type uses hardware assisted virtualization

•  Non-persistent local disk in VM •  Simple Storage Service (S3)

–  scalable persistent object store •  Elastic Block Storage (EBS)

–  persistent, block level storage

Amazon Web Services

Page 7: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python

•  Open source IaaS implementation –  API compatible with Amazon AWS –  manage virtual machines

•  Walrus & Block Storage –  interface compatible to S3 & EBS

•  Available to users on Magellan testbed •  Private virtual clusters

–  scripts to manage dynamic virtual clusters –  NFS/Torque etc

Eucalyptus

Page 8: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python

•  Platforms –  Amazon, Azure, Lawrencium (IT cluster) –  Magellan

•  IB, TCP over IB, TCP over Ethernet, VM •  Workloads

–  HPCC –  NERSC6 Benchmarks –  Applications Pipelines

•  JGI Supernova Factory

•  Metrics –  Performance, Cost, Reliability, Programmability

Virtualization Impact

Page 9: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python

NERSC-6 Benchmark Performance [1/2]

0

2

4

6

8

10

12

14

16

18

GAMESS GTC IMPACT fvCAM MAESTRO256

Run

time

Rel

ativ

e to

Car

ver

Amazon EC2

Lawrencium

EC2-Beta

EC2-Beta-Opt

Franklin

Carver

Page 10: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python

NERSC-6 Benchmark Performance [2/2]

0

10

20

30

40

50

60

MILC PARATEC

Run

time

rela

tive

to C

arve

r

Amazon EC2

Lawrencium

EC2-Beta

EC2-Beta-Opt

Franklin

Carver

Page 11: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python

Magellan: NERSC6 Application Benchmarks

0 10 20 30 40 50 60 70 80 90

100

GTC PARATEC CAM

Perc

enta

ge P

erfo

rman

ce R

elat

ive

to N

ativ

e

TCP o IB

TCP o Eth

VM

Page 12: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python

Magellan: HPCC

0 2 4 6 8

10 12

Ping Pong Latency RandRing Latency

Perc

enta

ge

Perf

orm

ance

Rel

ativ

e to

Nat

ive

TCP o IB

TCP o Eth

IB/VM

0

20

40

60

80

100

DGEMM Ping Pong Bandwidth

HPL PTRANS Perc

enta

ge P

erfo

rman

ce w

ith

Res

pect

to N

ativ

e

TCP o IB TCP o Eth VM

Page 13: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python

•  James Hamilton’s cost model –  http://perspectives.mvdirona.com/2010/09/18/OverallDataCenterCosts.aspx –  expanded for HPC environments

•  Quantify difference in cost between IB and Ethernet

Performance-Cost Tradeoffs

Application Class Performance Increase/Cost

Increase Tightly Coupled with IO

(e.g., CAM) 4.36

Tightly Coupled with minimal IO (e.g., PARATEC, GTC)

1.2

Page 14: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python

•  Tools to measure the expansion history of the Universe and explore the nature of Dark Energy

–  Largest data volume supernova search •  Data Pipeline

–  Custom data analysis codes •  Coordinated by Python scripts

–  Run on a standard Linux batch queue cluster •  Cloud provides

–  Control over OS versions –  Root access and shared “group” account –  Immunity to externally enforced OS or architecture changes

Nearby Supernova Factory

Page 15: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python

Experiments on Amazon EC2

15

Input Data Output Data

EBS via NFS Local storage to EBS via NFS

Staged to local storage from EBS

Local storage to EBS via NFS

EBS via NFS EBS via NFS

Staged to local storage from EBS

EBS via NFS

EBS via NFS Local storage to S3

Staged to local storage from S3

Local storage to S3

Page 16: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python

Total Cost (Instance + Storage/month)

0

20

40

60

80

100

120

140

160

180

EBS-A1 EBS-A2 EBS-B1 EBS-B2 S3-A S3-B

Tota

l Cos

t of a

n ex

perim

ent (

Cos

t of i

nsta

nces

+

stor

gge

cost

/mon

th)

Experiment Matrix

Data Cost per month (80) Instance Cost (80)

Data Cost per month (40) Instance Cost (40)

Page 17: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python

Magellan Software

17

Bat

ch Q

ueue

s

Virtu

al M

achi

nes

(Euc

alyp

tus)

Had

oop

Priv

ate

Clu

ster

s

Pub

lic o

r Rem

ote

Clo

ud

Sci

ence

and

Sto

rage

G

atew

ays ANI

Magellan Cluster

Page 18: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python

Hadoop Stack •  Open source reliable, scalable

distributed computing –  implementation of MapReduce –  Hadoop Distributed File System (HDFS)

Core Avro

MapReduce HDFS

Pig Chukwa Hive HBase

Source: Hadoop: The Definitive Guide

Zoo Keeper

Page 19: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python

Hadoop for Science •  Advantages of Hadoop

–  transparent data replication, data locality aware scheduling

–  fault tolerance capabilities •  Mode of operation

–  use streaming to launch a script that calls executable

–  HDFS for input, need shared file system for binary and database

–  input format •  handle multi-line inputs (BLAST sequences), binary

data (High Energy Physics)

19

Page 20: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python

•  Compare traditional parallel file systems to HDFS –  TeraGen and Terasort to compare file system performance

•  32 maps for TeraGen and 64 reduces for Terasort over a terabyte of data

Hadoop Benchmarking: Early Results [1/2]

0

2000

4000

6000

GPFS HDFS Lustre

Tim

e (s

econ

ds)

TeraGen

Terasort-64 reduces

Page 21: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python

21

Hadoop Benchmarking: Early Results [2/2]

0

10

20

30

40

50

60

70

0 10 20 30 40 50 60

Thro

ughp

ut M

B/s

Number of concurrent writers

10MB

10GB

TestDFSIO to understand concurrency at default block size

Page 22: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python

+  ~ 350 - 500 Genomes ~ .5 – 1 Mil Genes

Every 4 months

65 Samples: 21 Studies IMG+2.6 Mil genes 9.1 Mil total

Monthly

On demand

On demand + 330 Genomes !  158 GEBA

8.2 Mil genes

+ 287 Samples: ~105 Studies + 12.5 Mil genes 19 Mil genes

5,115 Genomes 6.5 Mil genes

IMG Systems: Genome & Metagenome Data Flow

Page 23: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python

BLAST on Hadoop •  NCBI BLAST (2.2.22)

–  reference IMG genomes- of 6.5 mil genes (~3Gb in size) –  full input set 12.5 mil metagenome genes against

reference •  BLAST Hadoop

–  uses streaming to manage input data sequences –  binary and databases on a shared file system

•  BLAST Task Farming Implementation –  server reads inputs and manages the tasks –  client runs blast, copies database to local disk or ramdisk

once on startup, pushes back results –  advantages: fault-resilient and allows incremental

expansion as resources come available

Page 24: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python

BLAST Performance

0 10 20 30 40 50 60 70 80 90

100

16 32 64 128

Tim

e (m

inut

es)

Number of processors

EC2-taskFarmer

Franklin-taskFarmer EC2-Hadoop

Azure

Page 25: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python

•  Initial config – Hadoop memory ulimit issues, –  Hadoop memory limits increased to accommodate high memory tasks –  1 map per node for high memory tasks to reduce contention –  thrashing when DB does not fit in memory

•  NFS shared file system for common DB –  move DB to local nodes (copy to local /tmp). –  initial copy takes 2 hours, but now BLAST job completes in < 10 minutes –  performance is equivalent to other cloud environments. –  future: Experiment with Distributed Cache

•  Time to solution varies - no guarantee of simultaneous availability of resources

BLAST on Yahoo! M45 Hadoop

Strong user group and sysadmin support was key in working through this.

Page 26: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python

•  Porting applications still requires lots of work •  Public clouds

–  Virtualization has a performance impact –  Failures when creating large instances –  Data Costs tend to be overwhelming

•  Eucalyptus –  Learning curve and stability at scale

•  Alternate stacks and trying version 2.0 –  SSE instructions were not exposed in VM

•  Additional benchmarking

Summary: Virtualization for Science

26

Page 27: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python

•  Deployment Challenges –  all jobs run as user “hadoop” affecting file permissions –  less control on how many nodes are used - affects allocation policies –  file system performance for large file sizes

•  Programming Challenges: No turn-key solution –  using existing code bases, managing input formats and data

•  Performance –  BLAST over Hadoop: performance is comparable to existing systems –  existing parallel file systems can be used through Hadoop On Demand

•  Additional benchmarking, tuning needed, Plug-ins for Science

Summary: Hadoop for Science

Page 28: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python

Acknowledgements

This work was funded in part by the Advanced Scientific Computing Research (ASCR) in the DOE Office of Science under contract number DE-C02-05CH11231.

CITRIS/UC, Yahoo M45!, Amazon EC2 Education and Research Grants, Microsoft Research, Wei Lu, Dennis Gannon, Masoud Nikravesh and Greg Bell.

Magellan - Shane Canon, Iwona Sakrejda Magellan Benchmarking – Shane Canon, Nick Wright EC2 Benchmarking – Keith Jackson, Krishna Muriki, John Shalf,

Shane Canon, Harvey Wasserman, Shreyas Cholia, Nick Wright BLAST – Shreyas Cholia, Keith Jackson, Shane Canon, John Shalf Supernova Factory – Keith Jackson, Rollin Thomas, Karl Runge

Page 29: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python

29

Questions [email protected]