View
5
Download
0
Category
Preview:
Citation preview
Evaluating Cloud Computing for Science
Lavanya Ramakrishnan
Lawrence Berkeley National Lab
June 2011
Magellan Research Questions
• Are the open source cloud software stacks ready for DOE HPC science?
• Can DOE cyber security requirements be met within a cloud?
• How usable are cloud environments for scientific applications?
• Are the new cloud programming models useful for scientific computing?
• Can DOE HPC applications run efficiently in the cloud? What applications are suitable for clouds?
• When is it cost effective to run DOE HPC science in a cloud?
• What are the ramifications for data intensive computing?
2
SUBTITLE HERE IF NECESSARY
Cloud Deployment Models
Physical Resource Layer Infrastructure as a Service (IaaS)
Platform as a Service (PaaS)
Software as a Service (SaaS)
3
Magellan User Survey
Program Office
Advanced Scientific Computing Research 17%
Biological and Environmental Research 9%
Basic Energy Sciences -Chemical Sciences 10%
Fusion Energy Sciences 10%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
Access to additional resources
Access to on-demand (commercial) paid resources closer to deadlines
Ability to control software environments specific to my application
Ability to share setup of software or experiments with collaborators
Ability to control groups/users
Exclusive access to the computing resources/ability to schedule independently of other groups/users
Easier to acquire/operate than a local cluster
Cost associativity? (i.e., I can get 10 cpus for 1 hr now or 2 cpus for 5 hrs at the same cost)
MapReduce Programming Model/Hadoop
Hadoop File System
User interfaces/Science Gateways: Use of clouds to host science gateways and/or access to cloud resources through science
Program Office
High Energy Physics 20%
Nuclear Physics 13%
Advanced Networking Initiative (ANI) Project 3%
Other 14%
Magellan Software B
atch
Que
ues
Virtu
al M
achi
nes
(Euc
alyp
tus)
Had
oop
Priv
ate
Clu
ster
s
Pub
lic o
r Rem
ote
Clo
ud
Sci
ence
and
Sto
rage
G
atew
ays ANI
Magellan Cluster
5
Are the open source cloud software stacks ready for DOE HPC science?
Can DOE cyber security requirements be met within a cloud?
Amazon Web Services
• Web-service API to IaaS offering • Non-persistent local disk in VM • Simple Storage Service (S3)
– scalable persistent object store • Elastic Block Storage (EBS)
– persistent, block level storage • Offers different instance types
– standard, micro, high-memory, high-cpu, cluster computer, cluster GPU
7
Private Cloud Software
• Eucalyptus – open source IaaS implementation, API
compatible with AWS – KVM and Xen can be used as hypervisors – Walrus & Block Storage
• interface compatible to S3 & EBS – experience with 1.6.2 and 2.0
• Other options – OpenStack, Nimbus, etc
8
Experiences with Eucalyptus (1.6.2)
• Scalability – VM network traffic is routed through a single node – limit on concurrent VMs due to messaging size
• Requires tuning and tweaking – co-exist with services such as DHCP – advanced Nehalem CPU instructions
• Allocation and Accounting – hard to ensure fairness since first come first
serve • Limited Logging and Monitoring
9
Security in the Cloud
• Trust issues – User provided images uploaded and shared – Root privileges by untrained users opens the door
for mistakes • Effective Intrusion Detection System (IDS)
strategy challenging – Due to the ephemeral nature of virtual machine
instances • Fundamental threats are the same, security
controls are different
10
Can DOE HPC applications run efficiently in the cloud? What
applications are suitable for clouds?
Workloads
• High performance computing codes – supported by NERSC and other
supercomputing centers • Mid-range computing workloads
– that are serviced by LBL/IT Services, other local cluster environments
• Interactive data intensive processing – usually run on scientist’s desktops
12
Experiment Setup
• Workloads – HPCC – Subset of NERSC-6 application benchmarks
for EC2 with smaller input sizes • represent the requirements of the NERSC workload • rigorous process for selection of codes • workload and algorithm/science-area coverage
• Platforms – Amazon – Lawrencium (IT cluster) – Magellan
13
Application Benchmarks
0
2
4
6
8
10
12
14
16
18
GAMESS GTC IMPACT fvCAM MAESTRO256
Run
time
Rel
ativ
e to
Mag
ella
n (n
on-V
M)
Franklin
Lawrencium
EC2-Beta-Opt
Amazon EC2
Amazon CC
14
Application Benchmarks
0
10
20
30
40
50
60
MILC PARATEC
Run
time
rela
tive
to M
agel
lan
Franklin
Lawrencium
EC2-Beta-Opt
Amazon EC2
Amazon CC
15
Application Scaling
0
0.2
0.4
0.6
0.8
1
1.2
64 128 256 512 1024
Perf
orm
ance
rela
tive
to n
ativ
e (IB
)
Number of cores
MILC TCPoIB
TCPoEth
AmazonCC
0
0.2
0.4
0.6
0.8
1
1.2
32 64 128 256 512 1024 Perf
orm
ance
rela
tive
to n
ativ
e (IB
)
Number of cores
PARATEC TCPoIB
TCPoEth
AmazonCC
16
Comparison of PingPong BW
0 10 20 30 40 50 60 70 80 90
100
32 64 128 256 512 1024
Perc
enta
ge B
andw
idth
rela
tive
to
Nat
ive
(IB)
Number of cores
AvgPingPongBW_TCPoIB AvgPingPongBW_TCPoEth AvgPing_Amazon
17
Amazon Reliability: A Snapshot
0
10
20
30
40
50
60
70
80 Tu
e M
ar 1
5 14
:45:
42
PD
T 20
11
Tue
Mar
15
14:5
5:40
P
DT
2011
Tu
e M
ar 1
5 15
:30:
21
PD
T 20
11
Tue
Mar
15
16:0
1:36
P
DT
2011
Tu
e M
ar 1
5 16
:11:
58
PD
T 20
11
Tue
Mar
15
17:0
3:33
P
DT
2011
Tu
e M
ar 1
5 17
:03:
39
PD
T 20
11
Tue
Mar
15
17:4
2:12
P
DT
2011
Tu
e M
ar 1
5 19
:14:
52
PD
T 20
11
Tue
Mar
15
20:1
0:09
P
DT
2011
W
ed M
ar 1
6 17
:59:
05
PD
T 20
11
Thu
Mar
17
00:3
8:40
P
DT
2011
Th
u M
ar 1
7 00
:45:
09
PD
T 20
11
Thu
Mar
17
11:5
6:21
P
DT
2011
Th
u M
ar 1
7 20
:17:
12
PD
T 20
11
Thu
Mar
17
20:2
2:53
P
DT
2011
Th
u M
ar 1
7 20
:37:
18
PD
T 20
11
Thu
Mar
17
22:0
5:16
P
DT
2011
Th
u M
ar 1
7 22
:39:
01
PD
T 20
11
Thu
Mar
17
23:0
5:18
P
DT
2011
Num
ber o
f ins
tanc
es
Time of day
Failed at startup
Successful
18
BLAST Performance
0 10 20 30 40 50 60 70 80 90
100
16 32 64 128
Tim
e (m
inut
es)
Number of processors
EC2-taskFarmer
Franklin-taskFarmer EC2-Hadoop
Azure
19
Queue Wait Time Vs VM Startup Overhead
20
Windows Azure VM startup time,2010 Yogesh Simmhan, Microsoft
Batch queue prediction times from QBETS service on NSF TeraGrid resources, 2010
BReW: Blackbox Resource Selection for eScience Workflows collaboration w/ Emad Soroush, Yogesh Simmhan, Deb Agarwal, Catharine van Ingen
What codes work well?
• Minimal synchronization, modest I/O requirements
• Large messages or very little communication
• Low core counts (non-uniform execution and limited scaling)
• Generally applications that would do well on midrange clusters – Future: Analyzing data from our batch
queue profiling (through IPM)
21
How usable are cloud environments for scientific applications?
Application Case Studies
• Magellan has a broad set of users – various domains and projects (MG-RAST,
JGI, STAR, LIGO, ATLAS, Energy+) – various workflow styles (serial, parallel)
and requirements • Two use cases discussed today
– MG-RAST - Deep Soil sequencing – STAR – Streamed real-time data analysis
STAR
23
MG-RAST: Deep Soil Analysis
Background: Genome sequencing of two soil samples pulled from two plots at the Rothamsted Research Center in the UK.
Goal: Understand impact of long-term plant influence (rhizosphere) on microbial community composition and function.
Used: 150 nodes for one week to perform one run (1/30 of work planned) and used NERSC for fault tolerance and recovery
Observations: MG-RAST application is well suited to clouds. User was already familiar with the Cloud
Image Courtesy: Jared Wilkening
24
Early Science - STAR
Details • STAR performed Real-time analysis of
data coming from Brookhaven Nat. Lab • First time data was analyzed in real-
time to a high degree • Leveraged existing OS image from
NERSC system • Started out with 20 VMs at NERSC and
expanded to ANL.
Image Courtesy: Jan Balewski, STAR collaboration
25
Application Design and Development
• Image creation and management – system administration skills – determining what goes on image etc
• Workflow and data management – need to manage job distribution and data
storage, archiving explicitly • Performance and reliability needs to be
considered
26
Are the new cloud programming models useful for scientific
computing? What are the ramifications for data
intensive computing?
Hadoop Stack
• Open source reliable, scalable distributed computing – implementation of MapReduce – Hadoop Distributed File System (HDFS)
Core Avro
MapReduce HDFS
Pig Chukwa Hive HBase
Source: Hadoop: The Definitive Guide
Zoo Keeper
28
HDFS Architecture
29
Hadoop for Science
• Advantages of Hadoop – transparent data replication, data locality
aware scheduling – fault tolerance capabilities
• Hadoop Streaming – allows users to plug any binary as maps
and reduces – input comes on standard input
30
Application Examples
• Bioinformatics applications (i.e., BLAST) – parallel search of input sequences – Managing input data format
• Tropical storm detection – binary file formats can’t be handled in
streaming • Atmospheric River Detection
– maps are differentiated on file and parameteC
31
HDFS vs GPFS (Time)
0
2
4
6
8
10
12
0 500 1000 1500 2000 2500 3000
Tim
e (m
inut
es)
Number of maps
Teragen (1TB) HDFS
GPFS
Linear(HDFS) Expon.(HDFS) Linear(GPFS) Expon.(GPFS)
32
• Deployment – all jobs run as user “hadoop” affecting file permissions – less control on how many nodes are used - affects allocation policies
• Programming: No turn-key solution – using existing code bases, managing input formats and data
• Additional benchmarking, tuning needed, Plug-ins for Science
Challenges
33
Comparison of MapReduce Implementations
34
40 50 60 70 80 90
050
100
150
Output data size (MB)
Proc
essin
g tim
e (s
)
64 core Twister Cluster64 core Hadoop Cluster64 core LEMO−MR Cluster
Collaboration w/ Zacharia Fadika, Elif Dede, Madhusudhan Govindaraju, SUNY Binghamton
0 10 20 30 40 50 60
010
2030
4050
Cluster size (cores)
Spee
dup
!
!
!
!
!
!! 64 core Twister Cluster64 core LEMO−MR Cluster64 core Hadoop Cluster
Hadoop Twister LEMO−MR
010
20 node1node2node3
Hadoop Twister LEMO−MR
Num
ber o
f wor
ds p
roce
ssed
(Billi
on)
010
20
node1: (Under stress)node2node3
Producing random floating point numbers
Load balancing
Processing 5 million 33 x 33 matrices
MARIANE
35
Collaboration w/ Zacharia Fadika, Elif Dede, Madhusudhan Govindaraju, SUNY Binghamton
Data Intensive Science
• Goal: Evaluating hardware and software choices for supporting next generation data problems
• Evaluation of Hadoop – using mix of synthetic benchmarks and
scientific applications – understanding application characteristics
that can leverage the model • data operations: filter, merge, reorganization • compute-data ratio
36
(collaboration w/ Shane Canon, Nick Wright, Zacharia Fadika)
Tools for managing code ensembles/UQ
• Code ensembles – problem that is decomposed into a large
number of loosely coupled tasks • Running VASP on 125K crystals in the
Materials Genome database • Uncertainty Quantification
• Evaluate Hadoop for managing CEs – compare with database (MySQL,
MongoDB), workflow tools, Message Queues
37
(collaboration w/ Dan Gunter, Elif Dede)
Unique Needs and Features of a Science Cloud
• Access to parallel filesystems and low-latency high bandwidth interconnect – access to legacy data sets
• Bare metal provisioning for applications that require custom environments – that cannot tolerate the performance hit from virtualization
• Preinstalled, pre-tuned application software stacks – specific libraries and performance considerations
• Customizations for site-specific policies – authentication, fairness
• Alternate MapReduce implementations – account for scientific data and analysis methods
38
Conclusions
• Current day cloud computing solutions have gaps for science – performance, reliability, stability – programming models are difficult for legacy apps – security mechanisms and policies
• HPC centers can adopt some of the technologies and mechanisms – support for data-intensive workloads – allow custom software environments – provide different levels of service
39
Acknowledgements
• US Department of Energy DE-AC02-05CH11232 • Magellan
– Shane Canon, Tina Declerck, Iwona Sakrejda, Scott Campbell, , Brent Draney
• Amazon Benchmarking – Krishna Muriki, Nick Wright, John Shalf, Keith Jackson, Harvey
Wasserman, Shreyas Cholia • Magellan/ANL
– Susan Coghlan, Piotr T Zbiegiel, Narayan Desai, Rick Bradshaw, Anping Liu, , Ed Holohan
• NERSC – Jeff Broughton, Kathy Yelick
• Applications – Jared Wilkening, Gabe West, Doug Olson, Jan Balewski, STAR
collaboration, K. John Wu, Alex Sim, Prabhat, Suren Byna, Victor Markowitz
40
Questions?
Lavanya Ramakrishnan LRamakrishnan@lbl.gov
41
Recommended