Comet: Realizing High-Performance Virtualized Clusters ... · SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO High-performance computing for the long tail

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

Comet: Realizing High-Performance Virtualized Clusters using SR-IOV Technology!

HPC Advisory Council!Guangzhou, China!5 November 2014!

!Richard Moore, Luca Clementi, Dmitry Mishin, !

Phil Papadopoulos, Mahidhar Tatineni, Rick Wagner !



Outline!

•  Comet: Objectives and Description!•  SR-IOV as an Enabler of Virtual HPC Clusters!•  Benchmark Comparisons !•  Implementing Virtual HPC Clusters!



High-performance computing for the long tail of science!

•  Comet goals (from NSF 13-528 solicitation)!•  “… expand the use of high end resources to a much larger

and more diverse community "•  … support the entire spectrum of NSF communities"•  ... promote a more comprehensive and balanced portfolio"•  … include research communities that are not users of

traditional HPC systems.“"

!



HPC for the 99%!

•  99% of jobs run on NSF’s HPC resources in 2012 used <2,048 cores !

•  And consumed >50% of the total core-hours across NSF resources!



Key Strategies for Comet Users!•  Target modest-scale users and new users/communities:

goal of 10,000 users/year! !•  Support capacity computing, with a system optimized for

small/modest-scale jobs and quicker resource response using allocation/scheduling policies!

•  Build upon and expand efforts with Science Gateways, encouraging gateway usage and hosting via software and operating policies!

•  Provide a virtualized environment to support development of customized software stacks, virtual environments, and project control of workspaces!



Comet: System Characteristics !•  Production early 2015 !•  Total peak flops 2 PF!•  Dell primary integrator!

•  Intel Haswell processors w/ AVX2"•  Mellanox FDR InfiniBand"

•  1,944 standard compute nodes !•  Dual CPUs, each 12-core, 2.5 GHz"•  128 GB DDR4 2133 MHz DRAM"•  2*160GB GB SSDs (local disk)"

•  36 GPU nodes (Feb 2015)!•  Same as standard nodes plus "•  Two NVIDIA K80 cards, each with

dual Kepler3 GPUs"•  4 large-memory nodes (April 2015)!

•  1.5 TB DDR4 1866 MHz DRAM"•  Four Haswell processors/node"

•  Hybrid fat-tree topology!•  FDR (56 Gbps) InfiniBand"•  Rack-level (72 nodes, 1,728 cores) full

bisection bandwidth "•  4:1 oversubscription cross-rack"

•  Performance Storage (Aeon)!•  7.6 PB, 200 GB/s; Lustre"•  Scratch & Persistent Storage segments"

•  Durable Storage (Aeon)!•  6 PB, 100 GB/s; Lustre"•  Automatic backups of critical data"

•  Gateway hosting nodes!•  Virtual image repository!•  Home directory storage!•  100 Gbps external connectivity to

Internet2 & ESNet !



Comet Network Architecture InfiniBand compute, Ethernet Storage!

Juniper 100 Gbps

Arista 40GbE (2x)

Data Mover Nodes

Research and Educa@on Network Access

Data Movers

Internet 2

7x 36-‐port FDR in each rack wired as full fat-‐tree. 4:1 over subscrip@on between racks.

72 HSWL 320 GB

Core InfiniBand (2 x 108-‐port)

36 GPU

4 Large-‐Memory

IB-‐Ethernet Bridges (4 x 18-‐port each)

Performance Storage 7 PB, 200 GB/s

32 storage servers

Durable Storage 6 PB, 100 GB/s

64 storage servers

Arista 40GbE (2x)

27 racks

FDR 36p

FDR 36p

64 128

18

72 HSWL 320 GB

72 HSWL

2*36

4*18

Mid-‐@er InfiniBand

Addi+onal Support Components (not shown for clarity) Ethernet Mgt Network (10 GbE) NFS Servers for Home Directories Virtual Image Repository Gateway/Portal Hos@ng Nodes Login Nodes Rocks Management Nodes

Node-‐Local Storage 18

72 FDR FDR

FDR

40GbE

40GbE

10GbE

18 switches !

4

4



Suggested Comet Applications!•  Modest core counts: full bisection bandwidth up to Comet island (1,728 cores)"•  128 GB DRAM/node (5.3 GB/core): single node shared memory apps and MPI

codes with large per-process memory footprint "•  AVX2: Codes with vectorizable loops. Any application with significant

performance gain relative to Sandy Bridge or Ivy Bridge (AVX)"•  SSDs: Computational chemistry, finite elements. Apps that generate large

numbers of small temporary files (finance, QM/MM)"•  GPU nodes: Molecular dynamics, linear algebra, image and signal processing."

•  Doesn’t replace Keeneland, but for workloads that have some GPU requirements. "•  Large memory nodes: de novo genome assembly, visualization of large data

sets, other large memory apps"•  Science Gateways: Gateway-friendly environment with local gateway hosting

capability, flexible allocations, scheduling policies for rapid throughput, heterogeneous workflows, and virtual clusters for software environment"

•  High performance virtualization: workloads with customized software stacks, especially those that are difficult to port or deploy in standard XSEDE environment!

!



Realizing High-Performance Virtualized Clusters using SR-IOV Technology!

•  Commercial cloud providers have solved the problem of virtualization for single core/single node jobs!

•  Some adoption in academia & government R&D labs"•  Not so for HPC applications which use message-passing (MPI) to harness

many compute nodes in parallel!•  Benchmarks show lower performance on cloud platforms, largely due to overhead of I/O

virtualization!•  Single Root I/O Virtualization (SR-IOV) drastically reduces this overhead,

opening the door to virtualized supercomputing at the cluster level!•  Benefits of virtualization to users!

•  Can maintain user software environment and minimize porting/maintenance time "•  Lower barrier to entry for new users, from straightforward SW to complex SW stacks"•  User can maintain control of their software stack - root access to virtual machine "•  More flexible software environment, including access via science gateways"•  Extends adoption of cloud computing paradigm to clusters and HPC applications"



Single Root I/O Virtualization in HPC!•  Problem: Virtualization generally has resulted

in significant I/O performance degradation (e.g., excessive DMA interrupts)"

•  Solution: SR-IOV and Mellanox ConnectX-3 InfiniBand host channel adapters "•  One physical function à multiple virtual

functions, each light weight but with its own DMA streams, memory space, interrupts"

•  Allows DMA to bypass hypervisor to VMs"•  SRIOV enables virtual HPC cluster w/ near-

native InfiniBand latency/bandwidth and minimal overhead!



Benchmarks to Compare Ethernet and InfiniBand with/without SR-IOV !•  Fundamental performance characteristics of

interconnect evaluated using OSU Micro-Benchmarks!•  Latency"•  Bandwidth (unidirectional and bidirectional)"

•  Applications testing for integrated overhead estimates!•  WRF (CONUS-12km): widely used weather modeling application used in both

research and operational forecasting. benchmark."•  Quantum ESPRESSO (DEISA AUSURF112): Performs density functional

theory (DFT) calculations for condensed matter problems. "

11!



Hardware/Software Configurations of Test Clusters !Native

InfiniBand (SDSC)

SR-IOV InfiniBand

(SDSC)

Native 10GbE (SDSC)

Software-Virtualized

10GbE (EC2)

SR-IOV 10GbE (EC2)

Platform

Rocks 6.1

(EL6)

Rocks 6.1 (EL6) kvm Hypervisor

Rocks 6.1

(EL6)

Amazon Linux 2013.09 (EL6) Xen HVM cc2.8xlarge Instance

Amazon Linux 2013.09 (EL6) Xen HVM c3.8xlarge Instance

CPUs Intel(R) Xeon E5-2660 (2.2 GHz) 16 cores/node

Intel(R) Xeon E5-2670 (2.6 GHz) 16 cores/node

Intel(R) Xeon E5-2680v2 (2.8 GHz) 16 cores/node

RAM 64 GB DDR3

60.5 GB DDR3

Interconnect

QDR4X InfiniBand Mellanox ConnectX-3

10GbE

10GbE

(Xen Driver)

10GbE

(Intel VF driver)



Latency Results 10 GbE: Native, virtualized and with SR-IOV !

13!

MPI point-to-point latency as measured by the osu_latency benchmark.!

.

MPI point-to-point latency as measured by the osu_latency benchmark. Error bars are +/- three standard deviations from

the mean.

.

•  2-2.5X slower than native case, even with SR-IOV!

•  12-40% improvement under virtualized environment with SR-IOV !

* SR-IOV provided with Amazon's C3 instances !

•  SR-IOV provides 3× to 4× less variation in latency for small message sizes!



Bandwidth Results 10 GbE: Native, virtualized and with SR-IOV !

14!

MPI (a) unidirectional bandwidth and (b) bidirectional bandwidth for 10GbE interconnect as measured by the osu_bw and osu_bibw benchmarks, respectively.

.

•  Unidirectional messaging bandwidth never exceeds 500 MB/s (~40% of line speed) with or without SR-IOV!

•  Native performance is 1.5-2X faster.!•  Similar results for bidirectional

bandwidth. !•  SR-IOV has very little benefit in both

cases.!•  SR-IOV helps slightly (13% for

random ring, 17% for natural ring) in collective bandwidth tests.!

•  Native total ring bandwidth was more than 2X faster than SR-IOV based virtualized results.!



Figure 5. MPI point-to-point latency measured by osu_latency for QDR InfiniBand. Included for scale are the analogous 10GbE measurements from Amazon (AWS)

and non-virtualized 10GbE.

.

Latency Results:QDR IB & 10 GbE, native and SR-IOV!

15!

•  SR-IOV with QDR InfiniBand!•  < 30% overhead for small

messages (<128 bytes)!•  < 10% overhead for eager

send/receive!•  Overhead à 0% for

bandwidth-limited regime!•  Amazon EC2 (10 GbE)!

•  > 50X worse latency!•  Time dependent (noisy)!

50x less latency than Amazon EC2!



Bandwidth Results:QDR IB & 10 GbE, native and SR-IOV!

16!

•  Comparison of bandwidth relative to native InfiniBand!

•  SR-IOV w/ QDR InfiniBand!•  < 2% bandwidth loss over

entire range!•  > 95% peak bandwidth!

•  Amazon EC2 (10 GbE)!•  < 35% peak bandwidth!•  While ratio of QDR/10GbE

bandwidth is ~4X, EC2 bandwidth is 9-25X worse than SR-IOV IB!

10x more bandwidth than Amazon EC2!

Figure 6. MPI point-to-point bandwidth measured by osu_bw for QDR InfiniBand. Included for scale are the analogous 10GbE measurements

from Amazon (AWS) and non-virtualized 10GbE.

.



WRF Weather Modeling – 15% Overhead with SR-IOV IB!

•  96-core (6-node) calculation!

•  Nearest-neighbor communication!

•  Scalable algorithms!•  SR-IOV incurs modest

(15%) performance hit!•  ... but still 20% faster

than EC2!•  Despite 20% slower CPUs"

WRF 3.4.1 – 3hr forecast!



Quantum ESPRESSO: 28% Overhead!•  48-core (3 node) calculation!•  CG matrix inversion -

irregular communication!•  3D FFT matrix transposes

(all-to-all communication)!•  28% slower w/ SR-IOV vs

native IB!•  SR-IOV still > 500% faster

than EC2!•  Despite 20% slower CPUs"

Quantum Espresso 5.0.2 – DEISA AUSURF112 benchmark!



Functional/Operational Design Points for Comet Virtualization!

●  Users may opt for running either as normal batch job or as a virtual cluster!●  Virtual Cluster (VC) front-end is up 24x365 on VM hosting nodes!●  Virtual Cluster Nodes are transitory (turned on/off thru batch queue)!

●  At least initially, cluster physical nodes dedicated to single VM (not shared)"●  All cluster nodes retain disk state after VC power off (shutdown)!●  Cluster owners have BIOS-level control of all nodes!●  Clusters must be isolated from one another on the network!●  Performance:!

o  SR-IOV InfiniBand performance close to native IB"o  Disk performance close to local disk (disk migration)"o  Cluster nodes boot as soon as they are scheduled by batch system"

●  Storage File Systems!o  Access to shared networked file system (NFS)"o  No access to shared Lustre (security issues)"

!



!!!●  Disk management!●  Scheduler integration!●  User API!●  Implement SR-IOV!●  User GUI (w/ remote console)!●  Upload ISOs/disk image!●  Home-area mounts to VC!●  Lustre mounts to VC!

HPC Virtual Cluster (VC) development tasks (Planned production summer 2015) !

Debugging/testing phase!

Next task!

Want to improve what’s there!

Relatively quick!

Nice to have!

Deployment/Administrative issue!

Not feasible yet!



Comet: Implementing Virtual Clusters (VCs) - only one VM per physical node –

- VC head node ‘always on’ -!

Physical node!(XSEDE stack)!

Virtual machine!(User stack)!

HN!

Virtual cluster!head node!

VC0! VC1!

VC2!

VC3!

VM Host Nodes!•  VC Head Nodes!•  Gateway Hosts!

HN0!…!

HN4!

HN5!…!

HNN!

SG0!…!

SG4!

SG5!…!

SGN!



Comet: Implementing Virtual Clusters (VCs) Head Node remains active after VC shutdown

But cluster nodes are released !



HN!


VC0! VC1!

VC2!

VC3!


HN0!…!

HN4!

HN5!…!

HNN!

SG0!…!

SG4!

SG5!…!

SGN!



VM Disk Management!●  Each VM gets a 36 GB disk (Small SCSI)!●  Disk images must be persistent through reboots!●  Two standard solutions:!

o  iSCSI (network mounted disk)"o  Persistent disk replication on nodes"

●  VMs can be allocated on any/all compute nodes dependent on availability (scheduler) – persistent replication on nodes not feasible!

●  But network-mounted disk can be expensive or hurt performance!



Hybrid solution for disk management!●  Dual central network-attached storage (NAS) devices

store all disk images!●  Initial boot of any cluster node uses an iSCSI disk (node

disk) on centralized network-attached storage (NAS)!●  On startup, Comet moves node disk to physical host

running the node VM – then disconnects from the NAS!o  All node disk operation is local to the physical host"o  Enables scale-out without a costly NAS device"

●  At shutdown, any changes made to node disk (now on physical host) are migrated back to the NAS, ready for next boot!



Implementing VCs: Startup/shutdown- Each VC has its own ZFS file system for VM images – !



HN!


VC0! VC1!

VC2!

VC3!ZFS pool!

VM disk image!


HN0!…!

HN4!

HN5!…!

HNN!

SG0!…!

SG4!

SG5!…!

SGN!



SR-IOV is a huge step forward in high-performance virtualization!

•  SR-IOV InfiniBand shows substantial improvement in latency over Amazon EC2, and it provides nearly zero bandwidth overhead!

•  Application benchmarks confirm significant improvement over EC2 overhead!

•  SR-IOV lowers performance barrier to virtualizing the interconnect and makes virtualized HPC clusters viable!•  Extends cloud computing paradigm to HPC applications"

•  Comet will deliver virtualized HPC to new/non-traditional communities that need software flexibility and re-use without major loss of performance!



謝謝!

This work supported by the National Science Foundation, award ACI-1341698.!

Documents

Comet: Realizing High-Performance Virtualized Clusters ... · SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO High-performance computing for the long tail