Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Comet: Realizing High-Performance Virtualized Clusters using SR-IOV Technology!
HPC Advisory Council!Guangzhou, China!5 November 2014!
!Richard Moore, Luca Clementi, Dmitry Mishin, !
Phil Papadopoulos, Mahidhar Tatineni, Rick Wagner !
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Outline!
• Comet: Objectives and Description!• SR-IOV as an Enabler of Virtual HPC Clusters!• Benchmark Comparisons !• Implementing Virtual HPC Clusters!
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
High-performance computing for the long tail of science!
• Comet goals (from NSF 13-528 solicitation)!• “… expand the use of high end resources to a much larger
and more diverse community "• … support the entire spectrum of NSF communities"• ... promote a more comprehensive and balanced portfolio"• … include research communities that are not users of
traditional HPC systems.“"
!
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
HPC for the 99%!
• 99% of jobs run on NSF’s HPC resources in 2012 used <2,048 cores !
• And consumed >50% of the total core-hours across NSF resources!
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Key Strategies for Comet Users!• Target modest-scale users and new users/communities:
goal of 10,000 users/year! !• Support capacity computing, with a system optimized for
small/modest-scale jobs and quicker resource response using allocation/scheduling policies!
• Build upon and expand efforts with Science Gateways, encouraging gateway usage and hosting via software and operating policies!
• Provide a virtualized environment to support development of customized software stacks, virtual environments, and project control of workspaces!
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Comet: System Characteristics !• Production early 2015 !• Total peak flops 2 PF!• Dell primary integrator!
• Intel Haswell processors w/ AVX2"• Mellanox FDR InfiniBand"
• 1,944 standard compute nodes !• Dual CPUs, each 12-core, 2.5 GHz"• 128 GB DDR4 2133 MHz DRAM"• 2*160GB GB SSDs (local disk)"
• 36 GPU nodes (Feb 2015)!• Same as standard nodes plus "• Two NVIDIA K80 cards, each with
dual Kepler3 GPUs"• 4 large-memory nodes (April 2015)!
• 1.5 TB DDR4 1866 MHz DRAM"• Four Haswell processors/node"
• Hybrid fat-tree topology!• FDR (56 Gbps) InfiniBand"• Rack-level (72 nodes, 1,728 cores) full
bisection bandwidth "• 4:1 oversubscription cross-rack"
• Performance Storage (Aeon)!• 7.6 PB, 200 GB/s; Lustre"• Scratch & Persistent Storage segments"
• Durable Storage (Aeon)!• 6 PB, 100 GB/s; Lustre"• Automatic backups of critical data"
• Gateway hosting nodes!• Virtual image repository!• Home directory storage!• 100 Gbps external connectivity to
Internet2 & ESNet !
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Comet Network Architecture InfiniBand compute, Ethernet Storage!
Juniper 100 Gbps
Arista 40GbE (2x)
Data Mover Nodes
Research and Educa@on Network Access
Data Movers
Internet 2
7x 36-‐port FDR in each rack wired as full fat-‐tree. 4:1 over subscrip@on between racks.
72 HSWL 320 GB
Core InfiniBand (2 x 108-‐port)
36 GPU
4 Large-‐Memory
IB-‐Ethernet Bridges (4 x 18-‐port each)
Performance Storage 7 PB, 200 GB/s
32 storage servers
Durable Storage 6 PB, 100 GB/s
64 storage servers
Arista 40GbE (2x)
27 racks
FDR 36p
FDR 36p
64 128
18
72 HSWL 320 GB
72 HSWL
2*36
4*18
Mid-‐@er InfiniBand
Addi+onal Support Components (not shown for clarity) Ethernet Mgt Network (10 GbE) NFS Servers for Home Directories Virtual Image Repository Gateway/Portal Hos@ng Nodes Login Nodes Rocks Management Nodes
Node-‐Local Storage 18
72 FDR FDR
FDR
40GbE
40GbE
10GbE
18 switches !
4
4
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Suggested Comet Applications!• Modest core counts: full bisection bandwidth up to Comet island (1,728 cores)"• 128 GB DRAM/node (5.3 GB/core): single node shared memory apps and MPI
codes with large per-process memory footprint "• AVX2: Codes with vectorizable loops. Any application with significant
performance gain relative to Sandy Bridge or Ivy Bridge (AVX)"• SSDs: Computational chemistry, finite elements. Apps that generate large
numbers of small temporary files (finance, QM/MM)"• GPU nodes: Molecular dynamics, linear algebra, image and signal processing."
• Doesn’t replace Keeneland, but for workloads that have some GPU requirements. "• Large memory nodes: de novo genome assembly, visualization of large data
sets, other large memory apps"• Science Gateways: Gateway-friendly environment with local gateway hosting
capability, flexible allocations, scheduling policies for rapid throughput, heterogeneous workflows, and virtual clusters for software environment"
• High performance virtualization: workloads with customized software stacks, especially those that are difficult to port or deploy in standard XSEDE environment!
!
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Realizing High-Performance Virtualized Clusters using SR-IOV Technology!
• Commercial cloud providers have solved the problem of virtualization for single core/single node jobs!
• Some adoption in academia & government R&D labs"• Not so for HPC applications which use message-passing (MPI) to harness
many compute nodes in parallel!• Benchmarks show lower performance on cloud platforms, largely due to overhead of I/O
virtualization!• Single Root I/O Virtualization (SR-IOV) drastically reduces this overhead,
opening the door to virtualized supercomputing at the cluster level!• Benefits of virtualization to users!
• Can maintain user software environment and minimize porting/maintenance time "• Lower barrier to entry for new users, from straightforward SW to complex SW stacks"• User can maintain control of their software stack - root access to virtual machine "• More flexible software environment, including access via science gateways"• Extends adoption of cloud computing paradigm to clusters and HPC applications"
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Single Root I/O Virtualization in HPC!• Problem: Virtualization generally has resulted
in significant I/O performance degradation (e.g., excessive DMA interrupts)"
• Solution: SR-IOV and Mellanox ConnectX-3 InfiniBand host channel adapters "• One physical function à multiple virtual
functions, each light weight but with its own DMA streams, memory space, interrupts"
• Allows DMA to bypass hypervisor to VMs"• SRIOV enables virtual HPC cluster w/ near-
native InfiniBand latency/bandwidth and minimal overhead!
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Benchmarks to Compare Ethernet and InfiniBand with/without SR-IOV !• Fundamental performance characteristics of
interconnect evaluated using OSU Micro-Benchmarks!• Latency"• Bandwidth (unidirectional and bidirectional)"
• Applications testing for integrated overhead estimates!• WRF (CONUS-12km): widely used weather modeling application used in both
research and operational forecasting. benchmark."• Quantum ESPRESSO (DEISA AUSURF112): Performs density functional
theory (DFT) calculations for condensed matter problems. "
11!
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Hardware/Software Configurations of Test Clusters !Native
InfiniBand (SDSC)
SR-IOV InfiniBand
(SDSC)
Native 10GbE (SDSC)
Software-Virtualized
10GbE (EC2)
SR-IOV 10GbE (EC2)
Platform
Rocks 6.1
(EL6)
Rocks 6.1 (EL6) kvm Hypervisor
Rocks 6.1
(EL6)
Amazon Linux 2013.09 (EL6) Xen HVM cc2.8xlarge Instance
Amazon Linux 2013.09 (EL6) Xen HVM c3.8xlarge Instance
CPUs Intel(R) Xeon E5-2660 (2.2 GHz) 16 cores/node
Intel(R) Xeon E5-2670 (2.6 GHz) 16 cores/node
Intel(R) Xeon E5-2680v2 (2.8 GHz) 16 cores/node
RAM 64 GB DDR3
60.5 GB DDR3
Interconnect
QDR4X InfiniBand Mellanox ConnectX-3
10GbE
10GbE
(Xen Driver)
10GbE
(Intel VF driver)
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Latency Results 10 GbE: Native, virtualized and with SR-IOV !
13!
MPI point-to-point latency as measured by the osu_latency benchmark.!
.
MPI point-to-point latency as measured by the osu_latency benchmark. Error bars are +/- three standard deviations from
the mean.
.
• 2-2.5X slower than native case, even with SR-IOV!
• 12-40% improvement under virtualized environment with SR-IOV !
* SR-IOV provided with Amazon's C3 instances !
• SR-IOV provides 3× to 4× less variation in latency for small message sizes!
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Bandwidth Results 10 GbE: Native, virtualized and with SR-IOV !
14!
MPI (a) unidirectional bandwidth and (b) bidirectional bandwidth for 10GbE interconnect as measured by the osu_bw and osu_bibw benchmarks, respectively.
.
• Unidirectional messaging bandwidth never exceeds 500 MB/s (~40% of line speed) with or without SR-IOV!
• Native performance is 1.5-2X faster.!• Similar results for bidirectional
bandwidth. !• SR-IOV has very little benefit in both
cases.!• SR-IOV helps slightly (13% for
random ring, 17% for natural ring) in collective bandwidth tests.!
• Native total ring bandwidth was more than 2X faster than SR-IOV based virtualized results.!
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Figure 5. MPI point-to-point latency measured by osu_latency for QDR InfiniBand. Included for scale are the analogous 10GbE measurements from Amazon (AWS)
and non-virtualized 10GbE.
.
Latency Results:QDR IB & 10 GbE, native and SR-IOV!
15!
• SR-IOV with QDR InfiniBand!• < 30% overhead for small
messages (<128 bytes)!• < 10% overhead for eager
send/receive!• Overhead à 0% for
bandwidth-limited regime!• Amazon EC2 (10 GbE)!
• > 50X worse latency!• Time dependent (noisy)!
50x less latency than Amazon EC2!
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Bandwidth Results:QDR IB & 10 GbE, native and SR-IOV!
16!
• Comparison of bandwidth relative to native InfiniBand!
• SR-IOV w/ QDR InfiniBand!• < 2% bandwidth loss over
entire range!• > 95% peak bandwidth!
• Amazon EC2 (10 GbE)!• < 35% peak bandwidth!• While ratio of QDR/10GbE
bandwidth is ~4X, EC2 bandwidth is 9-25X worse than SR-IOV IB!
10x more bandwidth than Amazon EC2!
Figure 6. MPI point-to-point bandwidth measured by osu_bw for QDR InfiniBand. Included for scale are the analogous 10GbE measurements
from Amazon (AWS) and non-virtualized 10GbE.
.
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
WRF Weather Modeling – 15% Overhead with SR-IOV IB!
• 96-core (6-node) calculation!
• Nearest-neighbor communication!
• Scalable algorithms!• SR-IOV incurs modest
(15%) performance hit!• ... but still 20% faster
than EC2!• Despite 20% slower CPUs"
WRF 3.4.1 – 3hr forecast!
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Quantum ESPRESSO: 28% Overhead!• 48-core (3 node) calculation!• CG matrix inversion -
irregular communication!• 3D FFT matrix transposes
(all-to-all communication)!• 28% slower w/ SR-IOV vs
native IB!• SR-IOV still > 500% faster
than EC2!• Despite 20% slower CPUs"
Quantum Espresso 5.0.2 – DEISA AUSURF112 benchmark!
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Functional/Operational Design Points for Comet Virtualization!
● Users may opt for running either as normal batch job or as a virtual cluster!● Virtual Cluster (VC) front-end is up 24x365 on VM hosting nodes!● Virtual Cluster Nodes are transitory (turned on/off thru batch queue)!
● At least initially, cluster physical nodes dedicated to single VM (not shared)"● All cluster nodes retain disk state after VC power off (shutdown)!● Cluster owners have BIOS-level control of all nodes!● Clusters must be isolated from one another on the network!● Performance:!
o SR-IOV InfiniBand performance close to native IB"o Disk performance close to local disk (disk migration)"o Cluster nodes boot as soon as they are scheduled by batch system"
● Storage File Systems!o Access to shared networked file system (NFS)"o No access to shared Lustre (security issues)"
!
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
!!!● Disk management!● Scheduler integration!● User API!● Implement SR-IOV!● User GUI (w/ remote console)!● Upload ISOs/disk image!● Home-area mounts to VC!● Lustre mounts to VC!
HPC Virtual Cluster (VC) development tasks (Planned production summer 2015) !
Debugging/testing phase!
Next task!
Want to improve what’s there!
Relatively quick!
Nice to have!
Deployment/Administrative issue!
Not feasible yet!
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Comet: Implementing Virtual Clusters (VCs) - only one VM per physical node –
- VC head node ‘always on’ -!
Physical node!(XSEDE stack)!
Virtual machine!(User stack)!
HN!
Virtual cluster!head node!
VC0! VC1!
VC2!
VC3!
VM Host Nodes!• VC Head Nodes!• Gateway Hosts!
HN0!…!
HN4!
HN5!…!
HNN!
SG0!…!
SG4!
SG5!…!
SGN!
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Comet: Implementing Virtual Clusters (VCs) Head Node remains active after VC shutdown
But cluster nodes are released !
Physical node!(XSEDE stack)!
Virtual machine!(User stack)!
HN!
Virtual cluster!head node!
VC0! VC1!
VC2!
VC3!
VM Host Nodes!• VC Head Nodes!• Gateway Hosts!
HN0!…!
HN4!
HN5!…!
HNN!
SG0!…!
SG4!
SG5!…!
SGN!
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
VM Disk Management!● Each VM gets a 36 GB disk (Small SCSI)!● Disk images must be persistent through reboots!● Two standard solutions:!
o iSCSI (network mounted disk)"o Persistent disk replication on nodes"
● VMs can be allocated on any/all compute nodes dependent on availability (scheduler) – persistent replication on nodes not feasible!
● But network-mounted disk can be expensive or hurt performance!
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Hybrid solution for disk management!● Dual central network-attached storage (NAS) devices
store all disk images!● Initial boot of any cluster node uses an iSCSI disk (node
disk) on centralized network-attached storage (NAS)!● On startup, Comet moves node disk to physical host
running the node VM – then disconnects from the NAS!o All node disk operation is local to the physical host"o Enables scale-out without a costly NAS device"
● At shutdown, any changes made to node disk (now on physical host) are migrated back to the NAS, ready for next boot!
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Implementing VCs: Startup/shutdown- Each VC has its own ZFS file system for VM images – !
Physical node!(XSEDE stack)!
Virtual machine!(User stack)!
HN!
Virtual cluster!head node!
VC0! VC1!
VC2!
VC3!ZFS pool!
VM disk image!
VM Host Nodes!• VC Head Nodes!• Gateway Hosts!
HN0!…!
HN4!
HN5!…!
HNN!
SG0!…!
SG4!
SG5!…!
SGN!
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
SR-IOV is a huge step forward in high-performance virtualization!
• SR-IOV InfiniBand shows substantial improvement in latency over Amazon EC2, and it provides nearly zero bandwidth overhead!
• Application benchmarks confirm significant improvement over EC2 overhead!
• SR-IOV lowers performance barrier to virtualizing the interconnect and makes virtualized HPC clusters viable!• Extends cloud computing paradigm to HPC applications"
• Comet will deliver virtualized HPC to new/non-traditional communities that need software flexibility and re-use without major loss of performance!
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
謝謝!
This work supported by the National Science Foundation, award ACI-1341698.!