Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
SAN DIEGO SUPERCOMPUTER CENTER
SR-IOV: Performance Benefits for Virtualized Interconnects!
Glenn K. Lockwood!Mahidhar Tatineni!
Rick Wagner!!
July 15, XSEDE14, Atlanta!
SAN DIEGO SUPERCOMPUTER CENTER
Background!• High Performance Computing (HPC) reaching beyond traditional
application areas. Need for increased flexibility from HPC cyberinfrastructure."
• Increasingly jobs on XSEDE resources are originating from web-portals/science gateways. "
• Science gateways and user communities can develop virtual "compute appliances" that contain tightly integrated application stacks that can be deployed in a hardware-agnostic fashion.!
• Several vendors also provide software packaged in appliances to enable easy deployment."
• Such benefits have been an important driving force behind compute clouds such as Amazon Web Services (AWS) EC2." 2
SAN DIEGO SUPERCOMPUTER CENTER
Background!• The network bandwidth and latency performance of virtualized systems
traditionally has been markedly worse than that of the native hardware."• Hardware vendors have been providing an increased support for
virtualization within hardware such as the I/O memory management units (IOMMU). "
• With the standardization and adoption of technologies such as Single-Root I/O Virtualization (SR-IOV) in network device hardware, a road has been paved towards truly high-performance virtualization for high-performance computing applications.!
• These technologies will be available to XSEDE users via future computing resources (Comet at SDSC: Production starting early 2015)." 3
SAN DIEGO SUPERCOMPUTER CENTER
Single Root I/O Virtualization in HPC!• Problem: complex workflows demand increasing
flexibility from HPC platforms"• Virtualization = flexibility"• Virtualization = IO performance loss (e.g.,
excessive DMA interrupts)"• Solution: SR-IOV and Mellanox ConnectX-3
InfiniBand HCAs "• One physical function (PF) à multiple virtual
functions (VF), each with own DMA streams, memory space, interrupts"
• Allows DMA to bypass hypervisor to VMs!
SAN DIEGO SUPERCOMPUTER CENTER
High-Performance Virtualization on Comet !
• Mellanox FDR InfiniBand HCAs with SR-IOV"• Rocks to manage high-performance virtual
clusters."• Flexibility to support complex science
gateways and web-based workflow engines"• Custom user defined application stack on virtual
clusters."• Leveraging FutureGrid expertise and experience."• High bandwidth filesystem(s) access via virtualized
InfiniBand."
SAN DIEGO SUPERCOMPUTER CENTER
Hardware/Software Configurations of Test Clusters !
Native InfiniBand (SDSC)
SR-IOV InfiniBand
(SDSC)
Native 10GbE (SDSC)
Software-Virtualized 10Gbe
(EC2)
SR-IOV 10GbE (EC2)
Platform
Rocks 6.1 (EL6)
Rocks 6.1 (EL6) kvm Hypervisor
Rocks 6.1 (EL6)
Amazon Linux 2013.09 (EL6) Xen HVM cc2.8xlarge Instance
Amazon Linux 2013.09 (EL6) Xen HVM c3.8xlarge Instance
CPUs Intel(R) Xeon E5-2660 (2.2 GHz) 16 cores/node
Intel(R) Xeon E5-2670 (2.6 GHz) 16 cores/node
Intel(R) Xeon E5-2680v2 (2.8 GHz) 16 cores/node
RAM 64 GB DDR3
60.5 GB DDR3
Inteconnect QDR4X InfiniBand Mellanox ConnectX-3
10GbE
10GbE
(Xen Driver)
10GbE
(Intel VF driver)
SAN DIEGO SUPERCOMPUTER CENTER
Benchmarks !• Fundamental performance characteristics of interconnect evaluated using
OSU Micro-Benchmarks – latency, unidirectional and bidirectional bandwidth tests."
• WRF: widely used weather modeling application that is run in both research and operational forecasting. CONUS-12km benchmark used for performance evaluation."
• Quantum ESPRESSO: Application that performs density functional theory (DFT) calculations for condensed matter problems. DEISA AUSURF112 benchmark." 7
SAN DIEGO SUPERCOMPUTER CENTER
Benchmark Build Details!• OSU Micro-Benchmarks (OMB version 3.9) compiled with OpenMPI
1.5 and GCC 4.4.6.!
• Both test applications were built with Intel Composer XE 2013 and all of the options necessary to allow the compiler to generate 256-byte AVX vector instructions that were available on all of the testing platforms. !
• OpenMPI 1.5 for both InfiniBand and 10GbE platform tests.!
• Intel's Math Kernel Library (MKL) 11.0 to provide all BLAS, LAPACK, ScaLAPACK, and FFTW3 functions where necessary.! 8
SAN DIEGO SUPERCOMPUTER CENTER
SR-IOV with 10GbE*: Latency Results!
9
MPI point-to-point latency as measured by the osu_latency benchmark.
.
MPI point-to-point latency as measured by the osu_latency benchmark. Error bars are +/- three
standard deviations from the mean.
.
• 12-40% improvement under virtualized environment with SR-IOV .
• 2-2.5X slower than native case, even with SR-IOV.
* SR-IOV provided with Amazon's C3 instances
• SR-IOV provides 3× to 4× less variation in latency for small message sizes.
SAN DIEGO SUPERCOMPUTER CENTER
SR-IOV with 10GbE: Bandwidth Results!
10
MPI (a) unidirectional bandwidth and (b) bidirectional bandwidth for 10GbE interconnect as measured by the osu_bw and osu_bibw benchmarks, respectively.
.
• Unidirectional messaging bandwidth never exceeds 500 MB/s (~40% of line speed).
• Native performance is 1.5-2X faster. • Similar results for bidirectional bandwidth.
SR-IOV has very little benefit in both cases. • SR-IOV helps slightly (13% for random
ring, 17% for natural ring) in collective bandwidth tests.
• Native total ring bandwidth was more than 2X faster than SR-IOV based virtualized results.
SAN DIEGO SUPERCOMPUTER CENTER
SR-IOV with InfiniBand: Latency!
11
• SR-IOV!• < 30% overhead for M <
128 bytes"• < 10% overhead for
eager send/recv"• Overhead à 0% for
bandwidth-limited regime"• Amazon EC2!
• > 5000% worse latency"• Time dependent (noisy)"
OSU Microbenchmarks (3.9, osu_latency)"
Figure 5. MPI point-to-point latency measured by osu_latency for QDR InfiniBand. Included for scale are the analogous 10GbE measurements
from Amazon (AWS) and non-virtualized 10GbE.
.
50x less latency than Amazon EC2!
SAN DIEGO SUPERCOMPUTER CENTER
SR-IOV with InfiniBand: Bandwidth!
12
• SR-IOV!• < 2% bandwidth loss
over entire range"• > 95% peak bandwidth"
• Amazon EC2!• < 35% peak bandwidth"• 900% to 2500% worse
bandwidth than virtualized InfiniBand"
OSU Microbenchmarks (3.9, osu_bw)"
10x more bandwidth than Amazon EC2!
Figure 6. MPI point-to-point bandwidth measured by osu_bw for QDR InfiniBand. Included for scale are the analogous 10GbE measurements
from Amazon (AWS) and non-virtualized 10GbE.
.
SAN DIEGO SUPERCOMPUTER CENTER
Application Benchmarks: WRF!• WRF CONUS-12km benchmark. The domain is 12 KM
horizontal resolution on a 425 by 300 grid with 35 vertical levels, with a time step of 72 seconds.!
• Run using six nodes (96 cores) over QDR4X InfiniBand virtualized with SR-IOV.!
• SR-IOV test cluster has 2.2 GHz Intel(R) Xeon E5-2660 processors.!
• Amazon instances were using 2.6 GHz Intel(R) Xeon E5-2670 processors. ! 13
SAN DIEGO SUPERCOMPUTER CENTER
Weather Modeling – 15% Overhead!• 96-core (6-node)
calculation"• Nearest-neighbor
communication"• Scalable algorithms"• SR-IOV incurs modest
(15%) performance hit"• ...but still still 20%
faster*** than Amazon"WRF 3.4.1 – 3hr forecast"
*** 20% faster despite SR-IOV cluster having 20% slower CPUs"
SAN DIEGO SUPERCOMPUTER CENTER
Application Benchmarks: Quantum Espresso!• DEISA AUSURF 112 benchmark with Quantum ESPRESSO. !
• Matrix diagonalization is done with the conjugate gradient algorithm. Communication overhead larger than WRF case.!
• 3D Fourier transforms stress the interconnect because they perform multiple matrix transpose operations where every MPI rank needs to send and receive all of its data to every other MPI rank. !
• These global collectives cause interconnect congestion, and efficient 3D FFTs are limited by the bisection bandwidth of the entire fabric connecting all of the compute nodes.! 15
SAN DIEGO SUPERCOMPUTER CENTER
Quantum ESPRESSO: 5x Faster than EC2!• 48-core, 3 node calc"• CG matrix inversion
(irregular comm.)"• 3D FFT matrix
transposes (All-to-all communication)"
• 28% slower w/ SR-IOV"• SR-IOV still > 500%
faster*** than EC2"Quantum Espresso 5.0.2 – DEISA AUSURF112 benchmark"
*** 20% faster despite SR-IOV cluster having 20% slower CPUs"
SAN DIEGO SUPERCOMPUTER CENTER
Conclusions!• SR-IOV: huge step forward in high-performance virtualization"
• Shows substantial improvement in latency over Amazon EC2, and it has negligible bandwidth overhead!
• Benchmark application performance confirms this: significant improvement over EC2"
• SR-IOV: lowers performance barrier to virtualizing the interconnect and makes fully virtualized HPC clusters viable!
• Comet will deliver virtualized HPC to new/non-traditional communities that need flexibility without major loss of performance!