View
213
Download
0
Category
Tags:
Preview:
Citation preview
Virtualizing Modern High-Speed Interconnection Networks with
Performance and Scalability
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Bo Li, Zhigang Huo, Panyong Zhang, Dan Meng
{leo, zghuo, zhangpanyong, md}@ncic.ac.cn
Presenter: Xiang Zhang zhangxiang@ncic.ac.cn
Introduction
• Virtualization is now one of the enabling technologies of Cloud Computing
• Many HPC providers now use their systems as platforms for cloud/utility computing, these HPC on Demand offerings include:– Penguin's POD– IBM's Computing On Demand service– R Systems' dedicated hosting service– Amazon’s EC2
Introduction:
Virtualizing HPC clouds?• Pros:– good manageability– proactive fault tolerance– performance isolation– online system maintenance
• Cons:– Performance gap
• Lack low latency interconnects, which is important to tightly-coupled MPI applications
• VMM-bypass has been proposed to relieve the worry
Introduction:
VMM-bypass I/O Virtualization• Xen split device driver model only used to setup necessary
user access points• data communication in the critical path bypasses both the
guest OS and the VMM
Application
OS
OS-bypass I/O device
Application
OS
IDD VM
Guest Module
Backend Module
Privileged Module
Privileged AccessVMM-bypass Access VMM-Bypass I/O (courtesy [7])
Introduction:
InfiniBand Overview• InfiniBand is a popular
high-speed interconnect– OS-bypass/RDMA– Latency: ~1us– BW: 3300MB/s
• ~41.4% of Top500 now uses InfiniBand as the primary interconnect Interconnect Family / Systems
June 2010
Source: http://www.top500.org
P5 P6 P7 P8
P1 P2 P3 P4
XRC domain
XRC domainnode1
node2
XRC in InfiniBand
P7 P8P5 P6
P3 P4P1 P2
RC in InfiniBand
node1
node2RQ SRQ
Introduction:
InfiniBand Scalability Problem• Reliable Connection (RC)
– Queue Pair (QP), Each QP consists of SQ and RQ– QPs require memory
• Shared Receive Queue (SRQ)• eXtensible Reliable Connection (XRC)
– XRC domain & SRQ-based addressing
Conns/Process:(N-1)×C
Conns/Process:(N-1)
SRQ5 SRQ6 SRQ7 SRQ8
N: node countC: cores per node
Problem Statement
• Does scalability gap exist between native and virtualized environments?– CV: cores per VM
XRCD
VM
XRCD
P7 P8P5 P6 VM
VM
XRCD XRCD
VM
XRCDVM
XRC domain
P7 P8
XRC domain
XRC domain
XRC domain
P5 P6
P1VM 2
VM 1
VM 4
VM 3
XRCD
P1VM
VM
XRCD XRCDVM
XRC in VMs (Cv=2) XRC in VMs (Cv=1)
Transport QPs per Process QPs per Node
Native RC (N-1)×C (N-1)×C2
XRC (N-1) (N-1)×C
VM RC (N-1)×C (N-1)×C2
XRC (N-1)×(C/CV) (N-1)×(C2/CV)
Scalability gap exists!
Presentation Outline
• Introduction• Problem Statement• Proposed Design• Evaluation• Conclusions and Future Work
Proposed Design:
VM-proof XRC design• Design goal is to eliminate the scalability gap– Conns/Process: (N-1)×(C/CV) (N-1)
P7 P8
Shared XRC domain
P5 P6
P1 VM
VM
VM
VM
Shared XRC domain
Proposed Design:
Design Challenges• VM-proof sharing of XRC domain
– A single XRC domain must be shared among different VMs within a physical node
• VM-proof connection management– With a single XRC connection, P1 is able to
send data to all the processes in another physical node (P5~P8), no matter which VMs those processes reside in
High-Speed Interconnection Network
Xen Hypervisor
Abstraction Device Interface (ADI)
Internal MPI Architecture
VM-proof CM
Channel Interface
InfiniBand OS-bypass I/O
MPI Library
CommunicationDevice APIs
MPI Application
Core InfiniBand Modules
Front-end DriverVM-proof
XRCD sharing
Resource Management
Core InfiniBand Modules
Back-end Driver
VM-proof XRCD sharing
Resource Management
UserspaceKernel
Native HCA Driver
Device Mananger and Control Software
IDD Guest Domain
Device Channel
Event ChannelP7 P8
Shared XRC domain
P5 P6
P1 VM
VM
VM
VM
Shared XRC domain
Proposed Design:
Implementation• VM-proof sharing of XRCD– XRCD is shared by opening the same XRCD file– guest domains and IDD have dedicated, non-
shared filesystem– pseudo XRCD file and real XRCD file
• VM-proof CM– Traditionally IP/hostname was used to identify
a node– LID of the HCA is used instead
Proposed Design:
Discussions• safe XRCD sharing– unauthorized applications from other VMs may share
the XRCD • the isolation of the sharing of XRCD could be guaranteed by
the IDD– isolation between VMs running different MPI jobs
• By using different XRCD files, different jobs (or VMs) could share different XRCDs and run without interfering with each other
• XRC migration– main challenge: XRC connection is a process-to-node
communication channel.• Future work
Presentation Outline
• Introduction• Problem Statement• Proposed Design• Evaluation• Conclusions and Future Work
Evaluation:
Platform• Cluster Configuration:– 128-core InfiniBand Cluster– Quad Socket, Quad-Core Barcelona 1.9GHz– Mellanox DDR ConnectX HCA, 24-port MT47396
Infiniscale-III switch• Implementation– Xen 3.4 with Linux 2.6.18.8– OpenFabrics Enterprise Edition (OFED) 1.4.2– MVAPICH-1.1.0
Evaluation:
Microbenchmark• The bandwidth results are
nearly the same• Virtualized IB performs ~0.1us
worse when using blueframe mechanism.– memory copy of the sending data
to the HCA's blueframe pageIB verbs latency using doorbell
IB verbs latency using blueframe MPI latency using blueframe
Explanation: Memory copy operations under virtualized case would
include interactions between the guest
domain and the IDD.
Evaluation:
VM-proof XRC Evaluation• Configurations– Native-XRC: Native environment running XRC-
based MVAPICH.– VM-XRC (CV=n): VM-based environment running
unmodified XRC-based MVAPICH. The parameter CV denotes the number of cores per VM.
– VM-proof XRC: VM-based environment running MVAPICH with our VM-proof XRC design.
Evaluation:
Memory Usage• 16 cores/node cluster fully
connected– The X-axis denotes the
process count– ~12KB memory for each
QP• 16x less memory usage– 64K processes will
consume 13GB/node with the VM-XRC (CV=1) configuration
– The VM-proof XRC design reduces the memory usage to only 800MB/node
Better
800MB
13GB
Evaluation:
MPI Alltoall Evaluation
• a total of 32 processes• 10%~25% improvement for messages < 256B
Better
VM-proof XRC
Evaluation:
Application Benchmarks• VM-proof XRC performs
nearly the same as Native-XRC– Except BT and EP
• Both are better than VM-XRC
Better
Better• little variation for
different CV values• Cv=8 is an exception• Memory allocation not
NUMA-aware guaranteed
VM-proof XRC
Evaluation:
Application Benchmarks (Cont’d)
~15.9x less conns
~14.7x less conns
Conclusion and Future Work
• VM-proof XRC design converges two technologies– VMM-bypass I/O virtualization– eXtensible Reliable Connection in modern high speed interconnection
networks (InfiniBand)
• the same raw performance and scalability as in native non-virtualized environment with our VM-proof XRC design– ~16x scalability improvement is seen in 16-core/node clusters
• Future work– evaluations on different platforms with increased scale – add VM migration support to our VM-proof XRC design– extend our work to the newly SRIOV-enabled ConnectX-2 HCAs
Questions?
{leo, zghuo, zhangpanyong, md}@ncic.ac.cn
Backup Slides
OS-bypass of InfiniBand
OpenIB Gen2 stack
Recommended