Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Wednesday, August 7, 2013 - 10AM-11AM PST
Accelerating High Performance Computing with
GPUDirect RDMA
© 2013 Mellanox Technologies 2
Leading Supplier of End-to-End Interconnect Solutions
Virtual Protocol Interconnect
Storage Front / Back-End
Server / Compute Switch / Gateway
56G IB & FCoIB 56G InfiniBand
10/40/56GbE & FCoE 10/40/56GbE
Virtual Protocol Interconnect
Host/Fabric Software ICs Switches/Gateways Adapter Cards Cables
Comprehensive End-to-End InfiniBand and Ethernet Portfolio
© 2013 Mellanox Technologies 3
I/O Offload Frees Up CPU for Application Processing
~88% CPU
Efficiency
User
Sp
ace
Sys
tem
Sp
ac
e
~53% CPU
Efficiency
~47% CPU
Overhead/Idle
~12% CPU
Overhead/Idle
Without RDMA With RDMA and Offload
Us
er
Sp
ac
e
Sys
tem
Sp
ac
e
© 2013 Mellanox Technologies 4
2008
QDR InfiniBand End-to-End GPUDirect Technology Released MPI/SHMEM Collectives Offloads (FCA), Scalable HPC (MXM), Open
SHMEM, PGAS/UPC
Long-Haul Solutions
Mellanox Interconnect Development Timeline
Connect-IB - 100Gb/s HCA Dynamically Connected
Transport
2009 2010 2011 2012 2013
FDR InfiniBand End-to-End
GPUDirect RDMA CORE-Direct Technology InfiniBand – Ethernet Bridging World’s First Petaflop Systems
Technology and Solutions Leadership
© 2013 Mellanox Technologies 5
The GPUDirect project was announced Nov 2009
• “NVIDIA Tesla GPUs To Communicate Faster Over Mellanox InfiniBand Networks”
GPUDirect was developed together by Mellanox and NVIDIA
• New interface (API) within the Tesla GPU driver
• New interface within the Mellanox InfiniBand drivers
• Linux kernel modification to allow direct communication between drivers
GPUDirect 1.0 was announced Q2’10
• “Mellanox Scalable HPC Solutions with NVIDIA GPUDirect Technology Enhance GPU-Based HPC Performance and Efficiency”
• “Mellanox was the lead partner in the development of NVIDIA GPUDirect”
GPUDirect RDMA alpha release is available today
• Mellanox has over two dozen developers using and providing feedback
• “Proof of Concept” designs in flight today with commercial end-customers and government entities
• Ohio State University has a version of MVAPICH2 for GPUDirect RDMA available for MPI application developers
GPUDirect RDMA is targeted for a GA release in Q4’13
GPUDirect History
© 2013 Mellanox Technologies 6
GPU communications uses “pinned” buffers for data movement
• A section in the host memory that is dedicated for the GPU
• Allows optimizations such as write-combining and overlapping GPU computation and data transfer for
best performance
GPU-InfiniBand Bottleneck (pre-GPUDirect)
CPU
GPU Chip
set
GPU Memory
InfiniBand
System
Memory 1 2
InfiniBand uses “pinned” buffers for efficient RDMA transactions
• Zero-copy data transfers, Kernel bypass
• Reduces CPU overhead
© 2013 Mellanox Technologies 7
GPUDirect 1.0
CPU
GPU Chip
set
GPU Memory
InfiniBand
System
Memory 1 2
CPU
GPU Chip
set
GPU Memory
InfiniBand
System
Memory
1
2
Transmit Receive
CPU
GPU Chip
set
GPU Memory
InfiniBand
System
Memory
1
CPU
GPU Chip
set
GPU Memory
InfiniBand
System
Memory
1
Non GPUDirect
GPUDirect 1.0
© 2013 Mellanox Technologies 8
LAMMPS
• 3 nodes, 10% gain
Amber – Cellulose
• 8 nodes, 32% gain
Amber – FactorIX
• 8 nodes, 27% gain
GPUDirect 1.0 – Application Performance
3 nodes, 1 GPU per node 3 nodes, 3 GPUs per node
© 2013 Mellanox Technologies 9
GPUDirect RDMA
Transmit Receive
CPU
GPU Chip
set
GPU Memory
InfiniBand
System
Memory
1
CPU
GPU Chip
set
GPU Memory
InfiniBand
System
Memory
1
GPUDirect RDMA
CPU
GPU Chip
set
GPU Memory
InfiniBand
System
Memory
1
CPU
GPU Chip
set
GPU Memory
InfiniBand
System
Memory
1
GPUDirect 1.0
© 2013 Mellanox Technologies 10
Hardware considerations for GPUDirect RDMA
CPU
Chip
set
CPU
Chipset or
PCIE switch
Note : A requirement for GPUDirect RDMA to work properly is that the NVIDIA
GPU and the Mellanox InfiniBand Adapter share the same root complex
© 2013 Mellanox Technologies 11
How to get started evaluating GPUDirect RDMA…
How do I get started with the GPUDirect RDMA alpha code release?
The only way to get access to the alpha release is by sending an email to :
[email protected] , you will receive a response within 24 hours, and will be able to
download the code via an FTP site.
If you would like to evaluate MVAPICH2.1.9-GDR, please state this in the email request,
and you will receive a separate email on how to download it.
How can I ensure I get the latest updates and information?
The Community Site at Mellanox (http://community.mellanox.com) is a great place to get
the very latest information on GPUDirect RDMA. You will also be able to connect with
your peers, ask questions, exchange ideas, find additional resources, and share best
practices.
© 2013 Mellanox Technologies 12
A Community for Mellanox Technology Enthusiasts
© 2013 Mellanox Technologies 13
Thank You
MVAPICH2-GDR: MVAPICH2 with GPUDirect RDMA
Prof. Dhabaleswar K. (DK) Panda The Ohio State University
E-mail: [email protected] MVAPICH Project
https://mvapich.cse.ohio-state.edu/
Webinar on
Accelerating High Performance Computing with GPUDirect RDMA
by
• 205 IB Clusters (41%) in the June 2013 Top500 list
(http://www.top500.org)
• Installations in the Top 40 (18 systems):
MVAPICH2 with GPUDirect RDMA
Large-scale InfiniBand Cluster Installations
462,462 cores (Stampede) at TACC (6th) 138,368 cores (Tera-100) at France/CEA (25th)
147, 456 cores (Super MUC) in Germany (7th) 53,504 cores (PRIMERGY) at Australia/NCI (27th)
110,400 cores (Pangea) at France/Total (11th) 77,520 cores (Conte) at Purdue University (28th)
73,584 (Spirit) at USA/Air Force (14th) 48,896 cores (MareNostrum) at Spain/BSC (29th)
77,184 cores (Curie thin nodes) at France/CEA (15th) 78,660 cores (Lomonosov) in Russia (31st )
120, 640 cores (Nebulae) atChina/NSCS (16th) 137,200 cores (Sunway Blue Light) in China 33rd)
72,288 cores (Yellowstone) at NCAR (17th) 46,208 cores (Zin) at LLNL (34th)
125,980 cores (Pleiades) at NASA/Ames (19th) 38,016 cores at India/IITM (36th)
70,560 cores (Helios) at Japan/IFERC (20th) More are getting installed !
73,278 cores (Tsubame 2.0) at Japan/GSIC (21st)
2
• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)
– MVAPICH (MPI-1) ,MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002
– MVAPICH2-X (MPI + PGAS), Available since 2012
– Used by more than 2,055 organizations (HPC Centers, Industry and Universities) in 70 countries
– More than 180,000 downloads from OSU site directly
– Empowering many TOP500 clusters • 7th ranked 204,900-core cluster (Stampede) at TACC
• 14th ranked 125,980-core cluster (Pleiades) at NASA
• 17th ranked 73,278-core cluster (Tsubame 2.0) at Tokyo Institute of Technology
• 75th ranked 16,896-core cluster (Keenland) at GaTech
• and many others
– Available with software stacks of many IB, HSE, and server vendors including Linux Distros (RedHat and SuSE)
– http://mvapich.cse.ohio-state.edu
3
MVAPICH2/MVAPICH2-X Software
MVAPICH2 with GPUDirect RDMA
• Released on 05/06/13
• Major Features and Enhancements – Based on MPICH-3.0.3
• Support for all MPI-3 features – Non-blocking collectives, Neighborhood Collectives, etc.
– Support for single copy intra-node communication using Linux supported CMA (Cross Memory Attach)
• Provides flexibility for intra-node communication: shared memory, LiMIC2, and CMA
– Checkpoint/Restart using LLNL's Scalable Checkpoint/Restart Library (SCR)
• Support for application-level checkpointing
• Support for hierarchical system-level checkpointing
– Scalable UD-multicast-based designs and tuned algorithm selection for collectives
– Improved job startup time
• Provided a new runtime variable MV2_HOMOGENEOUS_CLUSTER for optimized startup on homogeneous clusters
– Revamped Build system with support for parallel builds
– Many enhancements related to GPU
• MVAPICH2-X 1.9 supports hybrid MPI + PGAS (UPC and OpenSHMEM) programming models.
– Based on MVAPICH2 1.9 including MPI-3 features; Compliant with UPC 2.16.2 and OpenSHMEM v1.0d 4
MVAPICH2 1.9 and MVAPICH2-X 1.9
MVAPICH2 with GPUDirect RDMA
• OSU started this research and development direction in 2011
• Initial support was provided in MVAPICH2 1.8a (SC ‘11)
• Since then many enhancements and new designs related to GPU communication have been incorporated in 1.8 and 1.9 series
• Have also extended OSU Micro-Benchmark Suite (OMB) to test and evaluate – GPU-aware MPI communication
– OpenACC
5 MVAPICH2 with GPUDirect RDMA
Designing GPU-Aware MPI Library
6 MVAPICH2 with GPUDirect RDMA
What is GPU-Aware MPI Library?
PCIe
GPU
CPU
NIC
Switch
At Sender: cudaMemcpy(s_hostbuf, s_devbuf, . . .); MPI_Send(s_hostbuf, size, . . .);
At Receiver: MPI_Recv(r_hostbuf, size, . . .); cudaMemcpy(r_devbuf, r_hostbuf, . . .);
• Data movement in applications with standard MPI and CUDA interfaces
High Productivity and Low Performance
7 MVAPICH2 with GPUDirect RDMA
MPI + CUDA - Naive
PCIe
GPU
CPU
NIC
Switch
At Sender: for (j = 0; j < pipeline_len; j++) cudaMemcpyAsync(s_hostbuf + j * blk, s_devbuf + j * blksz,
…); for (j = 0; j < pipeline_len; j++) { while (result != cudaSucess) { result = cudaStreamQuery(…); if(j > 0) MPI_Test(…); } MPI_Isend(s_hostbuf + j * block_sz, blksz . . .); } MPI_Waitall();
<<Similar at receiver>>
• Pipelining at user level with non-blocking MPI and CUDA interfaces
Low Productivity and High Performance
8 MVAPICH2 with GPUDirect RDMA
MPI + CUDA - Advanced
At Sender: At Receiver: MPI_Recv(r_devbuf, size, …);
inside MVAPICH2
• Standard MPI interfaces used for unified data movement
• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
• Overlaps data movement from GPU with RDMA transfers
High Performance and High Productivity
MPI_Send(s_devbuf, size, …);
9 MVAPICH2 with GPUDirect RDMA
GPU-Aware MPI Library: MVAPICH2-GPU
• 45% improvement compared with a naïve user-level implementation (Memcpy+Send), for 4MB messages
• 24% improvement compared with an advanced user-level implementation (MemcpyAsync+Isend), for 4MB messages
H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur and D. K. Panda, MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters, ISC ‘11
Better
10 MVAPICH2 with GPUDirect RDMA
24 %
45 %
0
500
1000
1500
2000
2500
3000
32K 64K 128K 256K 512K 1M 2M 4M
Tim
e (u
s)
Message Size (bytes)
Memcpy+SendMemcpyAsync+IsendMVAPICH2-GPU
MPI Micro-benchmark Performance
• Support for MPI communication from NVIDIA GPU device memory
• High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host, and Host-GPU)
• High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)
• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node
• Optimized and tuned collectives for GPU device buffers
• MPI datatype support for point-to-point and collective communication from GPU device buffers
11 MVAPICH2 with GPUDirect RDMA
MVAPICH2 1.9 Features for NVIDIA GPU Clusters
• Fastest possible communication between GPU and other PCI-E devices
• Network adapter can directly read/write data from/to GPU device memory
• Avoids copies through the host
• Allows for better asynchronous communication
• OFED with GPU-Direct is under work by NVIDIA and Mellanox
GPU-Direct RDMA with CUDA 5.0
InfiniBand
GPU
GPU Memory
CPU
Chip set
System Memory
12 MVAPICH2 with GPUDirect RDMA
• Peer2Peer (P2P) bottlenecks on Sandy Bridge
• Design of MVAPICH2 – Hybrid design
– Takes advantage of GPU-Direct-RDMA for writes to GPU
– Uses host-based buffered design in current MVAPICH2 for reads
– Works around the bottlenecks transparently
13
Initial Design of OSU-MVAPICH2 with GPU-Direct-RDMA
MVAPICH2 with GPUDirect RDMA
IB Adapter
System Memory
GPU Memory
GPU
CPU
Chipset
P2P write: 5.2 GB/s P2P read: < 1.0 GB/s
SNB E5-2670
0
5
10
15
20
25
30
1 4 16 64 256 1K 4K
MVAPICH2-1.9MVAPICH2-1.9-GDR
Small Message Latency
Message Size (bytes)
Late
ncy
(us)
14
Performance of MVAPICH2 with GPU-Direct-RDMA
Based on MVAPICH2-1.9 Intel Sandy Bridge (E5-2670) node with 16 cores
NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.5, OFED 1.5.4.1 with GPU-Direct-RDMA Patch
GPU-GPU Internode MPI Latency
0
100
200
300
400
500
600
700
800
900
1000
16K 64K 256K 1M 4M
MVAPICH2-1.9MVAPICH2-1.9-GDR
Large Message Latency
Message Size (bytes)
Late
ncy
(us)
Better
Better
19.1
67.5 %
6.2
MVAPICH2 with GPUDirect RDMA
0
1000
2000
3000
4000
5000
6000
7000
8K 32K 128K 512K 2M
MVAPICH2-1.9MVAPICH2-1.9-GDR
Message Size (bytes)
Band
wid
th (M
B/s)
Large Message Bandwidth
0
100
200
300
400
500
600
700
800
1 4 16 64 256 1K 4K
MVAPICH2-1.9MVAPICH2-1.9-GDR
Message Size (bytes)
Band
wid
th (M
B/s)
Small Message Bandwidth
15
Performance of MVAPICH2 with GPU-Direct-RDMA
Based on MVAPICH2-1.9 Intel Sandy Bridge (E5-2670) node with 16 cores
NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.5, OFED 1.5.4.1 with GPU-Direct-RDMA Patch
GPU-GPU Internode MPI Uni-Directional Bandwidth
2.8x 33%
Bett
er
Bett
er
MVAPICH2 with GPUDirect RDMA
0
2000
4000
6000
8000
10000
12000
8K 32K 128K 512K 2M
MVAPICH2-1.9
MVAPICH2-1.9-GDR
Message Size (bytes)
Band
wid
th (M
B/s)
Large Message Bi-Bandwidth
0
100
200
300
400
500
600
700
800
900
1000
1 4 16 64 256 1K 4K
MVAPICH2-1.9
MVAPICH2-1.9-GDR
Message Size (bytes)
Band
wid
th (M
B/s)
Small Message Bi-Bandwidth
16
Performance of MVAPICH2 with GPU-Direct-RDMA
Based on MVAPICH2-1.9 Intel Sandy Bridge (E5-2670) node with 16 cores
NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.5, OFED 1.5.4.1 with GPU-Direct-RDMA Patch
GPU-GPU Internode MPI Bi-directional Bandwidth
54%
Bett
er
Bett
er
MVAPICH2 with GPUDirect RDMA
3x
How will it help me?
• MPI applications can be made GPU-aware to use direct communication from/to GPU buffers, as supported by MVAPICH2 1.9, and extract performance benefits
• GPU-Aware MPI applications using short and medium messages can extract added performance and scalability benefits with MVAPICH2-GPUDirect RDMA (MVAPICH2-GDR)
17 MVAPICH2 with GPUDirect RDMA
How can I get Started with GDR Experimentation?
• Two modules are needed – GPUDirect RDMA (GDR) driver from Mellanox – MVAPICH2-GDR from OSU
• Send a note to [email protected] • You will get alpha versions of GDR driver and MVAPICH2-GDR
(based on MVAPICH2 1.9 release) • You can get started with this version • MVAPICH2 team is working on multiple enhancements (collectives,
datatypes, one-sided) to exploit the advantages of GDR • As GDR driver matures, successive versions of MVAPICH2-GDR with
enhancements will be made available to the community
18 MVAPICH2 with GPUDirect RDMA
Will it be too Hard to Use GDR?
• No • You need to first install the OFED-GDR driver from Mellanox • Install MVAPICH2-GDR • Current GPU-aware features in MVAPICH2 are triggered with
a runtime parameter: MV2_USE_CUDA=1 • To activate GDR functionality, you just need to use one more
runtime parameter: MV2_USE_GPUDIRECT =1 • A short demo will be shown now to illustrate the easy usage
of MVAPICH2-GDR
19 MVAPICH2 with GPUDirect RDMA
Additional Information and Contact Point for Questions
http://www.cse.ohio-state.edu/~panda http://nowlab.cse.ohio-state.edu
MVAPICH Web Page http://mvapich.cse.ohio-state.edu
20 MVAPICH2 with GPUDirect RDMA
Accelerating High Performance Computing with GPUDirect™ RDMA
NVIDIA webinar 8/7/2013
Outline
! GPUDirect technology family ! Current NVIDIA Software and Hardware Requirements ! Current MPI Status ! Using GPU Direct with IBVerbs extensions ! Using GPUDirect RDMA and MPI ! CUDA 6 – moving from GPUDirect alpha to beta to GA ! Team Q & A
GPUDirect is a family of technologies
! GPUDirect Shared GPU-Sysmem for inter-node copy optimization ! How: Use GPUDirect-aware 3rd party network drivers
! GPUDirect P2P for intra-node, accelerated GPU-GPU memcpy ! How: Use CUDA APIs directly in application ! How: use P2P-aware MPI implementation
! GPUDirect P2P for intra-node, inter-GPU LD/ST access ! How: Access remote data by address directly in GPU device code
! GPUDirect RDMA for inter-node copy optimization ! What: 3rd party PCIe devices can read and write GPU memory ! How: Use GPUDirect RDMA-aware 3rd party network drivers* and MPI
implementations* or custom device drivers for other hardware
* forthcoming
NVIDIA Software and Hardware Requirements
! What drivers and CUDA versions are required to support GPU Direct?
! Alpha Patches work with – CUDA 5.0 or CUDA 5.5 ! Final release based on CUDA 6.0 (beta in October)
! New driver, probably version 331 ! Register at developer.nvidia.com for early access
! NVIDIA Hardware Requirements ! RDMA available on Tesla and Quadro Kepler class hardware
GPU Aware MPI Libraries – Current Status
All libraries allow • GPU and network device share same sysmem buffers • Utilizes best transfer mode (such as CUDA IPC direct
transfer within a node between GPUs) • Send and receive of GPU buffers and most
collectives Versions: • MVAPICH2 1.9 • OpenMPI 1.7.2 • IBM Platform MPI V9.1 Reference • NVIDIA GPUDirect Technology Overview
MVAPICH
Open MPI
IBM Platform Computing Computing
IBM Platform MPI
IB verbs extensions for GPUDirect RDMA
! Developers may program at the IB verbs level or with MPI ! Current version with RDMA support (available via Mellanox)
! Gives application developers early access to an RDMA path ! IB verbs was changed to provide:
! Extended memory registration API’s to support GPU buffer ! GPU memory de-allocation call-back (for efficient MPI implementations)
IB verbs with GPUDirect RDMA
Use existing memory registration APIs: ! struct ibv_mr *ibv_reg_mr(struct ibv_pd *pd, void *addr,
size_t length, int access)
! pd: protection domain ! Access:
IBV_ACCESS_LOCAL_WRITE = 1, IBV_ACCESS_REMOTE_WRITE = (1<<1), IBV_ACCESS_REMOTE_READ = (1<<2), IBV_ACCESS_REMOTE_ATOMIC = (1<<3), IBV_ACCESS_MW_BIND = (1<<4)
! int ibv_dereg_mr(struct ibv_mr *mr)
Example:
d_buf=cudaMalloc(); mr=ibv_reg_mr(pd,d_buf,size,…); // … RDMA on buffer here ibv_dereg_mr(mr) cudaFree(d_buf)
GPUDirect RDMA
GPU PCIe
GPU Memory
CPU/IOH
CPU Memory
PCIe
Third Party Hardware
Today
Mellanox
Tomorrow??
GPUDirect RDMA: Common use cases
! Inter-Node MPI communication ! Transfer data between local GPU memory and a remote Node
! Interface with third party hardware ! It requires adopting NVIDIA GPUDirect-Interop API in vendor software
stack
GPUDirect RDMA: What does it get you?
! MPI_Send latency of ~20us with Shared GPU-Sysmem ! No overlap possible ! Bidirectional transfer is difficult
! MPI_Send latency of ~6us with RDMA ! Does not affect running kernels ! Unlimited concurrency ! RDMA possible!
So what happens at CUDA 6?
! No change to MPI programs ! Interfaces are simplified, reducing the work for MPI implementors ! Programmers working at the verbs level also benefit ! Requires upgrading to CUDA 6 and then current NVIDIA driver
! Register to receive updates ! Release of CUDA 6 (RC1, then final) ! Release of MVAPICH2 beta, then final ! Progress from other MPI vendors
Contacts and Resources
! NVIDIA ! Register at developer.nvidia.com for
early access to CUDA 6 ! Developer Zone GPU Direct page ! Developer Zone RDMA page
! Mellanox ! Register at
community.mellanox.com ! Ohio State
! http://mvapich.cse.ohio-state.edu ! Emails
The End
Question Time
Upcoming GTC Express Webinars
August 13 - GPUs in the Film Visual Effects Pipeline
August 14 - Beyond Real-time Video Surveillance Analytics with GPUs
August 15 - CUDA 5.5 Production Release: Features Overview
September 5 - Data Discovery through High-Data-Density Visual Analysis using NVIDIA GRID GPUs
September 12 - Guided Performance Analysis with NVIDIA Visual Profiler
Register at www.gputechconf.com/gtcexpress
GTC 2014 Call for Submissions
Looking for submissions in the fields of
! Science and research ! Professional graphics
! Mobile computing
! Automotive applications
! Game development
! Cloud computing
Submit by September 27 at www.gputechconf.com