View
746
Download
1
Category
Preview:
Citation preview
Dror Goldenberg, March 2016, HPCAC Swiss
Co-Design Architecture Emergence of New Co-Processors
© 2016 Mellanox Technologies 2
Co-Design Architecture to Enable Exascale Performance
CPU-Centric Co-Design
Limited to Main CPU Usage Results in Performance Limitation
Creating Synergies Enables Higher Performance and Scale
Software Software
In-CPU Computing
In-Network Computing
In-Storage Computing
© 2016 Mellanox Technologies 3
The Intelligence is Moving to the Interconnect
CPU
Interconnect
Past Future
© 2016 Mellanox Technologies 4
Intelligent Interconnect Delivers Higher Datacenter ROI
Users
NETWORK
COMPUTING
NETWORK
Users
Intelligence
Network Offloads Computing for applications
Smart Network Increase Datacenter Value
Network functions On CPU
COMPUTING
© 2016 Mellanox Technologies 5
Breaking the Application Latency Wall
§ Today: Network device latencies are on the order of 100 nanoseconds
§ Challenge: Enabling the next order of magnitude improvement in application performance
§ Solution: Creating synergies between software and hardware – intelligent interconnect
Intelligent Interconnect Paves the Road to Exascale Performance
10 years ago
~10 microsecond
~100 microsecond
Network Communication Framework
Today
~10 microsecond
Communication Framework
~0.1 microsecond
Network
~1 microsecond
Communication Framework
Future
~0.05 microsecond
Co-Design Network
© 2016 Mellanox Technologies 6
Introducing Switch-IB 2 World’s First Smart Switch
© 2016 Mellanox Technologies 7
Introducing Switch-IB 2 World’s First Smart Switch
§ The world fastest switch with <90 nanosecond latency
§ 36-ports, 100Gb/s per port, 7.2Tb/s throughput, 7.02 Billion messages/sec
§ Adaptive Routing, Congestion control, support for multiple topologies
World’s First Smart Switch
Build for Scalable Compute and Storage Infrastructures
10X Higher Performance with The New Switch SHArP Technology
© 2016 Mellanox Technologies 8
SHArP (Scalable Hierarchical Aggregation Protocol) Technology
Delivering 10X Performance Improvement
for MPI and SHMEM/PAGS Communications
Switch-IB 2 Enables the Switch Network to
Operate as a Co-Processor
SHArP Enables Switch-IB 2 to Manage and
Execute MPI Operations in the Network
© 2016 Mellanox Technologies 9
SHArP Performance Advantage
§ MiniFE is a Finite Element mini-application • Implements kernels that represent
implicit finite-element applications
10X to 25X Performance Improvement
AllRedcue MPI Collective
© 2016 Mellanox Technologies 10
The Intelligence is Moving to the Interconnect
Communication Frameworks (MPI, SHMEM/PGAS)
The Only Approach to Deliver 10X Performance Improvements
Applications Transport RDMA SR-IOV
Collectives Peer-Direct GPUDirect
More…
MPI / SHMEM Offloads
Q1’16
Q3’16
© 2016 Mellanox Technologies 11
Multi-Host Socket DirectTM – Low Latency Socket Communication
§ Each CPU with direct network access
§ QPI avoidance for I/O – improve performance
§ Enables GPU / peer direct on both sockets
§ Solution is transparent to software
CPU CPU CPU CPU QPI
Multi-Host Socket Direct Performance
50% Lower CPU Utilization
20% lower Latency
Multi Host Evaluation Kit
Lower Application Latency, Free-up CPU
© 2016 Mellanox Technologies 12
Introducing ConnectX-4 Lx Programmable Adapter
Scalable, Efficient, High-Performance and Flexible Solution
Security
Cloud/Virtualization
Storage
High Performance Computing
Precision Time Synchronization
Networking + FPGA
Mellanox Acceleration Engines
and FGPA Programmability
On One Adapter
© 2016 Mellanox Technologies 13
Mellanox InfiniBand Proven and Most Scalable HPC Interconnect
“Summit” System “Sierra” System
Paving the Road to Exascale
© 2016 Mellanox Technologies 14
NCAR-Wyoming Supercomputing Center (NWSC) – “Cheyenne”
§ Cheyenne supercomputer system
§ 5.34-petaflop SGI ICE XA Cluster
§ Intel “Broadwell” processors
§ More than 4K compute nodes
§ Mellanox EDR InfiniBand interconnect
§ Mellanox Unified Fabric Manager
§ Partial 9D Enhanced Hypercube interconnect topology
§ DDN SFA14KX systems
§ 20 petabytes of usable file system space
§ IBM GPFS (General Parallel File System)
© 2016 Mellanox Technologies 15
High-Performance Designed 100Gb/s Interconnect Solutions
Transceivers
Active Optical and Copper Cables
(10 / 25 / 40 / 50 / 56 / 100Gb/s) VCSELs, Silicon Photonics and Copper
36 EDR (100Gb/s) Ports, <90ns Latency
Throughput of 7.2Tb/s
7.02 Billion msg/sec (195M msg/sec/port)
100Gb/s Adapter, 0.7us latency
150 million messages per second
(10 / 25 / 40 / 50 / 56 / 100Gb/s)
32 100GbE Ports, 64 25/50GbE Ports
(10 / 25 / 40 / 50 / 100GbE)
Throughput of 6.4Tb/s
© 2016 Mellanox Technologies 16
Leading Supplier of End-to-End Interconnect Solutions
Store Analyze Enabling the Use of Data
Software ICs Switches/Gateways Adapter Cards Cables/Modules
Comprehensive End-to-End InfiniBand and Ethernet Portfolio (VPI)
Metro / WAN NPU & Multicore
NPS TILE
© 2016 Mellanox Technologies 17
The Performance Advantage of EDR 100G InfiniBand (28-80%)
28%
© 2016 Mellanox Technologies 18
End-to-End Interconnect Solutions for All Platforms
Highest Performance and Scalability for
X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms
10, 20, 25, 40, 50, 56 and 100Gb/s Speeds
X86 Open POWER GPU ARM FPGA
Smart Interconnect to Unleash The Power of All Compute Architectures
© 2016 Mellanox Technologies 19
Technology Roadmap – One-Generation Lead over the Competition
2000 2020 2010 2005
20G 40G 56G 100G
“Roadrunner” Mellanox Connected
1st 3rd TOP500 2003
Virginia Tech (Apple)
2015
200G
Terascale Petascale Exascale
Mellanox 400G
© 2016 Mellanox Technologies 20
§ Transparent InfiniBand integration into OpenStack • Since Havana
§ RDMA directly from VM - SRIOV § MAC to GUID mapping § VLAN to pkey mapping § InfiniBand SDN network
§ Ideal fit for High Performance Computing Clouds
OpenStack Over InfiniBand – Extreme Performance in the Cloud
InfiniBand Enables The Highest Performance and Efficiency
© 2016 Mellanox Technologies 21
§ Mellanox End to End • Mellanox ConnectX-4 NIC family, Switch-IB/Spectrum switches and 25/100Gb/s cables
§ Bring the astonishing 100Gb/s speeds to the cloud with minimal CPU utilization • Both VMs and Hypervisors • Accelerations are critical to reach line rate - SR-IOV, RDMA, etc.
25, 50 And 100Gb/s Clouds Are Here!
92.412 Gb/s
0.71%
© 2016 Mellanox Technologies 22
The Next Generation HPC Software Framework To Meet the Needs of Future Systems / Applications
Unified Communication – X Framework (UCX)
© 2016 Mellanox Technologies 23
Exascale Co-Design Collaboration
Collaborative Effort Industry, National Laboratories and Academia
The Next Generation
HPC Software Framework
© 2016 Mellanox Technologies 24
A Collaboration Effort
§ Mellanox co-designs network interface and contributes MXM technology • Infrastructure, transport, shared memory, protocols, integration with OpenMPI/SHMEM, MPICH
§ ORNL co-designs network interface and contributes UCCS project • InfiniBand optimizations, Cray devices, shared memory
§ NVIDIA co-designs high-quality support for GPU devices • GPUDirect, GDR copy, etc.
§ IBM co-designs network interface and contributes ideas and concepts from PAMI § UH/UTK focus on integration with their research platforms
© 2016 Mellanox Technologies 25
Mellanox HPC-X™ Scalable HPC Software Toolkit
§ Complete MPI, PGAS OpenSHMEM and UPC package
§ Maximize application performance
§ For commercial and open source applications
§ Based on UCX (Unified Communication – X Framework)
© 2016 Mellanox Technologies 26
Mellanox Delivers Highest MPI (HPC-X) Performance
Enabling Highest Applications Scalability and Performance
Mellanox ConnectX-4 Collectives Offload
© 2016 Mellanox Technologies 27
Mellanox Delivers Highest Applications Performance (HPC-X)
§ Quantum Espresso application
IntelMPI BullMPI(HPC-X)
QuantumEspresso
TestCase #nodes <me(s) <me(s) Gain
A 43 584 368 37%
B 196 2592 998 61%
Enabling Highest Applications Scalability and Performance
© 2016 Mellanox Technologies 28
Maximize Performance via Accelerator and GPU Offloads
GPUDirect RDMA Technology
© 2016 Mellanox Technologies 29
GPUs are Everywhere!
GPUDirect RDMA / Sync
CPU
GPU Chip set
GPU Memory
System Memory 1
GPU
© 2016 Mellanox Technologies 30
§ Eliminates CPU bandwidth and latency bottlenecks § Uses remote direct memory access (RDMA) transfers between GPUs § Resulting in significantly improved MPI efficiency between GPUs in remote nodes § Based on PCIe PeerDirect technology
GPUDirect™ RDMA (GPUDirect 3.0)
With GPUDirect™ RDMA Using PeerDirect™
© 2016 Mellanox Technologies 31
Mellanox GPUDirect RDMA Performance Advantage
§ HOOMD-blue is a general-purpose Molecular Dynamics simulation code accelerated on GPUs § GPUDirect RDMA allows direct peer to peer GPU communications over InfiniBand
• Unlocks performance between GPU and InfiniBand • This provides a significant decrease in GPU-GPU communication latency • Provides complete CPU offload from all GPU communications across the network
102%
2X Application Performance!
© 2016 Mellanox Technologies 32
GPUDirect Sync (GPUDirect 4.0)
§ GPUDirect RDMA (3.0) – direct data path between the GPU and Mellanox interconnect • Control path still uses the CPU - CPU prepares and queues communication tasks on GPU - GPU triggers communication on HCA - Mellanox HCA directly accesses GPU memory
§ GPUDirect Sync (GPUDirect 4.0) • Both data path and control path go directly between the GPU and the Mellanox interconnect
0
10
20
30
40
50
60
70
80
2 4
Aver
age
time
per i
tera
tion
(us)
Number of nodes/GPUs
2D stencil benchmark
RDMA only RDMA+PeerSync
27% faster 23% faster
Maximum Performance For GPU Clusters
© 2016 Mellanox Technologies 33
Remote GPU Access through rCUDA
GPU servers GPU as a Service
rCUDA daemon
Network Interface CUDA Driver + runtime Network Interface
rCUDA library
Application
Client Side Server Side
Application
CUDA Driver + runtime
CUDA Application
rCUDA provides remote access from every node to any GPU in the system
CPU VGPU
CPU VGPU
CPU VGPU
GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU
© 2016 Mellanox Technologies 34
Interconnect Architecture Comparison
Offload versus Onload (Non-Offload)
© 2016 Mellanox Technologies 35
Offload versus Onload (Non-Offload)
§ Two interconnect architectures exist – Offload-based and Onload-based
§ Offload Architecture • The Interconnect manages and executes all network operations • The interconnect is capable of including application acceleration engines • Offloads the CPU and therefore free CPU cycles to be used by the applications • Development requires large R&D investment • Higher data center ROI
§ Onload architecture • A CPU-centric approach – everything must be executed on and by the CPU • The CPU is responsible for all network functions, the interconnect only pushes the data into the wire • Cannot support acceleration engines, no support for RDMA, and network transport is done by the CPU • Onload the CPU and reduces the CPU cycles available for the applications • Does not require R&D investments or interconnect expertise
© 2016 Mellanox Technologies 36
Sandia National Laboratory Paper – Offloading versus Onloading
© 2016 Mellanox Technologies 37
Interconnect Throughput – Offload versus Onload
The Offloading Advantage!
Network Performance Dramatically Depends on CPU Frequency!
Data Throughput:
20% Higher at common Xeon Frequency
250% Higher at common Xeon Phi Frequency
Common Xeon Frequency 2.6GHz
Common Xeon Phi Frequency ~1Ghz
© 2016 Mellanox Technologies 38
Only Offload Architecture Can Enable Co-Processors
Offloading (Highest Performance for all Frequencies)
Onloading (performance loss with lower CPU frequency)
Common Xeon Frequency
Common Xeon Phi Frequency
Onloading Technology Not Suitable for Co-Processors!
© 2016 Mellanox Technologies 39
Switch Latency
Message Rate
Mellanox InfiniBand Leadership Over Omni-Path
20% Lower
44% Higher
Power Consumption Per Switch Port
Scalability CPU efficiency
25% Lower
2X Higher
100 Gb/s Link Speed
200 Gb/s Link Speed
2014
Gain Competitive Advantage Today Protect Your Future
2017
Smart Network For Smart Systems RDMA, Acceleration Engines, Programmability
Higher Performance Unlimited Scalability
Higher Resiliency Proven!
Thank You
Recommended