1© 2019 Mellanox Technologies | Confidential
HPC-AI Advisory Council – Perth, WAAshrut Ambastha
Mellanox Smart Interconnect and Roadmap
2© 2019 Mellanox Technologies | Confidential
InfiniBand Accelerates 6 of Top 10 Supercomputers
1 2 3 5 8 10
3© 2019 Mellanox Technologies | Confidential
5
HDR 200G InfiniBand Wins Next Generation Supercomputers
1.7 Petaflops2K HDR InfiniBand NodesDragonfly+ Topology
23.5 Petaflops8K HDR InfiniBand NodesFat-Tree Topology
3.1 Petaflops1.8K HDR InfiniBand NodesFat-Tree Topology
62
1662641.6 Petaflops
HDR InfiniBand, Hybrid CPU-GPU-FPGA Fat-Tree Topology
4© 2019 Mellanox Technologies | Confidential
High Performance Interconnect Development
1995 2000 2010 2020 20252005 2015
InfiniBandSDR DDR QDR FDR EDR HDR NDR XDR
Crossbar Seastar Gemini Aries Slingshot
Myrinet
QsNet QsNet with Gateway to Ethernet
InfiniPath TrueScale OmniPath
First Petaflop Supercomputer
LANL Roadrunner
IBM / Mellanox InfiniBand
First Teraflop Supercomputer
Sandia ASCI Red
Intel
(InfiniBand)(InfiniBand)
5© 2019 Mellanox Technologies | Confidential
Accelerating All Levels of HPC/AI Frameworks
GPUDirect
RDMA
Network
Framework
Communication
Framework
Application
Framework ▪ Data Analysis▪ Configurable Logic
▪ SHARP
▪ MPI Tag Matching
▪ MPI Rendezvous
▪ Software Defined Virtual Devices
▪ Network Transport Offload
▪ RDMA
▪ GPU-Direct
▪ SHIELD (self-healing network)
6© 2019 Mellanox Technologies | Confidential
In-Network Computing to Enable Data-Centric Data Centers
GPU
CPU
GPU
CPU
GPU
CPU
CPU
GPU
GPUDirect
RDMA
Scalable Hierarchical
Aggregation and
Reduction Protocol
NVMeOverFabrics
Faster Data Speeds and In-Network Computing Enable Higher Performance and Scale
Mellanox In-Network Computing and Acceleration Engines
7© 2019 Mellanox Technologies | Confidential
Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)
8© 2019 Mellanox Technologies | Confidential
SHARP Allreduce Performance Advantages
SHARP enables 75% Reduction in Latency
Providing Scalable Flat Latency
6
12
2 31 4 M
9© 2019 Mellanox Technologies | Confidential
Oak Ridge National Laboratory – Coral Summit Supercomputer
SHARP AllReduce Performance Advantages
▪ 2K nodes, MPI All Reduce latency, 2KB message size
SHARP Enables Highest PerformanceScalable Hierarchical Aggregation and
Reduction Protocol
10© 2019 Mellanox Technologies | Confidential
SHARP AllReduce Performance Advantages 1500 Nodes, 60K MPI Ranks, Dragonfly+ Topology
SHARP Enables Highest PerformanceScalable Hierarchical Aggregation and
Reduction Protocol
11© 2019 Mellanox Technologies | Confidential
Performs the Gradient AveragingReplaces all physical parameter serversAccelerate AI Performance
SHARP Accelerates AI Performance
The CPU in a parameter server becomes the bottleneck
12© 2019 Mellanox Technologies | Confidential
SHARP Performance Advantage for AI
▪ SHARP provides 16% Performance Increase for deep learning, initial results▪ TensorFlow with Horovod running ResNet50 benchmark, HDR InfiniBand (ConnectX-6, Quantum)
16%
13© 2019 Mellanox Technologies | Confidential
Mellanox Accelerates Record-Breaking AI Systems
NVIDIA DGX SATURNV▪ 124 DGX-1 nodes interconnected by 32 L1 TOR Switches, in 2016
▪ Mellanox 36 port EDR L1 and L2 switches, 4 EDR per system
▪ Upgraded to 660 NVIDIA DGX-1 V100 Server Nodes, in 2017
▪ 5280 V100 GPUs, 660 PetaFLOPS (AI)
ImageNet training record breakers
V100 x 1088, EDR InfiniBand
91.62 %Scaling efficiency
P100 x 256, EDR InfiniBand
~90 %Scaling efficiency
P100 x 1024, FDR InfiniBand
80 %Scaling efficiency
14© 2019 Mellanox Technologies | Confidential
Mellanox Accelerates Record-Breaking AI Systems
Faster Speed InfiniBand Enabled Superior Scaling for Top-Level AI systems
80%
91.62%
74%
76%
78%
80%
82%
84%
86%
88%
90%
92%
94%
GP
U s
ca
ling
eff
icie
ncy
GPU Scaling Efficiency (ResNet-50)
Preferred Networks FDR + Chainer Tesla P100 x 1024
SonyEDR x 2 + Sony NNL Tesla V100 x 1088
900 s
291 s
0
100
200
300
400
500
600
700
800
900
1000
Tra
inin
g T
ime
(se
con
ds)
Training Time (ResNet-50)
Preferred Networks FDR + Chainer Tesla P100 x 1024
SonyEDR x 2 + Sony NNL Tesla V100 x 1088
3.1x Higher
Performance11.6% Higher
efficiency
EDR x 2FDREDR x 2FDR
Sony broke ImageNet Training Record
on cluster AI Bridging Cloud Infrastructure (ABCI)
with 2D-Torus GPU topology in Dec. 2018.
Nodes are connected with two rails of EDR. Link
Higher is better Lower is better
https://nnabla.org/paper/imagenet_in_224sec.pdf
15© 2019 Mellanox Technologies | Confidential
Adaptive Routing (AR) Performance – ORNL Summit
▪ Oak Ridge National Laboratory – Coral Summit supercomputer▪ Bisection bandwidth benchmark, based on mpiGraph▪ Explores the bandwidth between possible MPI process pairs
▪ AR results demonstrate an average performance of 96% of the maximum bandwidth measured
mpiGraph explores the bandwidth between possible MPI process pairs. In the histograms, the single cluster with AR indicates that all pairs achieve nearly maximum bandwidth while single-path static routing has nine clusters as congestion limits bandwidth, negatively impacting overall application performance.
16© 2019 Mellanox Technologies | Confidential
InfiniBand Congestion Control (2010)
Without Congetion ConteolWith Congetion Conteol
Congestion – Throughput loss No congestion – highest throughput!
17© 2019 Mellanox Technologies | Confidential
For the HPC-AI Cloud Enthusiasts
18© 2019 Mellanox Technologies | Confidential
It has been around for a while….
19© 2019 Mellanox Technologies | Confidential
Introducing Mellanox BlueField SmartNIC
Programmability
Performance
Isolation
20© 2019 Mellanox Technologies | Confidential
BlueField-2 SoC Block Diagram
▪ Tile architecture running 8 x Arm ® A72 CPUs▪ SkyMesh™ coherent low-latency interconnect▪ 6MB L3 Last Level Cache▪ Arm frequency : 2GHz - 2.5GHz ▪ Up to 256GB DDR4 @3200MT/s w/ ECC
▪ Dual 10➔100Gb/s ports or single 200Gb/s▪ 50Gb/s PAM4 SerDEs
▪ Supports both Ethernet and InfiniBand
▪ ConnectX-6 Dx controller
▪ 1GbE Out-of-Band management port
▪ Fully integrated PCIe switchPCIe Gen 4.0 - 16 lanes
Root Complex or Endpoint
I2C, USB, DAP, UART
MgmtPort
(1GbE)
ConnectX-6 Dx
Subsystem
L2 Cache
A72 A72
L2 Cache
A72 A72
L2 Cache
A72 A72
L2 Cache
A72 A72
DD
R 4
64
b + 8
b 3
20
0T/s
L3 C
ache
(6M
B)
PCIe Gen 4.0 Switch
Packet Proc. Packet Proc.
Application Offload, NVMe-oF, T10-DIF, etc.
eSwitch Flow Steering / Switching
RDMA transport RDMA transport
IPsec/TLS/CT Encrypt/Decrypt
GMII Security Engines
RNG
PubKey
Secure Boot
Accelerators
Deflate/ Inflate
SHA-2(De-Dup)
Regular Expression
GACCDMA
eMMC,GPIO
Dual VPI PortsInfiniBand/Ethernet: 1, 10, 25, 50,100, 200G
Out-of-BandManagement Port
21© 2019 Mellanox Technologies | Confidential
Functional Isolation with BlueField-2
▪ A Computer in-front of a computer
▪ Isolation and Offload▪ Infrastructure functions fully implemented in SmartNIC▪ Networking, Security and Storage
▪ Functionality runs secure in separate trust domain ▪ Enforces policies on compromised host▪ Host access to SmartNIC can be blocked by hardware
VMContain
er
Bare Met
al
SR-IOVVM
OVS OVS-DPDK
OvsdbServer
Controller
Neutron
eSwitch and Hardware table
Network Interface
Isolation
Control plane
Management Port
ArmHost
22© 2019 Mellanox Technologies | Confidential
Operating-System
BlueField Enables SDN in Bare-Metal Clouds
Limited to no SDN capabilitiesOrchestration through proprietary TOR
switch vendor pluginsMandates proprietary network driver
installation in bare-metal host
✓Full-featured SDN capabilities✓Full orchestration through upstream
OpenStack Neutron APIs✓No installation of network driver in bare-
metal host
TOR Switch
NIC
OpenStack Neutron Mellanox ML2 Plugin OpenStack Neutron
Operating-System
TOR Switch
Mellanox BlueField SmartNIC
Neutron
OVS L2
Agent
OVS
Bare-metal Host Bare-metal Host
Tenant’s Domain
Providers’ DomainVirtIO
Network
Interface
TOR Switch Networking SDN Integration
23© 2019 Mellanox Technologies | Confidential
Software-defined Network Accelerated Processing
Mellanox BlueField NVMe SNAP
▪ NVMe SNAP exposes NVMe interface as physical NVMe SSD▪ Implements NVMe-oF in SmartNIC – no NVMe-oF driver required on bare-metal host▪ Leveraging standard NVMe driver which is available on all major OSs
▪ Solving cloud storage pain points▪ OS Agnostic▪ Near-local performance▪ Secured, locked down solution▪ Boot from (remote) disk▪ Any Ethernet/InfiniBand wire protocol (NVMe-oF, iSER, iSCSI, proprietary etc)
▪ NVMe SNAP + SmartNIC as 2 in 1 ▪ Serving as both smart network adapter and emulating the local-storage▪ 2 in 1 bundle saving even more on CAPEX
24© 2019 Mellanox Technologies | Confidential
Operating-System
BlueField Enables Storage Virtualization in Bare-Metal Clouds
Bound by physical storage capacityNo backup service or limited to local RAIDNo possibility to manage storage resourcesNo migration of resources
✓Same flexibility as virtualized storage✓Same performance as local storage✓OS agnostic, only NVMe driver required✓Backed-up in the storage cloud✓Dynamically allocated cloud storage✓Any wire protocol & storage management
TOR Switch
NIC
Remote Storage
Operating-System
TOR Switch
Mellanox BlueField SmartNIC
Storage
InitiatorEmulated Storage
Bare-metal Host Bare-metal Host
Tenant’s Domain
Providers’ Domain
Local Storage
Local Physical Drive in Bare-metal Host NVMe SNAP Emulation
25© 2019 Mellanox Technologies | Confidential
Delivering Highest Performance and Scalability
▪ Scalable, intelligent, flexible, high performance, end-to-end connectivity solutions
▪ Standards-based, supported by large eco-system
▪ Supports all compute architectures: x86, Power, ARM, GPU, FPGA etc.
▪ Offloading and In-Network Computing architecture
▪ Flexible topologies: Fat Tree, Mesh, 3D Torus, Dragonfly+, etc.
▪ Converged I/O: compute, storage, management on single fabric
▪ Backward and future compatible
The Future Depends On Smart Interconnect
26© 2019 Mellanox Technologies | Confidential
Thank You