8/8/2019 3 Volt a Ire
1/36
2009 Voltaire Inc.
Voltaire Unified Fabric ManagerA new dimension to performance analysis and tuning
Ghislain de Jacquelot
8/8/2019 3 Volt a Ire
2/36
Voltaire Inc. 2Confidential - Internal
Introducing
Voltaires Grid Director 4000 Series
4036 - 36 ports
Firstgenerally availablecommercial-grade QDR switchesin the market
Lowest latencyswitch at100ns/300ns port-to-port
Smartswitch with advancedmanagement capabilities on-board
Most mature, 4th Generationswitch family and switch silicon
Most scalable with HyperScaletechnology
4700 From 324 ports to
8/8/2019 3 Volt a Ire
3/36
Voltaire Inc. 3Confidential - Internal
Infiniband: a black box ?
8/8/2019 3 Volt a Ire
4/36 Voltaire Inc. 4Confidential - Internal
An Infiniband Fabric is not a black box (1/2)
Requires Hardware management
Detect failures, communication problems
Inside the Infiniband Fabric- Port counters
- Port status (QDR,DDR,SDR 4X,2X,1X)
- Firmware upgrades (Switch and HCA ASICs)
Outside the Infiniband Fabric
- Chassis
- Power supplies
- Fans
- Temperature
- Chassis software updates (Switch management)
8/8/2019 3 Volt a Ire
5/36 Voltaire Inc. 5Confidential - Internal
An Infiniband Fabric is not a black box (2/2)
What about performance ?
Blocking vs non-blocking fabrics ?
Influence of routing algorithms ? Congestion ?
Mixing different protocols on the same fabric ?
Running multiple jobs on the same fabric ?
Performance monitoring Tools ?
8/8/2019 3 Volt a Ire
6/36 2009 Voltaire Inc.
Some Infiniband technology
8/8/2019 3 Volt a Ire
7/36 Voltaire Inc. 7Confidential - Internal
Fabric ?
is made of switch ASICs interconnected together
Mellanox InfiniScale III (aka Anafa): 24 ports
Mellanox InfiniScale IV (aka Shaldag): 36 ports
24 ports 24 ports 24 ports 24 ports 24 ports 24 ports 24 ports 24 ports
24 ports 24 ports 24 ports 24 ports
12
Nodes
12
Nodes
12
Nodes
12
Nodes
12
Nodes
12
Nodes
12
Nodes
12
Nodes
Inside a 96 ports switch
8/8/2019 3 Volt a Ire
8/36 Voltaire Inc. 8Confidential - Internal
Blocking ?
Defines the bandwidth ratio between layers in the fabric
24 ports
12
Nodes
12Uplinks
24 ports
16
Nodes
8Uplinks
FullyNon-Blocking
50%Blocking
24 ports
20
Nodes
4Uplinks
20%Blocking
8/8/2019 3 Volt a Ire
9/36 Voltaire Inc. 9Confidential - Internal
Congestion ?
Example: All orange nodes write simultaneously to the IOnode (red)
24 ports 24 ports 24 ports 24 ports 24 ports 24 ports 24 ports 24 ports
24 ports 24 ports 24 ports 24 ports
CN CN CN CN 12
Nodes
12
Nodes
12
Nodes
IO
NodeCN
CN CNCN CNCN CNCN
8/8/2019 3 Volt a Ire
10/369 Voltaire Inc. 10Confidential - Internal
Congestion Example
Degradation due to node oversubscription
Destination port in saturation (multiple sources)
Congestion spread across the fabric ALL other flows drop to 20% of capacity
Take time to recover
Common with storage traffic
drop
recovery
8/8/2019 3 Volt a Ire
11/369 Voltaire Inc. 11Confidential - Internal
Routing ?
InfiniBand packets are destination routed based on theDestination Logical ID (DLID) field in the header
In IB: DLID=route (not only remote address)DLIDs are 16 bits
48K values are used for unicast
16K values are used for multicastAt each switch ASIC, the incoming unicast DLID
is used as an index into a Linear ForwardingTable (LFT) that returns the outgoing switch
port number
E.g. the InfiniScale III ASIC supports all 48K possible LFT entries
Out Port #
DLID
012345678
91011
8/8/2019 3 Volt a Ire
12/36
8/8/2019 3 Volt a Ire
13/36
9 Voltaire Inc. 13Confidential - Internal
Communication Patterns (un-balanced)
A B C D E F G H
Communication pattern:A-CB-ED-G
F-H2:1 link contention:
A->C and B->E shareuplink to Switch 1 port 1
G->D and H->F share
uplink to Switch 2 port 4
1 2
3 4
1 2
3 4
1 2
3 4
1 2
3 4
Switch 1
1 2 3 4
112233
44
ABCDEF
GH
3412
1212
ABCD
EFGH
1234
1212
ABCD
EFGH
1212
3412
ABCD
EFGH
1212
1234
ABCD
EFGH
112233
44
ABCDEF
GH
Switch 2
1 2 3 4do
wnlinksu
plin
ks
IB path2 symmetric IB paths
8/8/2019 3 Volt a Ire
14/36
9 Voltaire Inc. 14Confidential - Internal
Optimization of Parallel Applications ?
Single-thread optimization
Some examples:
Instruction Pipelining
Blocking
Prefetch data
Tools: processor counters, profiling tools, compiler reports, etc
Goal: Overcome processor, cache, memory architecture contraints
Parallel optimization, scalability
Some examples: Load Balancing
Mix OpenMP and MPI
Barrier optimization
Tools: MPI Profilers (Intel Trace Analyzer, etc)
Goals: Overcome Balancing issues, increase computation to communication ratio, use parallel IO,etc
Fabric optimization ?
Benchmarking and Production environment are different
Systems used simultaneously by several applications, several kinds of traffic.
Handling efficiently multiple concurrent flows
8/8/2019 3 Volt a Ire
15/36
9 Voltaire Inc. 15Confidential - Internal
Observations
Blocking in cut through networks is a big issue
Different traffic classes have different requirements
Collectives and storage require congestion control
IPC requires low-latency (high-priority)
Storage may use more bandwidth and not be latency sensitive
Hardware based adaptive routing not efficient with bursty or storage traffic
Job layout can influence routing decisions IPC traffic typically stays within a job, or have unique patterns
Storage traffic fan into storage nodes
Management spread into all nodes
Hardware capabilities can be destructive if used inappropriately
E.g. mis-configured adaptive routing or congestion management
8/8/2019 3 Volt a Ire
16/36
9 Voltaire Inc. 16Confidential - Internal
Introducing
Voltaire UFM Unified Fabric Manager
Monitor, Analyze and Optimize
Ensure fabric health and performance visibility
Unique visibility into fabric traffic and bottlenecks
Optimize application performance Benchmark performance in real life
(weve managed to see 10X improvements)
Manage the scale-out Application centric platform
Efficient operations to thousands of fabricresources
Automate configurations and manage changes on the fly
Increase fabric up-time and resiliency better utilization
8/8/2019 3 Volt a Ire
17/36
9 Voltaire Inc. 17Confidential - Internal
Advanced Monitoring and Analysis
Monitor & analyze fabric performance
B/W utilization
Unique congestion monitoring
Dashboard for aggregated fabric view
Real-time fabric-wide health monitoring
Monitor events and errors through-out the fabric Threshold based alarms
Granular monitoring of host and switch parameters
Innovative congestion mapping
One view for fabric-wide congestion and traffic patterns
Enables root cause analysis for routing, job placement orresource allocation inefficiencies
All is managed at the application/aggregationlevel
8/8/2019 3 Volt a Ire
18/36
9 Voltaire Inc. 18Confidential - Internal
Fabric Optimization with UFM
Characterize applicationCharacterize application
traffic and prioritiestraffic and priorities
Feedback and AnalysisApplication Modeling
(CLI / GUI / API)
OptionalSchedulersSchedulers
Fabric Optimization
Fabric virtualization andFabric virtualization andQoSQoS
Optimize routing and jobOptimize routing and jobplacementplacement
UFMUFM
Show traffic andShow traffic and
congestion informationcongestion information
Monitoring
8/8/2019 3 Volt a Ire
19/36
9 Voltaire Inc. 19Confidential - Internal
UFM Application Centric Approach
PhysicalInfrastructure
VirtualInfrastructure
Applications
Map application requirements to fabric policies and
Map element status to application status
Fabric
Policy
Monitoring
C bi i UFM i h I d L di
8/8/2019 3 Volt a Ire
20/36
9 Voltaire Inc. 20Confidential - Internal
Combining UFM with Industry LeadingSchedulers
Enabling Intelligent Performance Driven Job Scheduling
8/8/2019 3 Volt a Ire
21/36
9 Voltaire Inc. 21Confidential - Internal
UFMs traffic aware routing
Todays routing algorithms are static while clusters aredynamic
Nodes are moving in and out of the cluster
Traffic patterns change
Static algorithms cant cope with changes resulting in congestions and in-efficiencies
Voltaire routing performance optimization Optimizations for various topologies enhanced during last years in large
clusters
New major conceptual shift from static to traffic pattern based algorithm
Traffic model can be derived automatically from topology
Voltaires enhancements are built on top of OpenSM in a modular plug-inarchitecture
Voltaires routing optimizations improve fabric performancewithout increasing cost
P f ti i ti titi i d
8/8/2019 3 Volt a Ire
22/36
9 Voltaire Inc. 22Confidential - Internal
Performance optimization: partitioning andQoS
UFM enables to run multiple clustersor separate application jobs on thesame infrastructure
Drag and drop configurationautomatically creates dedicated IPCand virtual I/O to each cluster
Quality of Service can be associatedwith fabric partitions so criticalapplications get priority in fabricrouting queues
Easy configuration of QoS via GUI or CLI assignment to pre-defined service levels
Changes in application needs is easilyreconfigured by simple re-allocation ofservers to apps or networks
Drag and drop assignment to networktriggers all configurations in the back-stage
Critical applications can be allocated the right resources and priority
8/8/2019 3 Volt a Ire
23/36
2009 Voltaire Inc.
Benefits
8/8/2019 3 Volt a Ire
24/36
9 Voltaire Inc. 24Confidential - Internal
Boost Apps Performance with Voltaire UFM
Optimize Real-Life Environments
8/8/2019 3 Volt a Ire
25/36
9 Voltaire Inc. 25Confidential - Internal
Test Environment
12 nodesrunning abandwidth
consuming job
2 nodes runninga latency critical
jobGoal: achievebestperformancewith Latencycritical tasks
W/O Partitioning Latency degradation of X
8/8/2019 3 Volt a Ire
26/36
9 Voltaire Inc. 26Confidential - Internal
W/O Partitioning Latency degradation of ~ X215%
Latency job running alone(Latency = ~0.000210)
Bandwidth job added onsame partition(Latency = ~0.000450)
8/8/2019 3 Volt a Ire
27/36
9 Voltaire Inc. 27Confidential - Internal
Create Partitions and Set QoS in UFM
Create 2 Logical Groups
Latency job
B/W oriented job
Create 2 Networks
One for each job
Assign Service Level
SL0 Low Latency Queue
SL1 50% (high/bandwidth)
(SL2 25%, SL3 25%)
UFM automatically createsvirtual NICs, partitions andService Level definitions
Run jobs with isolation and QoS return almost to
8/8/2019 3 Volt a Ire
28/36
9 Voltaire Inc. 28Confidential - Internal
Run jobs with isolation and QoS return almost tooriginal performance (~5% impact only)
Latency job running alone(Latency = ~0.000210)
Bandwidth job added onsame partition(Latency = ~0.000450)
Separate partitionsand QoS(Latency = ~0.000220) (!)
Voltaire UFM
8/8/2019 3 Volt a Ire
29/36
9 Voltaire Inc. 29Confidential - Internal
Voltaire UFM
Redefining Fabric Management
OpenSMSubnet Manager only, Technology Test Bed
Voltaire engineer is the OpenSM Maintainer
Voltaire UFMMonitor, Analyze & Optimize application
performance, Automate and ease fabricmanagement, Uses OpenSM withadvanced routing Plug-ins
Other Fabric Mgmt. SolutionLimited Proprietary SMDevice/Port oriented limited viewerand some troubleshooting tools
Voltaire GridVisionBasic monitoring & TroubleshootingRich GUI, CLI, SNMP functionality,Voltaire SM, Embedded in Switches
Questions ?
8/8/2019 3 Volt a Ire
30/36
2009 Voltaire Inc.
Open-MPI Accelerator (OMA)
V lt i OMA B fit
8/8/2019 3 Volt a Ire
31/36
9 Voltaire Inc. 31Confidential - Internal
Voltaire OMA Benefits
Accelerating standard, open source Open-MPI
Significant performance improvement (shmem only)
More effective when there is more intra-node communications (between cores)
Depends on the HW (# of cores, # of sockets) and the traffic pattern
Enhanced documentation
Open-MPI expertise RoadRunner and many othersWorks with InfiniBand and Ethernet (iWARP and TCP)
H Sh d M i D T d ?
8/8/2019 3 Volt a Ire
32/36
9 Voltaire Inc. 32Confidential - Internal
How Shared Memory is Done Today?
Shared memory
RAMRAM
CPU socket CPU socket
4 CPUCores
NUMAcc
12
#1
#2
HCA/iWARP
1. Process #1 writes the datainto shmem RAM
2. Process #2 reads the datafrom shmem RAM
Th OMA W
8/8/2019 3 Volt a Ire
33/36
9 Voltaire Inc. 33Confidential - Internal
The OMA Way
Shared memory
RAMRAM
CPU socket CPU socket
NUMAcc
1
1. For large messages Kernel willcopy data from process #1directly into process #2 (save
one copy), small massages willstay as today
#1
#2
HCA/iWARP
OMA Fluent Aircraft Benchmark
8/8/2019 3 Volt a Ire
34/36
9 Voltaire Inc. 34Confidential - Internal
OMA - Fluent Aircraft Benchmark
Fluent Aircraft
0
100
200
300
400
500
600
700
800
0 5 10 15 20 25 30 35
# of processes
Flu
entRating
Open MPI with OMA Open MPI
9% 7% 11% 25%10%
* OMA improves Fluent Aircraft Benchmark by up to 25%
8/8/2019 3 Volt a Ire
35/36
9 Voltaire Inc. 35Confidential - Internal
eff. bandwidth/proc for alltoall
0
500
1000
1500
2000
2500
3000
1 10 100 1000 10000 100000 1000000 10000000
bytes
MB/s
HP-MPI MVAPICH2
OPENMPI OPENMPI+OMA
pingpong bandwidth
0
1000
2000
3000
4000
5000
6000
1, E+ 00 1, E+ 01 1, E+ 02 1, E+ 03 1, E+ 04 1, E+ 05 1, E+ 06 1, E+ 07
bytes
MB/s
HP-MPIMVAPICH2OPENMPI
OPENMPI+OMA
8/8/2019 3 Volt a Ire
36/36
2009 Voltaire Inc.
Questions ?