Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Devendar Bureddy
HPCAC-Stanford, Feb 8, 2017
In-Network Computing SHARP Technology for MPI Offloads
© 2017 Mellanox Technologies 2- Mellanox Confidential -
Accelerating HPC applications and Artificial Intelligence
Accelerating HPC Applications
Significantly reduce MPI collective runtime
Increase CPU availability and efficiency
Enable communication and computation overlap
Enabling Artificial Intelligence Solutions to Perform
Critical and Timely Decision Making
Accelerating distributed machine learning
© 2017 Mellanox Technologies 3- Mellanox Confidential -
SwitchIB-2 : Smart Switch
Switch-IB 2 Enables the Switch Network to
Operate as a Co-Processor
SHARP Enables Switch-IB 2 to Manage and
Execute MPI Operations in the Network
The world fastest switch with <90 nanosecond latency
36-ports, 100Gb/s per port, 7.2Tb/s throughput, 7.02
Billion messages/sec
Adaptive routing, congestion control
Multiple topologies
© 2017 Mellanox Technologies 4- Mellanox Confidential -
SHARP Performance Data – OSU Allreduce 1PPN, 128 nodes
© 2017 Mellanox Technologies 5- Mellanox Confidential -
SHArP Performance Advantage
MiniFE is a Finite Element mini-application
• Implements kernels that represent
implicit finite-element applications
AllReduce
© 2017 Mellanox Technologies 6- Mellanox Confidential -
Allreduce – Four Process Recursive Doubling
1 2 3 4
1 2 3 4
1 2 3 4
Step 1
Step 2
© 2017 Mellanox Technologies 7- Mellanox Confidential -
SHARP: Scalable Hierarchical Aggregation Protocol
Reliable Scalable General Purpose Primitive
• In-network Tree based aggregation mechanism
• Large number of groups
• Multiple simultaneous outstanding operations
Applicable to Multiple Use-cases
• HPC Applications using MPI / SHMEM
• Distributed Deep Learning applications
Scalable High Performance Collective Offload
• Barrier, Reduce, All-Reduce, Broadcast
• Sum, Min, Max, Min-loc, max-loc, OR, XOR, AND
• Integer and Floating-Point, 32 / 64 bit
SHARP Tree
SHARP Tree Aggregation Node
(Process running on HCA)
SHARP Tree Endnode
(Process running on HCA)
SHARP Tree Root
© 2017 Mellanox Technologies 8- Mellanox Confidential -
SHARP Tree
SHARP Operations are Executed by a SHARP Tree• Multiple SHArP Trees are Supported
• Each SHArP Tree can handle Multiple Outstanding SHArP Operations
• Within a SHArP Tree, each Operation is Uniquely Identified by a SHArP-Tuple
- GroupID
- SequenceNumber
SHARP Tree is a Logical Construct• Nodes in the SHArP Tree are IB Endnodes
• Logical tree defined on top of the physical underlying fabric
• SHArP Tree Links are implemented on top of the IB transport (Reliable Connection)
• Expected to follow the physical topology for performance but not required
Physical Topology
One SHArP Tree
Switch/Router
HCA
SHArP Tree Aggregation Node
(Process running on HCA)
SHArP Tree Endnode
(Process running on HCA)
SHArP Tree Root
© 2017 Mellanox Technologies 9- Mellanox Confidential -
SHARP Tree (cont’d)
SHARP Tree is a Logical Construct
• Motivation to loosely follow the underlying
Physical Topology for Performance
- Efficient use of Physical Link Bandwidth
- Shortest Overall Path Length from Leafs to
Root
SHARP Group is a Tree Subset
• A sub-tree spanning a subset of the tree
end-nodes
SHARP Tree Nodes are ‘Processes’
• Multiple SHARP Tree “Nodes” can reside
on the same Physical HCA
SHArP Tree Root
SHArP Tree Root
SHARP Tree
The same SHArP Tree
(Logical View)
SHArP Tree Aggregation Node
SHArP Tree Endnode
(Process running on HCA)
© 2017 Mellanox Technologies 10- Mellanox Confidential -
SHARP SW Overview
Release
- | version |MOFED version |SwitchIB-2 FW | HPCX n| UFM version|
- |-------------------|---------------------------------|----------------------|------------|-----------------|
- |v1.0 |MLNX OFED 3.3-x.x.x | 15.1100.0072 | 1.6 | - |
- |v1.1 |MLNX OFED 3.4-0.1.2.0| 15.1200.0102 | 1.7 | - |
- |v1.2 |MLNX OFED 4.0-x.x.x | 15.1200.0102 | 1.8 | 5.8 |
Prerequisites
• SHARP software modules are delivered as part of HPC-X
• Switch-IB 2 firmware - 15.1200.0102 | or later
• MLXN OS - 3.6.1002 or later
• MLNX OFED 4.0
• MLNX OpenSM 4.7.0 or later (available with MLNX OFED 3.3-x.x.x or UFM 5.6)
© 2017 Mellanox Technologies 11- Mellanox Confidential -
Mellanox HPC-X™ Scalable HPC Software Toolkit
Complete MPI, PGAS OpenSHMEM and UPC package
Maximize application performance
For commercial and open source applications
Best out of the box experience
© 2017 Mellanox Technologies 12- Mellanox Confidential -
Mellanox HPC-X - Package Contents
Allow fast and simple deployment of
HPC libraries
Both Stable & Latest Beta are bundled
All libraries are pre-compiled
Includes scripts/module files to ease
deployment
Package Includes
OpenMPI / OpenSHMEM
BUPC (Berkeley UPC)
UCX
MXM
FCA-2.5
FCA-3.x (HCOLL)
SHARP
KNEM
Allows fast intra-node MPI communication for large
messages
Profiling Tools
Libibprof
IPM
Standard Benchmarks
OSU
IMB
© 2017 Mellanox Technologies 13- Mellanox Confidential -
HPCX/SHARP SW architecture
HCOLL
• optimized collective library
Libsharp_coll.so
• Implementation of high level sharp API for
enabling sharp collectives for MPI
• uses low level libsharp.so API
Libsharp.so
• Implementation of low level sharp API
High level API
• Easy to use
• Easy to integrate with multiple MPIs(OpenMPI,
MPICH, MVAPICH*)
SHARP(libsharp/libsharp_coll)
HCOLL(libhcoll)
MPI (OpenMPI)
InfiniBand Network
© 2017 Mellanox Technologies 14- Mellanox Confidential -
SHARP SW Architecture
Aggregation
Node
Aggregation
Node Aggregation
Node
Aggregation
Node….
Compute node
MPI
Process
SHARPD
daemon
Compute node
MPI
Process
SHARPD
daemon
Compute node
MPI
Process
SHARPD
daemon
..….
Subnet
Manager
Aggragation
Manager (AM)
© 2017 Mellanox Technologies 15- Mellanox Confidential -
SHARP SW Components
SHARP SW components:
• Libs
libsharp.so ( low level api)
libsharp_coll.so ( high level api)
• Daemons (sysadmin)
sharpd, sharp_am
• Scripts
sharp_benchmark.sh
sharp_daemons_setup.sh
• Utilities
sharp_coll_dump_config
sharp_hello
sharp_mpi_test
• public API
sharp_coll.h
© 2017 Mellanox Technologies 16- Mellanox Confidential -
SHARP: SHARP Daemons
sharp_am: Aggregation Manager daemon
• same node as Subnet Manager
• Resource manager
sharpd: SHARP daemon
• compute nodes
• Light wait process
• Almost 0% cpu usage
• Only control path
© 2017 Mellanox Technologies 17- Mellanox Confidential -
SHARP: Configuring Subnet Manager
Edit the opensm.conf file.
Set the parameter “sharp_enabled” to “2”.
Run OpenSM with the configuration file.
• % opensm -F <opensm configuration file> -B
Verify that the Aggregation Nodes were activated by the OpenSM, run
"ibnetdiscover".For example:
vendid=0x0
devid=0xcf09
sysimgguid=0x7cfe900300a5a2a0
caguid=0x7cfe900300a5a2a8
Ca 1 "H-7cfe900300a5a2a8" # "Mellanox Technologies Aggregation Node"
[1](7cfe900300a5a2a8) "S-7cfe900300a5a2a0"[37] # lid 256 lmc 0 "MF0;sharp2:MSB7800/U1" lid 512 4xFDR
© 2017 Mellanox Technologies 18- Mellanox Confidential -
SHARP: Configuring Aggregation Manager
Create a configuration directory for the future SHArP configuration file.
• % mkdir $HPCX_SHARP_DIR/conf
Create the fabric.lst file.
• Copy the subnet LST file created by the Subnet Manager to the AM’s configuration files directory
• Rename it to fabric.lst :
- % cp /var/log/opensm-subnet.lst $HPCX_SHARP_DIR/conf/fabric.lst
Create root GUIDs file.
• Copy the root_guids.conf file if used for configuration of Subnet Manager to
$HPCX_SHARP_DIR/conf/root_guid.cfg (or)
• Identify the root switches of the fabric and create a file with the node GUIDs of the root switches of the fabric.
• For example : if there are two root switches files contains
0x0002c90000000001
0x0002c90000000008
Create sharp_am.conf file% cat > $HPCX_SHARP_DIR/conf/sharp_am.conf << EOF
fabric_lst_file $HPCX_SHARP_DIR/conf/fabric.lst
root_guids_file $HPCX_SHARP_DIR/conf/root_guid.cfg
ib_port_guid <PortGUID of the relevant HCA port or 0x0> EOF
© 2017 Mellanox Technologies 19- Mellanox Confidential -
SHARP: Running SHARP Daemons
Setup the daemons
• $HPCX_SHARP_DIR/sbin/sharp_daemons_setup.sh
Usage
• Usage: sharp_daemons_setup.sh <-s> <-r> [-p SHArP location dir] <-d daemon> <-m>
-s - Setup SHArP daemon
-r - Remove SHArP daemon
-p - Path to alternative SHArP location dir
-d - Daemon name (sharpd or sharp_am)
-m - Add monit capability for daemon control
• $HPCX_SHARP_DIR/sbin/sharp_daemons_setup.sh -s $HPCX_SHARP_DIR -d sharp_am
• $service sharp_am start
© 2017 Mellanox Technologies 20- Mellanox Confidential -
SHARP: Running SHARP Daemons
sharp_am• %$HPCX_SHARP_DIR/sbin/sharp_daemons_setup.sh -s $HPCX_SHARP_DIR -d sharp_am
• %service sharp_am start
• Log : /var/log/sharp_am.log
Sharpd• conf file: $HPCX_SHARP_DIR/conf/sharpd.conf
ib_dev <relevant_hca:port>
sharpd_log_level 2
• %pdsh -w <hostlist> $HPCX_SHARP_DIR/sbin/sharp_daemons_setup.sh -s $HPCX_SHARP_DIR -d
sharpd
• %pdsh -w jupiter[001-032] service sharpd start
• Log : /var/log/sharpd.log
© 2017 Mellanox Technologies 21- Mellanox Confidential -
SHARP resource(quota) handling
Job Quota
• Max groups
disable sharp on communicator if it fails to create sharp group
• Max osts
1 for blocking collectives
> 1 for non-blocking & fragmentation
• Data size per OST
Max sharp supported payload (256 B)
• Max QPs
Need at least #PPN QPs/node
Hard problem to efficiently allocate resources
• Application profiling might help
• Per group
- Max OSTs
- Data size per OST
© 2017 Mellanox Technologies 22- Mellanox Confidential -
MPI Collective offloads using SHARP
Enabled through FCA-3.x (HCOLL)
Flags• HCOLL_ENABLE_SHARP (default : 0)
0 - Don't use SHArP
1 - probe SHArP availability and use it
2 - Force to use SHArP
3 - Force to use SHArP for all MPI communicators
4 - Force to use SHArP for all MPI communicators and for all supported collectives
• HCOLL_SHARP_NP ( default: 2)Number of nodes(node leaders) threshold in communicator to create SHArP group and use SHArP collectives
• SHARP_COLL_LOG_LEVEL
0 – fatal , 1 – error, 2 – warn, 3 – info, 4 – debug, 5 – trace
• HCOLL_BCOL_P2P_ALLREDUCE_SHARP_MAX
- Maximum allreduce size run through SHArP
© 2017 Mellanox Technologies 23- Mellanox Confidential -
Resources (quota)
• SHARP_COLL_JOB_QUOTA_MAX_GROUPS
- #communicators
• SHARP_COLL_JOB_QUOTA_OSTS
- Parallelism on communicator
• SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST
- Payload/OST
For complete list of SHARP COLL tuning options
- $HPCX_SHARP_DIR/bin/sharp_coll_dump_config -f
MPI Collectives offloads using SHARP
© 2017 Mellanox Technologies 24- Mellanox Confidential -
MPI Collectives offloads using SHARP
$ mpirun --display-map --bind-to core --map-by node -np 32 -mca pml yalla -mca btl_openib_warn_default_gid_prefix
0 -mca rmaps_dist_device mlx5_0:1 -mca rmaps_base_mapping_policy dist:span -x MXM_RDMA_PORTS=mlx5_0:1
-x HCOLL_MAIN_IB=mlx5_0:1 -x HCOLL_ENABLE_SHARP=1 -x
HCOLL_BCOL_P2P_ALLREDUCE_SHARP_MAX=4096 -x SHARP_COLL_LOG_LEVEL=3 -x
HCOLL_ENABLE_MCAST_ALL=0 -x SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST=128 /home/user02/hpcx-
v1.7.405-gcc-MLNX_OFED_LINUX-3.4-1.0.0.0-redhat6.5-x86_64/ompi-v1.10/tests/osu-micro-benchmarks-
5.2/osu_allreduce -i 10000 -x 1000 -m 256
# OSU MPI Allreduce Latency Test v5.2
# Size Avg Latency(us)
4 2.01
8 2.58
16 2.04
32 2.59
64 2.10
128 2.79
256 2.69
sharp_init_session
© 2017 Mellanox Technologies 25- Mellanox Confidential -
SHARP API
© 2017 Mellanox Technologies 26- Mellanox Confidential -
SHARP API
Datatypes
• 32 bit – INT/UINT/FLOAT
• 64 bit – LONG/Unsinged LONG/DOUBLE
Reduce Ops
• MIN
• MAX
• SUM
• LAND
• LOR
• BOR
• LXOR
• BXOR
• MAXLOC
• MINLOC
• PROD ( Not supported)
© 2017 Mellanox Technologies 27- Mellanox Confidential -
SHARP API
Control Path
• sharp Init / finalize
• sharp comm(group) create/destroy
Data path
• Collectives operations
- Allreduce
- Barrier
- Bast ( Not supported yet)
- Reduce ( Not supported yet)
• Blocking & Non-Blocking
© 2017 Mellanox Technologies 28- Mellanox Confidential -
SHARP API
libsharp_coll.so
d
Data pathControl path
SHARP API
Request pool mpool
utils(timer/env/data
structures..)
lisharp.soLow level SHARP API
© 2017 Mellanox Technologies 29- Mellanox Confidential -
SHARP API – Control path
Sharp job initialize
• API
int sharp_coll_init(struct sharp_coll_init_spec *sharp_coll_spec,
struct sharp_coll_context **sharp_coll_context)
• Code flow in HPC-X
• Input spec
MPI_Init () hcoll_init() sharp_coll_init()
struct sharp_coll_init_spec {int job_id; /* MPI Job ID */
char *hostlist; /* optional in job scheduler case */
int world_rank;
int world_size;
int (*progress_func)(void); /* External progress func */
struct sharp_coll_config config; /* SHARP COLL Configuration. See `sharp_coll_default_config' */
struct sharp_coll_out_of_band_collsoob_colls;
};
© 2017 Mellanox Technologies 30- Mellanox Confidential -
High level API – Control path..
- Sharp job finalize
• api
void sharp_coll_finalize(struct sharp_coll_context *context);
- Code flow in HPC-X
MPI_Finalize() hcoll_finalize() sharp_coll_finalize()
© 2017 Mellanox Technologies 31- Mellanox Confidential -
High level API – Control path…
Sharp comm(group) create
• Sharp comm is created for hcoll ptp subgroup
• api
• code flow in HPCX
• Input spec
int sharp_coll_comm_init(struct sharp_coll_context *context,
struct sharp_coll_comm_init_spec *spec,
struct sharp_coll_comm **sharp_coll_comm);
struct sharp_coll_comm_init_spec {int rank; /* MPI rank */int size; /* Communicator size */int is_comm_world; /* is MPI_COMM_WORLD */void *oob_ctx; /* External context for OOB functions */
};
MPI_Comm_create() hcoll_create_context sharp_coll_comm_init()
© 2017 Mellanox Technologies 32- Mellanox Confidential -
High level API – Control path…
Sharp comm(group) create
• Sharp comm is created for hcoll ptp subgroup
• api
void sharp_coll_comm_destroy(struct sharp_coll_comm *comm);
• code flow in HPC-X
• Input specstruct sharp_coll_comm_init_spec {
int rank; /* MPI rank */int size; /* Communicator size */int is_comm_world; /* is MPI_COMM_WORLD */void *oob_ctx; /* External context for OOB functions */
};
MPI_Comm_free() hcoll_destroy_context sharp_coll_comm_destroy()
© 2017 Mellanox Technologies 33- Mellanox Confidential -
High level API – data path
Barrier int sharp_coll_do_barrier(struct sharp_coll_comm *comm);
Allreduce int sharp_coll_do_allreduce(struct sharp_coll_comm *comm, struct sharp_coll_reduce_spec *spec);
- Reduce spec:
- Input Buffer types
Contig Buffer
Stream ( not supported yet)
IOV ( not supported yet)
struct sharp_coll_reduce_spec {int root; /* root rank number (ignored for allreduce) */struct sharp_coll_data_desc sbuf_desc; /* data to submit */struct sharp_coll_data_desc rbuf_desc; /* buffer to receive the result */enum sharp_datatype dtype; /* data type */int length; /* reduce operation size */enum sharp_reduce_op op; /* operator */
};
© 2017 Mellanox Technologies 34- Mellanox Confidential -
High level API – data path..
Non-blocking API• Returns request handle
• use can test/wait for the completion
• API
- int sharp_coll_do_barrier_nb(struct sharp_coll_comm *comm, struct sharp_coll_request **req);
- int sharp_coll_do_allreduce_nb(.., .., struct sharp_coll_request **req);
- int sharp_coll_req_test(struct sharp_coll_request *req);
- int sharp_coll_req_wait(struct sharp_coll_request *req);
- int sharp_coll_req_free(struct sharp_coll_request *req);
© 2017 Mellanox Technologies 35- Mellanox Confidential -
Summary
Scalable In-network computing
Easy to use
Easy to integrate API
Thank You