Understanding Low And Scalable Mpi Latency

W H I T E P a P E r

Understanding Low and Scalable Message Passing Interface LatencyLatency Benchmarks for High

Performance ComputingQLogic InfiniBand Solutions Offer 70%

Advantage Over the Competition

Executive Summary

Considerable improvements in InfiniBand® (IB) interconnect technology for High Performance Computing (HPC) applications have pushed bandwidth to a point where streaming large amounts data off-node is nearly as fast as within a node. However, latencies for small-message transfers have not kept up with memory subsystems, and are increasingly the bottleneck in high performance clusters.

Different IB solutions provide dramatically varying latencies, especially as cluster sizes scale upward. Understanding how latencies will scale as your cluster grows is critical to choosing a network that will optimize your time to solution.

The traditional latency benchmarks, which send 0-byte messages between two adjacent systems, result in similar latency measurements for emerging DDr IB Host Channel adapters (HCas) from QLogic® and competitors of about 1.4 microseconds (µs). However, on larger messages, or across more nodes in a cluster, QLogic shows a 60-70% latency advantage over competitive offerings. These scalable latency measurements indicate why QLogic IB products provide a significant advantage on real HPC applications.

Key FindingsThe QLogic QLE7140 and QLE7280 HCas outperform the •Mellanox® ConnectX™ HCa in osu_latency at the 128-byte message size and the 1024-byte message size by as much as 70%.

The QLogic QLE7140 and QLE7280 HCas outperform the •ConnectX HCa in “scalable latency” by as much as 70% as the number of MPI processes increase.

Introduction

Today’s HPC applications are overwhelmingly implemented using a parallel programming model known as the Message Passing Interface (MPI). To achieve maximum performance, HPC applications require a high-performing MPI solution, involving both a high-performance interconnect and highly tuned MPI libraries. InfiniBand has rapidly become the HPC interconnect of choice on 128 systems in the June 2007 Top 500 list. This rapid upswing was due to its high (2GB/s) maximum bandwidth, and its low (~1.4–3 µsec) latency. High bandwidth is important because it allows an application to move large amounts of data very quickly. Low latency is important because it allows rapid synchronization and exchanges of small amounts of data.

Latency Benchmarks for High Performance Computing QLogic InfiniBand Solutions Offer 70% Advantage

HSG-WP07017 SN0032014-00 a 2

W H I T E P a P E r

This white paper compares several benchmark results. For all of these results, the test bed consists of eight servers with standard “off-the shelf” components, and a QLogic SilverStorm® 9024 24-port DDr IB Switch.

Servers

2-socket rack-mounted servers2.6 Ghz dual-core, aMD™ Opteron® 2218 processors8 GB of DDr2-667 memoryTyan® Thunder n3600r (S2912) motherboards

The HCas benchmarked were:

Mellanox MHGH28-XTC (ConnectX) DDr HCa •

QLogic QLE7140 SDr HCa •

QLogic QLE7280 DDr HCa. •

all benchmarks were run using MVaPICH-0.9.9 as the MPI. For the Mellanox ConnectX HCas MVaPICH was run over the user-space verbs provided by the OFED-1.2.5 release. For the QLE7140 and QLE7280 MVaPICH was run over the InfiniPath™ 2.2 software stack, using the QLogic PSM aPI and OFED-1.2 based drivers.

Motivation for Studying Latency

Bandwidths over the network are approaching memory bandwidths within a system. running the Bandwidth microbenchmark from Ohio State (osu_bw) on a node, using the MVaPICH-0.9.9 implementation of MPI, measures large message intra-node (socket-to-socket) MPI bandwidth of 2 GB/s with message sizes 512k or smaller. This

bandwidth is at a 1:1 ratio with available bandwidth from a DDr IB connection.

In contrast, socket-to-socket MPI latency in either system is 0.40 µs, while the fastest inter-node IB MPI latency of is 1.3-3 µs can be achieved. a ratio of 7x to 3x in comparing socket-to-socket and IB! Thus, small-message latency is one of the areas where there is a significant penalty to go off-node. Though there are some “back-to-back” 2-node benchmarks available to help, the latency observed does not always represent the desired latency required from a high-performance cluster.

Different Ways to Measure Latency

MPI latency is often measured by one of a number of common microbenchmarks such as osu_latency, or the ping-pong component of the Intel® MPI Benchmarks (formerly Pallas MPI Benchmarks), or the ping-pong latency component of the High Performance Computing Challenge (HPCC) suite of benchmarks. all of these microbenchmarks have the same basic pattern. Each runs a single ping-pong test sending a 0- or 1-byte message between two cores on different cluster nodes, reporting the latency as half the time of one round-trip. Here are some example graphs showing the results of running osu_latency using three different IB HCas.


HSG-WP07017 SN0032014-00 a 3

W H I T E P a P E r

Judging from this test, the QLE7280, QLE7140, and ConnectX HCas are all similar with respect to 0-byte latency. However, as the message size increases significant differences are observed. For example with a 128-byte message size, the QLE7280 has a latency of 1.7 µs, whereas the ConnectX DDr adapter has a latency of 2.7 µs providing a 60% performance advantage for the QLE7280. With a 1024-byte message size, the QLE7280’s latency is 2.80 µs for a 70% advantage over ConnectX’s latency of 4.74 µs.

another test that measures latency is the randomring latency benchmark which is a part of the High Performance Computing Challenge suite of benchmarks (HPCC). The benchmark tests latency across a series of randomly assigned rings, averaging across all of them.1 The benchmark forces each process to talk to every other process in the cluster. This is important because there is a substantial difference in scalability with a large number of cores between those HCas that seemed so similar when running osu_latency.

1 The measurement differs from the pingpong case since the messages are

sent by two processes calling MPI_Sendrecv, rather than one calling MPI_Send

followed by MPI_recv.

as demonstrated, the QLE7280 and QLE7140 latencies largely remain flat with increasing process count. The ConnectX HCa’s latency, however, rises with the increase of processes. at 32-cores, the randomring Latency of QLogic QLE7280 DDr HCa is 1.33 µs compared to 2.26 µs for the ConnectX HCa. This amounts to 70% better performance for the QLE7280. The trend is for larger differences at larger core counts. Since low latency is required even at large core counts to scale application performance to the greatest extent possible, the QLogic HCa’s consistently low latency is referred to as “scalable latency.”


HSG-WP07017 SN0032014-00 a 4

W H I T E P a P E r

Understanding Why Latency Scalability Varies

To understand why latency scalability would differ, it helps to understand, at least at a basic level, how MPI works. The following is the basic path of an MPI packet, from a sending application process to a receiving application process.

Sending process has data for some remote process.1.

Sender places data in a buffer, passes a pointer to the MPI stack, 2. along with an indication of who the receiver is and a tag for identifying the message.

‘context’ or ‘communication id’ identifies the context over which 3. the point-to-point communication happens -- only messages in the same communicator can be matched (there is no “any” communicator).

There are some variations in how this process is implemented, often based on the underlying mechanism for data transfer.

With many interconnects offering high performance rDMa, there is a push towards utilizing it to improve MPI performance. rDMa is a one-sided communication model, allowing data to be transferred between from one host to another without the involvement of the remote CPU. This has the advantage of reducing CPU utilization, but requires the rDMa initiator to know where it is writing to or reading from. This requires an exchange of information before the data can be sent.

another mechanism that is used is what is known as the Send/recv model. This is a two-sided communication model where the receiver maintains a single queue where all messages go initially, and then the receiver is involved in directing messages from that queue to their final destination. This has the advantage of not requiring remote knowledge to begin a transfer, as each side only needs to know about its own buffers, but at the cost of involving the CPU on both sides.

Most high performance interconnects provide mechanisms for both of these models, but make different optimization choices in terms of tuning them. almost all implementations use rDMa for large messages, where the setup cost to exchange information initially is small relative to the cost of involving the CPU in transferring large amounts of data.

Thus, most MPIs implement a ‘rendezvous protocol’ for large messages, where the sender sends a ‘request to send’, the receiver pins the final location buffer and sends a key, and the sender does an rDMa write to the final location. MPIs implemented on OpenFabrics verbs do this explicitly, while the PSM layer provided with the QLogic QLE7100 and QLE7200 series HCas does it behind the scenes.

However, for small messages the latency cost of that initial setup is large compared to the cost of sending a message. a round-trip on the wire can triple the cost of sending a small message, while copying a couple of cache lines from a receive buffer to their final location costs you very little. This leads most implementors to use a Send/recv based approach. However, in HCas that have tuned for rDMa to the exclusion of Send/recv, this causes a large slowdown, resulting in poor latency. an rDMa write is much faster, but it requires that costly setup. The following describes a mechanism used to sidestep this problem.

achieving Low Latency with rDMa

For interconnects that have been optimized for remote Direct Memory access (rDMa), it can be desirable to use rDMa not only for large messages but also for small messages. This is done without incurring the setup latency cost by mimicking a receive mailbox in memory. For each MPI process, the MPI library sets up a temporary memory location for every process in the job. The setup and coordination is done at initialization time, so by the time communication starts every MPI process has knowledge of the memory location to write to, and can use rDMa. When receiving, the MPI library in the receiving process then goes and checks each temporary memory location, and then copies any messages that may have arrived to the correct buffers.

This can work well in small clusters or jobs, such as when running the common point to point microbenchmark. Each receiving process has only one memory location to check, and can very quickly find and copy any receiving message.


HSG-WP07017 SN0032014-00 a 5

W H I T E P a P E r

The issue with the approach is that it doesn’t scale. With rDMa, each remote process needs its own temporary memory location to write to. Thus, as a cluster grows the receiving process has to check an additional memory location for every remote process. In today’s world of multicore processors and large clusters, the array of memory locations rises exponentially.

The per-local-process memory and host software time requirements of this algorithm go up linearly with the number of processors in the cluster. This means that in a cluster made up of N nodes with M cores each, per-process memory use and latency grow as O(M * (N-1)), while per-node memory use grows even faster, as O(M2 * (N-1)).

a Scalable Solution: Send/recv

a more scalable solution is to use send/recv. Because the location in memory where messages are placed is determined locally, all messages can go into a single queue with a single place to check, instead of requiring a memory location per remote process. The results are then copied out in the order they arrive to the memory buffers posted by the application. Thus, the per-local-process memory requirements for this approach are constant, and the per-node memory requirements increase only with the size of the node.

Connection State

a final element which is harder to measure, but apparent in very large clusters, is the advantage of the connectionless protocol. The PSM is based on a connectionless, as opposed to a connected protocol (rC) that is used for most verbs-based MPIs.

The effect of a connected protocol is to require some amount of per-partner state both on host and on chip. When the number of processes scales up, this can lead to strange caching effects as the data is sent/received from the HCa. This can be mitigated to some extent using methods like Shared receive Queues (SrQ) and Scalable rC, but remains a problem for very large clusters using rC-based MPIs.

The QLogic approach with the PSM aPI sidesteps this by using a connectionless protocol and keeping the minimum necessary state to ensure reliability. Investigations at Ohio State showed the advantages of a connectionless protocol at scale when compared to an rC-based protocol, which were limited by the small MTU and lack of reliability in the UD IB protocol.1 In another paper, the investigators at OSU showed a need for a ‘UD rDMa’ approach in order to achieve full bandwidth. 2

PSM takes account of all of these issues behind the scenes. It allows the MPI implementor access to all of the scalability of a connectionless protocol, without the need to develop yet another implementation of segmentation and reliability, or running into any of the high-end bandwidth performance issues seen with UD.

1 http://nowlab.cse.ohio-state.edu/publications/conf-papers/2007/koop-ics07.

pdf

2 http://nowlab.cse.ohio-state.edu/publications/conf-papers/2007/koop-

cluster07.pdf


W H I T E P a P E r

Corporate Headquarters QLogic Corporation 26650 aliso Viejo Parkway aliso Viejo, Ca 92656 949.389.6000 www.qlogic.com

Europe Headquarters QLogic (UK) LTD. Surrey Technology Centre 40 Occam road Guildford Surrey GU2 7YG UK +44 (0)1483 295825

© 2007 QLogic Corporation. Specifications are subject to change without notice. all rights reserved worldwide. QLogic, the QLogic logo, and SilverStorm are registered trademarks of QLogic Corporation. InfiniBand is a registered trademark of the InfiniBand Trade association. aMD and Opteron are trademarks or registered trademarks of advanced Mirco Devices. Tyan is registered trademark of Tyan Computer Corporation. Mellanox and ConnectX are trademarks or registered trademarks of Mellanox Technologies, Inc.. Infini-Path is a trademark of Pathscale, Inc.. Intel is registered trademark of Intel Corporation. all other brand and product names are trademarks or registered trademarks of their respective owners. Information supplied by QLogic Corporation is believed to be accurate and reliable. QLogic Corporation assumes no responsibility for any errors in this brochure. QLogic Corporation reserves the right, without notice, to make changes in product design or specifications.

HSG-WP07017 SN0032014-00 a 6

Disclaimerreasonable efforts have been made to ensure the validity and accuracy of these performance tests. QLogic Corporation is not liable for any error in this published white paper or the results thereof. Variation in results may be a result of change in configuration or in the environment. QLogic specifically disclaims any warranty, expressed or implied, relating to the test results and their accuracy, analysis, completeness or quality.

Summary and Conclusion

The white paper, explores latency measurements, and illustrates how benchmarks measuring point-to-point latency may not be representative of the latencies that applications on large-scale clusters require. Explained were some of the underlying architectural reasons of varying approaches to low MPI latency and how the QLogic QLE7140 and QLE7280 IB HCas efficiently scale to large node counts.

The current trend towards rDMa in high-performance interconnects is very useful for those applications with large amounts of data to move. as system resources are already constrained, it is vital to limit CPU usage in moving large amounts of data through the system. However, a large and growing number of applications are more latency-bound than bandwidth bound, and for those an approach to low latency that scales is necessary. The QLogic QLE7100 and QLE7200 series IB HCas provide scalable low latency.

http://www.qlogic.com/

Documents

Understanding Low And Scalable Mpi Latency