Leveraging the Cray Linux Environment Core Specialization ... · Leveraging the Cray Linux Environment Core Specialization Feature to ... obtain overlap of MPI communication with

PROCEEDINGS OF THE CRAY USER GROUP, 2012 1

Leveraging the Cray Linux Environment CoreSpecialization Feature to Realize MPI

Asynchronous Progress on Cray XE SystemsHoward Pritchard, Duncan Roweth, David Henseler, and Paul Cassella

Abstract—Cray has enhanced the Linux operating system with a Core Specialization (CoreSpec) feature that allows fordifferentiated use of the compute cores available on Cray XE compute nodes. With CoreSpec, most cores on a nodeare dedicated to running the parallel application while one or more cores are reserved for OS and service threads.The MPICH2 MPI implementation has been enhanced to make use of this CoreSpec feature to better support MPIindependent progress. In this paper, we describe how the MPI implementation uses CoreSpec along with hardwarefeatures of the XE Gemini Network Interface to obtain overlap of MPI communication with computation for micro-benchmarks and applications.

Index Terms—MPI, CLE, core specialization, asynchronous progress

F

1 INTRODUCTION

The importance of overlapping computationwith communication and independent progressin Message Passing (MPI) applications is wellknown (see, for example [5], [9], [15]), even if inpractice, many MPI applications are not struc-tured to take advantage of such capabilities.Many different approaches have been takensince MPI was first standardized to providefor this capability, including hardware–basedapproaches in which the network adapter itselfhandles much of the MPI protocol [3], hy-brid approaches in which the network adapterand network adapter device driver togetheroffload the MPI protocol from the applica-tion [4], host software–based approaches toassist RDMA–capable, but MPI–unaware, net-work adapters [10], [18], as well as more gener-

• The authors are with Cray, Inc.E-mail: howardp,droweth,dah,[email protected]

This material is based upon work supported by the Defense AdvancedResearch Projects Agency under its Agreement No. HR0011-07-9-0001. Any opinions, findings and conclusions or recommendationsexpressed in this material are those of the author(s) and do notnecessarily reflect the views of the Defense Advanced ResearchProjects Agency. This work was supported in part by the Officeof Advanced Scientific Computing Research, Office of Science, U.S.Department of Energy, under Contract DE-AC02-06CH11357.

alized host software–based approaches whichtake advantage of modern multi–core proces-sors [11], [19].

The Cray XE Gemini RDMA–capable net-work adapter has features intended to assist inthe implementation of effective host software–based approaches for providing independentprogression of MPI and for allowing for over-lap of communication with computation. Toprovide for more effective implementation ofsuch host software–based approaches, Crayhas also enhanced the Cray Linux Environ-ment (CLE) Core Specialization feature to facil-itate management of host processor resourcesneeded for this approach. This paper describesthe combination of Gemini hardware features,the CLE Core Specialization feature, and en-hancements made to MPICH2 to realize thiscapability.

The rest of this paper is organized as follows.First, an overview of the Core Specializationfeature is presented. Features of the Gemininetwork adapter that are significant for thiswork are described in Section 3. Section 4describes the approach Cray has taken withthe MPICH2 implementation of MPI to realizebetter support for independent progress andcommunication/computation overlap. In sec-


tion 5 results using a standard MPI overheadbenchmark are presented, as well as resultsobtained for two MPI applications. The paperconcludes with a discussion of future workplanned for the Core Specialization feature inCLE, as well as improvements to MPICH2 andpossible extensions to the GNI API to bettersupport MPI independent progress.

2 CORE SPECIALIZATION OVERVIEW

Cray has enhanced the Linux kernel deliveredas part of CLE to allow for dynamic parti-tioning of processors (cpus) on nodes of CrayXE/XK systems into two groups – one groupdedicated to application threads, and the otherdedicated to system services. This partitioningis selectable by the user at application launchtime. Note this partitioning scheme does notprevent compute intensive applications frombeing able to use all available cpus on the nodeif so desired. A prototype implementation ofthis Core Specialization (CoreSpec) feature forthe Cray XT is described in [13].

An initial goal for this CoreSpec feature wasthe reduction of the impact of system–relatednoise on the performance of noise–sensitiveparallel applications. The basic idea is that byreserving one or more of the cpus on eachnode for system services daemons and kernelthreads, noise sensitive applications can per-form significantly better using the remainingcpus, which are now dedicated exclusively tothe application processes. Note that in addi-tion to system daemons and kernel threads,interrupts from the network adapter are alsodirected toward one or more of the cpus withinthe set of cpus reserved for the operating sys-tem. Significant improvement in the runtimefor the POP ocean model application on a CrayXT system was demonstrated when using theprototype CoreSpec feature described in [13].

It was realized early on in the design ofthe Cray XE, that CoreSpec could also be usedfor other purposes. Unlike the earlier Cray XTsystems, the Cray XE was introduced whenthe multi–core era of X86 64 processors wasalready in full swing. Indeed, except for a fewspecial systems, all Cray XE compute nodes

have at least 16 cpus per node, with the lat-est Interlagos–based systems having 32 cpusper node. The number of cpus per node isexpected to continue to increase. However, theability of most HPC applications to use all thecpus on the node effectively is unlikely to con-tinue, partly owing to the decreasing amountof memory and memory bandwidth per cpu,as well as increased sharing of cpu resourceswith other cpus on the same compute unit1.These otherwise unused cpus are ideal for useby any threads responsible for progress of theMPI state engines of the application processeson the node.

The prototype implementation Cray XT ver-sion of CoreSpec was enhanced to support thisnew usage model. A capability was added toallow for a thread to inform CoreSpec to sched-ule it on the system services core(s). Threadsthat spend large amounts of time descheduled,waiting on interrupts from network devices areideal candidates for scheduling on the systemservices cpus. The job kernel module pack-age [17] was also enhanced with an exten-sion to allow for MPI to create extra progressthreads without interfering with, or knowingabout, the placement policy requested by theapplication at job launch.

Some of the functionality of CoreSpec de-scribed here could, in principal, be imple-mented using existing kernel interfaces avail-able to user–space applications. However, inthe context of complex, layered software, amore specialized kernel placement mechanismallows for simpler usage by various indepen-dent software packages which an applicationmay be concurrently using.

3 GEMINI HARDWARE FEATURES FORSUPPORTING MPI ASYNCHRONOUSPROGRESS

The Cray XE Gemini network adapter [2] hasseveral features to help support host software–based MPI independent progress mechanisms.The most important of these is a multi–channel

1. A compute unit refers to a group of cpus which shareprocessor resources. An examples is the AMD Interlogos com-pute unit, in which two cpus share, among other resources, afloating point unit.


DMA engine. Transaction requests (TX descrip-tors) submitted to the DMA engine can beprogrammed to select for generation of a localinterrupt at the originator of the transaction,generation of an interrupt at the target of thetransaction, or both, when the transaction iscomplete. A TX descriptor may also be pro-grammed to delay processing of subsequentTX descriptors in the same channel until thetransaction is complete. Complete in this con-text means that for a RDMA write transaction,all data has been received at the target, and fora RDMA read transaction, all response data hasbeen received back at the initiator’s memory. Inaddition to the DMA engine, Gemini Comple-tion Queues (CQ) can be configured for pollingor blocking (interrupt–driven) mode. The Gem-ini also provides a low–overhead mechanismfor a process operating in user–space to gener-ate an interrupt and Completion Queue Event(CQE) at a remote node.

Some preliminary investigations of the over-head for using blocking CQs in conjunc-tion with the low–overhead remote inter-rupt generation mechanism using the GNI [1]GNI PostCqWrite and GNI CqWait functionsshowed that when the processes were sched-uled on the same cpu where the Gemini de-vice driver (kgni) interrupt handler was run, aping–pong latency of 3–4 µsecs was obtained.When the processes were run on cpus otherthan the one where the kgni interrupt handlerwas run, the latency increased to 7–8 µsec.These times were obtained on AMD Interla-gos 2.3 GHz processors. The measured wake–up times, particularly when the process beingmade runnable by the kgni interrupt handlerruns on the same cpu as the interrupt han-dler, were small enough to warrant pursuingan interrupt–driven approach to realizing MPIasynchronous progress.

4 IMPLEMENTATION OF PROGRESSMECHANISM IN MPICH24.1 Phase One

The MPICH2 for Cray XE systems utilizes theNemesis CH3 channel [6]. A Nemesis Networkmodule (Netmod) using the GNI interface was

implemented for the Cray XE [14]. This ini-tial GNI Netmod implementation did not pro-vide an explicit mechanism for MPI asyn-chronous progress. The MPICH2 1.3.1 codebase from Argonne was used for implement-ing changes in MPICH2 to support MPI asyn-chronous progress on Cray XE systems. Thisversion of MPICH2 has effective support forMPI-2 MPI THREAD MULTIPLE. Although aglobal lock is used for protecting internal datastructures, there are yield points within theMPI library where threads blocked waiting forcompletion of sends and receives yield the lockto allow other threads into the MPI progressengine.

A primitive asynchronous progress mecha-nism already exists in MPICH2, and is avail-able when MPICH2 is configured for runtimeselectable MPI-2 thread safety support. Themethod uses an active progress thread whichposts a blocking MPI receive request on anMPICH2 internal communicator at job startup.The thread then goes into the MPI progressengine, periodically yielding a mutex lock toallow application threads to enter the MPI stateengine. At job termination, the main thread ineach MPI rank then sends the message to itselfwhich matches the posted receive on whichthe progress thread is blocked. This causes theprogress thread to return from the blockingreceive, allowing MPI finalization to gracefullyclean up any resources associated with theprogress thread. This active progress threadapproach only works satisfactorily if each MPIrank has an extra cpu for its progress thread.In addition to this problem, the extra threadadds significantly to MPI message latency, es-pecially for short messages, since there is sig-nificant contention for the mutex lock. Note theprogress thread is described as active because,except at points where it is trying to acquire amutex lock, it is scheduled and running fromthe perspective of the kernel. Cray decided totake a different approach, utilizing the Geminifeatures mentioned above to avoid the use ofactive progress threads, and instead rely onprogress threads that are only scheduled (wo-ken up) when interrupts are received from theGemini network interface indicating there isMPI message processing to be done.


There were several goals for the methodused to enable MPI asynchronous progress inthe MPICH2 library. One was to avoid to thegreatest extent possible negatively impactingthe message latency and message rate for shortmessages. Some degradation is unavoidable aslong as a progress thread approach is used,owing to the need for mutex lock/unlocks inthe path through MPI. Another goal was not tofocus on optimizing only specific message de-livery protocols, but to use the progress threadin a way that progresses the MPI state engine’snon–cpu intensive tasks. For the first phaseof this effort, another goal was to keep to aminimum the additional complexity needed tosupport the feature.

The first goal was met in two ways. A givenrank’s progress thread is only woken up whenprogress is needed on messages large enoughto use the rendezvous (LMT) protocol [12]. Forsmaller messages, the progress thread is notinvolved. The second goal was realized by us-ing the local/remote interrupt capability of theGemini DMA engine in a manner that allowsfor both RDMA read, as well as cooperativeRDMA write LMT protocols 4.1.2 to be handledby the progress threads. The third goal wasrealized by retaining significant parts of the ex-isting MPICH2 progress thread infrastructure –and thus avoiding some of the complexities ofmore generalized on–load solutions [11], lever-aging the relatively mature MPI–2 thread safetysupport in MPICH2, and using the GeminiDMA engine in a way that avoids the need foradditional locks within the GNI Netmod itself.Also to keep things simple, no support wasadded to facilitate asynchronous progressionof intra–node messages. The following sectionsdetail changes to the GNI Netmod to supportMPI asynchronous progress.

4.1.1 MPI InitializationIf the user has set environment variables in-dicating that the asynchronous progress mech-anism should be used, MPICH2 early on instartup configures itself for MPI-2 thread sup-port level MPI THREAD MULTIPLE, initializ-ing mutexes and structures related to managingof thread private storage regions. As part ofthe GNI Netmod initialization, each MPI rank

creates a blocking receive (RX) CQ, in additionto the CQs described in [14]. Each rank also cre-ates a progress thread during the GNI Netmodinitialization procedure. The CoreSpec relatedprocedures described in Section 2 are taken toinsure that the progress threads are schedulableonly on one of the OS cpus if the -r CoreSpecoption was specified on the aprun commandline when the job was launched. The progressthread then enters a GNI CqWait call where itis descheduled and blocks inside the kernel,waiting for interrupts from the Gemini networkadapter. While blocked in the kernel, thesehelper threads consume no cpu cycles anddon’t interfere with the application threads.

4.1.2 Rendezvous (LMT) ProtocolIn addition to changes in the GNI Netmodinitialization procedure, the other major en-hancements to MPICH2 were in the manage-ment of the rendezvous or LMT protocol. Asdescribed in [14], both RDMA read and coop-erative RDMA write protocols are employed.For the RDMA read protocol, the sender rankshould optimally be notified when the receiverhas completed the RDMA read of the messagedata out of the sender’s send buffer, in otherwords, when the receiver sends a NemesisLMT receive done message. On the receiver side,the receiver needs to be notified when thesender sends a ready to send (RTS) message, aswell as when the RDMA read request it poststo read the data out of the sender’s send bufferhas completed. Similarly for the RDMA writeprotocol, the sender rank needs to be notifiedwhen the receiver sends a clear to send (CTS)message and when the RDMA write has com-pleted for delivering the message data from it’ssend buffer to the receiver’s receive buffer. Thereceiver rank needs to be notified when a RTSmessage arrives from the sender, and whena send done is received from the sender. Forlong messages, multiple CTS, RTS, send done,and RDMA writes are required to deliver themessage.

In order to allow for asynchronous progressfor both transfer types, a generalized approachfor waking up the progress threads at boththe receiver and sender side is required. Forboth cases, the progress thread on either side


needs to be woken up when a relevant controlmessage arrives. Rather than complicate theexisting method for delivering the control mes-sages, the control message itself is sent usingthe same approach as is taken when there isno progress thread. In addition to the controlmessage, a remote interrupt message is also de-livered to the target behind the control messageusing the GNI PostCqWrite function. This hasthe effect of waking up the progress thread ofthe targeted rank once the LMT control mes-sage has been delivered. The progress threadthen returns from GNI CqWait and invokesthe MPICH2 internal MPID nem network pollfunction to progress the state engine and pro-cess the control message.

The other progress point to handle is howto receive notification that the Gemini DMAengine has completed processing a TX de-scriptor associated with a message using theLMT path. For phase one, to keep things sim-ple, only a single RX CQ is used by eachprogress thread. This means, however, that theoption to generate a local interrupt when theDMA engine has processed the TX descrip-tor is not available. Instead, for every TX de-scriptor posted to the DMA engine which isused to transfer data for an LMT message, asecond TX descriptor is posted which does a4–byte dummy write back to the initiator ofthe transaction. This second TX descriptor isprogrammed to generate a remote interrupt.Note that since the second descriptor is actuallyonly doing a dummy write back to itself, theremote interrupt is being sent to the initiator ofthe DMA transaction. The first TX descriptor,the one that actually moves the data, is alsomarked with the GNI RDMAMODE FENCErdma mode bit. Thus, the arrival of the re-mote interrupt back at the initiator means thatthe first TX descriptor is complete as definedin Section 3. This modification to the use ofthe DMA engine has the advantage that theprogress thread only needs to block on a singleRX CQ, and more importantly, that the pro-cessing of the first TX transaction’s CQE is notdelayed by the time it takes for the progressthread to wake up and process the CQE. Thewake–up is purely optional. If the thread de-tects that an application thread is already in the

MPI progress engine, it can just immediately goback into the GNI CqWait call. The principaldownside of this usage model is that thereis a stall for every TX descriptor with theGNI RDMAMODE FENCE bit set. Since thefence blocks for all network responses to returnbefore proceeding to the next TX descriptor,this stall can be significant, particularly fordistant target nodes and in cases where thenetwork is heavily congested.

This phase one feature is available in theMPICH2 packaged as part of the Message Pass-ing Toolkit (MPT) 5.4 release.

4.2 Changes to MPICH2 - Phase TwoThe main goal with the phase two portionof the asynchronous progress feature is im-proving the mechanism for progression ofDMA transactions by removing the use ofthe GNI RDMAMODE FENCE bit from theTX transactions which move message data,as well as dispensing with the dummy TXtransaction used to generate an interrupt atthe initiator. By removing the need to use theGNI RDMAMODE FENCE bit and the extraTX descriptor, three significant performanceadvantages are realized, at least in theory:

1) Since the fence mode is no longer re-quired, multiple BTE channels can beused by the application,

2) The stalls introduced into the DMA en-gine TX descriptor pipeline are elimi-nated,

3) The number of TX descriptors that needto be processed is cut in half.

In order to generate local interrupts uponcompletion of a TX descriptor associated witha message using the LMT path, an addi-tional blocking TX CQ must be created dur-ing the GNI Netmod initialization procedure.The progress threads now have to block ontwo CQs, which can be accomplished usingthe GNI CqVectorWaitEvent function. In addi-tion to these changes, a more complex CQEmanagement system is required, since nowthe progress threads will need to store CQEsrecovered from the new blocking TX CQ ina way that both the progress threads, as wellas any application threads that are in the


MPI progress engine, can process these CQEs.Currently in the phase two implementation,a linked list of CQEs, protected by a mu-tex, is used for storing these CQEs. When aprogress thread is woken up due to a CQEon the blocking TX CQ, the dequeued CQEis added to this linked list. If the applicationthread is within the MPI progress engine al-ready, the asynchronous progress thread sim-ply goes back to blocking in the kernel byagain calling GNI CqVectorWaitEvent. Applica-tion threads already in the MPI progress enginethen pick up the CQE off of this linked listand process them in a manner analogous tothat used to process CQE’s returned from theexisting non–blocking TX CQ.

Phase two is still very much a work inprogress. As will be seen in the results section,the need for the progress thread to explicitlydequeue CQE’s from the blocking TX CQ hasa significant impact on MPI performance if theapplication threads do not have sufficient workto completely overlap the communication time.

5 RESULTS

5.1 Micro–benchmark Results

The CPU overheads of sending and receivingMPI messages were measured with the SandiaSMB test [16] using the post–work–wait methoddescribed in [7] where the overhead is definedto be:

...the overhead, defined as the lengthof time that a processor is engaged inthe transmission or reception of eachmessage; during this time, the proces-sor cannot perform other operations.

Application availability is defined to be thefraction of total transfer time that the applica-tion is free to perform non-MPI related work.In terms of the Sandia micro–benchmark andthe following figures:

overhead = iter t − work t (1)

and

availability = 1− iter t − work t

base t(2)

Fig. 1. Calculating sender availability on 16Kmessages. Communication overhead is low;CPU availability is measured at 83%.

The test measures the time to complete anon-blocking MPI Isend (or MPI Irecv) of agiven size, some number of iterations of a workloop and then an MPI Wait. The test is repeatedfor increasing numbers of iterations of the workloop until the total time taken (iter t) exceedsthe time for the communication step alone(base t) by some threshold. Examples of thiscalculation are illustrated for 16 Kbyte transfersin Figures 1 and 2 in which sender availabilityis measured to be 83% and receiver availability20%.

The SMB test repeats this measurement re-porting sender and receiver CPU availabilityfor increasing message size (see Figure 3). Theresults shown in the figure are baseline val-ues without MPI asynchronous progression en-abled.

The characteristics of each of the three MPIprotocols are clear. The small message caserequires CPU time for each message on boththe sender and receiver. For intermediate sizemessages where the LMT read protocol isused, sender availability increases as the re-ceiver fetches data independently. The receiverhowever consumes increasing amounts of CPUtime as the message size increases. The largemessage LMT cooperative put protocol pro-vides no overlap of computation and commu-nication, sender and receiver must both be inMPI calls to make progress.


Fig. 2. Calculating receiver availability on 16Kmessages. Communication overhead is high;CPU availability is measured at 20%.

These results are in line with expectations inthat the user process must be in an MPI call inorder to perform MPI matching or to advancethe underlying GNI protocol. This restrictionis not an issue when the application is makingMPI calls at a high rate, but poor overlap onintermediate and large transfers has a negativeimpact on performance of some applications.

In Figure 4 we repeat the measurementsusing the phase one library. At small messagesizes performance is much as before. Receiveavailability dips as we switch to rendezvousprotocol at message sizes of 8K bytes andabove. With progression enabled both send andreceive side availability then increase steadilywith message size.

In Figure 5 we compare the phase one andphase two implementations, focusing on inter-mediate size messages. The phase one imple-mentation shows poor receive side availabilityfor message sizes between 8K and 10K.

We can lower the threshold at which weswitch to the rendezvous protocol. The effectsof making this change are illustrated in Figure 6below. We show send and receive size avail-ability for intermediate message sizes with pro-gression enabled and disabled. Receiver avail-ability increases markedly once progression isenabled, and from there continues to increasegradually with message size.

Fig. 3. Application availability as a function ofmessage size measured using the SMB over-head test with one process per node. Availabilityis shown for the sender (red) and the receiver(blue). Progression disabled.

Fig. 4. Updated availability plot. Send and re-ceive side availability are shown as a function ofmessage size. Progression enabled, phase one.

In Figure 7 we show the impact on latencyof enabling asynchronous progression. Thereis a jump in latency as we switch from eagerto rendezvous protocol. By default the librarymakes this change at 8K bytes so as to min-imize the effect. The jump in latency is morepronounced when asynchronous progression isenabled. Latency also increases with the num-ber of processes per node. This is a result of theGemini device only supporting a single user in-band interrupt. All progress threads must bewoken when an interrupt is delivered.


Fig. 5. Comparison of CPU availability at in-termediate message sizes for phase one andphase two implementations.

The standard rendezvous algorithm per-forms message matching and then initiates ablock transfer to move the bulk data. Our re-sults show this working well for large messagesizes, but pushing up latencies at small sizes.An alternative approach is use an FMA get atintermediate sizes rather than a block transfer.This reduces the latency, especially with largenumbers of processes per node, but it also re-duces CPU availability. In Figure 8 we show theimpact on bandwidth of enabling progressionat 1K and 8K message sizes. The eager protocoldelivers good bandwidth at intermediate band-width, but CPU availability is relatively poorand main memory bandwidth is wasted copy-ing from system buffers to user space. The ren-dezvous protocol with thread based progres-sion delivers high bandwidth and high CPUavailability at large message sizes, but reducedbandwidth at intermediate sizes. These effectswill diminish as we reduce the overhead costsof thread based progression. The combinationof the overhead plots and latency/bandwidthcharts can be seen as a measure of success forprogress mechanisms. CoreSpec was observedto significantly reduce the variability in theSMB benchmark measurements.

5.2 S3D Application ResultsS3D is a massively parallel direct numericalsolver (DNS) for the full compressible Navier-

Fig. 6. CPU availability for intermediate mes-sage sizes. In these measurements the smallmessage eager protocol is used for transfers ofup to 512 bytes and rendezvous for all largersizes.

Fig. 7. Latency measurements with 4 processesper node. In these measurements the smallmessage eager protocol is used for transfers ofup to 512 bytes and rendezvous for all largersizes.

Stokes, total energy, species and mass conti-nuity equations coupled with detailed chem-istry [8]. It is based on a high-order accurate,non-dissipative numerical scheme solved on athree-dimensional structured Cartesian mesh.S3D’s performance has been optimized for in-creased grid size, more simulation time steps,and more species equations. These are criticalto the scientific goals of turbulent combustionsimulations in that they help achieve higher


Fig. 8. Bandwidth measurements with 4 pro-cesses per node. In these measurements thesmall message eager protocol is used for trans-fers of up to 512 bytes (green) or 4K bytes(blue). Rendezvous protocol with progressionenabled is used for all larger sizes.

TABLE 1Profile of S3D without progression.

Time% Time(secs) – Calls Total100.0% 333.55 – 2022296 Total56.7% 189.06 – 605408 USER38.3% 127.90 – 869027 MPI23.7% 78.91 – 61200 mpi waitall11.8% 39.43 – 6960 mpi wait2.4% 7.938 – 6960 mpi wait

4.1% 13.73 – 546009 OMP

Reynolds numbers, better statistics throughlarger ensembles, more complete temporal de-velopment of a turbulent flame, and the sim-ulation of fuels with greater chemical com-plexity. In addition, ORNL and Cray havespent the past year converting S3D into ahybrid MPI/OpenMP application capable ofusing MPI for inter–node data exchange, whileusing OpenMP within the node.

Profiles of this hybrid version of S3D run-ning on Cray XE6 show MPI traffic dominatedby non-blocking communications , with largeamounts of time spent in wait functions.(seeTable 1). Table 2 shows the corresponding mes-sage size profile. The use of large non-blockingmessages makes S3D a good candidate forasynchronous progression.

TABLE 2MPI message profile for S3D

MPI Msg Msg.Size Count< 16 179516− 256 27256− 4K 1008064K − 64K 264K − 1M 298681

TABLE 3S3D Time Step Summary

# Application Progression ProgressionThreads disabled enabled14 4.77 3.9315 4.68 4.0516 4.59 4.06

In Table 3 we show the S3D runtime inseconds per time step with 14, 15, and 16threads per node. Our first set of measurementswas performed with progression disabled. Inthe second set progression (phase one) wasenabled, with 2, 1, or 0 cores dedicated to pro-gression using the CoreSpec method. In the finalmeasurement, with 16 application threads andprogression enabled the interrupts generatedby the progress mechanism will cause applica-tion threads to be descheduled. All tests wererun on 64 nodes of a Cray XK system usingAMD Interlagos processors. Best performancewas optioned using 14 application threads us-ing two cpus for the progress thread. A 14%reduction in overall runtime was achieved byenabling progression. The benefits of core spe-cialization were relatively small for this hybridMPI/OpenMP application, 3% of the 14% gain.

5.3 MILC7 Application ResultsThe MIMD Lattice Computation (MILC) code(version 7) is a set of codes developed by theMIMD Lattice Computation Collaboration fordoing simulations of four dimensional SU(3)lattice gauge theory on MIMD parallel ma-chines. Profiling of this application showedthat for certain types of calculations, the MPImessage pattern was dominated by small (2-4 KB) to medium size (9-16 KB) messages


TABLE 4MPI message profile for S3D

MPI Msg Msg.Size Count< 8 16513− 1K 2837361K − 2K 983042K − 4K 11671924K − 8K 359268K − 16K 94820816K − 32K 107202

at 8192 ranks (see Table 4). The application’suse of MPI point to point messaging appearsto lend itself to potential overlap of commu-nication with computation. For purposes ofinvestigating the effectiveness of the thread–based asynchronous progression mechanism,Cray benchmarking modified the application togather timing data for the main MPI messagegather/scatter kernel within the application. AAsqtad R algorithm test case was run with alattice dimension of 64x64x64x144. Unlike theS3D application, MILC is a pure MPI code andis typically run with one MPI rank per cpu.

The application was run four different wayson a Cray XE6 using AMD Interlagos proces-sors. Runs were made without MPI progres-sion, with MPI progression using the phaseone method and one cpu per node reserved byCoreSpec, with MPI progression using the phasetwo method and one cpu per node reservedby CoreSpec, and one set of runs with MPIprogress using the phase one method but nocpus reserved for the progress threads. Table 5gives the runtime in seconds for these differentruntime settings. With CoreSpec reserving 1 cpuper node, the 8192 rank job requires 264 nodesas opposed to 256 nodes when running withoutCoreSpec. The modest runtime improvementswhen using the phase one progression mech-anism correlate closely with the reduction inthe average and maximum MPI wait time partof the gather/scatter operation. For example,at 8192 ranks, the average MPI Wait time inthe application’s gather wait function fell from309 seconds to 270 seconds, and the maximumtime fell from 719 seconds to 504 seconds usingthe phase one progress method. The modest

TABLE 5MILC Run Time Summary(secs)

# Run Type 4096 8192ranks ranks

No progression 2165 1168Progression (phase 1) 2121 1072Progression (phase 2) 3782 2138Progression (phase 1)no reserved cores 3560 2210Progression (phase 1)reserve core but no 2930 2070corespec

reduction in the average wait time when usingthe phase one progress mechanism, and themuch higher wait times when the phase twomechanism indicates that at least some of theranks do not have sufficient work to fullymask the communication time. The messagesizes are in the range where latency domi-nates, particularly when running a flat MPIapplication with many ranks per node. Thebad performance of the phase two method forthis application is most likely explained bythe fact that the progress thread must wakeup and dequeue CQEs before the main threadcan make progress on completing a message.Since the application does not have sufficientcomputation to fully mask the communica-tion time, the rank processes themselves have,in many cases, probably already entered theMPI Wait function well before the messagehas been delivered. Thus the full overhead ofthe need to wakeup the progress threads inorder to dequeue CQES from the TX CQ, pluscontention for mutex locks, is encountered.

The times in the final row of Table 5, whencompared with the times in the 2 row, showsthat CoreSpec leads to significantly better resultsfor the phase one progress mechanism thancan be obtained simply be running a reducednumber of MPI ranks per node and leaving anextra cpu available.

6 CONCLUSIONS AND FUTURE WORK

The thread–based progression mechanism ap-pears to have promise for use with appli-cations on Cray XE and future architectures.The results presented in this paper show that


the phase one approach can be used withthe class of MPI applications which tend touse larger messages in the 10–100 KB range.Hybrid MPI/OpenMP applications which tendto have larger messages and fewer intra–nodemessages are the most likely beneficiaries ofthis progress mechanism.

The preliminary results for the phase twoapproach have demonstrated that improvedprocessor availability is realized for smallermessages when using the specialized MPI over-lap test, although other micro–benchmarks andapplications show a need to process more ef-ficiently CQEs dequeued from CQs configuredas blocking. Extensions to the GNI API to allowfor more efficient CQE processing for suchcases are currently being investigated.

Cray is also enhancing the CoreSpec mech-anism to target compute units with hyper–thread support. In many cases, HPC applica-tions cannot efficiently use all of the hyper–threads that a compute unit can support, yetthese unused CPUs should work very well forMPI progress threads.

In addition to software improvements to thethread–based progress mechanism, follow–onproducts to the Cray XE6 will have additionalfeatures including many more in–band inter-rupts and lower–overhead access to the DMAengine, that will improve the performance ofhost thread–based MPI asynchronous progres-sion techniques.

ACKNOWLEDGMENTS

The authors would like to thank John Levesque(Cray, Inc.) and Steve Whalen (Cray, Inc.) forthe their help with the S3D and MILC applica-tions.

REFERENCES

[1] Cray Software Document S-2446-3103:Using the GNI andDMAPP APIs, March 2011.

[2] R. Alverson, D. Roweth, and L. Kaplan. The GeminiSystem Interconnect. High-Performance Interconnects, Sym-posium on, 0:83–87, 2010.

[3] J. Beecroft, D. Addison, D. Hewson, M. McLaren,D. Roweth, F. Petrini, and J. Nieplocha. QsNetII: DefiningHigh-Performance Network Design. IEEE Micro, 25(4):34–47, July 2005.

[4] R. Brightwell, K. Predretti, K. Underwood, and T. Hud-son. Seastar Interconnect: Balanced Bandwidth for Scal-able Performance. Micro, IEEE, 26(3):41–57, May-June2006.

[5] R. Brightwell, R. Riesen, K. D. Underwood, R. Brightwell,R. Riesen, and K. D. Underwood. Analyzing the Impactof Overlap, Offload, and Independent Progress for Mes-sage Passing Interface Applications. Int. J. High Perform.Comput. Appl, 19:103–117, 2005.

[6] D. Buntinas, G. Mercier, and W. Gropp. Design andEvaluation of Nemesis, a Scalable, Low-Latency, Message-Passing Communication Subsystem. In CCGRID’06, pages521–530, 2006.

[7] D. Doerfler and R. Brightwell. Measuring MPI Send andReceive Overhead and Application Availability in HighPerformance Network Interfaces. In Proceedings of the13th European PVM/MPI User’s Group Conference on RecentAdvances in Parallel Virtual Machine and Message PassingInterface, EuroPVM/MPI’06, pages 331–338, Berlin, Hei-delberg, 2006. Springer-Verlag.

[8] E. R. Hawkes, R. Sankaran, J. C. Sutherland, and J. H.Chen. Direct Numerical Simulation of Turbulent Combus-tion: Fundamental Insights Towards Predictive Models.Journal of Physics: Conference Series, 16(1), 2005.

[9] J. W. III and S. Bova. Where’s the Overlap? - An Analysisof Popular MPI Implementations, 1999.

[10] R. Kumar, A. R. Mamidala, M. J. Koop, G. Santhanara-man, and D. K. Panda. Lock-Free Asynchronous Ren-dezvous Design for MPI Point-to-Point Communication.In Proceedings of the 15th European PVM/MPI Users’ GroupMeeting on Recent Advances in Parallel Virtual Machine andMessage Passing Interface, pages 185–193, Berlin, Heidel-berg, 2008. Springer-Verlag.

[11] P. Lai, P. Balaji, R. Thakur, and D. K. Panda. ProOnE: aGeneral-purpose Protocol Onload Engine for Multi- andMany-core Architectures. Computer Science - R&D, pages133–142, 2009.

[12] MPICH2-Nemesis. Nemesis Network ModuleAPI. wiki.mcs.anl.gov/mpich2/index.php/NemesisNetwork Module API.

[13] S. Oral, F. Wang, D. Dillow, R. Miller, G. Shipman,D. Maxwell, D. Henseler, J. Becklehimer, and J. Larkin.Reducing Application Runtime Variability on Jaguar XT5.In Proceedings of Cray User Group, 2010.

[14] H. Pritchard, I. Gorodetsky, and D. Buntinas. A uGNI-based MPICH2 Nemesis Network Module for the CrayXE. In Proceedings of the 18th European MPI Users’ GroupConference on Recent Advances in the Message Passing Inter-face, EuroMPI’11, pages 110–119, Berlin, Heidelberg, 2011.Springer-Verlag.

[15] J. C. Sancho, K. J. Barker, D. J. Kerbyson, and K. Davis.Quantifying the Potential Benefit of Overlapping Com-munication and Computation in Large-Scale ScientificApplications. SC Conference, 2006.

[16] Sandia Micro-Benchmark (SMB) Suite. www.cs.sandia.gov/smb/.

[17] SGI. Process Aggregates (PAGG). oss.sgi.com/projects/pagg.

[18] S. Sur, H.-W.Jin, L. Chai, and D. K. Panda. RDMA ReadBased Rendezvous Protocol for MPI over Infiniband: De-sign Alternatives and Benefits. In Symposium on PPOPP,March, 2006.

[19] F. Trahay, E. Brunet, A. Denis, and R. Namyst. A Mul-tithreaded Communication Engine for Multicore Archi-


tectures. In International Parallel and Distributed Process-ing(IPDPS), 2008.

Documents

Leveraging the Cray Linux Environment Core Specialization ... · Leveraging the Cray Linux Environment Core Specialization Feature to ... obtain overlap of MPI communication with