Asymmetric Interactions in Symmetric Multi-core Systems

Asymmetric Interactions in Symmetric Multi-coreSystems: Analysis, Enhancements and Evaluation

T. ScoglandDept. of Computer Science

Virginia [email protected]

P. Balaji†Math. and Computer Science

Argonne National [email protected]

W. Feng, G. NarayanaswamyDept. of Computer Science

Virginia Tech{feng, cnganesh}@cs.vt.edu

Abstract—Multi-core architectures have spurred the re-cent rapid growth in high-end computing systems. Whilethe vast majority of such multi-core processors containsymmetric hardware components, their interaction withsystems software, in particular the communication stack,results in a remarkable amount of asymmetry in theeffective capability of the different cores. In this paper, weanalyze such interactions and propose a novel managementlibrary called SyMMer (Systems Mapping Manager) thatmonitors these interactions and dynamically manages themapping of processes on processor cores to transparentlyimprove application performance. Together with a detaileddescription of the SyMMer library, we also present perfor-mance evaluation comparing SyMMer to a vanilla commu-nication library using various micro-benchmarks as well aspopular applications and scientific libraries. Experimentalresults demonstrate more than a two-fold improvement incommunication time and 10-15% improvement in overallapplication performance.

I. INTRODUCTION

Multi- and many-core architectures [1]–[5] have beenone of the primary driving factors in the recent rapidgrowth of HEC systems. The great majority of thesearchitectures are symmetric (i.e., homogeneous) in thatthe cores have equal and interchangeable computationalparameters. Consequently, many application and systemssoftware developers assume that such symmetry is adirect indication of the effective capability of each corein the system. Unfortunately, as we have shown in ourprevious work [6], this is often untrue.

The reason for this, is that while the processor hard-ware itself is symmetric, the rest of the system is not. Forexample, most network hardware has not been designedto maintain state concerning which application processis running on which core of the system; thus hardwareinterrupts corresponding to data packets for differentprocesses are either randomly distributed across all cores

This research is funded in part by the Mathematical, Information,and Computational Sciences Division of the Office of AdvancedScientific Computing Research, Office of Science, U.S. Departmentof Energy, under Contract DE-AC02-06CH11357 and the Departmentof Computer Science in the College of Engineering at Virginia Tech.†This author’s work was also supported in part by the National

Science Foundation Grant #0702182.

or are statically directed to a single core. This additionalinterrupt processing can cause one core to be moreoverloaded than others, thereby affecting performance.Further, on processors where each core has its own dedi-cated L1 and/or L2 cache, the core handling the interrupt,fetches the data being communicated into its local cachefor processing it. Thus this data will not be equidistantfrom all cores. Consequently, further processing of thedata is faster on some cores, than on other cores. Finally,as we will show in the later sections, assumptions madewithin a systems software stack (such as the MPI library)with respect to the symmetry of the different corescan propagate as skew for end applications, resulting inunintended out-of-sync communication patterns. In theseand other examples, there can be a marked asymmetryin the effective capability of the different cores withrespect to what each core can offer to the overall parallelapplication.

So, how can such asymmetry be dealt with? Thereare at least two different options to deal with this: (1)we can design a programming model specific to multi-core systems that exposes the asymmetry in the differentcores to the application or systems software developer,allowing them to explicitly manage the workload ofdifferent processes; (2) we can provide an automatedapproach, either through a library or an external systemdaemon, which monitors the processes and dynamicallymaps different processes to different cores to improveperformance. Unfortunately, neither option is perfect. Forthe first option, even if we build a new programmingmodel, having application writers port their applicationsis cumbersome and impractical. Also, the exact amountof asymmetry depends on various factors specific to theprocessor hardware, and is difficult to abstract througha programming interface. For the second option, neithera library nor a daemon has explicit information fromthe application. Thus, any decision made will be basedon monitoring the interactions and inferring the applica-tion’s computation and communication patterns.

In this paper, we take the second approach to handlingasymmetry in multi-core environments by designing a

novel Systems Mapping Manager (SyMMer) library.Specifically, we first perform a detailed analysis of theasymmetric interactions between multi-core architecturesand the communication stack. Next, we utilize thisanalysis to efficiently monitor these interactions anddynamically manage the mapping of processes ontoprocessor cores so as to improve performance, whileremaining completely transparent to the programmer andend user. To achieve this, SyMMer uses several heuristicsto identify the types of asymmetry between the effectivecapabilities of cores and the processing requirements ofthe application. If substantial asymmetric behavior isdetected, SyMMer transparently re-maps the processes,thereby achieving improved performance. While SyM-Mer is a generic framework that can fit into any com-munication library, in this paper our descriptions refer toa version plugged into the MPICH2 implementation [7]of the Message Passing Interface (MPI) [8].

Together with the detailed design of the SyMMerlibrary, we also present performance evaluations com-paring SyMMer to vanilla MPICH2 using various micro-benchmarks as well as the popular scientific librariesand applications, GROMACS [9], LAMMPS [10] andthe FFTW [11] library. Our experimental results demon-strate that our approach can provide more than a two-fold improvement in communication time and 10-15%improvement in overall application performance.

II. OVERVIEW OF MULTI-CORE ARCHITECTURES

For many years, hardware manufacturers have beenreplicating structures on processors to create multiplepathways allowing multiple instructions to run concur-rently. Duplicate arithmetic and floating point units, co-processing units, and multiple thread contexts (SMT) onthe same processing die are examples of such replication.Multi-core processors are the next step in such hardwarereplication where two or more independent executionunits are combined on the same integrated circuit.

At a high level, multi-core architectures are similarto multi-processor architectures. The operating systemhandles multiple cores in the same way as multipleprocessors: by allocating one process at a time to eachcore. Arbitration of shared resources between the coreshappens completely in hardware, with no interventionfrom the OS. However, multi-core processors are alsovery different from multi-processor systems. For exam-ple, in multi-core processors, both computation unitsare integrated on the same die. Thus, communicationbetween these computation units does not have to gooutside the die, and hence is independent of the die pinoverhead, making intra-die communication much fasterthan inter-die. Further, architectures such as the current

Core 0 Core 1

L1 Cache L1 Cache

L2 Cache

Processor 0

Core 2 Core 3

L1 Cache L1 Cache

L2 Cache

Processor 1

Main memory

System Bus

Fig. 1. Intel Dual-core Dual-processor System

Intel multi-core processors, as shown in Figure 1, pro-vide a shared cache between the cores on the same die.This makes communication even simpler by eliminatingthe need for explicit communication protocols; the datais simply there in the local cache.

However, multi-core processors also have the dis-advantage of more shared resources as compared tomulti-processor systems. That is, multi-core processorsmight require different cores on a processor die to blockwaiting for a locally shared resource when the resource isbeing used by a different core. Such contention increasesas the ratio of cores to other resources (e.g., memorycontrollers or shared caches) increases.

III. ASYMMETRIC INTERACTIONS IN MULTI-CORESYSTEMS

Given the rise of multi-core architectures, its interac-tions with applications and systems software is impor-tant. This section describes such interactions.

A. Application and Developer Viewpoint of the Commu-nication

The communication stack described in this sectionconsists of the MPICH2 implementation of MPI, usingLinux TCP/IP sockets, though the behavior is similar toother implementations of MPI as well.

On the transmission side, the application formulatesa message and hands it over to MPI to transmit. MPIbuffers this data, appends a header, and passes it to theTCP stack for transmission. On the receiver side, thenetwork adapter transfers received packets to the socketbuffer, where they are assembled, checked for validity,and held until requested by the application. When theapplication calls an MPI receive operation, this datais copied out of the socket buffer into the applicationdesignated receive buffer.

2

B. Architectural Viewpoint of the Communication StackWhile the path the message follows is straightfor-

ward when viewed from an application or developerstandpoint, there are many hidden side effects belowthe clean abstraction provided by the higher layers ofthe communication stack. In this section, we presentthe impact these effects can have on a cores effectivecapability to perform computation and communication.

1) Processing Impact: As described in Section III-A,when a packet arrives, the network adapter places thedata in memory to be handled by TCP, after which aninterrupt is raised to inform the communication stackthat there is a message to process. For most systemarchitectures, the processing core to which the inter-rupt is directed is either statically or randomly chosenusing utilities such as IRQ balance. However, in bothapproaches, the chosen core to which the interrupt is as-signed is not guaranteed to be the same core on which theprocess performing the relevant communication resides.Thus, all of the processing done on the packet at thereceiver takes place on the chosen core including dataintegrity checks, connection demultiplexing, and othersuch compute intensive operations that can significantlyimpact the computational load on the chosen core.

Note that this protocol processing computational loadis in addition to whatever computation the applicationprocess is performing. Thus, as far as the applicationprocesses are concerned, the chosen core tends to havea reduced effective computational capability as comparedto the remaining cores.

0

500

1000

1500

2000

2500

3000

3500

1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M

Band

width (M

bps)

Message size (bytes)

Core 0 Core 2

Core 1 Core 3

Fig. 2. MPI Bandwidth on Intel

2) Cache Transaction Impact: Aspects of protocolprocessing such as data copies and checksum-based dataintegrity require the communication stack to touch thedata before handing it over to the application (throughthe socket buffer). For example, on the receiver side, thenetwork adapter places the incoming data in memory andraises an interrupt to the device driver. However, when

the TCP/IP stack performs a checksum of this data, it hasto fetch the data into its local cache. Once the checksumis complete, when the application process has to read thisdata, it has to fetch this data to its local cache. That is,if the application process resides on the same die as thecore performing the protocol processing, then the data isalready on the die and can be quickly accessed. Instead,if the application process resides on a different die, thenthe data has to be fetched using a cache-to-cache transfer.

To demonstrate these effects (processing impact andcache transaction impact), we measured the communica-tion bandwidth between two processes on two Intel dual-processor/dual-core machines, with the processes boundto different cores of the system. Figure 2 demonstratesthese measurements. The interrupts (and hence the proto-col processing) are always directed to core 0 in this case.As shown in the figure, when the application processesare also bound to core 0, they have to compete withthe protocol processing that is occurring on the samecore and hence suffer a performance penalty. On theother hand, when the application processes are bound tocore 1, they do not have to face this additional protocolprocessing overhead. Further, since core 0 performs theprotocol processing, the communication data is availableon the local dies shared L2 cache (Figure 1), and hence,can be easily used by core 1. Thus, core 1 does notface any of the protocol processing overheads but hasthe benefits of in-cache access to its data data througha free ride from core 0. This results in the best possibleperformance for this situation. Cores 2 and 3 do nothave to the face the protocol processing overheads, butdo not have the benefits of increased cache hits either.Thus, their performance is in between that of cores 0and 1. This behavior is confirmed in Figures 3 (a) and 3(b), using the actual performance counters.

C. Identifying the Symptoms of Communication Stackand Multi-core Architecture Interaction

Directly understanding the actual interactions betweenthe communication stack and the multi-core architec-ture is complicated, and requires detailed monitoringof various aspects of the kernel and hardware as wellas correlation between the various events. Therefore,we take an indirect approach to understanding theseinteractions by monitoring for symptoms in the applica-tion behavior that are triggered by known interactions,instead of monitoring the interactions themselves. Weshould note that while a certain interaction can resultin a symptom, the occurrence of the symptom does notnecessarily mean that the interaction has taken place.That is, each symptom can have a number of causesthat could have triggered it.

3

0.01

0.1

1

10

100

1000

10000

100000

1 4 16 64 256 1024 4k 16k 64k 256k 1M 4M

Num

ber of interrup

ts (Log scale)

Size (bytes)

Core 0 Core 2

Core 1 Core 3

‐50

0

50

100

150

200

250

1 4 16 64 256 1024 4k 16k 64k 256k 1M 4M

Percen

tage differen

ce of L2 cache misses

Size (bytes)

Core 0 Core 2

Core 1 Core 3

Fig. 3. (a) Interrupts Per Message and (b) Cache Analysis

In this section, we discuss the various symptoms thatwe need to monitor in order to infer that an interactionhas taken place. In Section IV, we describe our approachto monitor these symptoms and minimize the impact ofthe interactions through appropriate metrics.

1) Symptom 1: Communication Idleness: As notedin Section III-B, if a core is busy performing protocolprocessing, the number of compute cycles it can allocateto the application process is lower than other cores, thusslowing down the process allocated to this core. There-fore, a remote process that is trying to communicate withthis slower process would observe longer communicationdelays and be idle for longer periods as compared toother communicating pairs. This symptom is referredto as communication idleness. Again, as noted earlier,communication idleness can occur for a number ofreasons, including native imbalance in the application’scommunication model.

2) Symptom 2: Out-of-Sync Communication: Com-munication middleware such as MPI performs internalbuffering of data before communicating. Assuming boththe sender and receiver have equal computational ca-pabilities, there would not be any backlog of data atthe sender, and the MPI internal buffering would not beutilized. Let us consider a case where process A sendsdata to process B and both processes compute for a longtime. Then process B sends data to process A and againboth processes compute for a long time. Now, supposeprocess B is slower than process A, and A wishes to senddata to B. In this case, process A’s MPI library wouldhave to buffer the data since process B is not ready toreceive more. From process A’s perspective the send hascompleted once the data has been handed over to the MPIlibrary; thus, it moves on to perform its computation.

After its computation, when it tries to receive datafrom process B, it sees that the previous data that itattempted to send is still buffered and tries to send itout again. By now, B is ready to receive more data

and the send is successful. After receiving the data,process B goes off to perform its computation, whileprocess A is waits to receive its data. This behavior iscaused because despite the fact that processes A andB are performing similar tasks, they are slightly out-of-sync due to the difference between their effectivecomputational capabilities.

3) Symptom 3: Cache Locality: As mentioned inSection III-B, when a core performs protocol processing,it fetches the data to its cache in order to perform the dataintegrity check, among other things. Thus, if the processwhich is waiting for that data is either on that core, oranother on the same die, it will have the data withoutany further delay. On the other hand, if the process is onanother die, it will require a costly inter-die data transferto obtain it. Thus, a process which is transferring largeamounts of data will gain a performance benefit fromsharing a die with the core processing the communica-tion interrupts. In addition to this, we find that processeswhich share data locally also present this symptom. Forexample, if one process is frequently sharing data withanother process on the same machine, the data will betransferred between their caches on a regular basis, andcan benefit from a shared L2 cache.

D. Our Focus

As shown above, the effective capabilities of logicallyidentical cores are frequently not the same. The resultbeing that where a process is currently running cangreatly effect its performance. The appropriate responseto this issue is to map the processes on to the cor-rect cores. Traditionally the role of making sure thatprocesses are on the best possible processing unit hasfallen to the system scheduler, but current schedulers donot take the effective capability into account. Anotheroption is to map the processes manually as we did inour previous work [6], but the time and difficulty of thatmethod makes it impractical for many situations, such

4

as, projects where the code base changes frequently, orapplications where the work done per process changesfrequently. As such, an automatic or dynamic approachis necessary. In the rest of this paper we will describe andevaluate our solution, the dynamic mapping frameworkknown as SyMMer.

IV. THE SYMMER LIBRARY

This section describes our novel Systems MappingManager (SyMMer) library and its associated monitoringmetrics. The SyMMer library, shown in Figure 4, is aninteractive monitoring, information sharing, and analysisframework that can be tied into existing communicationmiddleware such as MPI or OpenMP. The library directlyimplements the monitoring, communication, and analysiscomponents, while the actual metrics that are used forthe decision making are separately pluggable as we willdescribe in Section IV-A. This allows the design ofmetrics and the programming environment where theyare used to be completely independent of each other.

!"##$%&'()"%*!"#+"%,%-*

."%&-"/&%0*!"#+"%,%-*

1%(234&4*!"#+"%,%-*

!"#$%&'()*+,-#.)

*+//#$)

."%&-"/&%0*!"#+"%,%-*

!"##$%&'()"%**

5&6/(/3*

Application

SyMMer Information Flow

Fig. 4. The SyMMer Component Architecture

Interaction Monitoring: The interaction monitoringcomponent is responsible for monitoring the system forinformation. This includes system specific information(hardware interrupts, software signals), communicationmiddleware specific information (MPI data bufferingtime and other internal stack overheads) and processorperformance counters (cache misses, interrupts). Thiscomponent utilizes existing libraries and the operatingsystem for as much of the instrumentation as possible,while relying on in-built functionality for the rest. Forexample, processor performance counters are measuredusing the Performance Application Programming Inter-face (PAPI [12]) and system specific information throughthe proc file-system. While MPI specific information canbe monitored through libraries such as PERUSE [13],several of the current MPI implementations do notsupport these libraries yet. Thus, we use in-built profilingfunctionality to obtain such information.

In order to minimize monitoring overhead, the mon-itoring component dynamically enables only the com-ponents which are required for the metrics being used.For example, if no metric makes use of processor per-formance counters, such information is not monitored.

Communication: The communication component is re-sponsible for the exchange of state or data betweendifferent processes in the application. Several forms ofcommunication are supported, including point-to-pointsharing models (for sending data to a specific process),collective sharing models (for sending data to a groupof processes) and bulletin board models (for publish-ing events that can be asynchronously read by otherprocesses). For each of these models, both intra-nodecommunication (between cores on the same machine)and inter-node communication (between cores on differ-ent machines) are provided. Inter-node communication isdesigned to avoid out-of-band communication by makinguse of added fields in the communication middleware’sexisting headers. Whenever a packet is sent, the senderadds the information which needs to be shared to theoutgoing header. The receiver, on receiving the header,shares this information with other local processes usingregular out-of-band intra-node communication. This ap-proach has the advantage that any single inter-node com-munication can share information about all processes onthe node. Intra-node communication, on the other hand,has been designed and optimized using shared memorywithout requiring locks or communication blocking ofany kind. This provides a great deal more flexibility andreduces the overhead of our framework significantly.

Analysis: The information collected by the monitoringcomponent in each process and shared with other pro-cesses is raw, in the sense that there is no correlationbetween the different pieces of information. Further, themonitored information is low-level data which needsto be processed and summarized into higher level andmore compact information before it can be processedby the various metrics. The information analysis com-ponent performs all the analysis and summarization ofthe data before passing it along. This component alsoallows the various pluggable metrics to be arbitrarilyprioritized for cases where an application may showmultiple symptoms. Finally, each monitoring event hasa certain degree of inaccuracy or burstiness associatedwith it. Thus, some monitoring events have more datanoise than others. To handle such issues, the analysiscomponent allows different monitors to define confidencelevels for their monitored data. Thus, depending onthe number of events that are received, the analysiscomponent can accept or discard different events based

5

on their confidence levels, using appropriate thresholds.In addition to its duties in preparing data for the

metrics to use, the analysis component takes actionwhen a metric determines that it is required. Oncethe analyzed data identifies two processes which arecurrently scheduled on cores that are not best suitedfor them, but can potentially improve performance byswapping the processes between the cores, the analysiscomponent is responsible for performing the swap asquickly and efficiently as possible. The communicationcomponent comes in to play in a handshake phase usedto minimize the time the processes spend on the samecore during a swap. For the purpose of this paper theactual mapping and swapping is accomplished using theget and set affinity functions available on Linux to seteach process to have an affinity with only one core, andwhen swapping simply changing the affinity of eachprocess to their new target core. The design does notmandate that this be the method used, and in fact wouldbenefit from the availability of an interface which wouldallow one to get and set the current core without affinitybeing set.

A. Metrics for Mapping Decisions

In this section, we discuss different metrics that canbe plugged into the SyMMer library. Specifically, wefocus on metrics that solve the symptoms noted in Sec-tion III-C. Since our current reference implementationmakes use of MPI, these metrics will be described inrelation to its facilities, but they are equally applicableto other implementations and models. In all cases, thedefault mapping is assumed to be random with respectto the capabilities of the cores corresponding to thedemands of the processes. In practice this is supported bythe fact that while the scheduler generally puts processeson cores in order, the capabilities of each core permachine, the demands of the processes, and the orderof launching the process is sufficiently variable for it tobe considered random.

1) Communication Idleness Metric: This metric isdefined based on the communication idleness symptomdefined in Section III-C. The main idea of this metric isto calculate the ratio of the idle time (waiting for com-munication) and computation time of different processes.This metric utilizes the MPI monitoring capability ofthe SyMMer framework to determine this ratio. The idletime is measured as the time between the entry and exitof each blocking MPI call within MPI’s progress engine.Similarly, the computation time is measured as the timebetween the exit and entry of each blocking MPI call.The computation time, thus represents the amount ofcomputation done by the application process, assuming

that the process does not block or wait on any otherresource.

Hence, the computational idleness metric representsthe idleness experienced by each process. For example,a process which has a high communication idleness canallow for other computations such as protocol process-ing. Processes with very little idle time on the otherhand cannot tolerate even a small amount of protocolprocessing without seeing material performance degrada-tion. Comparison of this idleness factor between differentprocesses provides an idea of which processes are moresuited for sharing the protocol processing overhead.Clearly, the idleness metric needs to be compared onlyfor processes running on the same node. Hence thismetric only uses the intra node communication channeldiscussed above.

Once the idleness is compared, if a process on a coreother than the protocol processing core is experiencinga high idleness ratio, and the process on the protocolprocessing core is experiencing very low or no idleness,this metric determines that they are suitable for a swap.Figure 5 illustrates this process by depicting each statea process may be in, and the core or cores it mightwish to switch to, from that state. Once the swap ismade the idleness of the two processes should be moresimilar, both to each other as well as to the averageidleness of all local processes. While this on the surfaceappears similar to the cache locality metric, there isa significant difference in that the locality metric triesto improve locality between local processes based oncache misses, whereas the idleness metric attempts tomatch the computation or communication demands of aprocess with a core whose capabilities will help balancethe uneven load.

2) Out-of-sync Communication Metric: The out-of-sync metric captures the amount of computation per-formed by a process while having unsent data buffered inits internal MPI buffers, and compares that with the waittime of other processes. As described in Section III-C,this metric represents the case where unsent data fol-lowed by a long compute phase results in high wait timeson the receive process. When the computation time (withbuffered data) is above a threshold, a message is sent toall processes informing them of this symptom. Similarly,when the wait time of a process is above a threshold, thisinformation is distributed as well. If communicating peerprocesses observe these conditions, a leader process isdynamically chosen, which then analyzes the data anddecides on the best mapping. It then informs the newmapping to all processes concerned.

The mapping decision works to move each process to

6

High Comp

High Comm

Low Comp

Low Comm

Low Comp

High Comm

High Comp

Low Comm

High Comp

High Comm

Low Comp

Low Comm

Low Comp

High Comm

High Comp

Low Comm

High Comp

High Comm

Low Comp

Low Comm

Low Comp

High Comm

High Comp

Low Comm

LowCapability

HighCapability

Average Capability

High Priority MoveLow Priority Move

Fig. 5. Process states: Each circle is a possible process behavior, each box a level of core capability, and each line a potential destination toimprove the performance of the process (based on the Idleness Metric)

a core that is similar to the one its peer is running on.In some cases a disparity between only two processescan cause all processes in an application to see out-of-sync communication, and fixing that one pair repairsthis behavior for all the others as well. This metric isunique among the three discussed in this paper, in thatit makes a completely distributed decision. In fact, itoccurs between a minimum of two nodes, and all nodespresenting the symptom make a decision that one ormore nodes should change their mappings.

3) Cache Locality Metric: The cache locality metricutilizes the L2 cache misses monitored by the SyMMerlibrary. This metric is specific to the processor architec-ture, and relies on the locality of cache between cores onthe same die. If the number of cache misses for a processis sufficiently greater than those observed by anotherprocess on the same node, then these two processes canbe swapped as long as the communicating process ismoved closer to the core performing the protocol stackprocessing. In some cases it can even determine that theprotocol processing core is the best location to place aprocess which has high enough cache misses, since thatcore will have no misses, the same as the other coreon the same die, allowing two processes to be mapped

in this manner rather than just one. This is an exampleof prioritized metrics, in that in these cases, followingthe communication idleness metric would result in abelow optimal result. Again, this metric only relies onintra-node communication as it is only used to switchprocesses to the respective cores within the same node.

V. EXPERIMENTAL EVALUATION

In this section, we evaluate our proposed approachwith multiple microbenchmarks in Section V-A andthe GROMACS and LAMMPS applications and FFTWFourier Transform library in Section V-B. The analysis ofeach microbenchmark uses a variable component whichwe found to have an effect on the performance of thebenchmark and the asymmetry we have been monitoring,the applications lack this because they do not exposesuch malleable characteristics. Thus the applications arepresented as a simple comparison of time with or withoutthe SyMMer library.

The cluster testbed used for our evaluation consistedof two Dell PowerEdge 2950 servers, each equippedwith two dual-core Intel Xeon 2.66GHz processors.Each server has 4GB of 667MHz DDR2 SDRAM.Each processor has a 4MB L2 cache which is shared

7

between two cores. The machines run the Fedora Core6 operating system with Linux kernel version 2.6.18and were connected using NetEffect NE010 10-GigabitEthernet network adapters.

A. Microbenchmark Evaluation

In this section, we evaluate the three microbenchmarkswe developed specifically to test each of the symptomsdescribed in Section III-C individually, and the benefitsachievable by the SyMMer library in these cases. Eachmicrobenchmark was thoroughly tested to ensure thatit stressed its symptom and not the others, since cachelocality can always effect performance, it should benoted here that while it does make a difference inthe first two, the difference is completely eclipsed bythe symptom they are designed to test, and is lost asstatistical noise. In Section V-B, we will evaluate theSyMMer library with real applications and scientificlibraries to show its impact in the real world.

17

19

21

23

25

27

29

31

33

35

1 1.33 2 4

Time (secon

ds)

Idleness ra1o

Vanilla SyMMer

Fig. 6. Communication Idleness Benchmark Performance

1) Communication Idleness Benchmark: The commu-nication idleness benchmark stresses the performanceimpact of delays due to irregular communication pat-terns. In this benchmark, processes communicate inpairs using MPI Send and MPI Recv, with each pairperforming different ratios of computation between eachcommunication step. Thus, a pair that is performing lesscomputation spends more time in an idle state waitingfor communication. Such processes are less impacted bythe protocol processing overhead on the same core ascompared to other processes which spend more of theirtime doing computation.

Figure 6 shows the performance of the communicationidleness benchmark using SyMMer-enabled MPICH2 ascompared to vanilla MPICH2. We define idleness ratioto be the ratio between the computation done by the pairdoing the most computation and the pair doing the least.This ratio hence represents the amount of computational

irregularity in the benchmark. Thus an idleness ratioof 1 represents that all the processes in the benchmarkperform the same amount of computation, while a valueof 4 represents one communicating pair performs up to4 times more computation than another one.

In Figure 6, we plot the time taken for the benchmarkto execute with various idleness ratios. We observe thatboth vanilla and SyMMer-enabled MPICH2 performthe same for an idleness ratio of 1. This is expectedgiven that an idleness ratio of 1 represents a completelysymmetric benchmark with no irregularity in computa-tion. Thus there is no scope for SyMMer to improveperformance in this case. We do observe however thatas the idleness ratio increases, with vanilla MPICH2,the performance of the benchmark increasingly dependson the mapping of processes to cores. That dependencemakes it possible for SyMMer to use the disparity incomputation to achieve a more optimal arrangement,reducing runtime by up to about 30%.

To further analyze the behavior of the benchmark, weshow the distribution of time spent in the different partsof communication in Figures 7(a) and 7(b) for vanillaMPICH2 and SyMMer-enabled MPICH2 respectively.For consistency both are based on the same initialmapping of processes to cores with an idleness ratioof 4. Figure 7(a), shows that the wait times for thedifferent processes are quite variable. This means, thatsome processes spend a lot of time waiting for theirpeer processes to respond, while other processes areoverloaded with the application computation as well asprotocol processing overhead. Figure 7(b), on the otherhand, illustrates the distribution for SyMMer-enabledMPICH2. As shown in the figure, SyMMer keeps thewait time more even across the varied processes, thusreducing the overhead seen by the application.

2) Out-of-Sync Communication Benchmark: Thisbenchmark emulates the behavior where two applicationprocesses perform a large synchronous data transferand a large computation immediately thereafter. Becausesome of the user data may be buffered, usually dueto a full buffer in a lower level communication layer,there is a possibility that the processes may go out-of-sync. In this case that refers to the situation wherethe sending and receiving processes are not assigned tocores with equal effective capabilities. This would resultin data being buffered at the sender node, causing thesend to be delayed, which results in the receiver processwaiting not only for the amount of time it takes to sendthe data, but also for the time needed to complete thecomputation which is logically after the send and receivepair completes.

8

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 1 2 3 4 5 6 7 Process ranks

Computa4on Wait Communica4on

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 1 2 3 4 5 6 7

Process ranks

Computa4on Wait Communica4on

Fig. 7. Communication Idleness Benchmark devision of time: (a) Vanilla MPICH2 (b) SyMMer-enabled MPICH2

0

20

40

60

80

100

120

256K 512K 1M 2M 4M

Time (Secon

ds)

Message Size

Vanilla SyMMer

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81

Second

s Blocked

Vanilla SyMMer

Fig. 8. Out-of-Sync Communication Benchmark: (a) Performance (b) MPI data buffering time

Figure 8(a) shows the performance of the out-of-sync communication benchmark using vanilla MPICH2and SyMMer-enabled MPICH2, it compares the totaltime in seconds to execute the benchmark with variousmessage sizes used in the communication step. Similarto the communication idleness benchmark, SyMMer-enabled MPICH2 performs as well as or better thanvanilla MPICH2 in all cases. At message sizes up to 256KB, both vanilla and SyMMer MPICH2 have the sameperformance. This is because messages at and below 256KB are sent out without buffering by MPICH2 (eagermessages). Because of the absence of buffering until256 KB, there can be no out-of-sync behavior, whichrepresents our base case where there is no scope forperformance improvement. For message sizes above 512KB however, we observe that SyMMer enabled MPICH2consistently outperforms vanilla MPICH2 by up to 80%.As the message size continues to rise above 512 KBhowever, the performance gap between SyMMer andvanilla begins to close. We believe this to be caused byMPI’s tendency to resist buffering larger messages whenpossible. A detailed analysis of this is left for futurework.

The internal workings of the benchmark are demon-

strated in Figure 8(b), which shows the times when datais buffered within the MPI library running under thesame initial conditions with and without SyMMer. Asshown in the figure, the data buffering time is almost anorder-of-magnitude less when using SyMMer. It shouldbe noted that, when an out of sync message occurs withSyMMer, the time taken is the same, but since SyMMercorrects the error in synchronization, it doesn’t happenfurther times. This ultimately results in the performanceimprovement demonstrated in Figure 10.

3) Cache Locality Benchmark: This benchmarkstresses the cache locality of processes by doing themajority of the network communication in certain pro-cesses in the application. Thus, depending on whichcore the communicating processes are mapped to, theymay present the cache locality symptom discussed inSection III-C, which lets us know that there is an impacton the performance of the application. Hence, when thecommunicating processes are not on the same processordie as the core performing protocol processing, they canpotentially take a severe performance hit.

In our benchmark, processes communicate in pairswherein two pairs of processes are engaged in heavyinter-node communication, while the other pairs performcomputation and exchange data locally. We measure the

9

performance as the time it takes to complete a certainnumber of iterations of the benchmark. This is portrayedin Figure-9(a) which showcases performance by compar-ing the total execution time with a computational loadfactor, which is effectively a measure of the amount ofwork done per run. We find that as the computationalload factor increases, SyMMer is able to outperformvanilla MPICH2 by up to 29%.

We profile the benchmark and count the number ofL2 cache misses observed by each process to gain afurther understanding of SyMMer’s behavior. Figure 9(b)shows the total number of cache misses observed bythe processes performing inter-node and intra-node com-munication respectively. We observe that the number ofcache misses is significantly decreased for the inter-nodecommunicating processes, but despite the migration andother overhead incurred in the process, the cache missesfall for the intra-node communicating processes as well.As it turns out, the intra-node communicating processesgain two benefits from SyMMer that we didn’t initiallyanticipate. Firstly, their locality with respect to the localprocesses they communicate with improves. Secondly,moving them away from the die doing the inter-nodeprocessing reduces the amount of cache thrashing theyhave to contend with. Thus SyMMer is not only ableto improve the cache locality of the inter-node com-municating processes, but also that of the intra-nodecommunicating processes as well.

B. Evaluating Applications and Scientific Libraries

In this section, we evaluate the performance oftwo molecular dynamics applications, GROMACS andLAMMPS, and the FFTW Fourier Transform library anddemonstrate the performance benefits achievable usingSyMMer.

1) GROMACS Application: GROMACS (GROningenMAchine for Chemical Simulations) [9] is a moleculardynamics application developed at Groningen University,primarily designed to simulate the dynamics of mil-lions of biochemical particles in a molecular structure.GROMACS is optimized towards locality of processes.It splits the particles in the overall molecular structureinto segments, distributes different segments to differentprocesses, and each process simulates the dynamics ofthe particles within its segment. If a particle interactswith another particle that is not within the process’local segment, MPI communication is used to exchangeinformation regarding the interaction between the twoprocesses. The overall simulation time is broken intomany steps, and performance is reported as the numberof nanoseconds per day of simulation time. For ourmeasurements, we use the GROMACS LZM application.

2) LAMMPS Application: LAMMPS [10] is a molec-ular dynamics simulator developed at Sandia NationalLaboratory. It uses spatial decomposition techniques topartition the simulation domain into small 3D sub-domains, one of which is assigned to each processor.This allows it to run large problems in a scalable waywherein both memory and execution speed scale linearlywith the number of atoms being simulated. We use theLennard-Jones liquid simulation with LAMMPS scaledup 64 times for our evaluation and use the communica-tion time for measuring performance.

3) FFTW Scientific Library: Fourier Transform li-braries are extensively used in several high-end scientificcomputing applications, especially those which rely onperiodicity of data volumes in multiple dimensions (e.g.,signal processing, numerical libraries). Due to its highcomputational complexity, scientists typically use FastFourier Transform (FFT) algorithms to compute theFourier transform and its inverse. FFTW [11] is a popularparallel implementation of FFT.

0

2

4

6

8

10

12

LAMMPS

Commun

ica)

on )me

Vanilla SyMMer

12.6

12.8

13

13.2

13.4

13.6

13.8

14

14.2

14.4

14.6

14.8

7 GROMACS

ns/day

Vanilla SyMMer

7.2

7.3

7.4

7.5

7.6

7.7

7.8

7.9

FFTW

Time

Vanilla SyMMer

Fig. 10. Performance Evaluation: (A) LAMMPS (B) FFTW, lower isbetter (C) GROMACS, higher is better

Figure 10 illustrates the performance achieved bySyMMer-enabled MPICH2 as compared to vanillaMPICH2 for GROMACS, LAMMPS and FFTW. Asshown in the figure, SyMMer-enabled MPICH2 canremap processes to the right cores so as to maximizeperformance, resulting in a performance improvementacross the board. For GROMACS, the overall applicationexecution time is presented, in the form of its owninternal performance measure ns/day, which shows animprovement of about 10-15%. For LAMMPS, commu-nication overhead is presented due to the fact that com-munication in LAMMPS tends not to scale with problemsize making it a more stable measure, which showsnearly a two-fold improvement in performance. ForFFTW, execution time is presented based on the internalaverage execution time provided by the benchmark, we

10

0

20

40

60

80

100

120

140

160

180

1 2 4 8

Time (Secon

ds)

Computa3onal load factor

Vanilla SyMMer

0

100000

200000

300000

400000

500000

600000

700000

Inter‐node Intra‐node

L2 Cache

Misses

Vanilla SyMMer

Fig. 9. (a) Cache locality performance (b) Cache miss analysis

noticed only about 3-5% performance difference in ourexperiments. This is due to the small communicationvolumes that are used for FFTW in our experiments.Given that SyMMer monitors for interactions betweenthe communication protocol stack and the multi-corearchitectures, small data volumes mean that such inter-action would be small as well.

In summary, we see a noticeable improvement inperformance with SyMMer-enabled MPICH2 for allthree cases. This shows that dynamic process-to-coremapping is a promising approach to minimize the impactof interactions of the communication protocol stackswith the multi-core architecture and allow us to improveapplication performance significantly in some cases.

VI. DISCUSSION ON ALTERNATIVE MULTI-COREARCHITECTURES

While we demonstrated our approach on the Intelmulti-core architecture, the idea of dynamically mappingapplication processes to the cores in a system is relevantto most current and next-generation multi- and many-core systems. For example, many-core accelerator sys-tems such as GPGPUs provide complicated hierarchicalarchitectures. The new NVidia Tesla system, for instance,has up to 16 GPUs, with each GPU having up to 128processing cores. The cores in a GPU are grouped to-gether into different multi-processor blocks where coreswithin a block communicate via fast shared memory,while cores across blocks share lower-speed global de-vice memory. Cores between different GPUs can onlycommunicate through the system bus and host memory.Such architectures clearly are highly sensitive to theplacement of processes on the different computationunits, and would certainly benefit from the knowledgegained in development of a mapping library such asSyMMer, if not the facilities of the library itself.

AMD-based multi-core architectures are quite simi-lar to Intel-based multi-core architectures, and as such

one might think they would behave the same. Thereare however, a few key differences which need to beaddressed before they can utilize SyMMer-like processmanagement libraries to full effect. Specifically, the non-uniform memory access (NUMA) model which theyuse makes process management more complicated ascompared to Intel systems. For example, on an Intelsystem, SyMMer could freely move any process to anycore in the system with the only migration cost being re-populating the new cache from main memory. However,on an AMD system, as soon as a process touches a mem-ory buffer, this buffer is allocated on its local memory.At this time, if the process is migrated to a differentdie, all of its memory access will be remote, resultingin significant overhead. For example, we noticed thatfor the LAMMPS application, process migration canincrease the non-local memory transfers by 17-31 foldin some cases. While the SyMMer architecture is stillvalid for AMD systems, aspects such as page migrationbetween different memory controllers become importantconcerns in achieving good performance. In its absence,SyMMer would be restricted to process mapping only tocores within a die, which would result in a more limitedimprovement in performance.

Finally, with the upcoming network-on-chip (NoC)architectures such as Intel Xscale (a.k.a. Intel terascale)processors and Tilera processors, dynamic process-to-core mapping will become increasingly important. Pri-marily owing to the huge number of cores that will resideon the same die. For example, the Intel Xscale processorswould accommodate 80 cores per die, while Tileraprocessors accommodate 64. In such cases, assigninga process to the wrong die could lead to a significantamount of inter-die communication which would belargely limited by the die pin overhead. A library likeour novel process-to-core mapping framework, coulddynamically search for the right assignments and usethem to achieve better and more consistent performance.

11

VII. RELATED WORK

While both multi-core architectures and communica-tion protocol stacks are heavily studied topics, to thebest of our knowledge, there is no existing literature thatdirectly studies the interactions between these two asidefrom our previous paper [6].

In [14], the authors quantify the performance differ-ence in asymmetric multi-core architectures and defineoperating system constructs for efficient scheduling onsuch architectures. This closely relates to the effectivecapability of the cores (similar to the current paper), butthe authors only quantify asymmetry that already existsin the architecture (i.e., different core speeds). In ourwork, we study the impact of the interaction of multi-core architectures with communication protocol stacks,which externally creates such asymmetry through aspectssuch as interrupts and cache sharing.

In [15], the authors study the impact of the Intelmulti-core architecture on the performance of variousapplications by looking at the amount of intra-nodeand inter-node communication performed. Some of theperformance problems of multi-cores are identified inthat paper and the importance of multi-core-aware com-munication libraries is underlined. They also discuss atechnique of data tiling for reducing the cache contentionwhich is dependent on the cache size. While the under-lying principle and approach of this work significantlydiffers from our approach, we believe that these twotechniques can be utilized in a complementary mannerto further improve performance.

The authors of [16], [17] look at the impact of sharedcaches on scheduling processes or threads on multi-corearchitectures wherein certain mappings of processes tocores would work better in terms of utilizing it. Weimprove upon their work by generalizing the extraneousfactors that affect application performance on multi-coresystems and by devising a framework for dynamicallyperforming the ideal mapping of processes to cores.

In summary, our work differs from existing literaturewith respect to its capabilities and underlying principles,but at the same time, forms a complementary contribu-tion to other existing literature that can be simultaneouslyutilized.

VIII. CONCLUDING REMARKS AND FUTURE WORK

In this paper, we demonstrated that the interactionsbetween the communication stack and multi-core archi-tectures can result in heavy asymmetry in the effectivecapabilities of the different cores; this results in signif-icant performance degradation for various applications.We further presented the design and evaluation of a novel

systems software stack, known as SyMMer (SystemsMapping Manager) library, that monitors such interac-tions and dynamically manages the mapping of processesonto processor cores so as to improve performance. Ourevaluation of the SyMMer library demonstrated nearly atwo-fold improvement in communication time and 10-15% improvement in overall performance for variousapplications.

As future work, we plan to study other multi- andmany-core architectures such as AMD NUMA andGPGPU systems to understand the implications of novelprocess-to-core mapping on such systems.

REFERENCES

[1] “Intel Core 2 Extreme quad-core processor,” http://www.intel.com/products/processor/core2xe/qc\ prod\ brief.pdf.

[2] “AMD Quad-core Opteron processor,” http://multicore.amd.com/us-en/quadcore/.

[3] “Sun Niagara,” http://www.sun.com/processors/UltraSPARC-T1/.[4] “Intel Terascale Research,” http://www.intel.com/research/

platform/terascale/teraflops.htm.[5] “Tilera TILE64 Processor family,” http://www.tilera.com/pdf/

ProBrief\ Tile64\ Web.pdf.[6] G. Narayanaswamy, P. Balaji, and W. Feng, “An Analysis of

10-Gigabit Ethernet Protocol Stacks in Multicore Environments,”in 15th International Symposium on High-Performance Intercon-nects (HotI 2007), Palo Alto, California, August 2007.

[7] A. N. Laboratory, “Mpich2: High performance and portablemessage passing.”

[8] “Message Passing Interface Forum,” http://www.mpi-forum.org.[9] H. J. C. Berendsen, D. van der Spoel, and R. van Drunen,

“GROMACS: A message-passing parallel molecular dynamicsimplementation,” Computer Physics Communications, vol. 91,no. 1-3, pp. 43–56, September 1995. [Online]. Available:http://dx.doi.org/10.1016/0010-4655(95)00042-E

[10] S. Plimpton, “Fast Parallel Algorithms for Short-Range Molec-ular Dynamics,” J. Comput. Phys., vol. 117, no. 1, pp. 1–19,1995.

[11] M. Frigo and S. G. Johnson, “The design and implementation ofFFTW3,” Proceedings of the IEEE, vol. 93, no. 2, pp. 216–231,2005, special issue on ”Program Generation, Optimization andPlatform Adaptation”.

[12] “PAPI,” http://icl.cs.utk.edu/papi/.[13] “PERUSE,” http://www.mpi-peruse.org/.[14] T. Li, D. Baumberger, D. A. Koufaty, and S. Hahn, “Efficient Op-

erating System Scheduling for Performance-Asymmetric Multi-Core Architectures,” in SC ’07, 2007.

[15] L. Chai, Q. Gao, and D. K. Panda, “Understanding the Impactof Multi-Core Architecture in Cluster Computing: A Case Studywith Intel Dual-Core System,” in CCGrid ’07, 2007.

[16] J. Anderson, J. Calandrino, and U. Devi, “Real-time schedulingon multicore platforms,” Real-Time and Embedded Technologyand Applications Symposium, 2006. Proceedings of the 12thIEEE, pp. 179–190, 04-07 April 2006.

[17] A. Fedorova, M. Seltzer, and M. D. Smith, “Cache-fair threadscheduling for multicore processors,” Division of Engineeringand Applied Sciences, Harvard University, Tech. Rep. TR-17-06,October 2006.

12

Documents

Asymmetric Interactions in Symmetric Multi-core Systems