Multi-Core Execution of Future Applications in the APZ VM

Multi-Core Execution of Future Applications in the APZ VM

A N D R E A S S E H R a n d C A R L B R I N G

Master of Science Thesis Stockholm, Sweden 2011

Multi-Core Execution of Future Applications in the APZ VM

A N D R E A S S E H R a n d C A R L B R I N G

Master’s Thesis in Computer Science (30 ECTS credits) at the School of Computer Science and Engineering Royal Institute of Technology year 2011 Supervisor at CSC was Mads Dam Examiner was Johan Håstad

TRITA-CSC-E 2011:089

ISRN-KTH/CSC/E--11/089--SE ISSN-1653-5715 Royal Institute of Technology School of Computer Science and Communication KTH CSC SE-100 44 Stockholm, Sweden URL: www.kth.se/csc

AbstractThe APZ VM is a subsystem in Ericsson’s AXE platform, which is atelecommunication system that is widely used around the world. TheAPZ VM application is essentially a virtual machine developed by Er-icsson to replace its predecessor, the APZ-machine. This shift in tech-nology was beneficial to Ericsson for several reasons, but the ability toutilize commercial hardware which provides cost-efficient performanceimprovements in the APZ VM was a particularly strong incentive. TheAPZ VM is already partially parallelized and divides the execution intotwo separate units referred to as the Signal Processor and the Instruc-tion Processor. This enables the APZ VM to use two separate processorcores during execution. The commercial hardware of today does howeverprovide far more processor cores than the APZ VM is able to exploit.

The main approach for exploiting additional processor cores in thisthesis is by parallelization of the Instruction Processor Unit (IPU). Theidea is to relieve the IPU of workload by migrating some of its work toanother processor core. This parallelized prototype of the APZ VM hasto be completely backwards compatible and transparent to the othersystems that the IPU communicate with. A full scale implementationof a parallelized IPU prototype is a large project exceeding the scope ofthis master’s thesis, therefore the focus is to construct a small prototypewith several delimitations which narrow the thesis. The potential of thefully parallelized IPU is instead investigated through simulations oncommercial hardware.

ReferatFlerkärnig exekvering av framtida applikationer i

APZ VM

APZ VM är ett subsystem i Ericssons AXE-plattform, vilket är etttelekommunikationssystem som är vida spritt i världen. Själva APZ VMär en applikation vilket agerar som en virtuell maskin och är utveckladav Ericsson för att ersätta dess föregångare APZ-maskinen. Teknolo-giskiftet var fördelaktigt för Ericsson av flera skäl, men framförallt vardet förmågan att använda kommersiell hårdvara med kostnadseffektivprestanda i APZ VM som lockade. APZ VM är delvis parallelliseradoch delar exekveringen i två separata enheter som kallas signalproces-sorn och instruktionsprocessorn. Den kommersiella hårdvaran som finnsidag erbjuder däremot betydligt fler processorkärnor än vad APZ VMkan utnyttja. Det huvudsakliga angrips sättet för att utnyttja ytterli-gare processorkärnor i den här rapporten är genom parallellisering av in-struktionsprocessorn. Idén är att avlasta instruktionsprocessorn genomatt migrera en del av arbetet till en annan processorkärna. Denna par-allelliserade prototyp av APZ VM måste också vara helt bakåtkompat-ibel och transparent för de andra systemen som instruktionsprocessornkommunicerar med. En fullskalig implementation av en prototyp medparallelliserad instuktionsprocessor är ett stort projekt vilket överstigerresurserna som finns att tillgå för den här rapporten, därför kommerfokus att vara på konstruktionen av en mindre prototyp där flera begrän-sningar är satta. Den fulla potentialen av projektet undersöks genomsimuleringar av de potentiella prestandaförbättringar som uppnås påkommersiell hårdvara med en helt parallelliserad instruktionsprocessor.

PrefaceThis thesis was written during the winter 2010 - 2011 at Ericsson in Telefonplan,Stockholm.We would like to thank our families, the coffe machine, Ericsson, KTH, the nearbypizzerias, google docs, LaTeX, eclipse, google translate, gdb, ssh, Coca-Cola, Redbull, HP support and supervisors which made this thesis possible.Division of workload in the thesis:

Task Andreas CarlBackground XProblem description X XTheory X XImplementation X XResult X XConclusion X X

Division of workload in the implementations:

Task Andreas CarlSpin experiment XCache experiment XIPU Simulation XPrototype X X

AbbreviationsAKE - Swedish: Automatiska Kodväljarbaserade ElektronikväxlarAP - Adjunct ProcessorAPT - Application systemAPZ - Execution platformASAC - ASA210C JIT-CompilerAXE - Automatic eXperimental Electronic switchBC - Blade ClusterBS - Blade SystemBSC - Base Station ControllerCP - Central ProcessorCPU - Central Processing UnitFIFO - First In First OutGCC - GNU Compiler CollectionGEP - Generic Processing BoardHAL - Hardware Abstraction LayerHPC - High Performance ComputingHT - Hyper ThreadingIO - Input/outputIP - Instruction ProcessorIPU - Instruction Processing UnitISA - Instruction Set ArchitectureISHM - Internally SHared MemoryISP - Improved In-service Performance or Internet Service ProviderJIT - Just In TimeL1 - Level 1, L2 stands for Level 2 and so forth.LLC - Last Level CacheLAN - Local Area NetworkMSC - Mobile Switching CentreMutex - Mutual ExclusionOS - Operating SystemOSI - Operating System InterfacePHC - Program Handling CheckPLEX - Programming Language for EXchangesRMP - Resource Module PlatformRP - Regional ProcessorRTD - Real Time DebuggerSMP - Shared Memory ProcessorsSMT - Simultaneous Multi-ThreadingSP - Signaling ProcessorSPU - Signaling Processing UnitVAS - Virtual Address SpaceVM - Virtual Machine

Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 Background 11.1 The Emergence of Ericsson . . . . . . . . . . . . . . . . . . . . . . . 11.2 AXE Digital Switching System . . . . . . . . . . . . . . . . . . . . . 11.3 Commercial Development of Parallelization . . . . . . . . . . . . . . 31.4 Historical Parallelization of the Central Processor . . . . . . . . . . . 3

2 Problem Definitions 52.1 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.1 Exclusively Intel microarchitecture . . . . . . . . . . . . . . . 62.1.2 Restrict Usage of Locks in Platform Layer . . . . . . . . . . . 62.1.3 Restrict Parallelization in the Application Layer . . . . . . . 62.1.4 Restrict Prototype to Limited Number of Additional Processors 72.1.5 Restrict Changes of Preexisting Code . . . . . . . . . . . . . 72.1.6 Restrict Offloading of Workload to One Job Buffer . . . . . . 7

2.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 System Descriptions 93.1 Central Processor Unit Architecture . . . . . . . . . . . . . . . . . . 9

3.1.1 Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.1.2 Cache Memory . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1.3 Nehalem Processor Architecture . . . . . . . . . . . . . . . . 12

3.2 Parallel Computing in a Software Perspective . . . . . . . . . . . . . 143.2.1 Programs and Processes . . . . . . . . . . . . . . . . . . . . . 143.2.2 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . 163.2.3 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Implementation 234.1 Equipment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2.1 Threading Interface . . . . . . . . . . . . . . . . . . . . . . . 254.2.2 Spin Experiment . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2.3 Cache Memory Experiment . . . . . . . . . . . . . . . . . . . 284.2.4 IPU Simulation Experiment . . . . . . . . . . . . . . . . . . . 294.2.5 Prototype of Parallelized APZ VM . . . . . . . . . . . . . . . 32

5 Results 415.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.1.1 Spin Experiment . . . . . . . . . . . . . . . . . . . . . . . . . 415.1.2 Cache Memory Experiment . . . . . . . . . . . . . . . . . . . 445.1.3 IPU Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.2 Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.2.1 Testing of the Prototype . . . . . . . . . . . . . . . . . . . . . 52

6 Conclusions 596.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.2 Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606.3 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

6.3.1 Regarding Inter-thread Communication . . . . . . . . . . . . 626.3.2 Further Code Optimization . . . . . . . . . . . . . . . . . . . 636.3.3 Determine if Block Is Runnable . . . . . . . . . . . . . . . . . 63

7 Bibliography 65

A The APZ System 67

Chapter 1

Background

To understand the contents and purpose of this thesis it is important to have a basicoverview of the huge system AXE. The background aims to introduce the readerto the subject by gradually increasing the level of detail of the information, thusgiving the reader a logical explanation and an understanding of how this thesis fitsinto “the bigger picture”.

1.1 The Emergence of EricssonThe company we now call Ericsson originally consisted of a small workshop in Stock-holm, Sweden where Lars Magnus Ericsson repaired various telegraph instrumentsfor the Swedish government, railways and army. At the very same time in 1876Alexander Graham Bell filed his patent for the telephone in the United States ofAmerica. Surprisingly the patent was never filed in Sweden even though the tele-phones of A. G. Bell where being sold there. It did not take long before L. M.Ericsson got his first request to repair such a telephone, and by doing so L. M.Ericsson realized the great potential of telephones.

This realization led L. M. Ericsson to build a better and less expensive telephone.The portfolio of products provided by Ericsson was initially limited to pairwise linktelephones for internal usage, but eventually the need for commercial telecommuni-cation networks boosted the development of telecommunication routers at Ericsson,which has continued to be one of the core products for the company. By 1900 Er-icsson had become one of the major suppliers of telecommunication equipment onthe international market.

1.2 AXE Digital Switching SystemThe development of AXE (Automatic eXperimental Electronic switch) started inthe early 1970s and is the digital successor to the analogue switching system AKE(Swedish: Automatiska Kodväljarbaserade Elektronikväxlar) and is the most de-ployed switch of Ericsson. The reason for the success of AXE was the initial pos-

1

CHAPTER 1. BACKGROUND

Figure 1.1: Lars Magnus Ericsson 1906.

sibility to shift from electromechanical communication into digital communication,which added both technological advantages and reduced logistical footprint. Thecontinuous success of AXE is highly influenced by its flexible system and modulararchitecture that easily is adapted to user requirements such as additional capacityand new functionality. The system is used in both new and old networks, which addsa further dimension of flexibility to AXE by providing backwards compatibility.

The AXE of today does not have so much in common with the early versionof the AXE switching system, except that it retains the original architecture. TheAXE system is no longer only a switching system for telephony, but includes a widevariety of services. The modern AXE platform provides support for data trafficusing 3G and tailor-made applications for service providers. There has also been adramatic improvement to overall functionality over the years for the AXE system,which includes support for new standard interfaces, increased processing power,reduced physical space requirements and power consumption.

The AXE system is divided in two main subsystems according to their distinctionin functionality:

Application System (APT) is the telephony subsystem of AXE, which includesa series of additional subsystems for traffic management and routing.

Execution platform (APZ) is the controller subsystem, which incorporates bothhardware and software subsystems for operating the AXE system.

2

1.3. COMMERCIAL DEVELOPMENT OF PARALLELIZATION

These subsystems are then used in different combinations to create nodes thatprovide specific functionality in the AXE system.

1.3 Commercial Development of Parallelization

The main techniques for performance benefits have historically been achieved byintroducing increasingly complex architectures with higher clock frequency. Unfor-tunately, the technique of constantly increasing the clock frequency of processorsis unsustainable because of physical constraints in power consumption and conse-quently heat generation. This development eventually hit an upper limit in 2004and a notable announcement was made describing this phenomenon. The processormanufacturer Intel decided to cancel the development the forthcoming micropro-cessors Tejas and Jayhawk due to thermal problems1.

This marks Intel’s change in strategy of development by reprioritizing theirefforts towards increasing performance for commercial products mainly through in-troducing multiple processors (or cores) on a single die. The commercial computerindustry now expect that future performance increases will largely come from in-creasing the number of cores on a die, rather than making a single core go faster.It is important to emphasize that multiple processors is not a completely new tech-nique, it has been around for a long time in high performance computing (HPC)environments such as research and enterprises. This marks a paradigm shift forthe processor industry which changes the programming interface by enforcing par-allelism to the programmer after decades of mostly sequential programming.

1.4 Historical Parallelization of the Central Processor

Improving performance of the CP is a continuously ongoing process at Ericsson.Many different solutions utilizing parallelization have been explored with variousdegree of success. Here is a brief overview of what solutions have been explored:

Replication of CP blades: This is a successful implementation of parallelismcalled Bladecluster, where traffic that used to go to a single CP instead getsdistributed to multiple CPs referred to as blades. These blades share resourcesbetween each other by using switches and network protocols for inter-bladecommunication.

Parallelism within the CP: 1. Workload divided between SPU and IPU: TheSPU handles communication and protocols, while the IPU executes jobswhich invoke applications in the APT.

1Laurie J. Flynn. 2004. News article - Intel halts development of two new microprocessors[http://www.nytimes.com/2004/05/08/business/08chip.html?ex=1399348800&en=98cc44ca97b1a562&ei=5007]

3

CHAPTER 1. BACKGROUND

Parallelism within the IPU is constrained by Ericsson: Modifications are notallowed to affect the upper application layer (APT) and has to be transparentto the PLEX-programmers.

1. Instruction level parallelism: Various techniques offered by modern pro-cessors, for example multiple functional units, pre-fetching or maskingmemory accesses, pipe-lining, out of order execution, very long instruc-tion words, cache hierarchies etc. are exploited using appropriate com-piler optimizations and program allocation strategies.

2. Task level parallelism: Distributing jobs to different processor cores in amanner that ensures sequencing requirements and maintains data con-sistency.a) Concurrent flow: Extracting concurrent flows from incoming proto-

col sequences and mapping each flow to different processor cores.b) Functional distribution (a special type of pipe-lining): Mapping dif-

ferent functions (PLEX application blocks) to different processorcores.This thesis: A restricted case of functional distribution

that simplifies design by assigning only new parallelization-safe application blocks to one additional processor core.

c) Combination of the above.d) Speculative execution: Executing jobs from a common job buffer

as processor cores become available, even out of order, but committhem in order of arrival. If data collision is detected, then a rollbackis performed for the affected jobs.

There have been functioning implementations of prototypes for all techniqueslisted above, but only a fraction of these prototypes are used in commercial re-leases of the product. The reason for not including many of these techniques incommercial releases in spite of having working prototypes is mainly because of thecontinuous hardware improvements providing more cost-effective performance. Butas the trend of hardware development is changing towards adding additional proces-sor cores instead of increasing clock frequency, the efforts to utilize parallelizationin APZ VM is once again back into focus. By concentrating on improving parallelperformance for new functionality, the complexity of the prototype implementationis significantly reduced.

4

Chapter 2

Problem Definitions

APZ VM is a real-time system where workload varies depending on several exter-nal parameters. The purpose of this thesis is to investigate how APZ VM can useparallelization to find viable solutions which utilize multi-core processors more effi-ciently and scale performance dynamically according to the workload, while retain-ing backward compatibility and real-time requirements of the system. Increasingperformance to sustain higher workload can also have drawbacks such as increasedpower and cooling consumption.

The main problem of the thesis is to determine at what level of work-load the APZ VM benefit from distributing part of the load onto multipleprocessor cores while considering the effects of key factors such as dy-namic clock frequency, shared cache memory and thread communicationoverhead.

There are several sub-problems that have to be solved for this to be determined:

Parallelized prototype of APZ VMTo efficiently make use of new multi-core microprocessors and increase perfor-mance of the APZ VM, the system has to be modified in a way that supportsreal-time offloading of workload to additional processor cores without dis-turbing the execution. The workload has to be distributed asymmetrically tothe processor cores due to constraints both in hardware and software. Thesoftware architecture of APZ VM has to be modified to support this.

Utilize Hyper ThreadingHyper Threading is an Intel implementation of simultaneous multithreading(SMT) that utilizes computational time of various delays in execution such asfetching memory, waiting for input/output (IO) operations. This techniquecan potentially be used in APZ VM to increase performance as the latestprocessor architecture from Intel support Hyper Threading.

Design rules for block execution

5

CHAPTER 2. PROBLEM DEFINITIONS

Offloading work to additional processor cores is only done for newly developedPLEX blocks which are safe to parallelize. This requires a set of design rulesto be defined which specify how new blocks are to be executed.

Optimize power consumptionThe processors in the Nehalem processor family from Intel are put to sleepduring low workload to reduce power and cooling consumption. This techniquecan potentially be utilized by APZ VM during periods of low workload.

Measure performance gainActual gain in performance is verified by performing measurements on thesystem. These measurements have to simulate various types of traffic andinput data to generate reliable results.

2.1 Delimitations“Make everything as simple as possible, but not simpler.” - Albert A. Einstein

2.1.1 Exclusively Intel microarchitecture

Intel’s microarchitecture is exclusively used for all types of tests in this thesis despitethere being a lot of other microarchitecture manufacturers. The decision to usethe Intel platform is a choice made by Ericsson and is therefore a limitation forthis thesis. This decision enables the usage of several Intel-specific features such asHyper-Threading, Turbo Boost and Smart Cache. There is however similar featuresimplemented in competing architectures as well. The Intel platform is therefore tobe regarded as a sample platform from a rather generic collection of platforms.

2.1.2 Restrict Usage of Locks in Platform Layer

The design of the prototype is done according to the concept of the original APZ VMand tries to avoid the need for locks and semaphores in the system. The focus isinstead to use simplified synchronization with low overhead using atomic operationswhich are realized through hardware.

2.1.3 Restrict Parallelization in the Application Layer

There is a huge amount of applications that run above the APZ VM, some of theseapplications are very old and co-exist in a spider network of dependencies with eachother. These applications are potentially designed in a non-parallelization-safe way.

This thesis therefore restricts additional processor cores to only execute appli-cations that guarantee parallelization-safety. These guarantees makes applicationsless complex to run in parallel and the need for advanced synchronization techniquesin the APT is avoided.

6

2.2. PURPOSE

2.1.4 Restrict Prototype to Limited Number of Additional ProcessorsThe internal communication of the APZ VM is today designed with buffers whichare writable and readable by only one assigned thread. This design choice scalespoorly, but is sufficient for a couple of threads. The implementation of the pro-totype utilizes a limited number of additional processor cores. This is due to thecomplexity of implementing internal communication between threads without lock-ing mechanisms.

2.1.5 Restrict Changes of Preexisting CodeThe APZ VM system serves as an operating platform for a huge amount of appli-cations with a very long history. The preexisting application layer running abovethe APZ VM cannot be modified in any way that risks breaking the original func-tionality. The existing APZ VM system is also kept as intact as possible by onlymodifying the essential parts.

2.1.6 Restrict Offloading of Workload to One Job BufferIncoming traffic is regarded as jobs in the APZ VM which are divided into differentjob buffers with different priorities. The prototype is only to offload work from one ofthese job buffers, thereby eliminating the complexity of handling different prioritiesand the inter-communication required to guarantee that these are executed in thecorrect order.

2.2 PurposeThe IPU is currently restricted to execute on a single processor core and is thereforeunable to effectively utilize multi-core processors. The purpose of this thesis is toinvestigate how the APZ VM can use parallelization to find viable solutions whichutilize multi-core processors more efficiently while retaining backward compatibil-ity and real-time requirements of the system. By improving performance of theAPZ VM, the system is also able to provide better service and reduced environmen-tal footprint trough more efficient power and cooling consumption.

7

Chapter 3

System Descriptions

This chapter aims to introduce the theoretical subjects and systems related to thisthesis and give the reader an intermediate understanding sufficient to the forthcom-ing results. This chapter is divided into three sub-chapters explaining the systemfrom different perspectives:

• Central Processor Unit Architecture

• Parallel Computing in a Software Perspective

• APZ VM

3.1 Central Processor Unit ArchitectureThis chapter investigates the hardware restrictions in terms of processor speed andaccess times. Features implemented in the Intel Nehalem microprocessor such aspipeline, cache memory and simultaneous multithreading are also described in thischapter.

In modern computers the main microprocessor is called the Central ProcessingUnit (CPU) and it executes series of stored instructions called programs. Theseinstructions are clearly defined in an Instruction Set Architecture (ISA) for theCPU. Common instructions in the CPU are described below:

• move memory

• addition

• subtraction

• multiplication

• division

• jump in the program code according to conditional statements

9

CHAPTER 3. SYSTEM DESCRIPTIONS

• wait

• more specialized and advanced calculation

The ISA has developed a lot over the years. In the first x86-microprocessors1

there where only a hundred instructions implemented in the CPU. Today, there are alot more instructions in the modern processors, but due to backward compatibilityall the original instructions are still available. Many of the new instructions areoptimized variants of the most basic operations, but specialized for different datatypes. During the paradigm shift to the 64-bit architecture, the manufacturersremoved some of the oldest instructions to get more room for new instructions inthe CPU.

3.1.1 Pipeline

There are a few steps a processor has to go through when executing an instruction.These steps are called the instruction pipeline. In a simplified view the processorssteps are divided into the following operations:

1. Fetch instruction

2. Decode instruction

3. Fetch data

4. Execute instruction

5. Store results

In the ancient Intel 8086 processor all these steps have to be executed for oneinstruction before the next instruction is allowed to execute. But this is inefficientand the processor designers searched for a way to improve this.

An instruction consists of multiple steps as illustrated in figure 3.1 and sincethese steps are executed in different areas of the core it is possible to overlap theexecution of instructions, which multiplies the instruction throughput. Today, mostof the new developed processors interleave the execution of instructions to improveperformance.

One of the drawbacks with a processor pipeline is branching. The result of abranch is resolved at the end of the pipeline and the processor can only make aguess if the branch is taken or not taken. If the processor makes a wrong guess, thepipeline has to be flushed and the execution has to be restarted at the branch jump.The cost of flushing the pipeline of its loaded instructions varies with the length ofthe pipeline itself. Therefore, the Nehalem pipeline has been shortened comparedto its predecessors.

1x86 is a standard instruction set developed by Intel.

10

3.1. CENTRAL PROCESSOR UNIT ARCHITECTURE

Figure 3.1: The processor pipeline.

3.1.2 Cache Memory

The CPU communicates with various forms of memory units during the executionof programs. The performance of the CPUs memory communication is largelydependent on the latency for reading and writing to memory. Therefore, the memoryin modern computer architecture is divided into hierarchical levels ordered after itscommunication performance (cache in the processor, main memory, hard drives,external devices).

The purpose of cache memory is to reduce the latency overhead caused by stallsin the CPU when fetching or writing to main memory. The performance improve-ment of cache memory is realized through faster access latencies, which is utilizedby the CPU as a preferred storage over the main memory.

The size of the cache is naturally much smaller than the actual main memoryit is caching for. Therefore, each main memory block is mapped to one or multiplecache blocks. If the main memory block is mapped to only one cache block it iscalled direct mapping and if it is mapped to multiple cache blocks it is called n-wayassociative (usually n is two or four).

The cache functionality is easily explained trough a simple usage scenario ofreading or writing data in the main memory:

Cache miss The CPU first checks whether a copy of the data is located in thecache. If the data is found, the CPU immediately reads from or writes to the cacheline. If the data is not found in the cache, then there is a so called “cache miss”.The CPU then has to load the requested data into the cache from the main memory.The CPU has to wait while this procedure finishes, the CPU is stalled. The CPUproceeds when the requested data is in the cache.

Data cache misses are less costly compared to instruction cache misses because

11


Figure 3.2: Cache levels

when a data cache miss occurs the processor continues execution with another in-struction which is not dependent of the data, this is called out of order execution.

There are several strategies for storing useful data in cache memories and therebyavoiding cache misses, these strategies generally rely on some kind of prediction ofwhat data the CPU will use in combination with exploiting the fact that commu-nication with main memory is more effective when reading large chunks of data.

3.1.3 Nehalem Processor ArchitectureThe following section describes central parts of Intel’s Nehalem processor-specificfunctions. It is possible to use a combination of simultaneous multithreading anddynamic clock frequency for power saving by deactivating unused physical cores,which results in a higher clock frequency per core.

Simultaneous Multithreading

Simultaneous multithreading (SMT) is exploited in Intel’s proprietary technologycalled Hyper Threading. The technology is an implementation of simultaneousmultithreading meant to increase the performance of the processor by effectivelyutilizing stall times in the processor for computation. The processor pipeline imple-ments dual instruction queues creating two logical processors per physical processor.This is beneficial when executing instruction stalls, the CPU is then able to switchto executing the other instruction queue instead. A stall occurs in the processorpipeline when the executing thread is waiting for cache memory to be fetched due toa branch miss-prediction. Reports of the Nehalem architecture from the European

12

3.1. CENTRAL PROCESSOR UNIT ARCHITECTURE

Organization for Nuclear Research (CERN) indicate that an increase of performancewith 15% - 28% is observed when Hyper Threading is turned on2.

Dynamic Clock Frequency

The technique for increasing the clock frequency dynamically is referred to as TurboBoost technology by Intel. When increasing the clock frequency of the CPU, theCPU also consumes more power, which in turn generates more heat. Too much heatgeneration results in damage to the cores and is the main reason why cores are notalways able to run at maximum capacity. Intel has instead added a feature in theCPU which monitors the following parameters in each core:

• Load

• Number of active cores

• Estimated power consumption

• Estimated current consumption

• Temperature

If the load goes up and the monitored values are within accepted thresholds,the core increases the clock frequency with steps of 133.33 MHz. Power consump-tion and heat generation goes up as clock frequency increases and if any of thesemonitored values exceeds their threshold limit, the core’s clock frequency is reduceduntil the monitored values are within acceptable limits.

Dynamic Cache Allocation

In Intel’s Nehalem processors there are three levels of cache. The first level (L1)closest to the CPU is 64 KB in size and has the fastest access time, but it alsois the smallest in memory size. The L1 cache is the only place where storage isdivided equally between instructions and data. It is common for modern computerarchitectures to have specific cache memories for program instructions and programdata, as the locality and reuse behavior of memory in these two areas differ. Thesecond level (L2) cache is 256 KB and it is like the L1 cache shared between thelogical cores while the processor runs in Hyper Threading mode.

The third level (L3) cache is the last level cache (LLC) on the Nehalem CPU.The LLC is shared between all cores in the CPU and consist of 8 MB. The Nehalemprocessor is the first processor from Intel with dynamic cache allocation in theLLC. Dynamic cache allocation is what Intel calls Smart Cache and allows the corewith the most need of data to allocate a larger area of the processors LLC. Inprevious microarchitectures the LLC was statically divided between all the cores.For instance, one of the Quad Core 2 processor from Intel has 12 MB L3 cachewhich is divided into 3 MB for each core.

2Evaluation of the Intel Nehalem-EX server processor. Cern openlab.

13


Cache Size Shared AssociativityL1 Instruction 128 KB (32 KB/core) Yes, between logical cores 8-wayL1 Data 128 KB (32 KB/core) Yes, between logical cores 8-wayL2 1024 KB (256 KB/core) Yes, between logical cores 8-wayL3 8 MB Yes, between all cores 16-way

Figure 3.3: Shows the cache configuration of a quad core Nehalem E5540 CPU

3.2 Parallel Computing in a Software Perspective

Support for parallel computing is today a necessity for high performance softwareapplications that are required to scale well with current and future hardware con-figurations. The major goal and difficulty of parallel computing is performing manycalculations simultaneously. One way is to decompose large problems into smallersub problems which may be solved concurrently. The performance gained whendecomposing a problem into sub problems and solving them concurrently is par-tially limited by the physical hardware and its number of processors available, andpartially from the maximum number of sub problems available for execution whichsets the upper bound of how many processors that may be utilized at a given time.

Ideally, a problem would always be decomposable into as many sub problems asavailable processors in the system and thus scale linearly in performance. This ishowever achievable only in theory with perfect conditions, where in reality systemshave overhead due to management of parallel tasks and inter-communication.

Amdahl’s Law The available concurrency in the underlying problem and howmuch of this concurrency is decomposable give the potential improvement in perfor-mance from parallelization. This law was formulated by Gene Amdahl in the 1960sand describes the relationship of how parallelizable parts and non-parallelizable(sequential) parts affect the speed-up of parallelization.

S = 11− P (3.1)

S is the potential speed-up of the program if parallelizedP is the fraction of what is decomposable.

3.2.1 Programs and Processes

An instance of a program executing in the operating system is regarded as a process.The main components that a process in the operating system consists of are thefollowing:

14

3.2. PARALLEL COMPUTING IN A SOFTWARE PERSPECTIVE

Memory Executing a program requires several different types of memory function-ality. The operating system provides the process with its own memory, whichis unique to the corresponding process and may not be shared.

Virtual Address Space The VAS acts as an abstraction layer between theprocess and the physical memory, where the operating system controlshow the virtual memory addresses are mapped to actual physical mem-ory. This enables the operating system to ensure that memory is keptseparated between processes.

Program instructions The actual program consists of a series of machineinstructions which are to be executed by the process. These instructionsare stored in a separate part of the memory.

Program stack The stack is used by the program to store return addressesof and thereby enable function calls.

Program heap The heap is used for dynamic memory allocation by theprogram.

Process state information The process state is represented by registers inthe executing processor which has to be stored in memory when theoperating system switches execution to another process.

Operating system attributes The process stores various information regardingthe operating system:

File descriptors The process allocates file descriptors for basic input andoutput communication of the program.

Security attributes The process stores various security permissions whichare inspected by the operating system before accessing files.

Task A task is regarded as an individual unit which may be executed by the op-erating system. A task represents either a whole process or a single threadwithin a process.

Most of the modern operating systems today support multitasking. This isachieved by executing individual tasks on physical processors during small time-slices. Multitasking is a crucial feature for both multi- and single-core processorsystems and is used for running many applications in parallel, but also for time-sharing of the systems computational capacity.

Threads

Threads can be regarded as miniature processes, but with plenty of differentiatingcharacteristics. Threads are contained inside a process and share many of the re-sources in the process such as virtual address space, program instructions, operatingsystem attributes and program heap. Individual stacks are needed for the threads

15


to be able to execute independent tasks within the program of the process. Thismakes threads the smallest unit of execution that is scheduled by the operatingsystem.

Two of the main benefits of threads compared to processes are:

Shared memory communication One of the main benefits of threads is com-munication through shared memory, which eases design and improves per-formance compared to the alternatives available for process communication.Shared memory communication have various pitfalls as well, which are ex-plained in section 3.2.2.

Reduced overhead Since threads share many of the resources within a process,the state information that has to be stored is significantly reduced whenswitching execution between threads in the process.

Linux handles each thread as an ordinary task when scheduling and storesthreads in its parent process virtual address space. As the thread and parent pro-cess share memory there is no extra overhead in mapping up new virtual memoryaddress space when creating multiple threads.

3.2.2 SynchronizationThere are risks of error when multiple threads operate on shared resources concur-rently in a program without proper synchronization. The most common problemencountered in multi-threaded programs is called a race condition, which occurswhen a sequence of inter-dependent instructions of the program is executed in anunintended order due to concurrent execution, this can result in corruption of dataand unexpected behavior of the program. The technique of Mutual Exclusion (mu-tex) is used in concurrent programming to avoid simultaneous access of sharedresources. Sections of code that accesses shared resources in concurrent environ-ments are also referred to as critical sections. There exists various synchronizationtechniques used to ensure that shared resources are not corrupted during concurrentexecution.

Atomic Operations

The simplest form of mutex is implemented through atomic operations in the hard-ware, which means that the hardware performs uninterruptable (atomic) operations.An atomic operation is regarded as a single machine instruction for the rest of thesystem. This is a sufficient technique for simple forms of synchronization which relyon reading or writing to individual variables.

The X86 instruction set provide special support for atomic instructions for exam-ple xchg (atomic swap), cmpxchg/cmpxchg8b (atomic test-and-set), xadd (atomicadd) and other integer instructions3. Atomic instructions are implemented usingthree inter-dependent mechanisms:

3Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A

16


Atomic instructions Basic memory instructions which are guaranteed to operateatomically on the memory location.

Bus locking Some memory instructions need to lock the bus to guarantee atomicoperation, for example the read-modify-write operation.

Cache coherency protocols Memory operations are frequently performed atom-ically inside of cached locations. Here the cache coherency protocols ensurethat cached memory locations are managed properly and synchronized toother cache memories while atomic operations are performed.

Locks

Shared resources can also be synchronized by using software synchronization prim-itives which encapsulate atomic operations into more sophisticated functionality.These primitives (such as mutex, semaphore, etc.) allow a single thread to ownthe lock of a critical section, while the other threads either spin or block dependingon their timeout mechanism. Blocking results in costly context-switching, whereasspinning results in wasteful use of the CPUs resources (unless used for a short du-ration). Non-blocking calls on the other hand allow the competing thread to returnto an unsuccessful attempt of the lock and in between the attempts perform usefulwork.

One major disadvantage with locks is the possibility of deadlock and starvation.A deadlock happens when two different threads holds one or more locks and end upin a state where they both want each other’s lock. The probability for deadlocksincrease as the applications gets more complex and utilizes more locks. It is possibleto resolve a deadlock by performing a rollback and redoing the execution, but thisis hard to implement efficiently in real-time systems. Starvation occur when tasksare denied execution because of other tasks greedily uses all resources.

3.2.3 Scheduling

To be able to run more processes than physical processor cores available, the op-erating system implements scheduling of process time. This is called multitaskingand the purpose of this is to utilize the processors efficiently without disturbing theuser experience with slow IO operations. The component in the operating systemwhich controls what process that is allowed to execute next is called the scheduler.The scheduler is implemented with one or multiple queues where tasks are waitingto be processed.

Multitasking is implemented in two general ways:

Cooperative multitaskingIn cooperative multitasking the process itself has the control of when to releasethe processor back to the scheduler. This type of multitasking risks gettingstuck with processes which are waiting for resources. Thus wasting processor

17


time and affecting the runtime of other processes in the system. This way ofmultitasking is implemented in the APZ VM.

Preemptive multitaskingIn preemptive multitasking the operating system controls which process thatexecutes and when depending on different factors. This is how Linux andmany other operating systems work today. Preemptive multitasking preventsgreedy processes from disturbing the entire system.The time a process is allowed to run before it is preempted is called a timeslice and is predetermined before the process starts. The time slice is usuallybetween 10ms and 300ms and is mostly determined by the process priority, ashigher prioritized processes are supposed to get more system resources. Thisway of multitasking is implemented in most modern operating systems.

There are several different ways to implement the scheduling priorities:

First In First OutTasks executes in the same order as they arrive.

Shortest Job FirstTasks that takes long time gets interrupted by tasks that take shorter time.This starves the tasks that need more time.

Round-robinEach task gets an equal predefined time to run and is then put into the endof the queue. Linux implements this schedule internally on each queue level.

Multilevel schedulingThere are multiple queues, one for each priority level. In some Linux distri-butions there are as many as 256 levels. The APZ VM also implements thistype of scheduling priority where each priority queue is represented with a jobbuffer.

Speculative schedulingThe scheduler continuously evaluates the processes by sampling performance-related attributes such as Instructions Per Cycle (IPC). The scheduler thenattempts to find the optimal mapping of processes to the available CPUs.

The APZ VM scheduler The APZ VM scheduler is described in appendix A

The Linux 2.6 scheduler As mentioned before, Linux implements the preemp-tive way of scheduling. Since Kernel version 2.6, Linux is also able to schedule andpreempt kernel tasks, which is quite unique to the Linux operating system. Unfor-tunately it is not possible to interrupt a kernel task at any given time, if the kerneltask holds a lock it is not preempted until the lock is released. Kernel preemption

18


is most interesting for real-time systems where it is important that all tasks getenough time to run.

Linux monitors how the process utilize its time slice, for example if the processwaits for I/O or if the process utilizes its full processor time. The Linux schedulerfrom version 2.6 and up is designed to be highly responsive to user input. Thismeans favoring tasks that are I/O dependent and gives such tasks a higher priority.This results in I/O dependent tasks being scheduled more frequently than processordemanding tasks. However, if an I/O task waits for slow resources or user input thescheduler hands the execution over to another task before the current tasks timeslice has ended. If a task still has time left from its time slice, it placed in the end ofthe active queue, ready to run again in accordance with round robin. If the processhas used up its time slice, the kernel calculates the process next time slice and putthe task in the expired queue.

This way the I/O tasks are run for short periods very frequently and the pro-cessor demanding tasks are allowed to execute their entire time slice without anycontext switch. When all processes have run their time slice and the active queue isempty, then the active queue and the expired queue are switched and the schedulingstarts over.

Load Balancing in the Linux Kernel

The load balancer in Linux is called in two ways:

1. Timer: The load balancer runs once every millisecond when the processor coreis idle and every 200ms when it is working.

2. Queue flip: The scheduler calls the load balancing function before it flips theexpired and active queue.

The load balancer attempts to move tasks between processor cores if the distributionof tasks is unbalanced, that is if a core has approximately 25% more tasks thananother core. The redistribution of tasks can result in both increased and decreasedcache performance for a task depending on how much data the tasks of the currentcore share.

Performance Factors

There are various factors that affect the performance of the IPU. The performanceof an already parallelized application is usually measured by comparing its execu-tion performance on a single core and the scaling behavior on multiple cores, wherelinear scaling is the goal. In the case of the IPU there is an additional factor toconsider: overhead required to convert a single-threaded IPU application into aparallel application - the difference between running the single-threaded IPU ap-plication on a single core and running the parallelized application on a single core.The final performance of the parallel application have to consider the fact that the

19


original unmodified IPU thread does not scale and thus affect the potential perfor-mance gain from parallelization which follows from equation 3.1. This is part of therequirements stated in subsection 2.1.5.

Issues that affects the final performance:

Software overheadOverhead in the parallel application such as performance difference betweenthe single-threaded IPU application and the parallelized application execut-ing on a single core. Which could be hidden behind existing execution, butrequires further investigations:

Stalls while executing on multiple cores due to:

Decrease in cache hit rate relative to single core execution caused by cachememory synchronization.

Locks and synchronization such as atomic operations in the APZ VM.Load imbalance of job distribution across cores due to dynamically chang-

ing workload.

Ericsson has provided measurements which reflect the behavior of the IPU inthe APZ VM. This data is used to perform approximate simulations of the IPU,which ease the implementation effort needed to perform various investigations andexperiments compared to implementing this in the actual APZ VM application.Performed simulations are based on the data in table 3.5. The frequency of jobsdestined for certain blocks in the IPU is also similar to the normal distributiondescribed in figure 3.4.

20


Figure 3.4: Normal distribution of jobs destined to 4000 different blocks where the meanµ is 2000 and the variance σ2 is 400

Number of blocks in the APZ VM: 4000Jobs per telephone call: 2000Number of instructions per job: 1000Memory operations per job: 80

Figure 3.5: This table show average data of how the IPU behaves.

21

Chapter 4

Implementation

The main task is to create a prototype with an additional IPU-thread in theAPZ VM, which handles only newly developed application blocks. To further inves-tigate various theories outside the scope of the prototype, a series of experimentsare performed on identical hardware as the prototype.

4.1 Equipment

The specifications of the equipment used during the implementation of the pro-totype as well as the experiments are described in this section. The equipmentsection includes the most central hardware and software utilities that are used inthe implementation.

Hardware

Manufacturer: IntelArchitecture: NehalemModel: Xeon E5540Hyper Threading: YesNumber of physical processors: 2Number of physical cores per processor: 4Number of logical CPUs per processor: 8L1 Cache per processor 64KB per coreL2 Cache per processor 256KB per coreL3 Cache per processor 8MB shared between cores

Figure 4.1: This table shows the specifications of the CPUs used.

23

CHAPTER 4. IMPLEMENTATION

Memory type: DDR3Number of modules: 8Size of each module: 4 096 MBTotal system memory: 32 872 MBMemory frequency: 1066 MHzCache line size: 64 B

Figure 4.2: This table shows specifications of the system memory used.

Software

Operating system: LinuxVersion: 2.6.32.12-0.7Distribution: SuSE SLES 11.1

Figure 4.3: This table shows specifications of the operating system used on the testingmachine.

Posix Threads (PThreads)

In shared memory multiprocessor (SMP) architectures threads are used to imple-ment parallelism. Historically hardware vendors have implemented their own pro-prietary versions of threads, making portability a concern for software developers.For UNIX systems, a standardized C language threads programming interface spec-ified by the IEEE POSIX 1003.1c standard.

GNU Compiler Collection (GCC)

GCC is an integrated distribution of compilers for several major programming lan-guages. GCC is used to compile the C/C++ code that the APZ VM is writtenin.

4.2 Experiments

This subsection describes how the experiments are implemented and what theyintend to measure. The purpose of the experiments is to simulate usage of hardwareand identify the hardware’s bottlenecks. This is difficult to fully implement for theprototype of the APZ VM with the limited resources of this thesis. But the resultsof the prototype gives a good indication.

24

4.2. EXPERIMENTS

4.2.1 Threading InterfaceThe threading interface provides a simple and robust implementation for executingprograms on multiple CPUs and in the operating system. The interface simplifiesthe initialization of new threads by hiding much of the needed configuration. Theinterface also provides monitoring functionality of the threads themselves, such asmeasurements of the runtime and average CPU clock frequency. It is possible tomap threads to specific CPUs using the threading interface by modifying the affinityvalue of threads in the operating system.

To initialize a new thread, the following parameters are required by the interface:

Thread functionA pointer to the function the thread executes.

Thread function parametersA pointer to the function arguments (optional).

Thread CPUAn integer number specifying which CPU the thread is assigned to (optional,default is CPU0). The physical mapping of the Nehalem CPUs in the oper-ating system is described in 4.4.

The purpose of the threading interface is to provide an easy to use interfaceenabling flexible parallelization of the following experiments.

Software mapping Physical processor Physical coreCPU0 0 0CPU1 1 0CPU2 0 2CPU3 1 2CPU4 0 1CPU5 1 1CPU6 0 3CPU7 1 3CPU8 (Hyper Thread of CPU0) 0 0CPU9 (Hyper Thread of CPU1) 1 0CPU10 (Hyper Thread of CPU2) 0 2CPU11 (Hyper Thread of CPU3) 1 2CPU12 (Hyper Thread of CPU4) 0 1CPU13 (Hyper Thread of CPU5) 1 1CPU14 (Hyper Thread of CPU6) 0 3CPU15 (Hyper Thread of CPU7) 1 3

Figure 4.4: This table show the operating system map CPUs to physical processor cores.

25


4.2.2 Spin ExperimentThis experiment evaluates the instruction throughput that is achieved when twothreads are distributed on different cores, both physical and logical. The idea isto run a loop which increments a thread bound variable until a predefined limitis reached. The experiment is also executed with different locality of the threadbound variable to see how the execution performance varies when the variables arelocated on the same cache line.

This experiment also tests how Hyper Threading (HT) is implemented in thehardware, if one thread starves the other or if the threads are able to run simulta-neously without blocking each other. The design of the experiment strives to be assimple as possible to fit in a few cache lines.

Figure 4.5: Structure of the spin experiment.

The thread loop is very simple and described below:while var < cond dovar ← var + 1

end whileThe code is compiled to x86 assembler with the GCC compiler by using the -s

flag. There are a few extra instructions both when reading, writing and comparingthe variable with the condition variable:

.L3:

26

4.2. EXPERIMENTS

movq -16(%rbp), %raxmovq (%rax), %raxleaq 1(%rax), %rdxmovq -16(%rbp), %raxmovq %rdx, (%rax).L2:movq -16(%rbp), %raxmovq (%rax), %raxcmpq -8(%rbp), %raxjb .L3

This instruction sequence of nine basic instructions is so small that it fits intothe CPU’s L1 cache the entire time. The program still has to be loaded into theCPU’s cache when the execution starts and the test warms up by executing thesingle threaded function twice.

Name Thread mode Thread0 Thread1Single core Single CPU0 -Single core, dual thread Interleaved CPU0 CPU0Single core, dual thread HT CPU0 CPU8Dual core, dual thread Parallel CPU0 CPU2

Figure 4.6: Thread setup of the spin experiment.

The CPU0 and CPU8 are located on the same physical core, as described infigure 4.4. These two CPUs are therefore running HT and the threads are assignedto one CPU each when running the program with HT. The CPU0 and CPU2 areon the same processor but on separate cores and the threads are assigned to oneCPU each when running the program on separate cores.

This test is not applicable to a real computer program as there are too few logicaloperations and memory accesses in the loop, but it gives a hint of how the hardwareresponds to a CPU demanding process with different thread configurations and theirinstructional throughput.

When the code is compiled using the GCC -O3 optimization flag, the loop isremoved and the value is increased directly. To have full control of which instruc-tions that are produced by the compiler and executed the experiment program iscompiled without compiler optimizations.

27


4.2.3 Cache Memory ExperimentThis experiment iterates bytewise over a memory array of 1024 MB that is dividedequally between threads. The experiment intend to show how important the mem-ory locality is for the processor cache and where the benefits lies when two threadsare working with similar data. When data is too scarcely spread the cache getstrashed.

This test is also a performance test of the hardware and demonstrates how thenew memory management control which is built into the Nehalem CPU performs.According to documents from Intel, the memory management control is able toperform at up to 25.6 GB/s. When reading the array sequentially the CPU takesadvantage of loading larger data blocks (cache lines) of 64 bytes from main memory.The 64 bytes in a cache line corresponds to 64 array step operations in the CPU.

Name Thread mode Thread0 Thread1 Iteration modeSingle core Single CPU0 - FullSingle core, dual thread Interleaved CPU0 CPU0 AlternateSingle core, dual thread Interleaved CPU0 CPU0 Half eachSingle core, dual thread Interleaved CPU0 CPU0 Alternate reverseSingle core, dual thread HT CPU0 CPU8 AlternateSingle core, dual thread HT CPU0 CPU8 Half eachSingle core, dual thread HT CPU0 CPU8 Alternate reverseDual core, dual thread Parallel CPU0 CPU2 AlternateDual core, dual thread Parallel CPU0 CPU2 Half eachDual core, dual thread Parallel CPU0 CPU2 Alternate reverse

Figure 4.7: Thread setup of the cache memory experiment.

The single threaded test runs on a single core that iterates over the whole arrayin sequence. The parallel tests divide the array between the two threads in equallylarge chunks. The division of data between the threads in the parallel tests consistof large data clusters (half each) and many small ones (alternate).

The parallel test also iterates over the array in both directions simultaneously(parallel and reverse) to determine how the data locality affects the performance.The data access pattern of this experiment is described in figure 4.8.

28

4.2. EXPERIMENTS

Figure 4.8: Structure of the access pattern in the cache memory experiment.

4.2.4 IPU Simulation ExperimentThe IPU is simulated using a program that behaves similarly to the actual IPU inAPZ VM, this is mainly realized through following the guidelines of table 3.5. Theidea of the program is to simulate a continuous flow of semi-random jobs whichare executed in their corresponding function blocks. The generated jobs executesa corresponding function block in the program store, which represent a compiledASA-block that access various data in the data store. The program is described bydividing it into three central parts:

Data storeThe data store consist of an 8 GB large array of data allocated on the heap.The data store is accessed by memory operations located in the function blocksof the program store. Each function block is assigned 1 MB of global dataarea and 1 MB of dynamic data area. The global data area is shared by alljobs and is accessed similarly for all jobs executing in the block. The dynamicdata area stores job-specific data and therefore has a unique access patternfor each job executing in the function block.

Program storeThe program store is a 10 MB large static library containing 4 000 similarfunctions labeled blockX - blockY. These functions represent jobs executingin blocks, where each block contain approximately 1 000 machine instructionswhich are free of branching, thus simulating the pattern of execution in theIPU.Each unique block performs 80 random memory operations within the blocks

29


corresponding area of the data store. The memory operations are equallydivided between the global data area and the dynamic data area. The globaldata is accessed in a block-specific way, thus providing good cache coherencywhen the IPU Simulator makes repeated calls to the same block. The dynamicdata is however accessed in a random pattern for each new job, thus experienceworse cache coherency during repeated calls to the same block.

Main loopThe main loop generates semi-random jobs and executes the correspondingblock and then increments the job counter. The main loop is initialized byseveral parameters:

1. The data store array where memory operations of the function blocks areperformed.

2. The runtime of the program in seconds.3. The interval of receiving blocks which jobs is to be generated for is within

the range of 0 - 4000.

SPU SimulatorThere is also an SPU Simulator included in the IPU Experiment which con-sume all resources from one processor core (CPU6). The SPU Simulatorperforms various computations and listens to a socket to emulate the basicbehavior of the real SPU in the APZ VM.

The characteristics of this program are similar to the behavior of the IPU. The keyeffects which the simulation emulates are the following:

Normal distribution of jobsThe jobs are generated according to the normal distribution which is illus-trated in 3.4. The normal distribution is described through the probabilitydensity function in formula 4.1. Real-time generation of values according tothe normal distribution is however computationally costly, the normal distri-bution values are therefore calculated in advance into an array of 400 000 inte-gers using the Box-Muller transformation described in formula 4.2. The arrayof normal distributed values is then accessed in real-time using a uniform-random generator with a predetermined random-seed to generate a sequenceof normal distributed jobs.

Instruction localityThe instruction locality of the IPU is emulated through the program storeand its absence of branching constructions in the machine code.

Data localityThe data locality of the IPU is emulated through the data store which isaccessed by the function blocks.

30

4.2. EXPERIMENTS

P (x) = 1σ√

2πe−(x−µ)2/2σ2 (4.1)

µ represents the meanσ2 represents the variance

z1 =√−2 ln x1 cos (2πx2) (4.2)

z2 =√−2 ln x1 sin (2πx2) (4.3)

z1 and z2 are normal distributed values with µ = 0 and σ2 = 1x1 and x2 are uniform independent distributed values between 0 and 1

The purpose of the IPU Experiment is to produce simple measurements of per-formance by comparing the number of jobs executed by the IPU Simulator duringa specified amount of time.

Figure 4.9: Overview of the IPU Simulator test.

31


4.2.5 Prototype of Parallelized APZ VM

The prototype is an extension of the APZ VM where the single-threaded executionof the Instruction Process Unit (IPU) is divided into two simultaneously executingthreads called IPU Master and IPU Slave. The IPU Master consists of a modifiedversion of the original IPU in the APZ VM and the IPU Slave is new and simplifiedIP thread which only execute distributable jobs. The prototype is constructed forinitial testing of the concept and to verify that an implementation is possible.

The prototype is built to meet the following delimitations presented in section2.1:

IPU ThreadsThe prototype extends the execution to an additional processor core whichruns an additional IPU thread called the IPU Slave.

Priority used in IPU SlaveThe IPU Slave executes jobs located on the JBB priority level, which is thepriority where a large portion of the traffic gets distributed.

Type of jobs executedThe IPU Slave executes jobs of the buffered jobs type to avoid race conditionswithin the application layer.

The blocksThe blocks which the jobs execute in are to be constructed in a thread safeway, which means that only buffered jobs are used for communication. Theblocks are also bound to a specific IPU thread.

The JIT-Compiler in parallelThe JIT-Compiler is not threading safe and therefore requires a work-aroundto avoid deadlocks when compiling code. The workaround is to run the blockson the IPU Master once for compilation and run compiled blocks on theIPU Slave.

Thread Safe Blocks

Jobs are executed on the IPU Slave in so called chains of blocks where the first andthe last job in the chain always executes on a block distributed to the IPU Master.This enables a transparent integration of the IPU Slave to the IPU Master byhaving the IPU Master regard the first and last jobs as ordinary jobs. This alsolets the first and last jobs perform flexible actions such as sending signals of thenon-buffered type. The sequence of execution in the chain is illustrated in figure4.10.

32

4.2. EXPERIMENTS

Figure 4.10: A Chain of blocks that are partially to be executed on the IPU Slave.

Communication Model Between the IPU Master and IPU Slave

The communication model represents how jobs are stored and transferred betweenthe two IPUs. The jobs are stored using circular FIFI buffers, which is a datastructure specifically designed for storing jobs in the APZ VM. The circular FIFObuffers are described in appendix A.

The prototype is extended with three additional communication buffers whichare illustrated in figure 4.11. Each communication buffer consists of a circular bufferwhich is a one-way transportation lane for jobs. The buffers are also illustrated ascommunication lanes in figure 4.12.

Circular buffer 1 (JBB2_SP )This buffer is used as a communication lane for jobs from the IPU Master tothe IPU Slave.

Circular buffer 2 (JBB2_IP )This buffer is used as a communication lane for jobs from the IPU Slave toitself.

Circular buffer 3 (SlaveToMaster)This buffer is used as a communication lane for jobs from the IPU Slave tothe IPU Master.

The jobs located in the JBB2_SP and the JBB2_IP buffers are stamped withan incremented sequence number, this sequencing mechanism is used to determine acorrect interleaved order jobs are to be retrieved in by the IPU Slave. This ensuresfairness in the order of incoming jobs executed by the IPU Slave.

33


The status of all buffers are collectively represented in a bitmask variable calledthe occupancySummary. The occupancySummary is a 32 bit variable, where eachbit in the variable represents the status of a circular buffer. The occupancySummaryis wide enough to contain all the 21 currently existing circular buffers and the ad-ditional three buffers introduced in the prototype. The corresponding bit in theoccupancySummary variable is updated when a new job is inserted into the buffer(a one) or when the last job in the buffer is retrieved (a zero). The variable isalready manipulated by the SP thread and the IP thread by using atomic hardwareinstructions. It is therefore possible to extend the variable manipulation furtherwith additional threads. The atomic instructions used are logical and/or opera-tions which are slower than ordinary and/or operations.

The SlaveToMaster buffer does not use its own bit in the occupancySummary,the bit corresponding to the original JBB-priority buffer is used instead. TheSlaveToMaster buffer shares status bit with the JBB-buffer since they both containjobs with the same priority. The SlaveToMaster buffer act as a communicationlane for the following types of jobs:

Compilation jobsJobs which not yet has executed on the IPU Slave and therefore has to becompiled on the IPU Master due to the JIT-compiler not being thread safe.

End jobsWhen the chain of blocks executed on the IPU Slave has come to an end, thelast job is sent to the IPU Master using the SlaveToMaster buffer.

Non-distributable jobThe IPU Slave only executes buffered jobs, so if a chain contain any direct orcombined jobs the block sending the non-buffered job has to be executed onthe main thread.

The prototype is extended with duplicated instances of the Scheduler class andthe LevelDividedRegisters class. The additional Scheduler instance executes in anew loop-function called the slaveScheduleJobs. This loop continuously polls theJBB2_SP and the JBB2_IP buffers for jobs to execute. The additional Lev-elDividedRegisters instance represents the original APZ registers which stores thecurrent state of execution, therefore another instance of these registers is requiredto avoid thrashing registers when executing applications in parallel.

Work-around for the JIT-compiler

The JIT-Compiler which produces assembly code corresponding to the ASA-blocksis not thread safe. Re-designing the JIT-compiler to be thread safe is not withinthe scope of this thesis, hence the prototype requires that the blocks which aredistributed to the IPU Slave are precompiled before they are executed.

This is solved by having the IPU Master as the designated compiler-thread. Ifthe IPU Slave discovers a job addressed to an uncompiled block, it sends the job

34

4.2. EXPERIMENTS

Figure 4.11: Thread communication model of the prototype.

back to the IPU Master through the SlaveToMaster buffer and waits until theIPU Master unlocks the IPU Slave. The IPU Master retrieves and executes the jobonce, thus compiling the block at the same time. The IPU Slave is then unlockedand is from now on able to execute jobs for this compiled block.

There are a number of features built into the system that makes it difficult toknow whether a block is available in compiled state or not. This is discussed morein detail in subsection 6.3.3.

Executing Jobs in Parallel

Jobs are executed in parallel on pre-compiled assembly code corresponding to ASA10C-blocks. Two unique blocks executes in parallel without any synchronization issuessince each block is a separate unit with specific storage of data and instructions.When a block finishes execution it produces a return signal, this signal has to be ofthe buffered signal type on the IPU Slave, which is returned to the job buffer wherethe prototype have control of what is executed.

35


Figure 4.12: Model of the job flow between the two IPU threads in the prototype.

It is crucial for the prototype to separate which blocks are distributable andwhich IPU-thread they are to be distributed to. This is managed by theBlockStatusinterface, which consist of a byte array with a size corresponding to the number ofblocks available to the prototype. There are 4 096 possible blocks and 8 bytes perelement in the BlockStatus array. This creates a memory footprint of 32 KB whichis acceptable on this platform. The BlockStatus interface list is initialized at thestart of the prototype.The BlockStatus interface controls if a job:

• is distributable to the IPU Slave

• has been executed once

• has been executed twice

• is the first job in the chain, where the next job is distributable

• needs trace, which means it runs compiled C++ code instead of JIT-compiledcode.

The trace bits are activated only on the distributable blocks and the first and lastblocks in chains. The execution of classic non-distributable blocks is not modified.

The rerouting of jobs is performed in the prototype at three sections in the codeand are determined by several tests which are illustrated in figure 4.13 and describedbelow:

The Signal distributor processes outgoing jobs sent from blocks recently exe-cuted.

36

4.2. EXPERIMENTS

If the IPU Master enters the signal distributor, the current job is executedin regular manner if non-distributable. However, if the current job isdistributable and the sending block of the job is distributable as well, itis a job where the block is to be run for compilation. Otherwise, the jobis inserted into JBB2_SP for execution on the IPU Slave.

If the IPU Slave enters the signal distributor, the current job is executed andsent back to the IPU Slave itself through JBB2_IP if it is distributable,otherwise the job is not distributable and sent back to the IPU Masterthrough the SlaveToMaster-buffer.

The JIT-compiler is optimized for the APZ VM and has sections where APZ VM-code is inlined by the JIT-compiler to perform insertions into the job buffers ofthe APZ VM more effectively. The prototype modifies the JIT-compiler to in-line code for insertion into the JBB2_IP buffer when compiling distributableblocks.

The Scheduler is divided into two separate instances that are executed on theIPU Master and the IPU Slave.

IPU MasterThe Scheduler-instance of the IPU Master is almost identical to theoriginal IPU of the APZ VM. The only difference occurs during retrievalof jobs from the JBB priority. The IPU Master first checks if there is ajob to retrieve from the SlaveToMaster buffer (which may contain jobsthat are not distributable or jobs that are to be compiled for IPU Slave).If a job is available it is retrieved for execution, otherwise the IPU Masterchecks if there is a regular JBB priority job pending for retrieval. Thereason for picking the SlaveToMaster job first is to limit the wait timefor the IPU Slave during compilation of a block in the IPU Master.

IPU SlaveThe Scheduler-instance of the IPU Slave retrieves jobs only from theJBB2_SP and the JBB2_IP buffers. When a job is successfully re-trieved it is executed by the IPU Slave if the destination block has beencompiled. Otherwise the block has to be compiled and the IPU Slavesends the job to the IPU Master through the SlaveToMaster-bufferand locks while the IPU Master compiles the block.

Testing of the Prototype

The prototype is connected to a cluster of CP-machines which are briefly describedin appendix A. The cluster consists of five active CPs based on an older hardwareplatform, while the test machine runs on hardware described in table 4.1 and acts asthe sixth CP in the cluster. All CPs in the cluster hosts an instance of the APZ VM,or in the case of the test machine an instance of the prototype. The traffic to the

37


Figure 4.13: Flow-based diagram of the two IPU-threads.

cluster is load balanced by a dedicated machine which equalizes the traffic to thedifferent CPs currently connected to the cluster.

The cluster receives traffic from a dedicated traffic generating server. This serveris controlled by a user interface and is able to generate both GSM and UMTS traffic.The traffic generated by this server reflects realistic scenarios in telephony networks.An additional type of traffic has been added to the traffic generating server to testthe prototype. This traffic consists of jobs destined to thread safe testing blockswhich are described in subsection 4.2.5.

The testing block included in the testing traffic is called PECL000 and is illus-trated in figure 4.14. The chain starts with PECL000, which then calls PECL001and so forth all the way to PECL023. PECL000 is always executed on the IPUMaster, since it is the first block in the chain and marked as not distributable.The blocks returns control to the preceding block in the chain. PECL023 returnsto PECL022, which returns to PECL021 and so forth. This makes PECL000 theending block of the chain as well. The PECL000 block has two parameters whichcan be manipulated during runtime for testing purposes.

Parameter 1, Block chain length (1-24)

38

4.2. EXPERIMENTS

Figure 4.14: The new chain of PECL blocks is hooked to the ordinary traffic.

The first parameter controls how many PECL-blocks that are to be includedin the chain if PECL000.

Parameter 2, Block instructions (1-65535)The second parameter controls how many instructions each PECL-block exe-cutes.

The traffic is load balanced between the six CPs, thus resulting in a sixth of thecalls being sent to the prototype. The PECL000-block is then invoked five timesper call in the prototype.

The testing is performed with four different kinds of APZ VM application setups:

Old APZ VMThe original version of the APZ VM is used as a reference point to the mea-surements of the prototype. In this setup the SPU executes on CPU0 and theIPU executes on CPU2.

Prototype with IPU Slave on separate core (Multi-core)The prototype distributes the IPU Slave to a separate core on the same phys-ical processor as the SPU and IPU Master executes on. The SPU executeson CPU0, the IPU Master executes on CPU2 and the IPU Slave executes onCPU4.

Prototype with IPU Slave on Hyper Thread with IPU Master (HT on IPU)

The prototype distributes the IPU Slave to the corresponding Hyper Threadcore which the IPU Master executes on. The SPU executes on CPU0, theIPU Master executes on CPU2 and the IPU Slave executes on CPU10.

Prototype with IPU Slave on Hyper Thread with SPU (HT on SPU)The prototype distributes the IPU Slave to the corresponding Hyper Threadcore which the SPU executes on. The SPU executes on CPU0, the IPU Masterexecutes on CPU2 and the IPU Slave executes on CPU8.

The performance is measured by examining the load of the CPUs. The load ofthe IPU Master always appear as 100% to the operating system and is therefore

39


measured with an internal tool which calculates how much time the IPU Masterspends at JBB (traffic) level and above, while execution spent below JBB level isregarded as maintenance (no load). The load of the IPU Slave is measured directlyin the operating system with the program top.

40

Chapter 5

Results

5.1 ExperimentsThis section presents the results and the analysis of the experiments described inprevious section 4.2.

5.1.1 Spin ExperimentThe bars in figure 5.1 represent the real time it takes for the system to execute 16million add operations on the same variable stored in memory.

The results are described below:

Test 1.1 Runtime: 3.87 sThis test consists of a single thread executing on CPU0.

Test 1.2 Runtime: 3.91 sThis test consists of two threads both executing on CPU0.

Test 1.3 Runtime: 4.52 sThis test consists of two threads which utilize Hyper Threading by executingon CPU0 and CPU8, which are physically mapped to the same processor core.The threads are manipulating variables which are located on the same cacheline.

Test 1.4 Runtime: 2.08 sThis test consists of two IPU threads running on CPU0 and CPU2, which arephysically mapped to two different processor cores. The threads are manipu-lating variables which are located on the same cache line.

Test 1.5 Runtime: 3.61 sThis test consists of two threads which utilize Hyper Threading by executingon CPU0 and CPU8, which are physically mapped to the same processor core.The threads are manipulating variables which are located on separate cachelines.

41

CHAPTER 5. RESULTS

Figure 5.1: Test result spin experiment.

Test 1.6 Runtime: 1.95 sThis test consists of two IPU threads running on CPU0 and CPU2, which arephysically mapped to two different processor cores. The threads are manipu-lating variables which are located on separate cache lines.

Analysis

The experiment shows how the hardware performs and is discussed below:

Increased clock frequencyThe spin experiment is a CPU heavy process with one or two concurrentthreads. The CPU cores are fully utilized which affects the dynamic clockfrequency feature available in the Nehalem architecture. As mentioned insection 3.1.3 the clock frequency is increased for active cores when fewer coresin total are active. The additional performance from increased clock frequencyis however not sufficient to match the performance of two cores executing inparallel.

Data location impacts the performance of the testThe comparison between Test1.3 and Test1.5 and the comparison between

42

5.1. EXPERIMENTS

1.4 and 1.6 shows that both HT and dual core gain a boost in performancewhen the modified data is not in the same L1 cache line. Even though theperformance increase is marginal it is preferred to allocate the shared dataon different locations in the address space to delimit cache trashing betweenthreads.

Hyper Threading performs worse than the single thread in the tests wheredata is on the same cache line. A comparison between the reference (Test1.1)and the Test 1.3 indicate there are times when the program running HT per-forms worse than the single threaded program. We do not know the reasonfor this but it might be due to the core pipeline being filled. When the datathe two threads work with is on the same cache line. The CPU gets a signalwhen the cache line is updated and the CPU has to invalidate the currentexecution. Then flush the pipeline and redo the execution.

Use more physical cores instead of logical if performance and real time re-sponse is essential. The Test1.3 and Test1.6 shows the big difference betweenthe worst and the best scenario when running programs in parallel. The factthat the HT program gives even worse performance than a single threadedprogram in certain situations is another reason.

43

CHAPTER 5. RESULTS

5.1.2 Cache Memory Experiment

The cache memory experiment is like the spin experiment a CPU-heavy processwhich activates the increased clock frequency.

Figure 5.2: Test results of forward array iteration.

The following descriptions refer to the figure 5.2:

Test 2.1 Runtime: 3.95 sThis test consists of a single thread running on CPU0. The single threaditerates over the entire array and is used as a reference.

Test 2.2 Runtime: 3.37 sThis test consists of two threads which utilize Hyper Threading by executingon CPU0 and CPU8, which are physically mapped to the same processor core.The threads alternate over the array.

Test 2.3 Runtime: 2.19 sThis test consists of two threads running on CPU0 and CPU2, which arephysically mapped to two different processor cores. The threads alternateover the array.

44

5.1. EXPERIMENTS

Test 2.4 Runtime: 3.15 sThis test consists of two threads which utilize Hyper Threading by executingon CPU0 and CPU8, which are physically mapped to the same processor core.The threads iterate over separate parts of the array.

Test 2.5 Runtime: 2.01 sThis test consists of two threads running on CPU0 and CPU2, which arephysically mapped to two different processor cores. The threads iterate overseparate parts of the array.

Figure 5.3: Test results reverse array iteration.

The following description refers to the figure 5.3, which shows the comparisonbetween forward and reverse memory access, where lower numbers are better:

Test 2.6 Runtime: 4.71 sThis test consists of a single thread running on CPU0.

Test 2.7 Runtime: 3.74 sThis test consists of two threads which utilize Hyper Threading by executingon CPU0 and CPU8, which are physically mapped to the same processor core.

45

CHAPTER 5. RESULTS

Test 2.8 Runtime: 2.41 sThis test consists of two threads running on CPU0 and CPU2, which arephysically mapped to two different processor cores.

Figure 5.4: Illustration of how the runtime drops when activating more threads


Test 2.1 Runtime: 3.95 sThis test consists of a single thread running on CPU0. The single threaditerates over the entire array and is used as a reference.

Test 2.5 Runtime: 2.01 sThis test consists of two threads running on CPU0 and CPU2, which arephysically mapped to two different processor cores. The threads iterate overseparate parts of the array.

Test 2.9 Runtime: 1.36 sThis test consists of three threads running on CPU0, CPU2 and CPU4, whichare physically mapped to three different processor cores. The threads iterateover separate parts of the array.

46

5.1. EXPERIMENTS

Test 2.10 Runtime: 1.02 sThis test consists of four threads running on CPU0, CPU2, CPU4 and CPU6,which are physically mapped to four different physical processor cores. Thethreads iterate over separate parts of the array.

Analysis

The cache experiment shows that the new memory controller is designed and givenbandwidth for the entire CPU and that is not possible to utilize the entire bandwidthwhen only one single core is activated. When activating additional cores the speedupis almost linear.

A few other observations follow:

Separate data in the memory for the different threads keeps the cache invalida-tions low. As with the spin test it is better to let the threads be responsible forseparate memory areas instead of having them operate on interleaved data.The comparison between Test2.2 and Test2.4 and the comparison betweenTest2.3 and Test2.5 is marginal but the divided data performs consistentlybetter.

PrefetchingThe reverse iterating tests (Test2.7, Test2.8 and Test 2.9) are all performingworse than the corresponding configurations (Test 2.1, Test2.2 and Test2.3)which iterates forward over the array. Prefetching is a hardware optimizationand by reading memory in reverse it is not utilized. Applications prefer readingmemory sequentially because the memory is fetched from main memory intothe CPU´s cache by reading a larger memory chunk. Therefore when readingthe next array element there is a higher probability that it is in the cache ifthe next array element is located on memory address a few steps higher thanthe preceding one.

Divide problemIf a task requires several memory accesses, there is much to gain by splittingthe calculations into smaller tasks and allocating them to different physicalcores, thus sending more memory requests. This is shown in the figure 5.2, themore cores the array is divided into the better performance of the experimentprogram. The division is best in applications with large data objects or dataarrays where the data is scattered in memory. With too many memory accessesthe cache is trashed and the speed of the memory controller is the performancelimit of the application.

47

CHAPTER 5. RESULTS

5.1.3 IPU SimulatorTesting of single core

Figure 5.5: Running the IPU Simulator on one processor core.

The following descriptions refer to the figure 5.5:Test 3.1 Total throughput: 307683 jobs/second

This test consists of a single IPU thread running on CPU0 and all 4000 blocks.The frequencies of generated jobs are uniformly distributed instead of normaldistributed, as in the following tests.

Test 3.2 Total throughput: 900783 jobs/secondThis test consists of a single IPU thread running on CPU0 and all 4000 blocks.

Test 3.3 Total throughput: 1267417 jobs/secondThis test consists of two IPU threads which utilize Hyper Threading by exe-cuting on CPU0 and CPU8. The threads share all 4000 blocks.

Test 3.4 Total throughput: 1291267 jobs/secondThis test consists of two IPU threads which utilize Hyper Threading by exe-cuting on CPU0 and CPU8. The threads divide the blocks equally betweeneach other (CPU0: 0-2000, CPU8: 2001-4000).

48

5.1. EXPERIMENTS

Test 3.5 Total throughput: 1460183 jobs/secondThis test consist of two IPU threads which utilize Hyper Threading by exe-cuting on CPU0 and CPU8. The threads divide the blocks unequally betweeneach other (CPU0: 0-500, CPU8: 501-4000).

Testing of dual cores

Figure 5.6: Running the IPU Simulator on dual processor cores.


Test 4.1 Total throughput: 1689883 jobs/secondThis test consists of two IPU threads running on CPU0 and CPU2. Thethreads share all 4000 blocks.

Test 4.2 Total throughput: 1981417 jobs/secondThis test consists of two IPU threads running on CPU0 and CPU2. Thethreads divide the blocks equally between each other (CPU0: 0-2000, CPU2:2001-4000).

Test 4.3 Total throughput: 2842933 jobs/secondThis test consists of two IPU threads running on CPU0 and CPU2. The

49

CHAPTER 5. RESULTS

threads divide the blocks unequally between each other (CPU0: 0-500, CPU2:501-4000).

Testing of triple cores

Figure 5.7: Running the IPU Simulator on triple processor cores.


Test 5.1 Total throughput: 2430333 jobs/secondThis test consists of three IPU threads running on CPU0, CPU2 and CPU4.The threads share all 4000 blocks.

Test 5.2 Total throughput: 3322283 jobs/secondThis test consists of three IPU threads running on CPU0, CPU2 and CPU4.The threads divide the blocks equally between each other (CPU0: 0-1333,CPU2: 1334-2666 and CPU4: 2667-4000).

Analysis

The experiment shows some interesting behaviors when running the IPU Simulatorin parallel. The most noticeable observations are discussed below:

50

5.1. EXPERIMENTS

Hyper Threading performanceWhen utilizing Hyper Threading the performance increases significantly byabout 30% between Test 3.2 and Test 3.3. The observed improvements fromHyper Threading have previously ranged between 15% and 28%, which placesthe IPU Simulator in the optimal area of utilization. This significant improve-ment in performance shows that there are many stalls which Hyper Threadingis able to utilize.

Cache coherencyThe IPU Simulator is very cache dependent and triple performance betweenTest 3.1 and Test 3.2. The critical difference between these two tests is theprobabilistic distribution of jobs. This improvement in performance indicatesthat cache misses represents a large limitation in performance.The improvement in performance of approximately 25% is also noticed be-tween Test 4.1 and Test 4.2. This improvement indicates that when blocksare bound to specific cores and data sharing is reduced, the IPU benefits fromreduced cache synchronizations.

ScalabilityThe scalability of increasing the number of processor cores that the IPU Simu-lator executes on is observed between Test 3.2, Test 4.1 and Test 5.1. There isan approximate drop in total performance of 5% for every additional proces-sor core which the IPU Simulator executes on. The parallelization overhead ishowever outweighed when sharing of data between processor cores is reducedas shown in Test 4.2 and Test 5.2.

51

CHAPTER 5. RESULTS

5.2 PrototypeThis section presents the results of testing the prototype described in section 4.2.5.The testing of the different configurations is done in the same way, starting with alow call rate for the APZ VM to warm up with. The call rate is then increased 1000at a time to avoid overload. When the cluster starts failing the traffic generationis stopped and the test is rerun from the beginning. The APZ VM optimizes theblocks when the load hits 60% and the tests are therefore run twice to give fair andcomparable results.

5.2.1 Testing of the PrototypeThese default settings are used unless stated otherwise when collecting the resultsfor the following graphs:

• Instructions per PECL-block: 1000

• Number of blocks in each chain: 24

• Calls per second: 1000 - 11000

Figure 5.8: When the load of the APZ VM hits 60%, the JIT-compiler starts tooptimize the blocks by removing unused code, thus improving the performance. Thisis observed in the figure when the load is stepwise increased to the 60%-threshold.The APZ VM afterward reports lower load for the same traffic as before.

52

5.2. PROTOTYPE

Figure 5.9: When the call rate is increased with steps of 1000, the different configu-rations responds differently in CPU load. The old APZ VM and the Multi-core hit60% load at around 6000 concurrent calls and are then optimized. The HT on IPUconfiguration scales better and is optimized at around 10000 concurrent calls. TheCPU load of the IPU Slave on the HT on SPU configuration hits its upper limit ataround 5000 concurrent calls and then start to drop calls.

53

CHAPTER 5. RESULTS

Figure 5.10: When the APZ VM has optimized its blocks, the CPU load increasesalmost linearly for all configurations. The HT on SPU configuration hits its limitbefore it gets optimized and is therefore removed from this figure. The last linerepresent the load of the Old APZ VM without the PECL-blocks. This is theoptimal load for the IPU Master where no overhead is created.

Figure 5.11: The offload of the IPU Master increases linearly with the call rate whenrunning the Multi-core configuration of the prototype. Both the Old APZ VM andthe Multi-core configuration run optimized blocks.

54

5.2. PROTOTYPE

Figure 5.12: The offload of the IPU Master increases linearly with the call ratewhen running the IPUs on the same core. Both the Old APZ VM and the Multi-core configuration run optimized blocks.

Figure 5.13: The HT on SPU configuration performs very bad and there is prac-tically no offload. The Old APZ VM runs optimized blocks while the HT on SPUAPZ VM runs non-optimized blocks.

55

CHAPTER 5. RESULTS

Figure 5.14: The CPU load of the IPU Slave is a lot higher compared to theCPU load which is offloaded from the IPU Master in the different configurations.This means that the blocks executed on the IPU Slave consume more CPU powercompared to running the blocks on the IPU Master. The HT on SPU configurationperforms the worst and it hits 100% CPU load already at 5000 concurrent calls.The HT on IPU configuration outperforms the other configurations (is the mosteffective IPU Slave) and an upper limit for the HT on IPU configuration is neverreached with this cluster setup.

56

5.2. PROTOTYPE

Figure 5.15: The call rate is kept constant at 4000 calls per second in this test. Thenumber of instructions executed by the PECL-blocks obviously affects the CPUload of the Old APZ VM since it runs all blocks on the same thread. The IPUMaster of the two configurations Multi-core and HT on IPU is not as affected byincreased number of instructions executed by the PECL-blocks because the blocksare migrated to the IPU Slave.

Figure 5.16: The call rate is kept constant at 4000 calls per second in this test.The number of instructions executed by the PECL-blocks which are migrated tothe IPU Slave clearly affects the CPU load. However, the increase of CPU load ofthe IPU Slave diminishes as the executed instructions per block are increased.

57

CHAPTER 5. RESULTS

Figure 5.17: The call rate is kept constant at 4000 calls per second in this test. Thenumber of blocks in the PECL-chains obviously affect the CPU Load of the OldAPZ VM since it runs all blocks on the same thread. The IPU Master of the twoconfigurations Multi-core and HT on IPU are not as affected by increased numberof blocks in the PECL-chain since they are migrated to the IPU Slave.

Figure 5.18: The call rate is kept constant at 4000 calls per second in this test. Thenumber of blocks in the PECL-chain which are migrated to the IPU Slave linearlyincrease the CPU Load of both configurations in the prototype.

58

Chapter 6

Conclusions

This chapter presents the conclusions of this thesis and discusses future potentialimprovements for the prototype. The conclusions review the previously stated prob-lems and delimitations in chapter 2.

6.1 Experiments

The experiments are designed to simulate potential usage scenarios of hardware andgive an idea of where improvements are obtained in the design of a fully parallelizedapplication.

All the experiments show better performance on the multi core configuration.But the prototype scales better when using the HT on IPU configuration, whichindicate that the prototype still has a fair amount of shared data between cores.The reason for this is likely due to the simple design of the experiments, while theprototype is large and complex. There are more parameters to consider a full scaleapplication compared to the simplified scenario. A full scale prototype is necessaryto determine conclusive results.

IPU Simulator

• The IPU Simulator is able to utilize the full potential of HT. This means thatthe APZ VM is likely receptive to utilizing this technique as well. The HTtechnique is able to increase performance by up to 30% for each core accordingto the IPU Simulator.

• The performance of the IPU Simulator depends heavily on the cache locality ofthe blocks which are executed. This is also observed when utilizing functionaldistribution to reduce data sharing between cores. The APZ VM is likely re-ceptive to functional distribution which enables utilization of additional cachememory without data sharing.

59

CHAPTER 6. CONCLUSIONS

• The APZ VM is likely to experience marginal drops in effectiveness per coredue to data sharing as additional cores are utilized by the application.

6.2 PrototypeThe purpose of the prototype is to investigate a series of concepts and solve thepreviously stated problems in chapter 2.

The main problem is to determine at what level of workload theAPZ VM benefit from distributing part of the load onto multiple pro-cessor cores.

• The results in figures 5.11 and 5.12 indicate that it is always worth offloadingthe IPU Master with workload to the IPU Slave for the HT on IPU andMulti-core configurations.

• The Multi-core configuration has a slightly lower overhead on the IPU Mastercompared to the HT on IPU configuration, which is illustrated in figure 5.10.

• The IPU Slave on the HT on IPU configuration is more efficient comparedto the Multi-core configuration. Efficiency in this context represents the ra-tio of load that the IPU Slave produces compared to the offloaded on theIPU Master, which is illustrated in figures 5.11, 5.12 and 5.14.

• The number of instructions executed by each PECL-block has a diminishingperformance cost on the IPU Slave, which is illustrated in figure 5.16.

The problem description in chapter 2 also lists several sub-problems which arediscussed below:

Parallelization prototype of the APZ VMThe prototype enables further parallelization by introducing the IPU Slavewhich processes new workload introduced as parallelization-safe applicationblocks. The IPU Slave executes a varying degree of workload depending onthe rate of calls and the composition of the blocks in the chain.

Utilize Hyper ThreadingThe IPU Slave is easily configured to run on any logical CPU available to theoperating system. The configuration running the two IPU threads in HT modeshows the best results. The IPU Slave does not produce the same amountof overhead as when running the IPU Slave on another physical CPU, thisis likely due to the two IPU threads sharing more data than expected. Theresults benefits from running each core in a higher frequency by activatingfewer physical cores.The test also shows how the threads characteristics affect the result whenpairing them together using HT. When the IPU Slave is running together with

60

6.2. PROTOTYPE

the SPU thread, which is a highly demanding IO thread, the performance ofthe IPU Slave is terrible. But when pairing the IPU Slave with the IPU Masterthe overall result is even better than having both threads run on differentCPUs.

Design rules for block executionThe rules for the blocks executing on the IPU Slave are quite simple anddesigned with parallelization and backwards compatibility in mind:

Buffered job signalsThe requirement of buffered job signals greatly simplifies parallelization.By using buffered job signals, the IPU Slave is able to regain control afterevery executed job and schedule the next job to ensure parallel safety.

ChainsThe chains of jobs that are executed on the IPU Slave can be designedin varying length and provide a flexible way to configure how workloadis to be shifted to the IPU Slave.

Start and end on the IPU MasterThe requirement to start and end every chain on the IPU Master followsfrom the requirement of backwards compatibility.

Optimize power consumptionThe IPU Slave continuously polls the buffers for jobs to execute, but if thebuffers are empty the IPU Slave starts to sleep during short intervals. This re-sult in reduced power and cooling consumption during periods of low workloadfor the IPU Slave and is observed in the server’s supervision tool.

Retain backward compatibilityThe backwards compatibility is maintained in the prototype and new featureswhich are introduced in the prototype appear transparent to the original func-tionality of the APZ VM.

Measure performance gainThe test machine running the prototype is added to a cluster of machines run-ning the ordinary APZ VM. The traffic is generated with a traffic generatingprogram. The load of the machine is measured using varying traffic on fourdifferent configurations. The CPU load of the IPU Master is read internallyfrom the APZ VM using a command that prints the workload. The CPU loadof the IPU Slave is read using the Linux command top.

Section 2.1 states several delimitations for the thesis which are discussed below:

Exclusively Intel microarchitectureThe test machine that the prototype is developed for is based on an IntelNehalem processor. Specifications of the processor are found in table 4.1.

61


Restrict usage of locks in platform layerThe prototype is free of locks during execution when warmed up, this meanswhen all blocks have been compiled for the IPU Slave. However, duringthe startup of the prototype a lock is utilized to halt the execution on theIPU Slave during compilation of its blocks.

Restrict parallelization in the application layerBy utilizing the BlockStatus interface, the prototype migrates only jobs whichare destined to blocks which are stated safe to distribute to the IPU Slave.

Restrict prototype to limited number of additional processorsThe prototype extends the APZ VM with one additional thread called theIPU Slave. The IPU Slave may be attached to any logical CPU of choice inthe operating system.

Restrict changes of preexisting codeThe prototype is completely backwards compatible and does not require anychanges of the preexisting upper layer application code. The prototype is alsodesigned to create minimal impact on the original APZ VM, but does howevermake small modifications in the code. The modifications in the APZ VM areall placed within definitions which the preprocessor of the compiler eitherkeeps or removes, this enables the application to be built with or without theIPU Slave by switching a compiler-flag.

Restrict offloading of workload to one job bufferThe IPU Slave processes only jobs which are located on the JBB priority leveland thus offloads only jobs from one job buffer on the IPU Master.

The prototype add some overhead to the IPU Master compared to the originalAPZ VM. The prototype determines in multiple places of the code which thread isrunning and performs the proper action.

6.3 RecommendationsThis section describes the further recommendation and potential improvement ofthe subjects in the thesis.

6.3.1 Regarding Inter-thread Communication

The current solution provide communications between the two IPUs through aseries of buffers, where one IPU thread is the master and the other IPU thread isthe slave. The IPU Slave sends all of its outbound traffic through the IPU Master.The communication between the IPU Master and the IPU Slave does not scalevery well. The IPU Master will eventually limit the potential performance gainfrom parallelization of the system when inter-communication between IPUs reach a

62

6.3. RECOMMENDATIONS

certain threshold. This limitation become even more obvious if additional IPUs areadded, where they all have to communicate with each other through the IPU Master.The optimal solution would be some sort of parallelization-safe data structure whichhandles all inter-thread communication of the IPUs. How such a data structure canbe implemented and what the potential performance loss or gain of using such adata structure is a big project to investigate and reserved for further work on thistopic.

6.3.2 Further Code Optimization

The tests and results of the modified APZ VM indicates that the IPU Master isrelieved of work by the IPU Slave. But it also indicates that the increase of workon the IPU Slave does not match the reduction of work on the IPU Master. Thismight be due to the added overhead with additional buffers.

The new blocks are not optimized when executed by the IPU Slave and theinsertion of the jobs are performed in pure C++ functions instead of optimizedJIT-code as in the ordinary blocks.

The slowdown of executing the blocks with trace is not considered in the testresults and this might be another source of the performance drop on the IPU Slave.

6.3.3 Determine if Block Is Runnable

There are a number of obstacles in system that makes it difficult to know whethera block is available in compiled state or not. To begin with, the APZ VM has agarbage collector that removes pre-compiled blocks from the memory after a whileof low activity, this makes it hard to determine if a job is pre-compiled in memoryor not. The solution of doing all the compilation on the IPU Master the first timea distributable block runs is insufficient if the garbage collector starts to invalidateunused blocks. The APZ VM is shipped with a lot more memory compared to afew years back. This lowers the importance of the garbage collector and it is evenpossible to disable the functionality for some time without any complications.

The second APZ VM feature to consider is the compilation of blocks is notguaranteed to be complete. When a block is executed, the JIT-compiler compilesonly the code currently needed. The most obvious case of this occurs in branchesin the code, the code of the branch is compiled only when the branch is actuallychosen, otherwise the branch contains a stump of code that invoke the JIT-compiler.If a different path of execution occurs in a block, it is likely to reach a stump ofcode which invokes the JIT-compiler.

There are a few possible solutions for this which has to be considered and even-tually implemented if this thesis results in an actual product. As the compiler isbound to the IPU Master there has to be some kind of lock or message passingimplemented. It is hard to run and compile a partially uncompiled distributableblock but a few possible solutions follow:

63


Unwind and rerun the job on IPU MasterTo unwind a job, the variables that have changed when running the job have tobe reverted. This requires some form of logging which is used when performinga rollback. The job is then sent back to the IPU Master for compilation andexecution.

Activate the compiler and send compiled code back to the IPU SlaveActivate the JIT-compiler on the IPU Master only when code has to be com-piled. The code is then sent back for execution on the IPU Slave. This removessome of the load from the IPU Master.

These proposed solutions are workarounds for an elementary problem in theAPZ VM and the optimal solution would be to have a thread safe JIT-compiler.

Thread Safe JIT-compiler

The current solution where the IPU Master compiles and executes the distributableblocks for the first time, in combination with the difficulty of determining if a blockis compiled or not is far from optimal. The problem with this solution originatefrom the JIT-compiler not being safe to run in parallel. But if it the JIT-compileris made thread safe, it would remove the limitation of compiling all blocks on thesame thread in the APZ VM.

If the JIT-compiler is made thread safe and the design with distributed function-ality is kept there would be less overhead in the application, resulting in an increaseof job throughput in the system. The problem with branching blocks would also beeliminated as the IPU Slave would be able to compile and execute the new branchesby itself.

Measuring Cache Misses on the Nehalem Architecture

Measuring the rate of cache misses in the APZ VM and the prototype would providemore exact conclusions regarding the effects of data sharing and data locality. Mea-suring the cache misses on the Nehalem architecture is currently not implementedin the tools that are available to the current operating system which is used in thisimplementation. The measurements of cache misses therefore has to be left out asa further recommendation for this thesis.

64

Chapter 7

Bibliography

1. John Meurling and Richard Jeans. The Ericsson Chronicle. Informationsför-laget, Stockholm, Sweden. 2000. 479 pages. ISBN 91-7763-464-3.

2. Johan Erikson and Bo Lindell. The Execution Model of APZ/PLEX - AnInformal Description. Mälardalen University. 2009. 48 pages.

3. Michael J. Flynn. Computer architecture: pipelined and parallel processordesign. Jones & Bartlett Learning. 1995. 808 pages. ISBN 978-0867202045.

4. Gordon E. Moore. Cramming more components onto integrated circuits. Elec-tronics, Volume 38, Number 8, April 19, 1965. 4 pages.

5. David E. Culler and Jaswinder Pal Singh. Parallel Computer Architecture.Morgan Kaufmann Publishers, INC, USA. 1999. 1019 pages. ISBN 1-55860-343-3.

6. Omer Khan and Sandip Kundu. A Self-Adaptive Scheduler for AsymmetricMulti-cores. GLSVI’10, May 16-18, 2010, Providence, Rhode Island, USA.pages 397 - 400.

7. Steven Hofmeyr, Costin Iancu and Filip Blagojevic. Load Balancing on Speed.PPoPP’10, January 9-14, 2010, Bangalore, India. 147 - 157 pages.

8. Daniel Hackenberg, Daniel Molka and Wolfgang E. Nagel. Comparing CacheArchitectures and Coherency Protocols on x86-64 Multicore SMP Systems.MICRO’42, December 12-16, 2009. 413 - 422.

9. Kevin J. Barker and Nikos P. Chrisochoides. 2003. An Evaluation of a Frame-work for the Dynamic Load Balancing of Highly Adaptive and Irregular Par-allel Applications. SC’03, November 15-21, 2003, Phoenix, Arizona, USA.pages 1 - 14.

10. Sverre Jarp, Alfio Lazzaro, Julien Leduc, Andrzej Nowak. Evaluation of theIntel Nehalem-EX server processor. CERN openlab. Technical article. May2010. 24 pages.

65

11. Robert Love. 2004. Linux Kernel Development. 332 pages. ISBN 0-672-32512-8

12. Gunnar Blom, Jan Enger, Gunnar Englund, Jan Grandell and Lars Holst.Sannolikhetsteori och statistikteori med tillämpningar. Studentlitteratur AB.2005. 420 pages. ISBN 978-91-44-02442-4

13. Avoiding and Identifying False Sharing Among Threads. Intel. Technical arti-cle. 2010. [ http://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads/ ]

14. Use Non-blocking Locks When Possible. Intel. Technical article. 2010. [http://software.intel.com/en-us/articles/use-non-blocking-locks-when-possible/]

15. Granularity and Parallel Performance Intel. Technical article. 2010. [http://software.intel.com/en-us/articles/granularity-and-parallel-performance/]

16. Performance insights to Intel Hyper Threading technology. Intel. Tech-nical article. 2009. [ http://software.intel.com/en-us/articles/performance-insights-to-intel-hyper-threading-technology/ ]

66

Appendix A

The APZ System

The appendix is not distributed with the report in agreement with Ericsson.

67

TRITA-CSC-E 2011:089 ISRN-KTH/CSC/E--11/089-SE

ISSN-1653-5715

www.kth.se

Documents

Multi-Core Execution of Future Applications in the APZ VM