Cache Optimised Data Structures and Algorithms …Cache Optimised Data Structures and Algorithms for Sparse Matrices Cache-optimierte Datenstrukturen und Algorithmen für dünnbesetzte

Fakultät für Informatikder Technischen Universität München

Bachelorarbeit in Informatik

Cache Optimised Data Structures andAlgorithms for Sparse Matrices

Cache-optimierte Datenstrukturen undAlgorithmen für dünnbesetzte Matrizen

Author: Alexander Friedrich Heinecke

Supervisor: Prof. Dr. Thomas Huckle

Advisor: Dr. Michael Bader

Submission Date: 15.04.2008

Ich versichere, dass ich diese Bachelorarbeit selbständig verfasst und nur dieangegebenen Quellen und Hilfsmittel verwendet habe.

I assure the single handed composition of this bachelor thesis only supported bydeclared resources.

München, den 15.04.2008

Alexander Heinecke

Abstract

Several papers have been published in the last two years discussing how to utilizespace-�lling curves, namely the Peano Curve, to implement powerful cache obliviousalgorithms for matrix matrix multiplication and matrix factorisations like the LUdecomposition.

This bachelor thesis describes extensions and re-implementations of the matrix mul-tiplication algorithm that are needed to implement a suitable multiplication algo-rithm for sparse matrices. Therefore a new block-layout of the matrices with well-known concepts, such as CSR (Compressed Sparse Row) at block level, and newroutines, which operates on this new structure are needed.

First performance comparisons with the established Intel Math Kernel Library showa well competitive matrix multiplication algorithm, if a sparse matrix is multipliedby a dense one. At the end of this thesis a possible parallelisation approach ispresented and compared with the Intel Math Kernel Library.

The source code that was developed during writing this thesis is available at http://tifammy.sourceforge.net

http://tifammy.sourceforge.net


iv

Zusammenfassung

Meherere Verö�entlichungen der letzten zwei Jahre beschreiben, in welche Weiserraumfüllende Kurve, namentlich die Peano Kurve, verwendet werden können, um"Cache oblivious"-Algorithmen für Matrix Matrix Multiplikation und Matrix Zer-legungen, wie der LR-Zerlegung, zu implementieren.

Diese Bachelorarbeit beschreibt Erweiterungen und Neuimplementierung der Ma-trix Matrix Multiplikations Algorithmik, um diesen Ansatz auch für dünne Matrizennutzen zu können. Hierfür müssen ein neues Blocklayout der Daten, das sich bekan-nter Formaten wie CSR (Compressed Sparse Row) bedient, und Funktionen, welcheauf diesem arbeiten, entworfen werden.

Erste Leistungsvergleiche mit der weit verbreiteten Intel Math Kernel Libraryzeigen, dass dieser Ansatz eine durchaus vergleichbar Leistung liefert, falls einedünnbesetzte Matrix mit einer dichtbesetzten multipliziert wird. Der letzte Ab-schnitt dieser Bachelorarbeit diskutiert einen möglichen Ansatz der Paralleisierungdieses Multiplikationsalgorithmuses und führt einen Vergleich mit der Intel MKLdurch.

Der Quellcode, der im Rahmen dieser Bacherlarbeit entstanden ist, ist im Internetverfügbar unter http://tifammy.sourceforge.net


vi

Acknowledgments

I want to thank all persons who helped me writing this bachelor thesis.

First I want to thank Dr. Michael Bader who made it possible for me writingthis thesis. He supported me for nearly two years in writing software for matrixapplications using space-�lling curves. All the time he gave me valuable hints totackle the problems I had with my implementations.

I also want to thank several persons at chair X for their tips and using their infras-tructure to measure benchmarks of my applications. They also gave me hints onimplementation details. Thank you to Josef Weidendorfer, Tobias Klug and MichaelOtt.

Special thanks go to my parents who supported me during my whole bachelorstudies.

viii

Contents

Table of Contents viii

List of Figures xi

List of Tables xv

List of Listings xvii

1 Introduction 1

2 Cache Architectures 52.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Short technical Introduction to Cache Memories . . . . . . . . . . . 72.3 Cache Structures of current (Multi-Core) Processors . . . . . . . . . 9

3 Existing Data Structures for Sparse and Dense Matrix Matrix Mul-tiplication 133.1 Matrix Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . 133.2 Storage Schemes and Algorithms for Dense Matrices . . . . . . . . . 14

3.2.1 Row-major Storage . . . . . . . . . . . . . . . . . . . . . . . 143.2.2 Column-major Storage . . . . . . . . . . . . . . . . . . . . . 153.2.3 Matrix Matrix Multiplication on these Data Structures . . . 16

3.3 Storage Schemes and Algorithms for Sparse Matrices . . . . . . . . 173.3.1 Storage Scheme Compressed Sparse Row (CSR) . . . . . . . 173.3.2 Other sparse storage schemes . . . . . . . . . . . . . . . . . 193.3.3 Matrix Matrix Multiplication on CSR . . . . . . . . . . . . . 19

3.4 Block Matrix Storage Approaches . . . . . . . . . . . . . . . . . . . 203.4.1 Block Matrix Storage for Dense Matrices . . . . . . . . . . . 213.4.2 Block Matrix Storage for Sparse Matrices . . . . . . . . . . . 21

3.5 Space-�lling Curves, Peano Order . . . . . . . . . . . . . . . . . . . 213.5.1 Construction of the Peano Element Order . . . . . . . . . . 223.5.2 Grammar of the Peano Element Order . . . . . . . . . . . . 233.5.3 Matrix Element ordering based on the Peano Curve . . . . . 24

3.5.3.1 Normal Array Storage . . . . . . . . . . . . . . . . 263.5.3.2 Block Matrices as Elements . . . . . . . . . . . . . 27

Contents

3.5.4 Matrix Matrix Multiplication based on the Peano Curve . . 27

4 Developed Data Structures for Sparse Matrix Matrix Multiplica-tion 294.1 Extended Array Storage . . . . . . . . . . . . . . . . . . . . . . . . 294.2 Traversed Tree Storage . . . . . . . . . . . . . . . . . . . . . . . . . 294.3 Sparse Matrix Matrix Multiplication . . . . . . . . . . . . . . . . . 32

4.3.1 Using Extended Array Storage . . . . . . . . . . . . . . . . . 324.3.2 Using Traversed Tree Storage . . . . . . . . . . . . . . . . . 33

5 Locality Properties of the Peano based Matrix Matrix Multiplica-tion 35

6 Implementation 396.1 Changes before Version 1.4.0 . . . . . . . . . . . . . . . . . . . . . . 396.2 Extensions to the old Implementaion (v1.4.0) . . . . . . . . . . . . 40

6.2.1 Data Storage of Blocks . . . . . . . . . . . . . . . . . . . . . 406.2.2 Matrix Matrix Multiplication . . . . . . . . . . . . . . . . . 41

6.3 Traversed Tree Storage (v2.0.0) . . . . . . . . . . . . . . . . . . . . 426.3.1 Data Structure Setup . . . . . . . . . . . . . . . . . . . . . . 436.3.2 Matrix Matrix Multiplication . . . . . . . . . . . . . . . . . 44

6.4 Parallelisation of Multiplication . . . . . . . . . . . . . . . . . . . . 45

7 Performance Analysis 477.1 Single Thread Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 487.2 Multi Thread Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 48

7.2.1 Two Threads . . . . . . . . . . . . . . . . . . . . . . . . . . 497.2.2 Four Threads and Eight Threads . . . . . . . . . . . . . . . 497.2.3 Parallel e�ciency . . . . . . . . . . . . . . . . . . . . . . . . 50

7.3 Performance Counter Analysis . . . . . . . . . . . . . . . . . . . . . 51

8 Conclusion 55

A Performance Graphs 57

B Algorithm De�nitions 71B.1 PPP Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . 71B.2 PRR Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . 72B.3 QPQ-Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . 72B.4 QRS-Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . 73B.5 RQP-Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . 73B.6 RSR-Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . 74B.7 SQQ-Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . 74B.8 SSS-Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

x

Contents

C Matrix Element Access Graphs 77

D Selected Code 83D.1 Memory Management . . . . . . . . . . . . . . . . . . . . . . . . . . 83D.2 Traversed Tree Storage . . . . . . . . . . . . . . . . . . . . . . . . . 85

D.2.1 Constructor of HybridMatrix . . . . . . . . . . . . . . . . . 86D.2.2 Management Array Setup . . . . . . . . . . . . . . . . . . . 88D.2.3 Row/Column Transformation . . . . . . . . . . . . . . . . . 90D.2.4 Multiplication Algorithm . . . . . . . . . . . . . . . . . . . . 94

E Glossary 99

Bibliography 101

xi

Contents

xii

List of Figures

2.1 Common memory hierarchy in today's systems. Characteristic valuesof latency and bandwidth for the several cache levels are given on theleft side. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Cache structure of the Intel Xeon 5300/5400 series. . . . . . . . . . 102.3 Cache structure of the AMD Opteron series. . . . . . . . . . . . . . 11

3.1 Row major storage scheme for matrices. . . . . . . . . . . . . . . . 153.2 Column major storage theme for matrices. . . . . . . . . . . . . . . 163.3 Compressed sparse row storage theme for sparse matrices. . . . . . 183.4 Construction of the Peano curve and Peano Element order; �rst three

iterations shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.5 Relationship between construction and grammar of the Peano order

[8]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.6 Recursive block element numbering of the matrix elements. . . . . . 253.7 External padding for matrices, values in brackets are the Peano index

of the matrix elements. . . . . . . . . . . . . . . . . . . . . . . . . . 263.8 Optimal order of execution. . . . . . . . . . . . . . . . . . . . . . . 283.9 PPP -Multiplication Algorithm. . . . . . . . . . . . . . . . . . . . . 28

4.1 External padding for matrix, with zero blocks in the inner structureto build sparse matrices. . . . . . . . . . . . . . . . . . . . . . . . . 30

4.2 External padding for matrix, level based zero storage (traversed treestorage). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.3 Diagonal dominant matrix displayed using a octal-tree with Peanonumbering scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.1 Memory access pattern of dense dense matrix matrix multiplication. 365.2 Memory access pattern of sparse dense matrix matrix multiplication;

A 2D laplacian like. . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6.1 Multi-thread owner computes approach of the Peano based matrixmultiplication [8]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

7.1 Performance per core measured in MFlop/s for all three test systemsand 1, 2, 4 and 8 threads. . . . . . . . . . . . . . . . . . . . . . . . 50

List of Figures

7.2 Parallel e�ciency for all three test systems and 1, 2, 4 and 8 threads. 51

A.1 Performance per core measured in MFlop/s for all three test systemsand 1, 2, 4 and 8 threads. . . . . . . . . . . . . . . . . . . . . . . . 57

A.2 Parallel e�ciency for all three test systems and 1, 2, 4 and 8 threads. 58

A.3 Performance comparison with one thread between TifaMMy and In-tel MKL 10 on the Clovertown workstation. . . . . . . . . . . . . . 59

A.4 Performance comparison with one thread between TifaMMy and In-tel MKL 10 on the Clovertown Server. . . . . . . . . . . . . . . . . 60

A.5 Performance comparison with one thread between TifaMMy and In-tel MKL 10 on the Barcelona Server. . . . . . . . . . . . . . . . . . 61

A.6 Performance comparison with two threads between TifaMMy andIntel MKL 10 on the Clovertown workstation. . . . . . . . . . . . . 62

A.7 Performance comparison with two threads between TifaMMy andIntel MKL 10 on the Clovertown Server. . . . . . . . . . . . . . . . 63

A.8 Performance comparison with two threads between TifaMMy andIntel MKL 10 on the Barcelona Server. . . . . . . . . . . . . . . . . 64

A.9 Performance comparison with four threads between TifaMMy andIntel MKL 10 on the Clovertown workstation. . . . . . . . . . . . . 65

A.10 Performance comparison with four threads between TifaMMy andIntel MKL 10 on the Clovertown Server. . . . . . . . . . . . . . . . 66

A.11 Performance comparison with four threads between TifaMMy andIntel MKL 10 on the Barcelona Server. . . . . . . . . . . . . . . . . 67

A.12 Performance comparison with eight threads between TifaMMy andIntel MKL 10 on the Clovertown workstation. . . . . . . . . . . . . 68

A.13 Performance comparison with eight threads between TifaMMy andIntel MKL 10 on the Clovertown Server. . . . . . . . . . . . . . . . 69

A.14 Performance comparison with eight threads between TifaMMy andIntel MKL 10 on the Barcelona Server. . . . . . . . . . . . . . . . . 70

B.1 PPP -Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . 71

B.2 PRR-Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . 72

B.3 QPQ-Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . 72

B.4 QRS-Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

B.5 RQP -Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . 73

B.6 RSR-Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

B.7 SQQ-Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

B.8 SSS-Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

C.1 Memory access pattern of dense dense matrix matrix multiplication. 77

C.2 Memory access pattern of dense dense matrix matrix multiplicationwith external padding. . . . . . . . . . . . . . . . . . . . . . . . . . 78

xiv

List of Figures

C.3 Memory access pattern of sparse dense matrix matrix multiplication;A �ve diagonals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

C.4 Memory access pattern of sparse dense matrix matrix multiplicationwith external padding; A �ve diagonals. . . . . . . . . . . . . . . . 80

C.5 Memory access pattern of sparse dense matrix matrix multiplication;A 2D laplacian like. . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

C.6 Memory access pattern of sparse dense matrix matrix multiplicationwith external padding; A 2D laplacian like. . . . . . . . . . . . . . . 82

xv

List of Figures

xvi

List of Tables

2.1 Data cache sizes of Intel's x86 product line since 1990. . . . . . . . 72.2 Di�erent types of the cache's associativity. . . . . . . . . . . . . . . 9

6.1 TifaMMy's versions overview. . . . . . . . . . . . . . . . . . . . . . 40

7.1 1-threaded measured hardware events on the Clovertown dual chan-nel workstation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

7.2 8-threaded measured hardware events on the Clovertown dual chan-nel workstation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

List of Tables

xviii

Listings

3.1 Straight forward naive matrix matrix multiplication. . . . . . . . . . 14

3.2 Straight forward naive matrix matrix multiplication for quadraticsparse matrices A,B and a dense matrix C. . . . . . . . . . . . . . . 19

3.3 Traversal of the the element order for the starting symbol P . . . . . 24

6.1 Allocator class to implemented a mixed continuous data stream. . . 41

D.1 allocator class to implemented a mixed continuous data stream . . . 83

D.2 constructor of the class hybrid matrix . . . . . . . . . . . . . . . . . 86

D.3 Traversed Tree Array Setup (SIMA) . . . . . . . . . . . . . . . . . . 88

D.4 coord2linear function for calculating the Peano position . . . . . . . 90

D.5 Matrix Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . 94

1 Introduction

Today fast operations on matrices and vectors are needed in many implementationsof fast numerical solutions for a given problem. Especially if you are dealing withpartial di�erential equations, sparse matrices play an important role.

For dense matrix operations a standard interface, called BLAS, is available by sev-eral well known implementations, such as Intel Math Kernel Library or GotoBLAS.Today sparse matrices are supported by Intel MKL and specialized sparse matrixoperations frameworks or libraries, but not by GotoBLAS. All these implementa-tions have in common, that they are heavily hardware aware, so they are designedto be executed on a speci�c processor.

To use today's hardware e�ciently means to implement suitable algorithms that areable to exploit the o�ered technologies. Hence the current trend is leading to manycore systems the implementations should contain a high amount of parallelism.Beside the increasing number of available CPUs the algorithm should take thememory-wall, which is a bottleneck, into account. Therefore the implementation isnormally carefully tuned to the size of the CPU-Caches. A ine�cient use of thesecaches results in an algorithm with a performance limited by the bandwidth andthe latency of the system's main memory.

Due to the multi/many-core CPUs released in the last years, the use of caches hasbeen changed. Hence the today built dual or quad core CPUs work on the sameproblem they might have to exchange data for synchronisation. Therefore currentCPUs are using shard cache levels. E.g. Intel uses a shared L2 cache for theirdual core CPUs and AMD uses are shared L3 cache for their quad core CPUs.These are two completely di�erent ways of implementing the cache hierarchy. Afast hardware-aware BLAS implementation has to be carefully tuned to the di�erenttypes of cache levels in order to achieve highest performance.

This bachelor thesis describes an implementation method, with abstracts muchas possible from the underlying hardware architecture and requires hardware awarecoding only in the parts where it is essential. This means a cache-oblivious approachis combined with an architecture-aware part. In this case architecture-aware is a

1. Introduction

synonym for e.g. the same ISA (Instruction Set Architecture) or the same imple-mented concepts, such as out of order execution or prefetching.

The presented multiplication algorithm for sparse matrices is an extension of theexisting dense matrix matrix multiplication. A cache-oblivious data structure whichis motivated by the matrix matrix multiplication is used. This data structure im-plements a block recursive matrix element numbering scheme based on space �llingcurves, namely the Peano curve. The storage scheme was already available for densematrices and two di�erent approaches for using it with sparse matrices are analysedin this bachelor thesis. The �rst one is a simple extension of the available densematrix implementation whereas the second uses the traversal of a incompleted octal-tree which is generated by the Peano order if the zero elements of the matrix aretaken into account. This block recursive numbering schemes o�ers the advantage,that a cache-oblivious matrix matrix multiplication can be easily applied, becauseit was design with respected to the matrix matrix multiplication. This approachuses all available cache levels automatically in an optimal way. Therefore an appli-cation has not to be rewritten in order the be executed with su�cient performanceon a new system/architecture. References [6, 7, 8, 9] describe a way how this canbe done for dense matrix matrix multiplication and LU decomposition by using ablock recursive structured algorithm.

These previous publications have shown that this approach su�ers from poor per-formance if scalars are used as the smallest amounts of data in the data-structure.A lot of more performance can be achieved if block matrices are used as elements.Therefore fast and hand-optimised kernels can be used so that the block operationswill take place in the fast cache memory.

Also the already presented parallelisation scheme was adopted for the sparse matrixmatrix multiplication, although fair load balancing between the CPUs might not bepossible in every case with that approach because of the involved matrices' sparsityand the "owner-computes" implementation.

The code implemented during this thesis is able to multiply a dense or sparse matrixwith a dense or a sparse matrix and store the result in a dense or sparse matrix.Like mentioned above this multiplication algorithm supports a quite simple way ofparallelisation.

The achieved MFlop/s-rates by this implementation are not bad. The implemen-tation, called TifaMMy, is able to outperform Intel's MKL in all test cases and onall test-platforms for the multiplication of a sparse matrix with a dense one.

2

1. Introduction

To code which was implemented during the last two years and this bachelor thesiscan be download at http://tifammy.sourceforge.net. It is published underGPL.

3


1. Introduction

4

2 Cache Architectures

Cache memories are used to lower the performance gap between the main memoryand the CPU. Their optimal use play an important role when high performanceapplications are implemented. In TifaMMy the cache hierachy is used by design ofthe data structure in an optimal way as already mentioned above. To understandthe need and use of caches this chapter introduces some basic knowledge aboutCPU cache structures. The motivation and work principles as well as current cachestructures for multi-core systems will be explained.

2.1 Overview

Since their introduction in the late sixties, microprocessors su�er from a huge prob-lem which still increases. The computational power of a CPU (Central ProcessingUnit) grows by a factor of about 1.32 [18] every year, however the velocity of themain memory, also known as RAM (Random Access Memory) is only increased by1.08 per year. To overcome this problem, which was already known in the �fties[18]extremely fast bu�er memories where attached to the CPU since the late eighties.First there was only one cache level with a very small size (4-8KB). The size of thisso called level one cache increased in the following years and cache memories wereintegrated in general purpose CPU such as the Intel x86 series. By speeding up theprocessors only one bu�er memory was not not enough anymore, hence a secondlevel of cache was introduced. This level has a size that is a high multiple of thelevel one cache's size but this bu�er has a higher latency as the level one cache.However its latency is much smaller than the main memory's latency of the wholesystem. Today's CPUs use a complete cache hierarchy with up to three cache levels.Figure 2.1 gives a description of the way caches used in current systems.

The technology used for main memory is the reason for the performance gap betweenthe CPU and the memory. Typically DRAM1-Technology is used for RAM. Thishas the advantage, that you can pack a big amount of memory at a very small chip

1Dynamic Random Access Memory, condesators are used to implement this memory

2. Cache Architectures

!"#$%&'(#)*+",-#. /0-1

2(3456(75

!+#$%&'(

)+/#. /8,-1

!+#$%&'(

)+/#. /8,-1

!9#$%&'(#)"0-#. "*0-1

2:0#;#<=#6>#/8?-

-%@AB4A6'

!%6(@&C

D4E(

'A%

<=#6>#+F-

'AG

<=#6>#+#F-

'A&

<=#6>#+#F-

!"#$$"%&'(

!"#$")&'(

!"*$")&'(

!"+$$")&'(

!"+,$")&'(

+"-.-/0

!"+$"-.-/0(

!"#,"-.-/0(

+12"-.-/0(

!","3(

!","4(

Figure 2.1: Common memory hierarchy in today's systems. Characteristic values oflatency and bandwidth for the several cache levels are given on the leftside.

size, but they are quite slow in comparison to SRAM2, because they are needed tobe refreshed every 20 ms. In contrast to DRAM, SRAM is really fast, but it needsa lot of space to be implemented in hardware. So it is not feasible to manage themain memory of a computer with SRAM.

In most cases these caches memories are 'small' amounts of SRAM directly attachedto CPU on the CPU's die. As described above the gap of performance betweenmemory and CPU still increases. So today's processor producers use di�erent waysto tackle this problem. With every generation of their chips they increase theamount of cache memory available for the CPU.

But the semiconductor industry also uses some other functionality to reduce thein�uence of the memory's performance lack. Here only some keywords should bementioned: Out-of-Order execution of instructions, prefetching of data into caches,

2Static Random Access Memory, is today implemented by using six �ip-�ops

6


to use the caches in optimal way, branch prediction in case of conditional jumpsand a lot of more features to minimize the memory tra�c.

However all these improvements on the logic implementation in current CPUs is notenough to overcome the performance gap in a su�cient way. Algorithms o�eringexcellent locality properties to operate as long as possible on the data stored incaches are needed to build fast applications on today's multi-core processors.

2.2 Short technical Introduction to Cache

Memories

Today there are up to three cache levels available to manage the data on which theCPU operates. Most CPUs have several cache levels (level one (L1), level two (L2)and level three (L3) cache). By rule of thumb, the higher the number of level thelower is the bandwidth that connects the cache with the upper cache level. Alsothe latency and the size of the cache level increases for a higher level. Table 2.1shows this fact for the Intel CPUs since 1990.

CPU L1 Cache L2 Cache486 none none

Pentium 8KB up to 512KB (motherboard)Pentium II 8KB 512KBPentium III 16KB 512KBPentium IV 16KB up to 2x2MBPentium M 16KB up to 2x2MBCore 2 Duo 2x32KB up to 6MBCore 2 Quad 4x32KB up to 2x6MB

Table 2.1: Data cache sizes of Intel's x86 product line since 1990.

The word "cache" is derived by the French "cacher" which has the meaning of "tohide". So the several levels of cache are transparent for the programmer. Theyonly become apart, when applying some high performance applications that usesthis cache levels by operating on blocked data structures.

Being extremely limited in size (see reasons above) there is the problem which datashould be stored in the caches. To solve this problem the locality properties ofmemory accesses are taken into account. These accesses have two di�erent kinds oflocality:

7


• Temporal locality: In most cases there are several memory accesses to thesame data in short time slot (e.g. loops). So one heuristic implies that if onedata was accessed once, it is probable that it will be accessed for a secondtime in the future, so this data should be stored in the cache. Therefore it isuseful to store data in the cache as long as possible. Being extremely limitedin size the cache has to evict some entries which can not be longer storedbecause newer data has to be bu�ered. A lot of work is done (theoretical andpractical) to keep the replacement policy as simple as possible. Today's CPUsuse a simpli�ed version of LRU (least recently used) called pseudo-LRU forthis decision.

• Spatial locality: Normally the di�erent parts of an application are notspread in the whole address space of the program. Common data structures,such as arrays, are often stored in sequence in memory. The second heuristicaccording to data management implies that, if an element stored at positioni is accessed it is probable that an element stored at position i + k, with kvery small, is also accessed within the next instructions. Therefore it wouldbe useful also storing the i + kth element in the cache.

Due to the spatial locality, caches are designed in blocks or lines with of the sizeof 64 to 512 bytes, depending on the cache level. If one byte is addressed, thecache loads a complete cache line that contains the needed value. By applyingthis algorithm, data stored at previous or following addresses will be automaticallyloaded into the cache. This feature is also known under the term load prefetcher,because it prefetches data which is probably used in future. Besides this primitiveconceptual prefetchers current CPUs provide so-called level 2/3 stream prefetchers

which try to analyse the memory access pattern of the application and preload data.In addition also software prefetchers are implemented in today's CPUs that allowthe programmer to give a hint to the processor which data the application will needin the future.

The replacement policy decides which entry has to be written back to main memoryif a new value has to be stored in the cache. If this algorithm is free to determinea line (or also called set) for the new values in the cache the cache is called fully

associative, so every position of the main memory can be stored at any set in thecache. The other extreme case is if the policy only allows one set for a given mainmemory address in the cache, then the behaviour of the cache is called direct mapped.Today most caches use a mixture of these algorithms. So if it is for example possibleto store a value at sixteen di�erent sets in the cache this cache is called 16-way set

associative. Table 2.2 gives a rough idea about the di�erent types of associativitywith m being the number of sets in the cache.

8


type #sets associativitydirect mapped m 1

n-way set associative mn

nfully associative 1 m

Table 2.2: Di�erent types of the cache's associativity.

2.3 Cache Structures of current (Multi-Core)

Processors

By introducing dual and quad-core CPUs the problem of the performance lack getsstronger again, because two or more CPUs communicate with the main memorythrough one bus connection. So complex structures of caches memories are usedin today's processors. These structures usually contain shared cache levels thatallow the several cores to communicate with each other without using the mainmemory. So a lot of overhead can be saved and the performance stays at a highlevel. Nevertheless the level one cache is exclusive for every core in today's multi-core processors. As a consequence complex cache coherence protocols are neededbetween the several exclusive caches because every cache can contain a copy of thesame data. If such data is updated all other cores must notice that their cache linesstoring this data are not valid anymore. Today the MESI-cache coherence protocolis used for such synchronisations.

In today's multi-core processors and multi-core multi-processor platforms severaldi�erent varieties of the cache hierarchy shown in �gure 2.1 are used. For under-standing these di�erent cache concepts, �gures 2.2 and 2.3 show the cache hierarchyfor the current Intel Xeon DP series chips (Codename: Clovertown / Harpertown)and their AMD rival Opteron (Codename: Barcelona). Both chip families arequad-core processors which means that they have four separate processor cores inone package. However they are built in completely di�erent ways. Intel uses twodual core chips in one package to build their quad core Xeon. In contrast to thatAMD's Opteron is a monolithic chip, having all four cores are on the same die.

Figure 2.2 shows the cache structure of the Xeon processor. The dual-core processorscan be identi�ed easily: Intel's dual-core chips have an shared level two (L2) cache.Each pair of cores can only communicate with an other pair by using the FSB (FrontSide Bus). This concept may be a bottleneck, if a lot of synchronisation is neededduring the execution of an application. A second disadvantage of this platform isthe memory connection itself. All processor have to load and store the data via theMCH (Memory Controlling Hub). If you have a look at Intel's multiple processor

9


Core 0 Core 1Internal

Bus

Shared

L2 Cache

Memory Controller

with dual / quad channel I/O


Bus

Shared

L2 Cache

FSB 0 FSB 1

Main Memory Main Memory


Bus

Shared

L2 Cache


Bus

Shared

L2 Cache

Main Memory Main Memory

L1

Cache

L1

Cache

L1

Cache

L1

Cache

L1

CacheL1

Cache

L1

Cache

L1

Cache

Figure 2.2: Cache structure of the Intel Xeon 5300/5400 series.

platform the MCH becomes the bottleneck of the system because all sixteen coreshave to communicate via this chip with the main memory of the system. See [8] fordetails.

AMD follows another approach to build their massively parallel x86 platforms.Figures 2.3 describes the assembly of a common two processor system. Havingall chips on one die, there is a shared level three (L3) cache for all four cores perprocessor. So here the amount of interprocessor communication might be less thanon the Intel platform. There is no MCH anymore in the AMD system. EveryCPU has a built-in memory controller. In addition to this every CPU has its ownmemory which can be accessed very fast, because in the optimal case it is onlyused by one processor and the connection to this memory need not to be sharedwith another one. Being a shared memory system there is also the need, thata processor can access data which is stored in a memory region maintained bya other CPU. Therefore AMD released the Hypertransport protocol which is anopen platform. These Hypertransport connections are point-to-point connectionsso they are not polluted by other processors. This system design might be theright one for small massively parallel systems (up to eight processors / sockets on

10


!"#$%& !"#$%'

()%!*+,$

!"#$%) !"#$%-

.*/0%.$1"#2

('%

!*+,$

('%

!*+,$

('%

!*+,$

('%

!*+,$

304$#0*5%678

()%!*+,$ ()%!*+,$ ()%!*+,$

9,*#$:

(-%!*+,$

!"#$%& !"#$%'

()%!*+,$

!"#$%) !"#$%-

('%

!*+,$

('%

!*+,$

('%

!*+,$

('%

!*+,$

304$#0*5%678

()%!*+,$ ()%!*+,$ ()%!*+,$

9,*#$:

(-%!*+,$

.*/0%.$1"#2

;2<$#4#*08<"#4

Figure 2.3: Cache structure of the AMD Opteron series.

one mainboard). However if you want to connect more chips by point-to-pointconnections the complexity increases rapidly.

11


12

3 Existing Data Structures for

Sparse and Dense Matrix Matrix

Multiplication

There are di�erent matrix storage schemes and algorithms available in today's BLASpackages. First the de�nition of the matrix matrix multiplication is repeated. Thenext two parts of this chapter explain common methods for storing and multiplyingdense and sparse matrices. These storage schemes and algorithms are relevant forthis bachelor thesis because they are used in TifaMMy. In the third part the blockrecursive numbering scheme for matrix elements and a multiplication algorithmused for this storage scheme are introduced. It is very important to explain theblock recursive structure very carefully because it is designed with respect to a fastand cache oblivious implementation of the matrix matrix multiplication.

3.1 Matrix Matrix Multiplication

A matrix matrix multiplication is done by the well known formula shown in equation3.1.

A ·B = C (3.1)

There are some conditions for the three matrices A, B and C. If A is a n × mmatrix, B must be a m× l matrix and C will be n× l matrix. n, m, l ∈ N0.

Equation 3.2 shows the calculation of one element of matrix C.

cn,l =m−1∑r=0

an,r · br,l (3.2)

3. Existing Data Structures for Sparse and Dense Matrix Matrix Multiplication

The product of two matrices A and B can be calculated by following simple programwritten in listing 3.1. This function can easily be adopted to the row or columnmajor storage scheme.

void MM_mult(Matrix A, Matrix B, Matrix C){

f o r ( unsigned i n t p = 0 ; p < n ; p++)f o r ( unsigned i n t q = 0 ; q < l ; q++)

f o r ( unsigned i n t r = 0 ; r < m; r++){C[ n , l ] += A[ n , r ] ∗ B[ r , q ] ;

}}

Listing 3.1: Straight forward naive matrix matrix multiplication.

3.2 Storage Schemes and Algorithms for Dense

Matrices

Common dense BLAS implementations are based on one of the two following for-mats: Row-major storage or column-major storage. It also depends on the pro-gramming languages that is used for implementing your application, which of thestorage schemes you have to use for matrices. When using FORTRAN the data isstored in column-major order in most cases. C uses the row-major storage scheme.Due to the performance of your application you use simple arrays to build these'data-structure' and pass them by call by reference to the BLAS library function.

3.2.1 Row-major Storage

Figure 3.1 shows the principle of the row-major storage scheme. Here the elementsof the matrix are stored row-wise in memory. If you want to build a n×m matrix(n rows and m columns) you have to allocate a part in the memory which is ableto hold n ·m elements of the datatype you have chosen for you matrix. Usually a1D-array is used for this purpose.

Equation 3.3 gives a function for extracting an element E given by its row andcolumn position out of the matrix.

E(row, column) = row ·m + column (3.3)

14


!"#"!"$%&'(# )"#")"$%&'(#

!"#$#%&'%(#)*'%*+&,-./#*.--.0*'%*/-##%1

23

1

45

67

32

76

54

89

:

6

7

Figure 3.1: Row major storage scheme for matrices.

Formally this is as a function from N2 → N. Hence every row holds m entries therow position has to be multiplied with number of elements in one row which is thenumber of columns in the matrix to get the beginning of the needed row. After thatyou are able to address the needed element by simply adding the column, which isthe position you want to read in this special row.

3.2.2 Column-major Storage

The column-major storage is quite similar to the row-major scheme. The onlydi�erence is that the elements are stored column-wise in the memory. Figure 3.2describes this graphically.

Equation 3.4 gives a function that returns the value for given position (row/column)in the data array for column-major storage.

E(row, column) = col · n + row (3.4)

In contrast to the row-major storage here all columns hold n entries. First theneeded column is calculated by multiplying the number of rows (n) with column

15


!"#"!"$%&'(# )"#")"$%&'(#

!"#$#%&'%(#)*'%*+&,-./#*.--.0*'%*/-##%

1 23

1

456732765489:

67

Figure 3.2: Column major storage theme for matrices.

position of the needed element. To get the �nal position the row of the element isadded to the beginning position of the needed column.

3.2.3 Matrix Matrix Multiplication on these DataStructures

The algorithm needed for multiplying two dense matrices stored in one of the ex-plained formats is quite similar the listing 3.1 above. But a disadvantage of thepresented code must be mentioned here. This method do not use any extensions forbeing cache e�cient, such as blocking or reordering of the matrix's elements. There-fore this naive implementation of the matrix matrix multiplication should only beused if small matrices (dimension less or equal 400) are multiplied. In this case alldata can be stored in the last level caches of the processors and no blocking for anoptimal cache use is needed.

Beside blocking techniques there is a second opportunity to speed up the dense ma-trix matrix multiplication. The multiplication can be seen as set of scalar products

16


of two vectors, namely a row of A and a column of B. Today's general purposeprocessors provide so called SIMD or vector units. By using these units four or twoelements of the row/column can be processed by one instruction. Hence currentIntel or AMD CPUs are able to execute such a vector instruction in only one cyclea speed up by the factor of four or two can be achieved.

The implementation of TifaMMy presented in [7, 8] uses such a vector implemen-tation for the multiplications of the block matrices. Also the well know BLASimplementations, such as Intel MKL, GotoABLAS and ATLAS, use these tech-niques.

3.3 Storage Schemes and Algorithms for Sparse

Matrices

At the beginning a short and informal de�nition of a sparse matrix should be given:A matrix is treated as sparse matrix in a linear algebra software package (whichmeans the zero entries are not stored), if it is pro�table under performance andmemory usage aspects. Therefore both storage schemes above should only used, ifmatrices, with more than the half of elements are non-zero elements, have to beprocessed.

The following section introduces sparse matrix storage schemes. Hence CSR (Com-pressed Sparse Row) plays a very important role in TifaMMy's implementation itis described in detail. Other storage schemes are mentioned but not explained.

3.3.1 Storage Scheme Compressed Sparse Row (CSR)

In case of sparse matrices other storage schemes should be evaluate in order to savememory. The scheme used in this bachelor thesis is an extended compressed sparserow scheme. Figure 3.3 demonstrates the implementation of the normal compressedsparse row storage. The grey areas in this �gure mark elements which are zero andshould not be stored in the data structure.

When using this method, you have to deal with three arrays of data. Two of themare navigation arrays for the data and so their datatype is unsigned integer and theyare complete overhead because they do not contain any information about matrixelements' values. This storage scheme should (in case of using double precision

17


!"#"!"$%&'(#

!

"#

$%

$&

$"

"&

'

%

&

$'

! " $ & ( # % ) '

)*+,"-).+,"."""

*+," *+,$ ---

*+,$'

.-

.- )!+,"/0

/!012! /"01" /$012$ /&012! .-

.- /$)012(

34,5672879:;2<==+>1

?4@2A4B7C:=2<==+>1

)0+,"/1

)/+,"1

---

/!012*+," /"01*+,$ /$012*+,& /&012*+,( .-

.- /$)012*+,$'

D+C+2<==+>1

Figure 3.3: Compressed sparse row storage theme for sparse matrices.

�oating point values) only be used if less than the half of the elements are non-zeroelements.

In CSR, you have a normal value array that contains the values of the non-zeroelements in row-wise storage. Due to the fact that zero elements are included inequation 3.3, this formula can not be used to determine the position of a speci�celement. Therefore the two additional integer arrays are necessary. The simpler oneis the column index array: It contains the information for every element in which(logical) column of the matrix the element is placed. The number of elements inthe column index array is identical to the number of elements in the value array.To store the row information of the matrix elements, the row index array has to beimplemented. It has a size of n+1 with n being the number of rows of your matrix.The i-th entry in this array contains the start position of the i-th row in value andcolumn index array. The n + 1-th position in the row index array is �lled with thenumber of elements stored in the whole matrix.

Remark: In my implementation "TifaMMy" an extended CSR format is used. It isessential to allow "zero-rows" in the CSR format for proper implementation, so −1is used in the the row index array the mark a row as a zero-row.

18


3.3.2 Other sparse storage schemes

There are several other storage formats for sparse matrices (e.g. CCR, Skyline,...).They are not used in TifaMMy and will not be introduced here. They are describedin detail in [25].

3.3.3 Matrix Matrix Multiplication on CSR

The multiplication of a matrices stored in CSR is quite similar to the normal matrixmatrix multiplication printed in listing 3.1. Good algorithms try to take the sparsityinformation contained in the two additional arrays into account in order to reducethe computational complexity.

Listing 3.2 gives an example implementation of a matrix matrix multiplication oftwo sparse matrices A and B. The result of this multiplication is stored in the densematrix C. Such a algorithm is used in the new release of TifaMMy developed inthis bachelor thesis to implemented the sparse block matrix multiplication.

// d i s the dimension o f a l l th ree matr i ce svoid MM_mult_csr(Matrix A, Matrix B, Matrix C){

unsigned i n t vec_elems_B = 0 ; //number o f e lements in arow o f B

unsigned i n t vec_elems_A = 0 ; //number o f e lements in arow o f A

in t row_pos_B ; // cur r ent p o s i t i o n in therow array o f B

in t row_pos_A ; // cur r ent p o s i t i o n in therow array o f A

// loop over the rows in Bf o r ( unsigned i n t i = 0 ; i < d ; i++){

row_pos_B = b_row_array [ i ] ;i f ( row_pos_B != −1){

vec_elems_B = get_begin_pos_next_row ( i , b_row_array ) −row_pos_B ;

// loop over a l l e lements in B f o r row if o r ( unsigned i n t j = 0 ; j < vec_elems_B ; j++){

19


// loop over the rows o f A and Cf o r ( unsigned i n t k = 0 ; k < n ; k++){

row_pos_A = a_row_array [ k ] ;i f (row_pos_A != −1){

vec_elems_A = get_begin_pos_next_row (k ,a_row_array ) − row_pos_A ;

col_pos_B = b_col_array [ row_pos_B + j ] ;

// loop over the row e n t r i e s in Af o r ( unsigned i n t n = 0 ; n < vec_elems_A ; n++){

// i f the re i s a entry in f i t t i n g row o f Bi f ( a_col_array [ row_pos_A + n ] == i ){

pArr [ ( k∗d) + col_pos_B ] += a [ row_pos_A + n ] ∗b [ row_pos_B + j ] ;

}}

}}

}}

}}

}

Listing 3.2: Straight forward naive matrix matrix multiplication for quadratic sparsematrices A,B and a dense matrix C.

Also Intel's MKL provides a sparse BLAS interface. Due to high dimensions ofsparse matrices it seems to be sure that Intel does not use this naive implementationwithout any modi�cations.

3.4 Block Matrix Storage Approaches

Hence TifaMMy is using a block approach for storaging the matrix this sectiondemonstrates that is not a complete new thing introduced by this implementationof the matrix multiplication.

20


3.4.1 Block Matrix Storage for Dense Matrices

There are no well known implementation for dense matrix operations that use anexplicit block layout. Nevertheless current high performance BLAS implementationhas to use a kind of blocking to achieve these high velocities.

According to [11] a block layout is created "on the �y" during the matrix matrixmultiplication. These implementations are hardware aware [11] because the tem-poral used bu�ers are optimal tuned to the sizes of the provided level one and leveltwo caches.

There is a second approach developed at the supercomputing center in Barcelona byJose Herrero [14] called Hypermatrix. This implementation of the matrix structureis quite similar to TifaMMy. Small block matrices that perfectly �t into the levelone cache of the CPU are used.

3.4.2 Block Matrix Storage for Sparse Matrices

The above mentioned Hypermatrix structure is also extended to be used with sparsematrices. [15] presents some details for Cholesky factorisation of sparse matricesstored in the Hypermatrix scheme.

3.5 Space-�lling Curves, Peano Order

Especially the row and column major storage schemes su�er from poor performancewhen the naive matrix matrix multiplication algorithm ("row dot column", printedin listing 3.1) is applied on them. In this case far jumps in memory will appear ifthe end of a row or column is reached.

Another way of a matrix traversal will be described as presented in [6, 7, 8] inthis section. It is a block recursive numbering scheme of the elements based onspace-�lling curves. Therefore a Peano ordered element numbering inspired by theconstruction of the Peano curve is used in this implementation.

21


3.5.1 Construction of the Peano Element Order

The Peano element order is de�ned with a recursive scheme described in [19]. Figure3.4 shows the explained issue.

• The quadratic area is separated in nine equal-sized sub areas, which all havethe same edge-length.

• The parameter interval is equally divided into nine parts in each sector of thequadratic area, which are disjunct and close.

• Each sub interval contains a su�cient transformed Peano order. The wholePeano order is built be concatenating these suborders.

• The necessary transformations for building the suborders are only re�exionsat the horizontal and vertical axis of the quadratic (sub)area.

Figure 3.4: Construction of the Peano curve and Peano Element order; �rst threeiterations shown.

22


3.5.2 Grammar of the Peano Element Order

The above mentioned four items to generate the peano element order can alsobe described by a grammar. So we are able to give a formal description of theconstruction.

P

P

P

P

Q Q

R

S

R

P QQ

P

Q

Q

P

QS

R

S

RR

R

R

R

S S

P

Q

P

SS

R

S

S

R

S

Q

P

Q

Figure 3.5: Relationship between construction and grammar of the Peano order [8].

The connection between to construction and the grammar is shown in �gure 3.5.All four patterns P, Q, R, S only di�er in their traversal direction. A grammardescription contains non-terminals, terminals and productions. In case of generatingthe Peano element order following symbols and productions are used:

• The set of non-terminals is de�ned as {P, Q, R, S}. These four symbols arerepresentatives for the patterns shown in �gure 3.5. The starting symbol isP .

• The set of terminals is de�ned as {↑, ↓,←,→}.

• The set of productions can be derived also by �gure 3.5. Equations 3.5 to3.8 show the resulting productions. For a correct generation of the elementnumbering following additional assumption has to be made: It is only allowedto replace all non-terminals in one production step.

P ←− P ↓ Q ↓ P → S ↑ R ↑ S → P ↓ Q ↓ P (3.5)

Q←− Q ↓ P ↓ Q← S ↑ R ↑ S ← Q ↓ P ↓ Q (3.6)

R←− R ↑ S ↑ R→ P ↓ Q ↓ P → R ↑ S ↑ R (3.7)

S ←− S ↑ R ↑ S ← Q ↓ P ↓ Q← S ↑ R ↑ S (3.8)

23


With this grammar it is quite simple to give the traversal of the ordering scheme.It is shown for the starting symbol P in listing 3.3.

void P( l e v e l ){

i f ( l e v e l > 0){P( l e v e l −1); down ( ) ;Q( l e v e l −1); down ( ) ;P( l e v e l −1); r i g h t ( ) ;S ( l e v e l −1); up ( ) ;R( l e v e l −1); up ( ) ;S ( l e v e l −1); r i g h t ( ) ;P( l e v e l −1); down ( ) ;Q( l e v e l −1); down ( ) ;P( l e v e l −1);

}}

Listing 3.3: Traversal of the the element order for the starting symbol P .

3.5.3 Matrix Element ordering based on the Peano Curve

Putting everything together, we can see that using the naive row-major or column-major storage scheme for matrices is not a su�cient solution, especially if you aredealing with high dimensioned matrices. If for example a matrix multiplication isapplied on row or column-major stored matrices jumps will occur. The distancesbetween the accessed memory cells get greater by increasing the dimension of themultiplied matrices. So the spatial locality of the data access is violated and thedi�erent level of caches are used in unoptimal way.

Today's BLAS libraries use row or column-major storage for matrices. To exploitthe caches optimal, hardware aware implementations are needed. According to [11]a common approach is to implement some temporal bu�er, which are carefully tunedto size of the di�erent cache level of a CPU. As a consequence a new implementationhas to be developed for every upcoming CPU.

This bachelor thesis will introduce a storage scheme for matrices, which is cacheoblivious and is able to tackle dense and sparse matrices. Therefore the matrices'selements are stored using a numbering scheme based on the Peano curve. By usingthe Peano order for matrix storage in combination with adequate algorithms bestpossible spatial and temporal locality can be reach [6, 9].

24


Therefore we will have a closer look on �gure 3.4 again. As mentioned before thearray in which the Peano curve resides is divided into nine equal sized subareas.This principle is now transformed to the elements of a matrix. The element storageand numbering of the matrix is constructed using the route of the Peano curve. Therecursive scheme is now a system of square submatrices which together build thewhole matrix. Figure 3.6 shows the element numbering for a matrix constructedin that way. In contrast to the row or column major storage scheme, you see herethere is a spatial locality for both: the rows and the columns.

!

!

"

!

#

$

#

!

"

!"

!

!"#$

!!"#% !

!

"

!

#

$

#

!

"

!

!

"

!

#

$

#

!

"

!

!

"

!

#

$

#

!

"

!

!

"

!

#

$

#

!

"

"

"

!

"

$

#

$

"

!

"

"

!

"

$

#

$

"

!

#

#

$

#

!

"

!

#

$

#

#

$

#

!

"

!

#

$

$

$

#

$

"

!

"

$

#

Figure 3.6: Recursive block element numbering of the matrix elements.

The submatrices in �gure 3.6 have all a dimension of 3n, n ∈ N0, this is a conse-quence of the recursive construction. The next subsection present how to tackle"normal" matrix dimensions with this approach. However the advantage of thisstorage scheme is, that you can easily apply an algorithm for the matrix multipli-cation which takes an optimal use out of all available caches levels by design. Suchalgorithms are called cache oblivious.

25


3.5.3.1 Normal Array Storage

The normal array storage was initially described by Christian Mayer in his diplomathesis [9]. This is a straight forward implementation of the recursive numberingscheme for matrix elements based on the Peano Curve. The elements are stored inPeano order.

As already mentioned, a power of three as dimension of the matrices is needed.In that case a complete and simple Peano based element numbering is possible.However also other dimension are needed, too. In order to solve this problem, thesmallest �tting dimension with is a power of three is selected an the not neededelements (namely columns at the right and rows at bottom) are not stored in thearray. These changes have some e�ects on the optimal data structure, it is not thatcache e�cient like before but still very good, because some jumps in memory occurduring the calculation of the matrix product [7, 8]. Please see [9] for details to thisstructure and �gure 3.7.

0 (0)

1 (1)

2 (2)

5 (5)

4 (4)

3 (3)

6 (6)

7 (7)

8 (8)

9 (9)12 (14)13 (15)

14 (16) 11 (13) 10 (10)

21 (47)

20 (46)

19 (45)

18 (44)

17 (43)

22 (48)

23 (49)

24 (50)

15 (39)

16 (40)

Figure 3.7: External padding for matrices, values in brackets are the Peano indexof the matrix elements.

26


3.5.3.2 Block Matrices as Elements

Results of the last year show that it is not the best choice to use scalars at the lowestlevel in the recursive construction. Instead square block matrices with a dimensioncarefully tuned to L1 cache size of the executing processor are needed. This meansit is optimal if two of these matrices �t into the L1 cache.

The storage schemes used for this block matrices are the classical ones describedabove. So both the row-major and column-major storage are used in case of densematrices. TifaMMy also supports CSR at this block-level for building sparse matri-ces by the new done implementation in this bachelor thesis. Please remember theremark on the zero lines!

Another advantage of this inner block elements is: A extremely optimised kernelfor each matrix operation can be implemented. TifaMMy uses an hand-optimisedassembler implementation for the multiplication of two dense blocks. As mentionedabove this kernel is written with SSE 3 (SIMD) support for current Intel and AMDprocessors.

The performance results presented in [7, 8] demonstrate that TifaMMy's combina-tion of hardware aware and oblivious approaches seems to be the right mixture andit is able to outperform well-known BLAS implementations.

3.5.4 Matrix Matrix Multiplication based on the PeanoCurve

However the algorithm presented in listing 3.1 is not cache optimal due to the jumpsin the memory at the end of every loop. Now an algorithm for the multiplicationof two matrices stored in Peano order is shown in �gure 3.5.4.

All three matrices are in P -Storage, so this multiplication is called the PPP -Multiplication. This is the starting point for the recursive multiplication algorithm.If there are no scalars or block-matrices at this level, 27 calls for all multiplicationsof the nine sub matrices in A and B have to be done. Although there are a lot ofpossible combinations of the storage formats in the sub matrices of A, B, C for therecursive calls, only eight combinations are needed. All eight varieties are shown in�gures 3.5.4 and B.1 to B.8.

You can see the e�cient cache usage of the algorithm by looking carefully at thiseight �gurers. These equations have best spatial and temporal behaviour. The only

27


1 6 72 5 83 4 9

1 6 72 5 83 4 9

=

1 6 72 5 83 4 9

1 1 1 - 2 1 2 - 3 1 3 - 4 2 3 - 5 2 2 - 6 2 1 - 7 3 1

?

8 3 2�9 3 3�9 4 4�8 4 5�7 4 6�6 5 6�5 5 5?

4 5 4 - 3 6 4 - 2 6 5 - 1 6 6 - 1 7 7 - 2 7 8 - 3 7 9?

4 8 9�5 8 8�6 8 7�7 9 7�8 9 8�9 9 9

Figure 3.8: Optimal order of execution.

operator needed for implementing this system of recursive functions is the move upand the move down operator in all three matrices. It does not matter in which stepyou have a look on this algorithm. The previous elements of the current step areonly one element away. Also the elements of the next step do so.

PC0 +=PA0PB0 SC4 +=SA4SB4 −→ RC3 +=RA3SB4

↓ ↑ ↓QC1 +=QA1PB0 RC5 +=RA5SB4 RC3 +=PA2RB5 PC8 +=PA8PB8

↓ ↑ ↓ ↑PC2 +=PA2PB0 RC5 +=PA6RB3 SC4 +=QA1RB5 QC7 +=QA7PB8

↓ ↑ ↓ ↑PC2 +=RA3QB1 SC4 +=QA7RB3 RC5 +=PA0RB5 PC6 +=PA6PB8

↓ ↑ ↓ ↑QC1 +=SA4QB1 RC3 +=PA8RB3 PC6 +=PA0PB6 PC6 +=RA5QB7

↓ ↑ ↓ ↑PC0 +=RA5QB1 PC2 +=PA8PB2 QC7 +=QA1PB6 QC7 +=SA4QB7

↓ ↑ ↓ ↑PC0 +=PA6PB2 −→ QC1 +=QA7PB2 PC8 +=PA2PB6 −→ PC8 +=RA3QB7

Figure 3.9: PPP -Multiplication Algorithm.

28

4 Developed Data Structures for

Sparse Matrix Matrix

Multiplication

This chapter introduces the two implemented data structures and the resultingalgorithms for the matrix matrix multiplication used in the sparse matrix imple-mentation. First a quite "simple" structure, which is only a tuned version of theold storage scheme, is presented and the second structure is a more e�cient. Thishas led to a complete re-implementation of the entire application.

4.1 Extended Array Storage

The �rst approach of this bachelor thesis extends the already described normalarray storage scheme in order to store sparse matrices. It labels every element witha type (zero, non-zero) so that applied algorithms on this data structure are ableto determine, whether an element is a zero or non-zero element. As this is onlyan extension to the already implemented structure, all existing algorithms can bereused, if some simple checks of the element's type are added. However the numberof elements increases only by about O(n) with n being the dimension of the sparsematrices. As a consequence this leads to a big overhead if matrices with dimensiongreater than 104 are processed. Please see the performance analysis section fordetails. In �gure 4.1 an example is printed for this storage structure, N is thesymbol for a non-zero element, Z for a zero element.

4.2 Traversed Tree Storage

As described in the previous section, there can be a big overhead by using theextended array storage for sparse matrices with high dimensions. To overcome this

4. Developed Data Structures for Sparse Matrix Matrix Multiplication

0 (0)

1 (1)

2 (2)

5 (5)

4 (4)

3 (3)

6 (6)

7 (7)

8 (8)

9 (9)12 (14)13 (15)

14 (16) 11 (13) 10 (10)

21 (47)

20 (46)

19 (45)

18 (44)

17 (43)

22 (48)

23 (49)

24 (50)

15 (39)

16 (40)

N

N

Z

Z

Z

N

N

N

Z

Z

Z

N

N

N

Z

Z

Z

N

N

N

Z

Z

Z

N

N

Figure 4.1: External padding for matrix, with zero blocks in the inner structure tobuild sparse matrices.

problem a new storage scheme called traversed tree storage is described. Here foreach of the nine elements at every level in the recursive construction is stored ifthere are any non-zero elements in the following submatrices one level below. Soit is possible to implement algorithms that take this information into account andreduce the overhead in processing big sparse matrices to a minimum.

Figure 4.2 displays the elements, which are still stored in the array and how theyare stored in the array. In addition to this data array a second structure is neededwhich contains the information about the non-zeros mentioned above. The traver-sal, which is used for this storage scheme is a mixture of the depth-�rst-traversaland the breadth-�rst-traversal, so the information of one recursion level is storedclose together. For every submatrix on every level of the recursive construction theinformation of the nine following submtrices is stored in sequence in this data struc-ture. This allows algorithms, operating on a speci�c level of the recursive layout,to determine for which of the nine possible submatrices they have to start recursivecalls. Figure 4.3 points out this issue by presenting the matrix structure as an oc-tal tree. Now it gets clear why this storage scheme is between the two mentioned

30


0 (0)

1 (1)

4 (5)

3 (4)

2 (3)

5 (7)

6 (8)

7 (9)

12 (45)

11 (44)

10 (43)

8 (39)

9 (40)

Figure 4.2: External padding for matrix, level based zero storage (traversed treestorage).

traversals. The information is stored level-wise from the "left to the right". If thelast level of the tree is reach during the traversal in an algorithm this array containsthe index of the belonging element in the data array. So in the data array only thenon-zero elements need to be stored. These elements are numbered by the Peanoscheme and zero elements are simply ignored.

Beside the e�cient implementation of sparse matrices the traversed tree storage canalso be used to tackle another problem by implementing matrix operations based onspace �lling curves. It can be used for the external padding, so the implementationof the algorithms becomes much easier. In the old implementation a lot of workhas to be done to allow all dimension for matrices by using a complicated versionof the matrix matrix multiplication algorithm. The logic about the structure wasspread in the application: One part was in the data structure by storing only thatelements which are available and second was in the algorithm(s) which have to dealwith the fact that data array need not to contain a complete Peano traversal of thematrix as shown in �gure 3.7.

31


!

! !"#$" $

! !"#$" $

! !"#$" $

"

%%%

Figure 4.3: Diagonal dominant matrix displayed using a octal-tree with Peano num-bering scheme.

4.3 Sparse Matrix Matrix Multiplication

The second part of this section covers the extensions of the matrix matrix multipli-cation algorithm to deal with the two data structures for sparse matrices.

4.3.1 Using Extended Array Storage

In this case the algorithm developed by Christian Mayer can be kept, but somesmall changes are needed. Before a multiplication is applied to the elements it ischecked, if all three needed elements in A, B, C are non-zero. Only in this case themultiplication takes places.

Therefore the old algorithm can be reused without any change because the checksare done in the block matrix multiplication algorithm (remember: TifaMMy usesblock matrices as matrix elements due to the higher performance). However thisway of implementation has the disadvantage that was already mentioned above: By

32


increasing the dimensions of the three matrices, the overhead of the multiplicationbecomes higher. The reason is the complete traversal of all three matrices whichmay contains many zero elements.

4.3.2 Using Traversed Tree Storage

For the traversed tree storage scheme a complete new algorithm had to be designed.In contrast to the old implementation, which checks for zero elements just beforemultiplying the elements, this new version checks for complete zero submatricesat every function call in the above presented eight equations. Therefore the man-agement information, which contains the information for zero submatrices at everylevel of the block recursive construction of the matrix is taken into account. Onlyif all three needed submatrices of A, B and C are non-zero matrices the recursivecalls are done.

The advantage of this implementations is, that not the complete octal-tree hasto be traversed, when the multiplication is performed. So space in the memoryfor storing the information if an element is zero or non-zero element is saved bydesign and less function calls are taking place during the multiplication. Using thisnew method it is possible to execute any sparse matrix matrix multiplication withconstant MFlop/s rate like shown in the performance analysis section.

33


34

5 Locality Properties of the Peano

based Matrix Matrix

Multiplication

This short chapter demonstrates that the block recursive approach is also useful ifsparse matrices are multiplied. Being big plots, most of the access patterns discussedare given in the appendix.

Following test cases are printed in the plots:

• A, B, C dense and quadratic with dimension 468, no external padding isneeded (52 · 9, see [8] for details)

• A, B, C dense and quadratic with dimension 350, external padding is needed

• A sparse and quadratic with �ve diagonals and B, C dense and quadratic withdimension 468, no external padding is needed

• A sparse and quadratic with �ve diagonals and B, C dense and quadratic withdimension 350, external padding is needed

• A sparse and quadratic with 2D laplacian like layout and B, C dense andquadratic with dimension 468, no external padding is needed

• A sparse and quadratic with 2D laplacian like layout and B, C dense andquadratic with dimension 350, external padding is needed

Figure 5.1 shows the access pattern for a dense matrix matrix multiplication withoutexternal padding. We can see that there are no jumps in the access pattern.

Also the jumps that occur if the multiplication is executed on a set of matricesfor which external padding is needed do not in�uence the locality of the data ina destructive way like shown in �gure C.2. These jumps are not that far in the

5. Locality Properties of the Peano based Matrix Matrix Multiplication

0

10

20

30

40

50

60

70

80

1 28 55 82 109 136 163 190 217 244 271 298 325 352 379 406 433 460 487 514 541 568 595 622 649 676 703

# Operation

Pean

oin

dex o

f E

lem

en

t

A

B

C

Figure 5.1: Memory access pattern of dense dense matrix matrix multiplication.

memory. Therefore serious cache misses do not exist, because the zero elements arenot stored. As consequence these logical jumps do not occur in the memory.

In �gures C.3 and C.4 the access patterns for the �ve diagonal sparse matrix A areprinted. We can see that the access pattern is nearly optimal in matrices B andC. In matrix A a lot of jumps occur. However these jumps do not occur if thetraversed tree storage is implemented for the matrices because only the move upand down operator is used and logical separated elements are stored close togetherby this storage scheme. Also if external padding is needed for multiplying "general"matrices, there are no serious violations of this optimal access pattern.

Figures 5.2 and C.6 demonstrate the access pattern of the second sparse matrix'smultiplication. Here a 2D laplacian like matrix A is tested. The �rst �gure plotsthe access pattern of the sparse matrix multiplied by a dense one without externalpadding. Also the multiplication using external padding shown in �gure C.6 has agood temporal and spatial locality. The jumps which are obliviously contained inthese two sparse access patterns are also only logical jumps. Therefor the cachesare used in an optimal way.

36


0

10

20

30

40

50

60

70

80

1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191 201 211 221

# Operation

Pe

an

oin

de

x o

f E

lem

en

ts

A

B

C

Figure 5.2: Memory access pattern of sparse dense matrix matrix multiplication; A2D laplacian like.

These �gures demonstrate, that is suitable to use the cache oblivious approach alsofor the processing of sparse matrices. The performance analysis section points outthat these optimal memory access patterns lead to an extreme fast sparse densematrix matrix multiplication.

37


38

6 Implementation

This chapter describes the implementation done in this bachelor thesis. Two ver-sions of TifaMMy have been developed. The �rst one is an extension of TifaMMyimplemented by Christian Mayer and me during 2006 and 2007. The second imple-mentation is a complete redesign of the framework from scratch which implementsthe traversed tree storage for the matrix elements.

At the end, the parallelisation scheme described in [8] is repeated, because it isalso possible to use it with the sparse matrix matrix multiplication. Both versionsdiscussed below support this way of parallelisation.

The versions of TifaMMy that were implemented during this bachelor thesis are1.4.0 and 2.0.0.

6.1 Changes before Version 1.4.0

The version of TifaMMy released with Christian Mayer's diploma thesis [9] was1.0.0. It contains algorithms for dense matrix matrix multiplication (of all dimen-sions) and for LU Decomposition (for matrices which a dimension that is a power ofthree). As matrix elements scalars are used. This library was released in January2006.

Further development was done to improve the performance and functionality ofTifaMMy. The �rst step was to enable the use of block matrices in TifaMMy. Sothe �rst cache level can be exploited optimal. In version 1.1.0 the matrix matrixmultiplication was improved by using block matrices with highly hand tuned mul-tiplication kernels utilizing SSE instructions. Version 1.2.0 enables the use of blockmatrices for the LU decomposition. Therefore several block operators had to beimplemented which enables the use of block matrices at lowest level of the Peanorecursion. Version 1.1.0 and 1.2.0 support all four common �oating point formats:�oat, double, complex and complex double and were implemented during a SEP

6. Implementation

(Systementwicklungsprojekt) by Alexander Heinecke, Stephan Günther and RobertFranz at Technische Universität München from September 2006 until May 2007.

In summer 2007 TifaMMy was extended for a second time. Support for SMPsystems was added. In case of the matrix matrix multiplication it was possible toimplement an owner computes approach which scales very well as shown in [8]. Alsothe a parallelisation of the LU decomposition has been done. However it scales notvery well due to the synchronisation points. Therefore performance results of theparallel LU decomposition are not published, yet, and has to be reviewed. Thisis the last stable release of TifaMMy which supports all above mentioned �oatingpoint formats. It can be downloaded as version 1.3.2. Table 6.1 summarize all theseversion infos.

Version description1.0.0 initial release; only scalars as elements are possible1.1.0 block matrix support for matrix matrix multiplication1.2.0 block matrix support for LU Decomposition1.3.2 parallel version of matrix matrix multiplication1.4.0 support for sparse matrix matrix multiplication2.0.0 only support for sparse matrix matrix multiplication

Table 6.1: TifaMMy's versions overview.

6.2 Extensions to the old Implementaion (v1.4.0)

The version 1.4.0 of TifaMMy uses the data structure and algorithm presented in4.1. Only "a few" changes were needed to implement this version. Version 1.4.0support sparse matrices only for the data type double.

6.2.1 Data Storage of Blocks

This subchapter explains the new data allocator used in TifaMMy version 1.4.0 and2.0.0. In order to redesign TifaMMy for dealing with sparse matrices, the followingextension is needed: It is essential to allocate di�erent types of block matrices ina continuous array to save them "in Peano order". Therefore the class, describedin listing D.1 implements an abstraction which allows the store the data of a denseblock and a CSR block in the same array.

40

6. Implementation

The smallest unit of data available, a byte, is used to calculate the size neededfor storing the data of the di�erent blocks. This is done by the functionsset_Num_Dense_Blocks and set_Num_CompRow_Blocks. After that it is possi-ble to get the number of needed bytes. In order to achieve best performance, everysubstream for every block is stored 64 byte aligned which makes padding in thecontinuous array necessary.

By using iterator like functions a single block can be initialized by calling one ofthe following methods: get_next_pos_dense and get_next_pos_comprow.

6.2.2 Matrix Matrix Multiplication

The other changes to implement sparse matrices in 1.4.0 are straight forward. How-ever, the changes required in the matrix matrix multiplication need to be described.Only the block operations has to be changed, the recursive Peano multiplication canbe reused.

The following new block methods are needed to support a complete set of sparsematrices:

• dense · dense = dense

• sparse · dense = dense

• sparse · sparse = dense

• sparse · sparse = sparse

These di�erent operators are simply integrated into the multiplication operatorof the block matrix. Before the multiplication takes place, there is a check if allthree involved blocks are non-zero blocks. This leads to code for the multiplicationshown in listing 6.1. Only the dense · dense = dense is implemented by usingSSE assembler instructions. In all other operations at least one matrix is a sparsematrix, so a vector approach is very di�cult to implement and is not done in this�rst implementation.

void B lo ckMu l t i p l i c a t i on ( BlockMatrix A, BlockMatrix B){

i f (A. get_leaf_type ( ) == DENSE && B. get_leaf_type ( ) == DENSE&& leaf_type == DENSE)

{

41

6. Implementation

// Matrix Mu l t i p l i c a t i o n}

i f (A. get_leaf_type ( ) == COMPROW && B. get_leaf_type ( ) == DENSE&& leaf_type == DENSE)

{// Matrix Mu l t i p l i c a t i o n

}

i f (A. get_leaf_type ( ) == DENSE && B. get_leaf_type ( ) == COMPROW&& leaf_type == DENSE)


}

i f (A. get_leaf_type ( ) == COMPROW && B. get_leaf_type ( ) ==COMPROW && leaf_type == DENSE)


}

i f (A. get_leaf_type ( ) == COMPROW && B. get_leaf_type ( ) ==COMPROW && leaf_type == COMPROW)


}}

Listing 6.1: Allocator class to implemented a mixed continuous data stream.

6.3 Traversed Tree Storage (v2.0.0)

This subchapter introduces the new features in the re-implementation of TifaMMy.The biggest changes have be done to the data structure. They are described inthe upcoming subsections. However the used multiplication algorithm and blockoperations are quite similar to the old ones, but not identical.

42

6. Implementation

6.3.1 Data Structure Setup

Here the implementation of the data structure described in the chapter 4.2 of thisbachelor thesis is explained. Respective code listings are given in the appendix:D.2, D.3 and D.4.

The �rst of these three listings is the constructor of the so called HybridMatrix,which is a class that implements block matrices stored in Peano order. In contrastto the old implementation the already mentioned management array, which containsif a block is a zero block is needed. This array is called sima in the source code.bima is the array which contains the block information that has pointers into thecontinuous data stream as member variables to address the matrix data, stored incorrect Peano order.

First of all the number of needed blocks (number of non-zero blocks) is calculated.This is important to get the number of needed bima entries.

However the setup of the sima array is more complicated. At the beginning thenumber of elements has to be detected. In this �rst implementation this numberis the maximum possible number. This has the advantage that a simple array canbe used to implement sima. Upcoming versions of TifaMMy probably will use theSTL implementation of a vector to store the sima information.

After the allocation, the sima array has to be initialized. Therefore the functioninit_sima_bima is called which also sets the data structure of the sparse blockmatrices. The second code example shows this method. As you can see this is alsoa recursive function which is called for every of the nine subareas per level. If thereis any non-zero element stored in one of these subareas the sima array contains avalue greater or equal zero. This values contains the position in the sima arraywhere the information of the next recursion step (on the lower level) is stored inthe sima array. If the last level is reached, the values are the indexes of the neededblocks in the bima array. An value equal -1 labels a subarea as a zero area and norecursive or block multiplication calls are done.

Besides the setup of the data structure a function is needed which allows a "tradi-tional" access to the data stored in the matrix. The function coord2linear providesthis functionality and is given in the third code example. It works in a quite similarway like init_sima_bima.

43

6. Implementation

6.3.2 Matrix Matrix Multiplication

The new implementation of the matrix matrix multiplication exploits the above de-scribed data structure by dramatically reducing the number of handled zero blocks.Listing D.5 displays the important parts of the new matrix matrix multiplication.

At the end of the example the functionMulAdd_MM_PPP is printed. This methodis a straight forward implementation of the equation shown in �gure 3.5.4. Thefunction has two di�erent cases: Because of the implementation of the sima arraydescribed above it has to be checked, if the current level of the recursion is the lastlevel. If this is the case, the block multiplication operator is called, in all other casesrecursive calls for the nine subareas are done. In contrast to the old implementationthis new algorithm checks before every recursive call or block operation if all threeinvolved block matrices or submatrices are containing non-zero elements. Only inthis case the operation is performed.

By using this implementation not only the "problem" of the sparse matrices can betackled but also it can be used for implementing the extern padding by design ofthe data structure and not by using sophisticated extensions of the algorithm. Thismakes implementing new algorithms for Peano structured matrices very simple,because with the traversed tree storage all information about the matrix is storedin its data structure and need not to be implemented in every single algorithm onthe matrix.

12

23

3 4

4

1

Figure 6.1: Multi-thread owner computes approach of the Peano based matrix mul-tiplication [8].

44

6. Implementation

6.4 Parallelisation of Multiplication

The already developed parallelisation for the dense matrix matrix multiplicationwas also adopted for the sparse matrix multiplication. It works with the sameprinciple as described in [8]. The Peano order in matrix C is split up into equalsized parts. The number of parts should match with processors available to performthe matrix multiplication. Figure 6.1 illustrates this fact. This method of imple-menting the parallelisation is called "owner-computes". It has the advantage thatonly one processor is writing data into a speci�c result block in matrix C, so nosynchronisation is needed and cache pollution is avoided.

However this parallelisation scales or works only well, if the sparsity pattern oraccess pattern of the multiplication matches the splitting of result blocks amongthe processors. It might be possible that for other not tested case, this scheme ofload balancing is not satisfactory.

45

6. Implementation

46

7 Performance Analysis

The last chapter of this bachelor thesis compares the computational power ofTifaMMy to the power of Intel's Math Kernel Library. Only following test caseis regarded in this section sparse · dense = dense. For the sparse A matrix a 2Dlaplacian like matrix is used. For this synthetic test, matrix B is set to 1 in everyelement and Matrix C is complete zero before the multiplication of A and B.

Following systems were used to generate the plots:

• Dual Intel Xeon X5355 quad-core workstation (Clovertown, 2,66 GHz, 2-ways)with 6 GB of FB-DIMM memory connected to the MCH via dual channel

• Dual Intel Xeon X5355 quad-core server (Clovertown, 2,66 GHz, 2-ways) ma-chine with 8 GB of FB-DIMM memory connected to the MCH via quadchannel

• Dual AMD Opteron 2347 Server (Barcelona, 1,9 GHz, TLB enabled, 2-ways)with 16 GB DDR2 memory connected via AMD's NUMA technology.

There will be three di�erent looks on the performance data of TifaMMy. First thesingle thread (processor) performance is analysed.

In a second step the parallel implementation of TifaMMy is tested with this sparsetest. Hence using the Intel's and AMD's quad-core CPUs (they are built by puttingtwo dual-core CPUs into one package (Intel) and monolithic quad-cores (AMD))there are three di�erent analysis used in the parallel case: with two, four and eightthreads.

At the end a performance counter analysis is done to proof that TifaMMy showsoptimal use of the available caches.

7. Performance Analysis

7.1 Single Thread Analysis

The plot A.3 demonstrates that TifaMMy is about 50% faster than Intel's MathKernel Library in Version 10 on the Clovertown workstation. The di�erences ofTifaMMy in version 1.4.0 and 2.0.0 are obvious. TifaMMy 1.4.0 shows the alreadymentioned growing lack of performance with by increasing the dimensions of the testmatrices. This is a result of the growing number of zero blocks which are processedby the logik of TifaMMy but these blocks are not multiplied. This leads to a fallingchart. Instead TifaMMy 2.0.0 produces a horizontal plot due to the better scalingdata structure. The small zig-zags in all plots of the clovertown workstation areprobalbly a result of the used kernel (2.6.18) which is needed because of the VTuneutility.

More performance can be achieved if the Clovertown server system is tested. Thissystem provides a quad channel memory interface, therefore more bandwidth isavailable. This explains the little higher MFlop/s-rate in �gure A.4. The di�erencebetween TifaMMy 2.0.0 and TifaMMy 1.4.0 gets a little clearer on this machine.TifaMMy is able to achieve about twice of MKL's performance on this platform.

The test results on the Barcelona System are interesting. There is no di�erencebetween the both implementations of TifaMMy. This may have several reasons.The logic and arithmetic units seems to be much slower than on the Xeon processor.The performance gap between the Xeon and Barcelona is more than the di�erencein the speed of the two processors. Therefore the advantage of the traversed treestorage in version 2.0.0 may not be exploited. Figure A.5 shows the plot of the onethread test on the Barcelona. Also on the Barcelona TifaMMy outperforms Intel'sMKL.

7.2 Multi Thread Analysis

In this sub chapter the parallel implementation which was copied from the densecase is benchmarked. Hence not a well tuned application for sparse matrices is usedhere the result might not be optimal. However it is compared with the parallelversion of Intel's MKL and shows competitive performance.

48


7.2.1 Two Threads

In case of using two threads TifaMMy 2.0.0 runs optimal on the Xeon platforms asdemonstrated in �gures A.6 and A.7. On the Barcelona (�gure A.8) TifaMMy 1.4.0shows stills the same performance like TifaMMy 2.0.0. This is an additional pointfor having another limitating factor on AMD system than the memory connection.

On the Clovertown server the speed di�erence between the two TifaMMy imple-mentations is obvious greater than on the Clovertown workstation. This is a furtherindicator for the memory boundness of the sparse matrix matrix multiplication.

The comparison to Intel's MKL is identical on all three platforms. TifaMMy is ableoutperform this common sparse matrix routine. On the Clovertown server systemTifaMMy seems to have a little bit higher parallel e�ciency. However this point ofthe implementation's test is analysed in detail in section 7.2.3.

7.2.2 Four Threads and Eight Threads

Figures A.9 to A.14 present the high parallel performance test of TifaMMy. Theperformance measured on all three systems by using eight threads (or processors)is very poor. On the Xeon systems the performance by using eight threads ispoorer than using four threads. The Barcelona system yields a signi�cantly betterscalability. This might be a consequence of the shared level three cache for fourcore in the AMD system. The performance reached by using eight threads is aboutsix to �ve times higher than executing TifaMMy with one thread. In contrast onthe Xeon system, only about three times of the performance using one thread canbe achieved.

However TifaMMy is able to calculate the matrix products faster than Intel's MKL.But with increasing dimension (the data of the sparse matrix won't �t into the cacheanymore) also the performance of TifaMMy slows heavily down. Both TifaMMyand MKL su�er from a falling plot which also indicates the memory boundness ofthe sparse matrix matrix multiplication.

49


7.2.3 Parallel e�ciency

To compare the scalability of TifaMMy and MKL on the Xeon and the Barcelonaplatform the parallel e�ciency is regarded.

E(p) = F (p)/(p · F (1)) (7.1)

This formula gives the e�ciency when using p threads, where F (1) denotes theaverage performance in terms of MFlop/s measured for a single thread, p is thenumber of threads, and F (p) is the average performance by using p threads [8].

All values used to calculate the parallel e�ciency are the average values of theperformance plots given in the appendix.

11

03

,9

95

3,8

66

9,2

32

8,3

11

61

,2

10

69

,5

80

5,4

43

7,1

58

4,3

56

9,4

52

8,8

39

9,2

11

02

,9

95

5,8

68

8,9

33

1,0

11

87

,7

10

84

,3

84

2,7

45

4,2

57

9,7

56

6,5

52

8,2

40

3,1

78

2,3

75

3,3

58

3,2

27

6,5

83

8,4

83

0,4

65

3,3

31

0,5

47

5,6

49

9,8

47

0,3

37

1,0

0,0

200,0

400,0

600,0

800,0

1000,0

1200,0

1400,0

1 Th.

X5355 D

2 Th.

X5355 D

4 Th.

X5355 D

8 Th.

X5355 D

1 Th.

X5355 Q

2 Th.

X5355 Q

4 Th.

X5355 Q

8 Th.

X5355 Q

1 Th.

2347

2 Th.

2347

4 Th.

2347

8 Th.

2347

MF

lop

/s p

er

Co

re

TifaMMy 1.4.0

TifaMMy 2.0.0

Intel MKL 10

Figure 7.1: Performance per core measured in MFlop/s for all three test systemsand 1, 2, 4 and 8 threads.

Figure 7.1 gives the MFlop/s rates measured in the performance tests. These resultscon�rm the impression we get from plots analysed above: The Xeons have a lotmore computational power but their scalability is very poor. In contrast to that the

50


10

0,0

0%

86

,40

%

60

,62

%

29

,74

%

10

0,0

0%

92

,10

%

69

,36

%

37

,64

%

10

0,0

0%

97

,44

%

90

,49

%

68

,31

%

10

0,0

0%

86

,66

%

62

,46

%

30

,01

%

10

0,0

0%

91

,29

%

70

,95

%

38

,24

%

10

0,0

0%

97

,73

%

91

,12

%

69

,54

%

10

0,0

0%

96

,29

%

74

,55

%

35

,34

%

10

0,0

0%

99

,05

%

77

,93

%

37

,04

%

10

0,0

0%

10

5,0

8%

98

,88

%

78

,00

%

0,00%

20,00%

40,00%

60,00%

80,00%

100,00%

120,00%

1 Th.

X5355 D

2 Th.

X5355 D

4 Th.

X5355 D

8 Th.

X5355 D

1 Th.

X5355 Q

2 Th.

X5355 Q

4 Th.

X5355 Q

8 Th.

X5355 Q

1 Th.

2347

2 Th.

2347

4 Th.

2347

8 Th.

2347

Pa

rall

el

Eff

icie

nc

yTifaMMy 1.4.0

TifaMMy 2.0.0

Intel MKL 10

Figure 7.2: Parallel e�ciency for all three test systems and 1, 2, 4 and 8 threads.

Opteron system has the advantage of the shared level three cache and therefore itscales very well for using up to four threads / CPUs. If all eight CPUs are used, onlythe performance of Opteron system is satisfactory when the e�ciency is regarded.

Figure 7.2 give the parallel e�ciency calculated by equation 7.1. TifaMMy is inmost case not so e�cient like Intel's MKL. This is consequence of using the averagevalues of the performance plots. On the Xeon systems there is mentioned zig-zagwith lowers these averages. This �gure demonstrates for one more time that thebottleneck of the Xeon systems is their shared memory interface. The quad channelserver system has slightly better values than the dual channel system.

7.3 Performance Counter Analysis

The measured results should be mellowed by some performance counter analysis.Cache hit rates and bus utilisation are presented in tables . The �rst table gives

51


Ratios/Event TifaMMy MKL

L1 Hit Rate 92.84% 98.04%L2 Hit Rate 97.87% 95.88%

L2 Lines demanded 20.7 · 106 40.8 · 106

L2 Lines prefechted 336.2 · 106 327.3 · 106

Cache invalid 7.0 · 106 5.0 · 106

DTLB Hit Rate 99.99% 99.99%Stall cycles ratio 0.26 0.23

Clocks per Instruction 0.515 0.527

Table 7.1: 1-threaded measured hardware events on the Clovertown dual channelworkstation.

Ratios/Event TifaMMy MKL

L1 Hit Rate 91.49% 97.94%L2 Hit Rate 92.13% 92.10%

L2 Lines demanded 328.1 · 106 429.4 · 106

L2 Lines prefechted 136.9 · 107 175.6 · 107

Cache invalid 99.0 · 106 75.0 · 106

DTLB Hit Rate 99.99% 99.99%Stall cycles ratio 0.76 0.69

Clocks per Instruction 1.792 1.465

Table 7.2: 8-threaded measured hardware events on the Clovertown dual channelworkstation.

the values for a single threaded calculation. In the second table the values of thesame counters are presented using all eight CPUs for the calculation of the matrixmatrix product. The Clovertown workstation was used the collect these values.

The measured hardware counters are same as the ones described in [8].

These results are very interesting. TifaMMy's level one cache hit rate, which wasperfect in dense, is not that perfect anymore. This may be a consequence of theused block size of the inner block matrices. These can be carefully tuned to the L1cache size when all blocks are dense. But if CSR blocks are used the amount ofdata processed in one block operation may vary and no optimal size of the blockscan be determined in a simple way.

Although MKL has several values that may seem to be better at �rst sight, there isone counter that is extremely performance critical. It is L2 lines loaded on demand.These are the L2 misses with loading data on demand. Here the processor was

52


not able to �nd the needed data in all available cache levels and the data had betransfered from the main memory. The value of this this counter is signi�cantlylower for TifaMMy than for Intel's MKL. In addition TifaMMy also has a betterprefetch-rate if only one core is used. These are exactly the expected results whenusing a space-�lling curve based algorithm.

The stall cycles ratio identi�es that the sparse matrix matrix multiplication is amemory bounded application. Most of the CPU time is wasted by waiting for datato be transfered from the main memory to the CPU. The high stall rates from about70% are responsible for the poor parallel e�ciency shown above.

The fact that the processor is spending a lot of time by waiting for its needed datais also con�rmed by the CPI rate. If only one processors is executing the matrixmultiplication routine about two assembler instructions can be retired per one CPUcycle. However when all eight processors are used this rate increases dramaticallyto values of about 1.5. This means the processor needs more than one cycle tocomplete one instruction. As consequence the pipeline is not to work probably andpipeline stalls cannot be avoided.

53


54

8 Conclusion

This bachelor thesis demonstrates that a cache oblivious approach can be verysuitable if you are dealing with sparse matrices. However you have to tune thealgorithms with a data structure that allows an e�cient storage scheme of thematrices. This means you should only store as few as possible zero marked matrixblocks.

Beside being only a structure for sparse matrices the traversed tree storage can beused to simplify the already implemented dense matrix matrix multiplication. Hereit supports the external or also called zero padding by design of the data structureand no complicated implementation of the matrix matrix multiplication is neededto enable a matrix multiplication with any dimension.

TifaMMy is able to outperform an established library if sparse matrix algorithmsare used. The performance analysis shows that in case of using only one threadTifaMMy is up to 50% faster than the MKL and also if the not well optimisedparallel implementation is executed, TifaMMy caculates faster.

By testing a new �eld of matrix applications this bachelor thesis can be regardedas a starting point for several new aspects of the cache oblivious approach. Sothe already mentioned re-implementation of the dense matrix support with thetraversed tree storage should be tested.

The performance results have shown that the implementation works quite �ne, ifonly one thread is used. The multi-thread implementation su�ers from the notperfect load balancing between the several CPUs and CPI ratio is quite high forTifaMMy running on the Clovertown system. Here some improvements should bedone.

The presented versions for sparse matrices of TifaMMy only support double as�oating point datatype. Here next steps should be done to enable the use of �oat,complex and complex double values because for example today's applications needcomplex values in several cases.

8. Conclusion

Sparse systems of equation needed to be preconditioned in some cases. The ILUmatrix factorisation is a well known preconditioner and quite similar to the normalLU decomposition which is already implemented in TifaMMy. Therefore a nextstep would be to integrate ILU into the sparse version of TifaMMy.

56

A Performance Graphs

11

03

,9

95

3,8

66

9,2

32

8,3

11

61

,2

10

69

,5

80

5,4

43

7,1

58

4,3

56

9,4

52

8,8

39

9,2

11

02

,9

95

5,8

68

8,9

33

1,0

11

87

,7

10

84

,3

84

2,7

45

4,2

57

9,7

56

6,5

52

8,2

40

3,1

78

2,3

75

3,3

58

3,2

27

6,5

83

8,4

83

0,4

65

3,3

31

0,5

47

5,6

49

9,8

47

0,3

37

1,0

0,0

200,0

400,0

600,0

800,0

1000,0

1200,0

1400,0

1 Th.

X5355 D

2 Th.

X5355 D

4 Th.

X5355 D

8 Th.

X5355 D

1 Th.

X5355 Q

2 Th.

X5355 Q

4 Th.

X5355 Q

8 Th.

X5355 Q

1 Th.

2347

2 Th.

2347

4 Th.

2347

8 Th.

2347

MF

lop

/s p

er

Co

re

TifaMMy 1.4.0

TifaMMy 2.0.0

Intel MKL 10

Figure A.1: Performance per core measured in MFlop/s for all three test systemsand 1, 2, 4 and 8 threads.

A. Performance Graphs

10

0,0

0%

86

,40

%

60

,62

%

29

,74

%

10

0,0

0%

92

,10

%

69

,36

%

37

,64

%

10

0,0

0%

97

,44

%

90

,49

%

68

,31

%

10

0,0

0%

86

,66

%

62

,46

%

30

,01

%

10

0,0

0%

91

,29

%

70

,95

%

38

,24

%

10

0,0

0%

97

,73

%

91

,12

%

69

,54

%

10

0,0

0%

96

,29

%

74

,55

%

35

,34

%

10

0,0

0%

99

,05

%

77

,93

%

37

,04

%

10

0,0

0%

10

5,0

8%

98

,88

%

78

,00

%

0,00%

20,00%

40,00%

60,00%

80,00%

100,00%

120,00%

1 Th.

X5355 D

2 Th.

X5355 D

4 Th.

X5355 D

8 Th.

X5355 D

1 Th.

X5355 Q

2 Th.

X5355 Q

4 Th.

X5355 Q

8 Th.

X5355 Q

1 Th.

2347

2 Th.

2347

4 Th.

2347

8 Th.

2347

Pa

rall

el

Eff

icie

nc

y

TifaMMy 1.4.0

TifaMMy 2.0.0

Intel MKL 10

Figure A.2: Parallel e�ciency for all three test systems and 1, 2, 4 and 8 threads.

58


50

0

60

0

70

0

80

0

90

0

10

00

11

00

12

00

13

00

14

00

15

00 52

416

780

1144

1508

1872

2236

2600

2964

3328

3692

4056

4420

4784

5148

5512

5876

6240

6604

6968

7332

7696

8060

8424

8788

9152

9516

9880

1024

4 1060

8 1097

2 1133

6 1170

0 1206

4

Ma

trix

Dim

en

sio

n

MFlop/s

TifaM

My 1

.4.0

TifaM

My 2

.0.0

Inte

l M

KL 1

0

Figure A.3: Performance comparison with one thread between TifaMMy and IntelMKL 10 on the Clovertown workstation.

59


50

0

70

0

90

0

11

00

13

00

15

00

17

00 52

416

780

1144

1508

1872

2236

2600

2964

3328

3692

4056

4420

4784

5148

5512

5876

6240

6604

6968

7332

7696

8060

8424

8788

9152

9516

9880

1024

4 1060

8 1097

2 1133

6 1170

0 1206

4

Ma

trix

Dim

en

sio

n

MFlop/s

TifaM

My 1

.4.0

TifaM

My 2

.0.0

Inte

l M

KL 1

0

Figure A.4: Performance comparison with one thread between TifaMMy and IntelMKL 10 on the Clovertown Server.

60


35

0

45

0

55

0

65

0

75

0

85

0

95

0

10

50

11

50

12

50

68

408

748

1088

1428

1768

2108

2448

2788

3128

3468

3808

4148

4488

4828

5168

5508

5848

6188

6528

6868

7208

7548

7888

8228

8568

8908

9248

9588

9928

10268

10608

10948

11288

11628

11968

Ma

trix

Dim

en

sio

n

MFlop/s

TifaM

My 1

.4.0

TifaM

My 2

.0.0

Inte

l M

KL 1

0

Figure A.5: Performance comparison with one thread between TifaMMy and IntelMKL 10 on the Barcelona Server.

61


13

00

15

00

17

00

19

00

21

00

23

00 52

416

780

1144

1508

1872

2236

2600

2964

3328

3692

4056

4420

4784

5148

5512

5876

6240

6604

6968

7332

7696

8060

8424

8788

9152

9516

9880

1024

4 1060

8 1097

2 1133

6 1170

0 1206

4

Ma

trix

Dim

en

sio

n

MFlop/s

TifaM

My 1

.4.0

TifaM

My 2

.0.0

Inte

l M

KL 1

0

Figure A.6: Performance comparison with two threads between TifaMMy and IntelMKL 10 on the Clovertown workstation.

62


12

00

14

00

16

00

18

00

20

00

22

00

24

00

26

00

28

00

30

00 52

416

780

1144

1508

1872

2236

2600

2964

3328

3692

4056

4420

4784

5148

5512

5876

6240

6604

6968

7332

7696

8060

8424

8788

9152

9516

9880

1024

4 1060

8 1097

2 1133

6 1170

0 1206

4

Ma

trix

Dim

en

sio

n

MFlop/s

TifaM

My 1

.4.0

TifaM

My 2

.0.0

Inte

l M

KL 1

0

Figure A.7: Performance comparison with two threads between TifaMMy and IntelMKL 10 on the Clovertown Server.

63


60

0

70

0

80

0

90

0

10

00

11

00

12

00

13

00

14

00

15

00

68

408

748

1088

1428

1768

2108

2448

2788

3128

3468

3808

4148

4488

4828

5168

5508

5848

6188

6528

6868

7208

7548

7888

8228

8568

8908

9248

9588

9928

10268

10608

10948

11288

11628

11968

Ma

trix

Dim

en

sio

n

MFlop/s

TifaM

My 1

.4.0

TifaM

My 2

.0.0

Inte

l M

KL 1

0

Figure A.8: Performance comparison with two threads between TifaMMy and IntelMKL 10 on the Barcelona Server.

64


20

00

22

00

24

00

26

00

28

00

30

00

32

00

34

00 52

416

780

1144

1508

1872

2236

2600

2964

3328

3692

4056

4420

4784

5148

5512

5876

6240

6604

6968

7332

7696

8060

8424

8788

9152

9516

9880

1024

4 1060

8 1097

2 1133

6 1170

0 1206

4

Ma

trix

Dim

en

sio

n

MFlop/s

TifaM

My 1

.4.0

TifaM

My 2

.0.0

Inte

l M

KL 1

0

Figure A.9: Performance comparison with four threads between TifaMMy and IntelMKL 10 on the Clovertown workstation.

65


20

00

25

00

30

00

35

00

40

00 52

416

780

1144

1508

1872

2236

2600

2964

3328

3692

4056

4420

4784

5148

5512

5876

6240

6604

6968

7332

7696

8060

8424

8788

9152

9516

9880

1024

4 1060

8 1097

2 1133

6 1170

0 1206

4

Ma

trix

Dim

en

sio

n

MFlop/s

TifaM

My 1

.4.0

TifaM

My 2

.0.0

Inte

l M

KL 1

0

Figure A.10: Performance comparison with four threads between TifaMMy and In-tel MKL 10 on the Clovertown Server.

66


16

00

18

00

20

00

22

00

24

00

26

00

28

00

30

00

68

408

748

1088

1428

1768

2108

2448

2788

3128

3468

3808

4148

4488

4828

5168

5508

5848

6188

6528

6868

7208

7548

7888

8228

8568

8908

9248

9588

9928

10268

10608

10948

11288

11628

11968

Ma

trix

Dim

en

sio

n

MFlop/s

TifaM

My 1

.4.0

TifaM

My 2

.0.0

Inte

l M

KL 1

0

Figure A.11: Performance comparison with four threads between TifaMMy and In-tel MKL 10 on the Barcelona Server.

67


20

00

22

00

24

00

26

00

28

00

30

00

32

00

34

00

36

00

38

00

40

00 52

416

780

1144

1508

1872

2236

2600

2964

3328

3692

4056

4420

4784

5148

5512

5876

6240

6604

6968

7332

7696

8060

8424

8788

9152

9516

9880

1024

4 1060

8 1097

2 1133

6 1170

0 1206

4

Ma

trix

Dim

en

sio

n

MFlop/s

TifaM

My 1

.4.0

TifaM

My 2

.0.0

Inte

l M

KL 1

0

Figure A.12: Performance comparison with eight threads between TifaMMy andIntel MKL 10 on the Clovertown workstation.

68


19

00

24

00

29

00

34

00

39

00

44

00

49

00 52

416

780

1144

1508

1872

2236

2600

2964

3328

3692

4056

4420

4784

5148

5512

5876

6240

6604

6968

7332

7696

8060

8424

8788

9152

9516

9880

1024

4 1060

8 1097

2 1133

6 1170

0 1206

4

Ma

trix

Dim

en

sio

n

MFlop/s

TifaM

My 1

.4.0

TifaM

My 2

.0.0

Inte

l M

KL 1

0

Figure A.13: Performance comparison with eight threads between TifaMMy andIntel MKL 10 on the Clovertown Server.

69


20

00

22

00

24

00

26

00

28

00

30

00

32

00

34

00

36

00

38

00

40

00

68

408

748

1088

1428

1768

2108

2448

2788

3128

3468

3808

4148

4488

4828

5168

5508

5848

6188

6528

6868

7208

7548

7888

8228

8568

8908

9248

9588

9928

10268

10608

10948

11288

11628

11968

Ma

trix

Dim

en

sio

n

MFlop/s

TifaM

My 1

.4.0

TifaM

My 2

.0.0

Inte

l M

KL 1

0

Figure A.14: Performance comparison with eight threads between TifaMMy andIntel MKL 10 on the Barcelona Server.

70

B Algorithm De�nitions

B.1 PPP Multiplication

PC0 +=PA0PB0 SC4 +=SA4SB4 −→ RC3 +=RA3SB4

↓ ↑ ↓QC1 +=QA1PB0 RC5 +=RA5SB4 RC3 +=PA2RB5 PC8 +=PA8PB8

↓ ↑ ↓ ↑PC2 +=PA2PB0 RC5 +=PA6RB3 SC4 +=QA1RB5 QC7 +=QA7PB8

↓ ↑ ↓ ↑PC2 +=RA3QB1 SC4 +=QA7RB3 RC5 +=PA0RB5 PC6 +=PA6PB8

↓ ↑ ↓ ↑QC1 +=SA4QB1 RC3 +=PA8RB3 PC6 +=PA0PB6 PC6 +=RA5QB7

↓ ↑ ↓ ↑PC0 +=RA5QB1 PC2 +=PA8PB2 QC7 +=QA1PB6 QC7 +=SA4QB7

↓ ↑ ↓ ↑PC0 +=PA6PB2 −→ QC1 +=QA7PB2 PC8 +=PA2PB6 −→ PC8 +=RA3QB7

Figure B.1: PPP -Multiplication

B. Algorithm De�nitions

B.2 PRR Multiplication

RC0 +=PA8RB0 QC4 +=SA4QB4 −→ PC3 +=RA5QB4

↓ ↑ ↓SC1 +=QA7RB0 PC5 +=RA3QB4 PC3 +=PA6PB5 RC8 +=PA0RB8

↓ ↑ ↓ ↑RC2 +=PA6RB0 PC5 +=PA2PB3 QC4 +=QA7PB5 SC7 +=QA1RB8

↓ ↑ ↓ ↑RC2 +=RA5SB1 QC4 +=QA1PB3 PC5 +=PA8PB5 RC6 +=PA2RB8

↓ ↑ ↓ ↑SC1 +=SA4SB1 PC3 +=PA0PB3 RC6 +=PA8RB6 RC6 +=RA3SB7

↓ ↑ ↓ ↑RC0 +=RA3SB1 RC2 +=PA0RB2 SC7 +=QA7RB6 SC7 +=SA4SB7

↓ ↑ ↓ ↑RC0 +=PA2RB2 −→ SC1 +=QA1RB2 RC8 +=PA6RB6 −→ RC8 +=RA5SB7

Figure B.2: PRR-Multiplication

B.3 QPQ-Multiplication

QC0 +=QA0PB8 RC4 +=RA4SB4 −→ SC3 +=SA3SB4

↓ ↑ ↓PC1 +=PA1PB8 SC5 +=SA5SB4 SC3 +=QA2RB3 QC8 +=QA8PB0

↓ ↑ ↓ ↑QC2 +=QA2PB8 SC5 +=QA6RB5 RC4 +=PA1RB3 PC7 +=PA7PB0

↓ ↑ ↓ ↑QC2 +=SA3QB7 RC4 +=PA7RB5 sC5 +=QA0RB3 QC6 +=QA6PB0

↓ ↑ ↓ ↑PC1 +=RA4QB7 SC3 +=QA8RB5 QC6 +=QA0PB2 QC6 +=SA5QB1

↓ ↑ ↓ ↑QC0 +=SA5QB7 QC2 +=QA8PB6 PC7 +=PA1PB2 PC7 +=RA4QB1

↓ ↑ ↓ ↑QC0 +=QA6PB6 −→ PC1 +=PA7PB6 QC8 +=QA2PB2 −→ QC8 +=SA3QB1

Figure B.3: QPQ-Multiplication

72


B.4 QRS-Multiplication

SC0 +=QA8RB8 PC4 +=RA4QB4 −→ QC3 +=SA5QB4

↓ ↑ ↓RC1 +=PA7RB8 QC5 +=SA3QB4 QC3 +=QA6PB3 SC8 +=QA0RB0

↓ ↑ ↓ ↑SC2 +=QA6RB8 QC5 +=QA2PB5 PC4 +=PA7PB3 RC7 +=PA1RB0

↓ ↑ ↓ ↑SC2 +=SA5SB7 PC4 +=PA1PB5 QC5 +=QA8PB3 SC6 +=QA2RB0

↓ ↑ ↓ ↑RC1 +=RA4SB7 QC3 +=QA0PB5 SC6 +=QA8RB2 SC6 +=SA3SB1

↓ ↑ ↓ ↑SC0 +=SA3SB7 SC2 +=QA0RB6 RC7 +=PA7RB2 RC7 +=RA4SB1

↓ ↑ ↓ ↑SC0 +=QA2RB6 −→ RC1 +=PA1RB6 SC8 +=QA6RB2 −→ SC8 +=SA5SB1

Figure B.4: QRS-Multiplication

B.5 RQP-Multiplication

PC8 +=RA0QB0 SC4 +=QA4RB4 −→ RC5 +=PA3RB4

↓ ↑ ↓QC7 +=SA1QB0 RC3 +=PA5RB4 RC5 +=RA2SB5 PC0 +=RA8QB8

↓ ↑ ↓ ↑PC6 +=RA2QB0 RC3 +=RA6SB3 SC4 +=SA1SB5 QC1 +=SA7QB8

↓ ↑ ↓ ↑PC6 +=PA3PB1 SC4 +=SA7SB3 RC3 +=RA0SB5 PC2 +=RA6QB8

↓ ↑ ↓ ↑QC7 +=QA4PB1 RC5 +=RA8SB3 PC2 +=RA0QB6 PC2 +=PA5PB7

↓ ↑ ↓ ↑PC8 +=PA5PB1 PC6 +=RA8QB2 QC1 +=SA1QB6 QC1 +=QA4PB7

↓ ↑ ↓ ↑PC8 +=RA6QB2 −→ QC7 +=SA7QB2 PC0 +=RA2QB6 −→ PC0 +=PA3PB7

Figure B.5: RQP -Multiplication

73


B.6 RSR-Multiplication

RC8 +=RA8SB0 QC4 +=QA4PB4 −→ PC5 +=PA5PB4

↓ ↑ ↓SC7 +=SA7SB0 PC3 +=PA3PB4 PC5 +=RA6QB5 RC0 +=RA0SB8

↓ ↑ ↓ ↑RC6 +=RA6SB0 PC3 +=RA2QB3 QC4 +=SA7QB5 SC1 +=SA1SB8

↓ ↑ ↓ ↑RC6 +=PA5RB1 QC4 +=SA1QB3 PC3 +=RA8QB5 RC2 +=RA2SB8

↓ ↑ ↓ ↑SC7 +=QA4RB1 PC5 +=RA0QB3 RC2 +=RA8SB6 RC2 +=PA3RB7

↓ ↑ ↓ ↑RC8 +=PA3RB1 RC6 +=RA0SB2 SC1 +=SA7SB6 SC1 +=QA4RB7

↓ ↑ ↓ ↑RC8 +=RA2SB2 −→ SC7 +=SA1SB2 RC0 +=RA6SB6 −→ RC0 +=PA5RB7

Figure B.6: RSR-Multiplication

B.7 SQQ-Multiplication

QC8 +=SA0QB8 RC4 +=PA4RB4 −→ SC5 +=QA3RB4

↓ ↑ ↓PC7 +=RA1QB8 SC3 +=QA5RB4 SC5 +=SA2SB3 QC0 +=SA8QB0

↓ ↑ ↓ ↑QC6 +=SA2QB8 SC3 +=SA6SB5 RC4 +=RA1SB3 PC1 +=RA7QB0

↓ ↑ ↓ ↑QC6 +=QA3PB7 RC4 +=RA7SB5 SC3 +=SA0SB3 QC2 +=SA6QB0

↓ ↑ ↓ ↑PC7 +=PA4PB7 SC5 +=SA8SB5 QC2 +=SA0QB2 QC2 +=QA5PB1

↓ ↑ ↓ ↑QC8 +=QA5PB7 QC6 +=SA8QB6 PC1 +=RA1QB2 PC1 +=PA4PB1

↓ ↑ ↓ ↑QC8 +=SA6QB6 −→ PC7 +=RA7QB6 QC0 +=SA2QB2 −→ QC0 +=QA3PB1

Figure B.7: SQQ-Multiplication

74


B.8 SSS-Multiplication

SC8 +=SA8SB8 PC4 +=PA4PB4 −→ QC5 +=QA5PB4

↓ ↑ ↓RC7 +=RA7SB8 QC3 +=QA3PB4 QC5 +=SA6QB3 SC0 +=SA0SB0

↓ ↑ ↓ ↑SC6 +=SA6SB8 QC3 +=SA2QB5 PC4 +=RA7QB3 RC1 +=RA1SB0

↓ ↑ ↓ ↑SC6 +=QA5RB7 PC4 +=RA1QB5 QC3 +=SA8QB3 SC2 +=SA2SB0

↓ ↑ ↓ ↑RC7 +=PA4RB7 QC5 +=SA0QB5 SC2 +=SA8SB2 SC2 +=QA3RB1

↓ ↑ ↓ ↑SC8 +=QA3RB7 SC6 +=SA0SB6 RC1 +=RA7SB2 RC1 +=PA4RB1

↓ ↑ ↓ ↑SC8 +=SA2SB6 −→ RC7 +=RA1SB6 SC0 +=SA6SB2 −→ SC0 +=QA5RB1

Figure B.8: SSS-Multiplication

75


76

C Matrix Element Access Graphs

0

10

20

30

40

50

60

70

80

1 28 55 82 109 136 163 190 217 244 271 298 325 352 379 406 433 460 487 514 541 568 595 622 649 676 703

# Operation

Pean

oin

dex o

f E

lem

en

t

A

B

C

Figure C.1: Memory access pattern of dense dense matrix matrix multiplication.

C. Matrix Element Access Graphs

05

10

15

20

25

30

35

40

45

121

41

61

81

101

121

141

161

181

201

221

241

261

281

301

321

341

# O

pera

tio

n

Peanoindex of Element

A B C

Figure C.2: Memory access pattern of dense dense matrix matrix multiplicationwith external padding.

78


0

10

20

30

40

50

60

70

80

11

52

94

35

77

18

59

91

13

12

71

41

15

51

69

18

31

97

21

12

25

# O

pe

ratio

n


A B C

Figure C.3: Memory access pattern of sparse dense matrix matrix multiplication; A�ve diagonals.

79


05

10

15

20

25

30

35

40

45

11

42

74

05

36

67

99

21

05

11

81

31

# O

pe

rati

on


A B C

Figure C.4: Memory access pattern of sparse dense matrix matrix multiplicationwith external padding; A �ve diagonals.

80


0

10

20

30

40

50

60

70

80

11

12

13

14

15

16

17

18

19

11

01

11

11

21

13

11

41

15

11

61

17

11

81

19

12

01

21

12

21

# O

pe

rati

on

Peanoindex of Elements

A B C

Figure C.5: Memory access pattern of sparse dense matrix matrix multiplication; A2D laplacian like.

81


05

10

15

20

25

30

35

40

45

16

11

16

21

26

31

36

41

46

51

56

61

66

71

76

81

86

91

96

10

11

06

11

11

16

12

11

26

13

1

# O

pe

rati

on


A B C

Figure C.6: Memory access pattern of sparse dense matrix matrix multiplicationwith external padding; A 2D laplacian like.

82

D Selected Code

Some important parts of TifaMMy's source code are printed. They are referencedby the explanations of the implementation.

D.1 Memory Management

The following class implements the storage of di�erent types of data (�oating point,integer) in a continuous array. It implements functions to return substreams of thewhole stream.

template <c l a s s T>c l a s s Hybr idAl ign64Al locator : pub l i c A l i gn64Al l oca to r <T>{

char∗ data_ptr ; // Pointer to data stream

s ize_t data_size ; // s i z e o f data stream in bytess i ze_t s ize_dense ; // s i z e o f padded dense block in

bytesi n t e lem_size_sparse ; // s i z e o f the elmes in compressed

row storaged b locks

i n t pos ; // Pos i t i on o f i t e r a t o r

bool data_al located ; // i d e n t i f i c a t o r i f the stream i sa l l o c a t e d

pub l i c :

Hybr idAl ign64Al locator ( ) throw ( ){

th i s−>set_or ig ina l_ptr_nul l ( ) ;data_al located = f a l s e ;data_ptr = NULL;data_size = 0 ;

D. Selected Code

s ize_dense = 0 ;pos = 0 ;

}

~Hybr idAl ign64Al locator ( ) throw ( ) {}

// Set the number o f dense b locks in t h i s matrixvoid set_Num_Dense_Blocks ( i n t count , i n t block_elems , i n t

e lem_bytes ize ){

i f ( ! data_al located ){

s ize_dense = ( s i ze_t ) ( ( block_elems∗ e lem_bytes ize )%64) + (s i ze_t ) ( block_elems∗ e lem_bytes ize ) ;

data_size += ( s i ze_t ) ( s ize_dense ∗ count ) ;}

}

// s e t the number o f compressed row storaged b locks in t h i smatrix

void set_Num_CompRow_Blocks( i n t num_elems , i n t elem_bytesize ,i n t num_rows)

{i f ( ! data_al located ){

s i ze_t amount = ( s i ze_t ) ( ( num_elems∗ e lem_bytes ize )+(num_elems∗4)+((num_rows+1)∗4) ) ;

s i z e_t amount_align = ( s i ze_t ) ( ( ( num_elems∗ e lem_bytes ize )+(num_elems∗4)+((num_rows+1)∗4) )%64) ;

data_size += ( s i ze_t ) ( amount + amount_align ) ;e lem_size_sparse = elem_bytes ize ;

}}

// c r e a t e space o f hybird data streamvoid al locate_stream_data ( ){

data_ptr = ( char ∗) TifaMMy : : a l l o c_a l i gned ( data_size , 64 ) ;data_al located = true ;

}

// f r e e the space o f hybrid data streamvoid deal locate_stream_data ( )

84

D. Selected Code

{TifaMMy : : f r e e_a l i gned ( data_ptr ) ;data_al located = f a l s e ;

}

// get next address o f dense blockvoid ∗ get_next_pos_dense ( ){

void ∗ r e t = ( void ∗)&data_ptr [ pos ] ;

pos += ( i n t ) s ize_dense ;

r e turn r e t ;}

// get next address o f compressed row storaged blockvoid ∗ get_next_pos_comprow ( i n t num_elems , i n t num_rows){

void ∗ r e t = ( void ∗)&data_ptr [ pos ] ;

i n t amount = ( ( num_elems∗ elem_size_sparse )+(num_elems∗4)+((num_rows+1)∗4) ) ;

i n t amount_align = ( ( ( num_elems∗ elem_size_sparse )+(num_elems∗4)+((num_rows+1)∗4) )%64) ;

pos += amount + amount_align ;

r e turn r e t ;}

// Set the i t e r a t o r p o s i t i o n to zerovoid r e s e t_ i t e r a t o r ( ){

pos = 0 ;}

} ;

Listing D.1: allocator class to implemented a mixed continuous data stream

D.2 Traversed Tree Storage

There are four interesting points in this Implementation. The �rst one is the con-structor of the data structure.

85

D. Selected Code

Besides this the setup of the management data structure that contains the infor-mation about the zero elements / submatrices is printed.

In order to access the data in a common way an interface that calculates the trans-formation of the row and column coordinates into the position in the Peano orderis needed. This code is shown in the third code example.

The last code shows a part of the multiplication algorithm. It is the PPP -Multiplication, all the others are done in the same way.

D.2.1 Constructor of HybridMatrix

HybridMatrix ( const TifaMMyHybridLayout& MatrixLayout , unsignedi n t _n, unsigned i n t _m, bool t ransposed = f a l s e , constA l l o ca to r& a = Al l o ca to r ( ) ) :

Matrix< data_type , HybridMatrix<data_type , Al locator> >( _n, _m, transposed ) ,

a l l o c ( a ){unsigned i n t temp_size_sima = 0 ;

// s e t number o f p r o c e s s o r s that handle ope ra t i on s to t h i smatrix

cpus = 1 ;

// s e t e lements equal to the number o f non−zero b locks inmatrix

th i s−>elements = MatrixLayout . get_num_dense_blocks ( ) +MatrixLayout . get_num_csr_blocks ( ) ;

// a l l o c a t e the space f o r the block in fo rmat ion managementas s ray (BIMA)

th i s−>bima = a l l o c . a l l o c a t e ( th i s−>elements ) ;

// equal d i s t anc e s epa ra t i on o f peano curvenum_blocks_thread = ( ( th i s−>elements ) / cpus )+1;

// Are the re more rows than columsi f (_n > _m){

th i s−>vir_dim = TifaMMy : : power3 (TifaMMy : : l og3 (_n) ) ;th i s−>i t e r a t i o n = TifaMMy : : l og3 (_n) ;

}e l s e

86

D. Selected Code

{th i s−>vir_dim = TifaMMy : : power3 (TifaMMy : : l og3 (_m) ) ;th i s−>i t e r a t i o n = TifaMMy : : l og3 (_m) ;

}

// Ca lcu la te max . p o s s i b l e s i z e o f s p a r s i t y in fo rmat ionmanagement array (SIMA)

f o r ( unsigned i n t l e v e l = 1 ; l e v e l < i t e r a t i o n + 1 ; l e v e l++){

temp_size_sima += (TifaMMy : : power3 ( l e v e l ) ∗TifaMMy : : power3 (l e v e l ) ) ;

}

// Al locae the max . space f o r SIMA// th i s−>sima = ( i n t ∗) mal loc ( temp_size_sima∗ s i z e o f ( i n t ) ) ;th i s−>sima = new in t [ temp_size_sima ] ;// i n i t simaf o r ( unsigned i n t i = 0 ; i < temp_size_sima ; i++){

th i s−>sima [ i ] = −1;}

// Re fe rence s to t h i s matrixth i s−>ref_count = new in t (1 ) ;

// i n i t the sima array and block i n f o s ( peano index , g l oba lrow , g l oba l c o l )

sima_pos = 0 ;sima_value = 0 ;init_sima_bima ( MatrixLayout , i t e r a t i o n , −1, Peano : : P, 0 , th i s−>n , 0 , th i s−>m) ;

// s e t s i z e o f DATA and a l l o c a t e (CSR s i z e a l r eady s e t byinit_sima_bima )

set_size_datastream_densepart (MatrixLayout .get_num_dense_blocks ( ) , 2704 , ( i n t ) s i z e o f ( double ) ) ;

a l l o c . al locate_stream_data ( ) ;

// i n i t data s t r u cu t r e & va lue s in b locksa l l o c . r e s e t_ i t e r a t o r ( ) ;f o r ( unsigned i n t i = 0 ; i < th i s−>elements ; i++){

th i s−>bima [ i ] . i n i t_b lock ( MatrixLayout . elemfun , MatrixLayout. get_rows ( ) ) ;

}

87

D. Selected Code

}

Listing D.2: constructor of the class hybrid matrix

D.2.2 Management Array Setup

void init_sima_bima ( const TifaMMyHybridLayout& MatrixLayout ,unsigned i n t l e v e l , i n t father_index , const Peano : : Type type ,unsigned i n t start_row , unsigned i n t end_row , unsigned i n t

s tar t_co l , unsigned i n t end_col ){

// Dimension o f one o f the nine sub−matr i ce sconst unsigned i n t tdim = TifaMMy : : power3 ( l e v e l − 1 ) ;// Elements in one sub−matr i ce sconst unsigned i n t elem = tdim∗ tdim ;

i n t ∗num_elem_level = new in t [ 9 ] ;

// s e t from fa th e r to ch i l d ( bottom up)i f ( father_index >= 0 )

th i s−>sima [ father_index ] = sima_pos ;

switch ( type ){case Peano : : P :

// v i s i t the b locks with P Numbering schema// 1 s t Blocknum_elem_level [ 0 ] = get_sima_block_info (MatrixLayout , l e v e l ,

start_row , start_row+tdim , star t_co l , s t a r t_co l+tdim ) ;// 2nd Blocknum_elem_level [ 1 ] = get_sima_block_info (MatrixLayout , l e v e l ,

start_row+tdim , start_row+(2∗tdim ) , s tar t_co l , s t a r t_co l+tdim ) ;

// 3 rd Blocknum_elem_level [ 2 ] = get_sima_block_info (MatrixLayout , l e v e l ,

start_row+(2∗tdim ) , start_row+(3∗tdim ) , s tar t_co l ,s t a r t_co l+tdim ) ;

// 4 th Blocknum_elem_level [ 3 ] = get_sima_block_info (MatrixLayout , l e v e l ,

start_row+(2∗tdim ) , start_row+(3∗tdim ) , s t a r t_co l+tdim ,s ta r t_co l +(2∗tdim ) ) ;


start_row+tdim , start_row+(2∗tdim ) , s t a r t_co l+tdim ,s ta r t_co l +(2∗tdim ) ) ;

88

D. Selected Code


start_row , start_row+tdim , s ta r t_co l+tdim , s ta r t_co l +(2∗tdim ) ) ;


start_row , start_row+tdim , s ta r t_co l +(2∗tdim ) , s t a r t_co l+(3∗tdim ) ) ;


start_row+tdim , start_row+(2∗tdim ) , s t a r t_co l +(2∗tdim ) ,s t a r t_co l +(3∗tdim ) ) ;


start_row+(2∗tdim ) , start_row+(3∗tdim ) , s t a r t_co l +(2∗tdim ) , s t a r t_co l +(3∗tdim ) ) ;

i f ( l e v e l > 1 ){

i f ( num_elem_level [ 0 ] > −1 )init_sima_bima ( MatrixLayout , l e v e l −1, num_elem_level

[ 0 ] , Peano : : P, start_row , start_row+tdim , star t_co l ,s t a r t_co l+tdim ) ;


[ 1 ] , Peano : :Q, start_row+tdim , start_row+(2∗tdim ) ,s tar t_co l , s t a r t_co l+tdim ) ;


[ 2 ] , Peano : : P, start_row+(2∗tdim ) , start_row+(3∗tdim ), s tar t_co l , s t a r t_co l+tdim ) ;


[ 3 ] , Peano : : R, start_row+(2∗tdim ) , start_row+(3∗tdim ), s t a r t_co l+tdim , s ta r t_co l +(2∗tdim ) ) ;


[ 4 ] , Peano : : S , start_row+tdim , start_row+(2∗tdim ) ,s t a r t_co l+tdim , s ta r t_co l +(2∗tdim ) ) ;


[ 5 ] , Peano : : R, start_row , start_row+tdim , s ta r t_co l+tdim , s ta r t_co l +(2∗tdim ) ) ;

i f ( num_elem_level [ 6 ] > −1 )

89

D. Selected Code

init_sima_bima ( MatrixLayout , l e v e l −1, num_elem_level[ 6 ] , Peano : : P, start_row , start_row+tdim , s ta r t_co l+(2∗tdim ) , s t a r t_co l +(3∗tdim ) ) ;


[ 7 ] , Peano : :Q, start_row+tdim , start_row+(2∗tdim ) ,s t a r t_co l +(2∗tdim ) , s t a r t_co l +(3∗tdim ) ) ;


[ 8 ] , Peano : : P, start_row+(2∗tdim ) , start_row+(3∗tdim ), s t a r t_co l +(2∗tdim ) , s t a r t_co l +(3∗tdim ) ) ;

}break ;

case Peano : :Q:// see case Peano : : P, only f o r Q Storage herebreak ;

case Peano : :R:// see case Peano : : P, only f o r R Storage herebreak ;

case Peano : : S :// see case Peano : : P, only f o r S Storage herebreak ;

}

d e l e t e [ ] num_elem_level ;}

Listing D.3: Traversed Tree Array Setup (SIMA)

D.2.3 Row/Column Transformation

i n t c oo rd2 l i n e a r ( const unsigned i n t row , const unsigned i n t co l, unsigned i n t pos_sima , unsigned i n t l e v e l , const Peano : :Type type , unsigned i n t start_row , unsigned i n t s ta r t_co l )const

{const unsigned i n t tdim = TifaMMy : : power3 ( l e v e l − 1 ) ;const unsigned i n t elem = tdim∗ tdim ; // e lements in one part

o f the matrix

// determine po s i t i o n o f coo rd i an t e s in the nine b locksunsigned i n t entry_row ;unsigned i n t entry_col ;

i f ( start_row <= row && row < ( start_row + tdim ) )

90

D. Selected Code

entry_row = 0 ;i f ( ( start_row+tdim ) <= row && row < ( start_row + (2∗ tdim ) ) )entry_row = 1 ;

i f ( ( start_row+(2∗tdim ) ) <= row && row < ( start_row + (3∗ tdim) ) )

entry_row = 2 ;

i f ( s t a r t_co l <= co l && co l < ( s ta r t_co l + tdim ) )entry_col = 0 ;

i f ( ( s t a r t_co l+tdim ) <= co l && co l < ( s ta r t_co l + (2∗ tdim ) ) )entry_col = 1 ;

i f ( ( s t a r t_co l +(2∗tdim ) ) <= co l && co l < ( s ta r t_co l + (3∗ tdim) ) )

entry_col = 2 ;

// r e cu r s i on endi f ( l e v e l == 1 ){

switch ( type ){case Peano : : P :

i f ( entry_row == 0 && entry_col == 0 )re turn th i s−>sima [ pos_sima ] ;

i f ( entry_row == 1 && entry_col == 0 )re turn th i s−>sima [ pos_sima+1] ;








break ;case Peano : :Q:

// l i k e case Peano : : P, only f o r Q s to ragebreak ;

case Peano : :R:// l i k e case Peano : : P, only f o r R s to rage

91

D. Selected Code

break ;case Peano : : S :

// l i k e case Peano : : P, only f o r S s to ragebreak ;

}}

// r e c u r s i v e c a l l s or cance lswitch ( type ){case Peano : : P :

i f ( entry_row == 0 && entry_col == 0 ){

i f ( th i s−>sima [ pos_sima ] > 0 )re turn coo rd2 l i n e a r ( row , co l , th i s−>sima [ pos_sima ] ,

l e v e l −1, Peano : : P, start_row + ( tdim∗entry_row ) ,s t a r t_co l+(tdim∗ entry_col ) ) ;

}i f ( entry_row == 1 && entry_col == 0 ){

i f ( th i s−>sima [ pos_sima+1] > 0 )re turn coo rd2 l i n e a r ( row , co l , th i s−>sima [ pos_sima+1] ,

l e v e l −1, Peano : :Q, start_row + ( tdim∗entry_row ) ,s t a r t_co l+(tdim∗ entry_col ) ) ;






l e v e l −1, Peano : : R, start_row + ( tdim∗entry_row ) ,s t a r t_co l+(tdim∗ entry_col ) ) ;



l e v e l −1, Peano : : S , start_row + ( tdim∗entry_row ) ,s t a r t_co l+(tdim∗ entry_col ) ) ;

92

D. Selected Code



l e v e l −1, Peano : : R, start_row + ( tdim∗entry_row ) ,s t a r t_co l+(tdim∗ entry_col ) ) ;






l e v e l −1, Peano : :Q, start_row + ( tdim∗entry_row ) ,s t a r t_co l+(tdim∗ entry_col ) ) ;




}break ;

case Peano : :Q:// l i k e case Peano : : P, only f o r Q s to ragebreak ;

case Peano : :R:// l i k e case Peano : : P, only f o r R s to ragebreak ;

case Peano : : S :// l i k e case Peano : : P, only f o r S s to ragebreak ;

}

// i f no block was found in s t r u c tu r ere turn −1;

}

Listing D.4: coord2linear function for calculating the Peano position

93

D. Selected Code

D.2.4 Multiplication Algorithm

__force in l ine void do_Elem_Multiplication ( ){

i f ( (bima_a > −1) && (bima_b > −1) && (bima_c > −1) ){

c [ bima_c ] += a [ bima_a ] ∗ b [ bima_b ] ;}

}

__force in l ine void do_Block_Recursive_Call_PPP ( ){

i f ( ( sima_a [ ind_a ] > −1) && ( sima_b [ ind_b ] > −1) && ( sima_c [ind_c ] > −1) )

{save_ind_a [ i t e r a t i o n ] = ind_a ;save_ind_b [ i t e r a t i o n ] = ind_b ;save_ind_c [ i t e r a t i o n ] = ind_c ;

ind_a = sima_a [ ind_a ] ;ind_b = sima_b [ ind_b ] ;ind_c = sima_c [ ind_c ] ;

MulAdd_MM_PPP( ) ;

ind_a = save_ind_a [ i t e r a t i o n ] ;ind_b = save_ind_b [ i t e r a t i o n ] ;ind_c = save_ind_c [ i t e r a t i o n ] ;

}}

// seven fu the r func t i on r e c u r s i v e c a l l s// . . . .

void MulAdd_MM_PPP( void ){

a s s e r t ( th i s−>i t e r a t i o n != 0 ) ;

// End o f r e cu r s i oni f ( th i s−>i t e r a t i o n == 1 ){

bima_a = sima_a [ ind_a ] ; bima_b = sima_b [ ind_b ] ; bima_c =sima_c [ ind_c ] ;

do_Elem_Multiplication ( ) ; bima_a = sima_a [ ind_a+1] ;; bima_c = sima_c [ ind_c+1] ;

94

D. Selected Code


do_Elem_Multiplication ( ) ; bima_a = sima_a [ ind_a+3] ; bima_b= sima_b [ ind_b+1] ;


do_Elem_Multiplication ( ) ; bima_a = sima_a [ ind_a+5] ;; bima_c = sima_c [ ind_c ] ;

do_Elem_Multiplication ( ) ; bima_a = sima_a [ ind_a+6] ; bima_b= sima_b [ ind_b+2] ; ;



do_Elem_Multiplication ( ) ; ; bima_b = sima_b [ind_b+3] ; bima_c = sima_c [ ind_c+3] ;








do_Elem_Multiplication ( ) ; bima_a = sima_a [ ind_a ] ;; bima_c = sima_c [ ind_c+5] ;

do_Elem_Multiplication ( ) ; ; bima_b = sima_b [ind_b+6] ; bima_c = sima_c [ ind_c+6] ;






95

D. Selected Code




do_Elem_Multiplication ( ) ;

r e turn ;}

th i s−>i t e r a t i o n −−;

// Recurs ive Ca l l sdo_Block_Recursive_Call_PPP ( ) ; ind_a++; ind_c++;do_Block_Recursive_Call_QPQ () ; ind_a++; ind_c++;do_Block_Recursive_Call_PPP ( ) ; ind_a++; ind_b++;do_Block_Recursive_Call_RQP ( ) ; ind_a++; ind_c−−;do_Block_Recursive_Call_SQQ () ; ind_a++; ind_c−−;do_Block_Recursive_Call_RQP ( ) ; ind_a++; ind_b++;do_Block_Recursive_Call_PPP ( ) ; ind_a++; ind_c++;do_Block_Recursive_Call_QPQ () ; ind_a++; ind_c++;do_Block_Recursive_Call_PPP ( ) ; ind_b++; ind_c++;do_Block_Recursive_Call_PRR ( ) ; ind_a−−; ind_c++;do_Block_Recursive_Call_QRS ( ) ; ind_a−−; ind_c++;do_Block_Recursive_Call_PRR ( ) ; ind_a−−; ind_b++;do_Block_Recursive_Call_RSR ( ) ; ind_a−−; ind_c−−;do_Block_Recursive_Call_SSS ( ) ; ind_a−−; ind_c−−;do_Block_Recursive_Call_RSR ( ) ; ind_a−−; ind_b++;do_Block_Recursive_Call_PRR ( ) ; ind_a−−; ind_c++;do_Block_Recursive_Call_QRS ( ) ; ind_a−−; ind_c++;do_Block_Recursive_Call_PRR ( ) ; ind_b++; ind_c++;do_Block_Recursive_Call_PPP ( ) ; ind_a++; ind_c++;do_Block_Recursive_Call_QPQ () ; ind_a++; ind_c++;do_Block_Recursive_Call_PPP ( ) ; ind_a++; ind_b++;do_Block_Recursive_Call_RQP ( ) ; ind_a++; ind_c−−;do_Block_Recursive_Call_SQQ () ; ind_a++; ind_c−−;do_Block_Recursive_Call_RQP ( ) ; ind_a++; ind_b++;do_Block_Recursive_Call_PPP ( ) ; ind_a++; ind_c++;do_Block_Recursive_Call_QPQ () ; ind_a++; ind_c++;do_Block_Recursive_Call_PPP ( ) ;

th i s−>i t e r a t i o n++;}

96

D. Selected Code

// seven fu the r f unc t i on s// . . . .

Listing D.5: Matrix Matrix Multiplication

97

D. Selected Code

98

E Glossary

Architecture, x86: describes the provided machine interface of a processor; x86is a general purpose platform by Intel

BLAS: Basic Linear Algebra Software; These are libraries with C or Fortran func-tions for Vector/Vector (Level 1), Matrix/Vector (Level 2) and Matrix/Matrix(Level 3) operations.

Cache: fast bu�er memory to overcome the lack of performance between the mainmemory and the CPU

Core: Today they are multi-core processors. These processors have several identicalCPUs on one die, or on two dies in one package. They can be regarded as "normal"SMP platforms with some extensions (shared cache levels, shard bus systems). Coreis the name of CPU on the die.

CPU: Central Processing Unit; same as processor

DRAM: Dynamic Random Access Memory; build by condensators, need to berefershed every 20 ms, the technology used for main memories.

FSB. Front Side Bus. Parallel high speed bus which connects the CPU with theMCH.

MCH: Memory Controller Hub, Interface for the CPUs to communicate with themain memory.

MFlop/s: Million of �oating point operations per second.

NUMA: Non Uniformed Memory Access. Every CPU in a multi processors systemhas its own memory but can access the memories of the other CPUs by communi-cating with these CPUs

SIMD: Single instruction multiple data; more than one value (typically two or fourvalues) are processed by one instruction.

E. Glossary

SMP: Symmetric Multi Processing; today's normal multiprocessor system.

SRAM: Static Random Access Memory; very fast memory but it needs a lot spaceon a die. It is normally build by using �ip�ops and is used for cache memories.

SSE: Vector units of the x86 architecture; available since Pentium III in 1999.

UMA: Uniformed Memory Access. There is "one big" main memory, which isaccessed by all processors in the system. The memory accesses are managed by theMCH.

100

Bibliography

[1] Sedgewick, R.: Algorithms in C++. Addison-Wesley, 1998, 3rd edition, ISBN:0-201-35088-2.

[2] Saad, Y.: Iterative Methodes for Sparse Linear Systems, PWS Publishing Com-pany, 1996, ISBN: 0-534-94776-X

[3] Meister, A.: Numerik linearer Gleichungssysteme, Vieweg, 2005, 2nd edition,ISBN: 3-528-13135-7

[4] Patterson, D. A., Hennessy, J. L.: Computer Organization and Design, thehardware/software interface, 3rd edition Morgan Kaufmann Publishers, ISBN:1558604286

[5] Louis, D.: C/C++ New Reference, Markt & Technik, 1999, ISBN: 3-8272-5592-9

[6] Bader, M. and Zenger, C.: Cache oblivious matrix multiplication using anelement ordering based on a Peano curve. Linear Algebra Appl. 417 (2�3),2006.

[7] Bader, M., Franz, R., Günther, S., and Heinecke, A.: Hardware-oriented Imple-mentation of Cache Oblivious Matrix Operations Based on Space-�lling Curves,Proceedings of the PPAM 2007, LNCS 4967, 2008, in print.

[8] Heinecke, A., Bader, M.: Parallel Matrix Multiplication based on Space-�llingCurves on Shared Memory Multicore Platforms, Proceedings of the ACM Int.Conf. on Computing Frontiers 2008, accepted.

[9] Mayer, C.: Cache oblivious matrix operations using Peano curves, DiplomaThesis, Technische Universität München, 2006

[10] Elmroth, E., Gustavson, F., Jonsson, I., and Kågström, B.: Recursive blockedalgorithms and hybrid data structures for dense matrix library software. SIAMReview 46 (1), 2004.

Bibliography

[11] Goto, K., and van de Geijn, R.A.: Anatomy of a High-Performance Ma-trix Multiplication. Accepted for publication in ACM Transactions on Mathe-matical Software 34 (3), 2008, preprint: http://www.cs.utexas.edu/users/flame/pubs/openflame.pdf.

[12] Gustavson, F. G.: Recursion leads to automatic variable blocking for denselinear-algebra algorithms. IBM Journal of Research and Development 41 (6),1997.

[13] Kurzak, J. and Dongarra, J.: Implementation of the Mixed-Precision HighPerformance LINPACK Benchmark on the Cell Processor. LAPACK WorkingNote 177, 2006.

[14] Herrero, J., Navarro, J.: Adapting Linear Algebra Codes to the Memory Hier-archy Using a Hypermatrix Scheme. In Proc. Int. Conf. on Parallel Processingand Applied Mathematics (PPAM'05). LNCS 3911, pp. 1058-1065, June 2006

[15] Herrero, J., Navarro, J.: Analysis of a Sparse Hypermatrix Cholesky withFixed-Sized Blocking. In Journal Applicable Algebra in Engineering, Commu-nication and Computing, 18 (3), pages 279-295, 2007. DOI: 10.1007/s00200-007-0039-8.

[16] Smailbegovic, F., Gaydadjiev G. N., Vassiliadis S.: Sparse Matrix StorageFormat, TU Delft.

[17] D'Azevedo E. F., Mark R. Fahey, M. R., Mills, R. T.: A Vectorized SparseMatrix Multiply for Compressed Row Storage Format, Proceedings of the ICCS2005.

[18] Jessen, E.: Lecture Notes of Computer Architectures, Technische UniversitätMünchen, 2008, http://www.net.informatik.tu-muenchen.de//teaching/WS07/comparch/.

[19] Bader, M.: Lecture Notes of Algorithms in scienti�c comupting, TechnischeUniversität, 2008, http://www5.in.tum.de/lehre/vorlesungen/algowiss/

[20] Intel Corporation: Intel x86 Architecture Volume 1: Basic Architecture; IntelCorporation 2007

[21] Intel Corporation: Intel x86 Architecture Volume 2A: Instruction Set Refer-ence, A-M; Intel Corporation 2007

102

http://www.cs.utexas.edu/users/flame/pubs/openflame.pdf

http://www.cs.utexas.edu/users/flame/pubs/openflame.pdf

http://www.net.informatik.tu-muenchen.de//teaching/WS07/comparch/

http://www.net.informatik.tu-muenchen.de//teaching/WS07/comparch/

http://www5.in.tum.de/lehre/vorlesungen/algowiss/

Bibliography

[22] Intel Corporation: Intel x86 Architecture Volume 2B: Instruction Set Refer-ence, N-Z; Intel Corporation 2007

[23] Intel Corporation: Intel x86 Architecture Volume 3: System ProgrammingGuide; Intel Corporation 2007

[24] Microsoft Corporation: Version 2005 (Juli) of MSDN DVD severed by ManiacServer at Technische Universität München. Microsoft Corporation 2005

[25] Dongarra, J.: Sparse Matrix Storage Formats http://www.cs.utk.edu/

~dongarra/etemplates/node372.html

[26] Wikipedia. Cpu cache - wikipedia, the free encyclopedia, 2008. http://en.wikipedia.org/w/index.php?title=CPU_cache.

[27] Intel math kernel library, 2007. http://intel.com/cd/software/products/asmo-na/eng/perflib/mkl/

[28] TifaMMy (TifaMMy isn't the fastest Matrix Multiplication, yet), http://

tifammy.sourceforge.net, version 1.3.2.

103

http://www.cs.utk.edu/~dongarra/etemplates/node372.html

http://www.cs.utk.edu/~dongarra/etemplates/node372.html

http://en.wikipedia.org/w/index.php?title=CPU_cache

http://en.wikipedia.org/w/index.php?title=CPU_cache

http://intel.com/cd/software/products/asmo-na/eng/perflib/mkl/

http://intel.com/cd/software/products/asmo-na/eng/perflib/mkl/



Documents

Cache Optimised Data Structures and Algorithms …Cache Optimised Data Structures and Algorithms for Sparse Matrices Cache-optimierte Datenstrukturen und Algorithmen für dünnbesetzte