SMOReS: Sparse Matrix Omens of Reordering Success

SMOReS:

Sparse Matrix Omens of Reordering Success

Samantha Wood

Advisors:David G. WonnacottMichelle Mills Strout

(thanks to Eric Eaton for machine learning advice)

Spring 2011

Abstract

Despite their widespread use, sparse matrix computations exhibitpoor performance, due to their memory-bandwidth bound nature.Techniques have been developed that help these computations take ad-vantage of unexploited data reuse by transforming it into data locality,generally improving performance. One such technique is to reorderthe matrix prior to running a computation on it. However, reorder-ing a matrix takes time and does not always provide performance im-provements. We present a classification model that predicts, with 82%accuracy and no significantly incorrect predictions, whether reorder-ing a matrix will improve the performance of the matrix power kernel,Akx [13]. Our classifier is an ensemble of decision stumps generated bythe AdaBoost [6] learning algorithm and is trained on 60 matrices witha wide range of memory footprints and average number of nonzeros perrow.

Contents

1 Introduction 21.1 Experimental Framework . . . . . . . . . . . . . . . . . . . . 4

1.1.1 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 41.1.2 Machines . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Computations 62.1 SpMV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Power Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 Sparse Iterative Solvers . . . . . . . . . . . . . . . . . 72.2.2 Straightforward Algorithm and its Drawbacks . . . . . 72.2.3 Communication-Avoiding Algorithm . . . . . . . . . . 82.2.4 Performance of Communication-Avoiding Algorithm . 8

3 Sparse Matrix Reorderings 113.1 Reordering Heuristics . . . . . . . . . . . . . . . . . . . . . . 12

3.1.1 Random . . . . . . . . . . . . . . . . . . . . . . . . . . 123.1.2 BFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.1.3 METIS . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Performance Results of Reordering . . . . . . . . . . . . . . . 16

4 Deciding to Reorder 194.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.2 Prediction Model . . . . . . . . . . . . . . . . . . . . . . . . . 204.3 Prediction Results . . . . . . . . . . . . . . . . . . . . . . . . 22

5 Conclusion 265.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

1

Chapter 1

Introduction

Sparse matrices, which typically contain less than 1% nonzero values, arecommonly used in a variety of applications, including solvers for systemsof linear equations, scientific simulations, and implementations of PageR-ank [14], an algorithm used in search engines. To help visualize a sparsematrix, we can look at its spyplot, which shows the structure of the matrixby plotting a point for each nonzero. Figure 1.1 shows a sample spyplotfor a 525, 825 × 525, 825 sparse matrix with 3,674,625 nonzeros (0.0013%nonzeros) that takes up about 46MB in memory in sparse form. This ma-trix is generated by a computational fluid dynamics problem and is fromthe University of Florida Sparse Matrix Collection [3].

Figure 1.1: Spyplot for parabolic fem, a 525, 825× 525, 825 sparse matrix.

2

Despite their widespread use, sparse computations are plagued by poorperformance, typically achieving no more than 5-10% of peak performanceon a machine. This relatively poor performance stems from the memory-bandwidth bound nature of sparse computations. Since memory access issubstantially slower than arithmetic and the sparsity of the computationsresults in comparatively few arithmetic operations per datum produced,sparse computations have a high memory access to arithmetic ratio and canexecute no faster than the speed of accessing slow memory.

To improve a sparse computation’s performance, it is necessary to reducethe memory access to arithmetic ratio by decreasing the total number ofaccesses to slow memory. Sparse computations are characterized by datareuse that is difficult to exploit. In other words, although the same datavalues are reused multiple times over the course of the computation, thisreuse does not occur prior to the data’s eviction from cache. Therefore,each time a value is used, it must be read from slow memory.

There are two main classes of techniques to improve the performance ofa sparse computation by taking advantage of data reuse:

1. develop a more efficient general algorithm, or

2. modify the computation’s execution at runtime, tuning it to the specificmatrix on which it is running.

Each of these approaches attempts to turn data reuse into data locality, sothat data is used multiple times before eviction from cache and is read fromslow memory as few times as possible.

Work has been done on developing more efficient general algorithms, suchas the communication-avoiding power kernel [13], which will be discussedlater. However, sometimes a more efficient general algorithm is too difficultto develop or is unable to provide significant performance improvements forall matrices. In such cases, runtime reordering transformations may providefurther performance benefits. Iteration reordering transformations reorderthe iterations of a computation on a particular matrix. Data reorderingtransformations reorder the matrix prior to running a computation on it.Both types of runtime reordering transformations attempt to cluster accessesto the same data values.

We focus on data reordering transformations, or matrix reorderings. Amatrix reordering is a permutation of the rows and columns, which does notaffect the correctness of the computation. For many matrices, reordering thematrix can significantly improve computation performance. However, find-ing the best reordering for a matrix is NP-hard because it is equivalent tominimal linear arrangement [11] or optimal linear ordering [7]. Reordering

3

heuristics can help find a good reordering for a matrix, but no heuristic con-sistently provides the best orderings. Furthermore, in some cases, reorderingdoes not actually benefit, and may in fact hurt, performance. Additionally,these heuristics take time that is not justifiable when the original matrixperforms as well or better than the reordered matrix. Consequently, it isdesirable to know prior to running reordering heuristics whether they willbenefit performance. We present a model we have developed that accu-rately predicts, prior to running any reordering heuristics, whether or notto reorder a matrix.

To our knowledge, we are the first to make such a prediction. Mostprevious work (e.g. [1, 2, 4, 10, 12]) has focused on developing successfulreordering heuristics. Other work, including that of Strout and Hovland [16],has used metrics to select a best reordering from several. However, thisselection occurs post-reordering. Furthermore, they only address single-coreperformance improvements whereas we examine multicore performance.

Chapter 2 discusses the matrix computations we use to develop ourmodel. Chapter 3 describes data reorderings in more detail, including thereordering heuristics we consider. Chapter 4 describes our prediction modeland results, followed by conclusions and future work in Chapter 5.

1.1 Experimental Framework

Here we briefly describe the matrices and machines used in our experiments.

1.1.1 Matrices

We use 60 sparse matrices that span a wide range of structures, sizes, anddensities. These matrices are from the University of Florida Sparse MatrixCollection [3], a collection of sparse matrices that have been produced by avariety of real applications.

1.1.2 Machines

The performance results used in this paper were collected on several types ofmachines, including Intel’s Xeon E5450, Xeon X3450, Core2 Quad Q9550,and Core2 6300. Further information about these machines is described inTable 1.1. We gathered additional performance results that are not directlyused in this paper from an Intel Xeon E5520, an Intel Xeon X5550, and anIntel Core2 Duo E8300. More information about these machines is availablein Table 1.2. The most important point to note is that although the exact

4

performance values vary from machine to machine, the performance trendsare preserved across all machines we used. Consequently, we use a samplingof machines in the displayed results. Results displayed in Chapter 2 werecollected on an Intel Core2 Quad Q9550 running 4 threads in OpenMP. Theresults in Section 3.1 were collected on a Harpertown running 8 threads inOpenMP. All the remaining results were collected on a Lynnfield running 8threads in OpenMP. All displayed results, however, were collected on all themachines in Table 1.1.

Processor Cores L2 Cache NicknameIntel Xeon E5450 2× 4 4× 6 MB HarpertownIntel Xeon X3450 2× 4 2× 8 MB LynnfieldIntel Core2 Quad Q9550 4 2× 6 MBIntel Core2 6300 2 2 MB

Table 1.1: Processor information for machines directly used in results.

Processor Cores L2 CacheIntel Xeon E5520 2× 4 2× 8 MBIntel Xeon X5550 2× 4 2× 8 MBIntel Core2 Duo E8300 2 6 MB

Table 1.2: Processor information for machines not directly used in results.

5

Chapter 2

Computations

Sparse matrix applications employ a variety of core computations, includingsparse matrix-vector multiply (SpMV) and the power kernel. Here, we ex-plore each of these computations in more detail, without yet considering theeffects of matrix reorderings, and investigate one algorithmic improvementto the power kernel, Akx.

2.1 SpMV

Sparse matrix-vector multiplication is one of the most common and mostfundamental sparse matrix computations. As the name implies, the compu-tation is simply the multiplication of a sparse matrix, A, by a dense vector,x. Since it is an extremely common computation, many highly optimizedimplementations of SpMV are available. We do not use any of these im-plementations in our experiments. Instead, we use handwritten code withonly basic optimizations because we believe there is value to having sleek,portable, and comprehensible code, even if it is not highly optimized. Wefocus on other optimization techniques that do not result in architecture-specific or unintelligible code.

2.2 Power Kernel

Another commonly used sparse computation is the power kernel, which com-putes

Akx,

where A is a sparse matrix, k an integer power, and x a dense vector.

6

2.2.1 Sparse Iterative Solvers

One of the power kernel’s many uses is in iterative solvers for sparse systemsof linear equations. Such solvers try to find a vector, x, such that Ax = b,given sparse matrix A and vector b. To solve this directly, however, requirestaking the inverse of the very large matrix A. This computation is too costlyin terms of time. Consequently, iterative solvers, such as GMRES (General-ized Minimal Residual Method) [15], find a reasonable approximation of xmuch more quickly. One of the core computations that occurs at each iter-ation in these solvers is the power kernel. Therefore, improving the powerkernel’s efficiency directly improves that of the iterative solver.

2.2.2 Straightforward Algorithm and its Drawbacks

The most straightforward algorithm for computing the power kernel groupsthe computation as

(A · · · (A(A(Ax)))),

a series of sparse matrix-vector multiplications. This multiplies the vector,x, by the matrix, A, generating an intermediate vector, which is then againmultiplied by A. This is repeated until k multiplications by A have occurred.Although this algorithm is straightforward, it is not particularly efficient, aswe will see.

The main problem with the straightforward algorithm is that it does nottake advantage of spatial and temporal data reuse. In particular, althougheach value in A is used k times in the computation, it is only used oncebetween being read into the cache and evicted from it. In other words,each value in A is read from slow memory each time it is used. This occursbecause A and x are too large to fit into cache. Consequently, for everymultiplication by A, A is read from slow memory, making the speed of thepower kernel bounded by the speed of slow memory. A similar situationoccurs for the vector x. A single value in x can be used multiple timesin one multiplication by A. However, since A and x are too large to fit incache, it is quite likely that values in x must be read into cache anew for eachuse, further ensuring that the speed of the power kernel is bounded by thespeed of slow memory. One way to address the memory-bandwidth boundnature of this computation is to use an algorithm that takes advantage ofthe spatial and temporal data reuse inherent in the power kernel.

7

2.2.3 Communication-Avoiding Algorithm

One such algorithm is the communication-avoiding algorithm [13]. In thiscontext, communication refers to, in the sequential case, the sharing of databetween fast and slow memory or, in the parallel case, between cores. Thecommunication-avoiding algorithm seeks to limit communication in both thesequential and parallel cases by replacing this communication with redun-dant computation. Specifically, the algorithm partitions the work of thepower kernel computation among threads so that each thread is responsiblefor computing a unique subset of the resulting vector. Each thread nowperforms the straightforward algorithm on a subset of A and independentlycomputes all intermediate values necessary for its designated portion of theresulting vector. If two threads require the same intermediate value, ratherthan one computing it and sending it to the other thread, they each computeit.

This may seem counterintuitive because we have now increased theamount of computation necessary for the power kernel. However, arith-metic is substantially faster than communication. Consequently, with lowenough levels of redundant computation, the communication-avoiding algo-rithm can perform better than the straightforward algorithm, despite theincreased amount of arithmetic. Note that the communication-avoiding al-gorithm, by working on a subset of A, takes advantage of both spatial andtemporal data reuse.

2.2.4 Performance of Communication-Avoiding Algorithm

Here we examine some results for the straightforward algorithm versus thecommunication-avoiding algorithm. These results (Figure 2.1) were col-lected on an Intel Core2 Quad Q9550 running 4 threads in OpenMP. Forthe most part, the communication-avoiding algorithm performs substantiallybetter than the straightforward algorithm. However, there are two matricesfor which it performs worse: crankseg 2 and nd24k. Examining the spy-plots for these matrices compared to those for the other matrices helps usunderstand this performance. In particular, in matrices such as cant (Fig-ure 2.2a), pwtk (Figure 2.2b), and others, the nonzeros are clustered alongthe diagonal of the matrix. However, for crankseg 2 (Figure 2.3a) andnd24k (Figure 2.3b), the nonzeros are scattered much more widely through-out the matrix. Due to the structure of these matrices, the communication-avoiding algorithm performs so much redundant computation that it out-weighs the benefits gained in limiting communication. Additionally, it may

8

0

0.2

0.4

0.6

0.8

1

1.2

bmw

7st-1

cage13

Cant

cfd2

crankse

g_2

ldoor li

nd24k

parabolic

_fem

pwtk

therm

al2

xenon2

matrix

GFlo

ps

(h

igh

er i

s b

ett

er)

straightforward communication-avoiding

Figure 2.1: Gflops for straightforward vs. communication-avoiding.

be the case that each thread’s matrix and vector are still too large to fit incache, so the algorithm has not solved all of the issues.

In general, this algorithm has improved the performance of the powerkernel. However, we noted that there are a few cases for which this is nottrue. Consequently, we turn to run-time transformations to make furthermodifications to the execution of the computation.

9

(a) cant (b) pwtk

Figure 2.2: Communication-avoiding-friendly matrices.

(a) crankseg 2 (b) nd24k

Figure 2.3: Communication-avoiding-unfriendly matrices.

10

Chapter 3

Sparse Matrix Reorderings

Recall that the goal of both the algorithmic improvements we just investi-gated for the power kernel and matrix reorderings is to turn unexploited datareuse into data locality (reuse before eviction from cache). To understandhow reordering a matrix can accomplish this, we will consider an examplesimplified to a trivial size in which we wish to multiply a sparse matrix, A,by a dense vector, x.

⎛⎜⎜⎝

a00 0 0 a03

0 a11 0 0a20 0 a22 00 0 0 a33

⎞⎟⎟⎠

⎛⎜⎜⎝

x0

x1

x2

x3

⎞⎟⎟⎠ =

⎛⎜⎜⎝

a00x0 + a03x3

a11x1

a20x0 + a22x2

a33x3

⎞⎟⎟⎠

We assume that the cache can hold at most 4 values (not including computedvalues), each value is read into cache separately, and a new value replacesthe least recently used value in the cache. Notice that after multiplying thefirst row by x, the cache contains {a00, x0, a03, x3}, but upon multiplying thesecond row, x0 must be evicted from the cache to make room for x1. Then,when multiplying the third row, x0 must be read into cache again from slowmemory. A similar issue occurs with x3.

Notice, however, that if we merely rearrange the rows and columns of Aand x as below

⎛⎜⎜⎝

a22 a20 0 00 a00 a03 00 0 a33 00 0 0 a11

⎞⎟⎟⎠

⎛⎜⎜⎝

x2

x0

x3

x1

⎞⎟⎟⎠ =

⎛⎜⎜⎝

a20x0 + a22x2

a00x0 + a03x3

a33x3

a11x1

⎞⎟⎟⎠ ,

each value must only be read into cache once. Also notice that the resultof the computation is the same but permuted the same as the columns.

11

We have just performed a matrix reordering on A that increases the datalocality of the computation.

Matrix reorderings generally focus on improving intra-multiplication lo-cality (e.g. within Ax) as opposed to inter-multiplication locality (e.g.across Akx). This is in contrast to algorithmic improvements, such as thecommunication-avoiding power kernel, that also improve inter-multiplicationlocality. In a computation such as the straightforward power kernel, matrixreorderings can help improve data locality in one sparse matrix-vector mul-tiplication but generally not across subsequent multiplications by A. Inthe communication-avoiding power kernel, reordering a matrix can provideadditional performance benefits by reducing the amount of overlapping com-putation performed by partitions.

3.1 Reordering Heuristics

Although finding a good ordering for our example matrix was easy, in gen-eral, finding the best ordering for a matrix is NP-hard. Reordering heuris-tics can be used to find good, but not necessarily optimal, orderings for amatrix. We considered three reordering heuristics: random, breadth-firstordering (BFS), and METIS.

3.1.1 Random

Perhaps the most obvious reordering algorithm for a matrix is to randomlypermute some or all of the rows of the matrix.

The random reordering performs a Knuth shuffle [5] on a portion orall of the matrix. Given a matrix, A, and a percentage, p, of rows torandomly permute, the algorithm randomly selects p% of the rows in A.Call the set of these selected rows R. The algorithm iterates through R,swapping the current row with a randomly selected row in R that has notyet been iterated over. The pseudocode for swapping the rows is as fol-lows:

for i = |R| − 1 downto 1 doj ← randint(), 0 ≤ j ≤ iswap A[R[i]] and A[R[j]]

end forThis generates a reordering of A in which p% of the rows are randomlyreordered.

12

(a) original

(b) 1% (c) 5% (d) 10%

Figure 3.1: bmw7st-1 randomly reordered

13

Figure 3.1 shows the spyplots for some sample partial random reorder-ings. By simply looking at these matrices, we predict that the randomreorderings will hurt performance. In general, the original orderings formatrices are significantly better than random. We have confirmed this pre-diction by running the communication-avoiding power kernel with increasinglevels of randomization on a Harpertown running 8 threads in OpenMP. Ascan be seen in Figure 3.2, an increase in randomization results in a decreasein performance. All the matrices that we investigated come from the Univer-sity of Florida Sparse Matrix Collection [3] and are produced by real-worldapplications. The original matrices are the direct output of various appli-cations and have not been previously reordered. Due to the nature of theapplications, the structure of the original matrices is better than random.

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10

% randomly reordered

gfl

op

s

bmw cfd2 li pwtk xenon

Figure 3.2: Communication-avoiding performance with increasing random-ization of matrices.

14

3.1.2 BFS

A sparse matrix can be interpreted as an adjacency graph in which each rowis a node and an edge exists from node i to node j when Aij �= 0. Anothertechnique for reordering a matrix is to perform a breadth-first search on itsadjacency graph, ordering the rows in the order they are encountered in thegraph traversal [1].

Recall our example matrix, A:⎛⎜⎜⎝

a00 0 0 a03

0 a11 0 0a20 0 a22 00 0 0 a33

⎞⎟⎟⎠ .

The adjacency graph for this matrix is

0

��

�� 1

2 3

A breadth-first traversal of this graph starts at node 0, and then visits, inorder, nodes 2, 3, and 1. The BFS heuristic takes this ordering of the nodes,reverses it, and then rearranges the rows and columns in the resulting order:

⎛⎜⎜⎝

a11 0 0 00 a33 0 00 0 a22 a20

0 a03 0 a00

⎞⎟⎟⎠

⎛⎜⎜⎝

x1

x3

x2

x0

⎞⎟⎟⎠ .

If we examine, as before, the data locality of this ordering, we find that x3

must still be read from slow memory to cache twice. Before it is reused,it is evicted from cache to provide space for x0. However, unlike in ourinitial ordering, x0 must only be read from cache once and remains in cachebetween multiplication by a20 and a00. Thus, the BFS ordering is betterthan the original but not as good as the hand-optimized ordering.

3.1.3 METIS

METIS [12] is a graph partitioning heuristic that can be used to partitionthe adjacency graph into k roughly equal-sized subgraphs such that thenumber of edges between different subgraphs is minimized. In reordering asparse matrix with METIS, the goal is to create k partitions that are each

15

smaller than the size of the cache. Then, a sparse computation can work oneach partition, reading it only once from slow memory, taking advantage oftemporal and spatial data reuse.

3.2 Performance Results of Reordering

In general, reordering matrices provides performance improvements, as an-ticipated. However, as Figure 3.3 shows, there are some matrices for whichreordering provides little or no performance benefits. For both SpMV andthe communication-avoiding power kernel, cant and cfd2 see no perfor-mance improvement from reordering. Looking at their spyplots (Figure 3.4),we notice that the nonzeros are clustered along the diagonal in their originalorderings. This reaffirms our previous observations that such a structure isconducive to good performance.

In contrast, drastic performance improvements are gained by reorder-ing matrices such as ldoor, parabolic fem, and thermal2. Notice in Fig-ure 3.5 that reordering transforms the nonzero structure of ldoor from scat-tered to clustered along the diagonal. Other matrices that see substantialperformance improvements from reordering also undergo similar structuralchanges when reordered. Again, we notice that the closer a matrix is to aband matrix, the more likely it is to perform well.

From this observation, the question naturally follows as to whether ornot we can predict if reordering a matrix will provide performance benefits.Running reordering heuristics takes time. If reordering provides no morethan minimal performance benefits, the cost of running reordering heuristicsmay not be justified. Consequently, we wish to predict, prior to reordering,when it is beneficial to reorder.

16

0

0.2

0.4

0.6

0.8

1

1.2

bm

w7

st-

1

ca

ge

13

ca

nt

cfd

2

cra

nk

se

g_

2

ldo

or li

nd

24

k

pa

rab

olic_

fem

pw

tk

the

rma

l2

xe

no

n2

matrix

gfl

op

s

(h

igh

er i

s b

ett

er)

NONE BFS METIS

(a) SpMV

0

0.2

0.4

0.6

0.8

1

1.2

bm

w7

st-

1

ca

ge

13

ca

nt

cfd

2

cra

nk

se

g_

2

ldo

or li

nd

24

k

pa

rab

olic_

fem

pw

tk

the

rma

l2

xe

no

n2

matrix

gfl

op

s

(h

igh

er i

s b

ett

er)

NONE BFS METIS

(b) Communication-avoiding power kernel

Figure 3.3: Gflops with reordering, collected on a Lynnfield running8 threads in OpenMP.

17

(a) cant (b) cfd2

Figure 3.4: Reordering unnecessary.

(a) ldoor original (b) ldoor BFS (c) ldoor METIS

Figure 3.5: Reordering substantially improves performance.

18

Chapter 4

Deciding to Reorder

We now turn to the problem of deciding, prior to reordering, whether toreorder a matrix. To make this prediction, we train a machine learning clas-sification model on metrics evaluated on the original matrix. The generatedmodel can then be used to classify a matrix as likely or unlikely to benefitfrom reordering.

4.1 Metrics

Our prediction model is based on five different metrics, or properties ofa matrix: number of rows, number of nonzeros, density, bandwidth, andprofile. The density of a matrix is the percent of the matrix that is nonzero.We define the width of row i as

wi = maxaij �=0 ∧ aik �=0

|j − k|.

Then we call the bandwidth of a matrix the maximum row width:

β = maxi

wi.

We define the profile of a matrix as the sum, over all rows, of the distancefrom the nonzero farthest from the center of a row to the center of the row:

δ =n∑

i=1

maxaij �=0

|i− j|.

The bandwidth and profile metrics are adapted from those described byGibbs et al. [8] to accommodate non-symmetric matrices. Note that each of

19

our metrics is simple enough that we can compute them rapidly for an arbi-trary matrix. Depending on the storage format for the matrix, computingthese metrics is either O(rows) or O(nonzeros).

We also considered using other metrics, including spatial locality [16],storage requirements for the matrix, and values of these metrics normalizedby number of rows or number of nonzeros. However, we found that themodel makes the most accurate predictions when using only the five metricswe selected.

4.2 Prediction Model

We use a supervised learning algorithm to construct a classifier for sparsematrices that predicts whether a given matrix will benefit from reordering.In our approach, we train this classifier on a set of sparse matrices thatare labeled with the appropriate decision, and then use the classifier topredict the correct decision for new sparse matrices. For the purposes ofclassification, we represent a matrix mi as a vector xi ∈ R5 that containsthe values of the five previously described metrics on mi. We then train themodel on the set of instances X = {(x1, y1), . . . , (xN , yN )} that are labeledwith either yi = + if mi should be reordered or yi = − if mi should notbe reordered. The class label yi for each of the matrices was determinedempirically as described in Section 4.3.

To train the classifier on X, we use the AdaBoost [6] learning algorithmto generate an ensemble of decision stumps. A decision stump is a one-levelbinary decision tree that attempts to split, as accurately as possible, allthe training instances, X, into classes, + or −, using only one feature, ormetric. AdaBoost runs multiple iterations, each iteration generating a newdecision stump and increasing the weight of mispredicted instances in thetraining set for the next iteration. It then takes a weighted combination ofall generated decision stumps to use as the final model; each decision stumpis weighted as a function of its ability to classify the data. We chose theAdaBoost algorithm for its strong empirical performance in this applicationand its ability to avoid overfitting the training data. In our experiments,we use the implementation of AdaBoost on decision stumps included in theWEKA machine learning toolkit [9].

Once we have generated our classifier, we need only evaluate the modelon a matrix at runtime to predict whether or not to reorder the matrix. Themodel only needs to be generated once offline, and evaluating the model ona matrix (including the computation of the metrics) is much faster than

20

actual predicted matrix rows nonzeros bandwidth profile % nonzeros MB- - 1d5pt 1,000,000 4,999,994 4 2,000,000 0.0005 64.0- - DK01R 903 11,766 28 12,591 1.4430 0.1+ + GT01R 7,980 430,909 7,624 9,988,902 0.6767 5.2+ + PR02R 161,070 8,185,136 86,727 644,831,189 0.0315 98.9- + Stanford Berkeley 683,446 7,583,376 678,132 3,696,254,145 0.0016 93.7- - apache1 80,800 542,184 1,616 65,286,400 0.0083 6.8- + apache2 715,176 4,817,870 68,245 1,722,396,924 0.0009 60.7

+ + audikw 1 943,695 77,651,847 928,967 509,927,131,174 0.0087 935.6+ + bmw7st-1 141,347 7,339,667 126,396 2,755,263,297 0.0367 88.6+ + bmwcra 1 148,770 10,644,002 141,053 2,229,336,334 0.0481 128.3+ + cage13 445,315 7,479,343 318,787 35,934,876,665 0.0038 91.5- - cant 62,451 4,007,383 548 17,063,692 0.1028 48.3- + cfd2 123,440 3,087,898 6,314 174,975,278 0.0203 37.5- - consph 83,334 6,010,480 46,481 356,560,602 0.0865 72.5+ + cop20k A 121,192 2,624,331 121,052 7,926,785,959 0.0179 32.0- + copter1 17,222 145,676 16,895 20,288,859 0.0491 1.8

+ + copter2 55,476 519,093 55,312 1,293,670,522 0.0169 6.5+ + crankseg 2 63,838 14,148,858 62,040 2,698,030,587 0.3472 170.0+ + cvxbqp1 50,000 349,968 44,440 1,320,759,788 0.0140 4.4- - ex11 16,614 1,096,948 5,176 7,564,332 0.3974 13.2+ + finance256 37,376 209,750 37,349 425,333,952 0.0150 2.7+ + ford1 18,728 69,399 18,705 53,880,129 0.0198 0.9+ + ford2 100,196 376,265 96,274 1,338,335,984 0.0037 4.9- - gearbox 153,746 6,071,064 137,953 318,504,177 0.0257 73.5- - gridgena 48,962 512,084 809 15,753,140 0.0214 6.3+ + hood 220,542 10,768,436 219,767 17,999,024,200 0.0221 130.1+ + inline 1 503,712 36,816,342 502,406 70,758,218,052 0.0145 443.8- - jnlbrng1 40,000 199,200 400 8,000,000 0.0124 2.6- - kim2r 456,976 11,330,020 2,708 615,095,118 0.0054 137.8+ + ldoor 952,203 46,522,475 687,560 170,289,658,287 0.0051 562.1- - li 22,695 1,215,181 2,469 21,032,091 0.2359 14.7- - mac econ fwd500 206,500 1,273,389 2,090 72,485,971 0.0030 16.1- + mc2depi 525,825 2,100,225 770 247,247,616 0.0008 27.3- - minsurfo 40,806 203,622 404 8,242,408 0.0122 2.6+ + nd24k 72,000 28,715,634 71,042 2,124,197,519 0.5539 344.9- + nd3k 9,000 3,279,690 8,921 49,062,735 4.0490 39.4- - nd6kg 18,000 6,897,316 17,864 180,000,652 2.1288 82.8- - obstclae 40,000 197,608 400 7,920,396 0.0124 2.5+ - oilpan 73,752 3,597,188 7,286 52,910,663 0.0661 43.5- - opt1 15,449 1,288,670 3,242 12,982,466 0.5399 15.5+ + parabolic fem 525,825 3,674,625 525,820 176,303,446,485 0.0013 46.2- - pdb1HYS 36,417 4,344,765 34,475 161,126,579 0.3276 52.3- + pds10 16,558 105,054 15,582 69,344,098 0.0383 1.3- + pre2 659,033 5,834,044 232,486 5,675,656,136 0.0013 72.6

+ + pwt 36,519 229,068 35,599 249,893,521 0.0172 2.9+ + pwtk 217,918 11,634,424 189,337 2,745,419,729 0.0245 140.5- - ramage02 16,830 1,913,062 5,064 32,961,337 0.6754 23.0+ + rma10 46,835 2,329,092 25,380 406,404,281 0.1062 28.1- - s3dkq4m2 90,449 4,820,891 1,223 55,347,844 0.0589 58.2- - s3dkt3m2 90,449 3,753,461 1,223 55,347,844 0.0459 45.4+ + shipsec1 140,874 7,813,404 10,145 504,033,585 0.0394 94.3+ + srb1 54,924 1,974,768 48,797 154,303,789 0.0655 23.9+ + thermal2 1,228,045 8,580,313 1,226,628 197,878,621,915 0.0006 107.9- - torsion1 40,000 197,608 400 7,920,396 0.0124 2.5+ + torso1 116,158 8,516,500 112,127 420,092,343 0.0631 102.7+ - vanbody 47,072 2,336,898 5,413 54,557,830 0.1055 28.2- - wathen100 30,401 471,601 608 7,961,300 0.0510 5.8- - wathen120 36,441 565,761 608 9,541,360 0.0426 6.9- + webbase-1M 1,000,005 3,105,536 987,689 150,794,208,939 0.0003 41.3

+ + xenon2 157,464 3,866,688 17,783 763,954,479 0.0156 47.0

Table 4.1: Metrics and class labels for sample matrices.

21

reordering a matrix. In the event that we do not need to reorder the matrix,we have saved a significant amount of preprocessing time. If we do need toreorder the matrix, we have only introduced a very small overhead that isjustified by the cases in which we do not need to reorder.

4.3 Prediction Results

In generating and evaluating our classifier, we use all 60 of the matrices wehave gathered from the University of Florida Sparse Matrix Collection [3].Table 4.1 shows our five metrics evaluated on each of the 60 sparse matrices,as well as the storage requirements and actual and predicted class labels foreach matrix. Matrices with mispredicted class labels are highlighted.

We determine the class label, yi, for each matrix, mi, based on theperformance of the original matrix and reorderings on the communication-avoiding power kernel. Since running reordering heuristics is time consum-ing, we decide to train our classifier not to reorder matrices for which thebest reordered performance is only marginally better than that of the orig-inal matrix. Figure 4.1 shows performance results for correctly classifiedmatrices that should be reordered (Figure 4.1a) and that should not be re-ordered (Figure 4.1b). Figure 4.2 shows performance results for incorrectlyclassified matrices. Note that all three of these graphs are on the samescale. These results were collected with 8 threads in OpenMP running ona Lynnfield, one of the several types of machines on which we collected theresults used in classifying the matrices. Although the exact performance val-ues varied from machine to machine, we only show results from one samplemachine because the relational differences between reordered and originalmatrix performance were, in general, preserved across machines.

We generate the classifier as described in Section 4.2. Figure 4.3 showsthe classifier generated from our 60 training instances. Recall that the classi-fier is an ensemble of decision stumps, where each decision stump is weightedaccording to its ability to correctly classify the training data. In Figure 4.3,each decision stump is labeled above with and sized according to its weight.The predicted class for a matrix is determined by a weighted combinationof the decisions made by each decision stump.

To test the accuracy of our model, we use leave-one-out cross-validation.This creates 60 different models, each time leaving out a different matrixfrom the training set to test on. The estimate of our model’s accuracy isthen given as the average of the accuracies of each of the 60 models on theirrespective testing instances. This methodology follows standard machine

22

0

0.2

0.4

0.6

0.8

1

1.2

1.4audik

w_1

bm

w7st-

1

bm

wcra

_1

cage13

cop20k_A

copte

r2

cra

nkseg_2

cvxbqp1

finance256

ford

1

ford

2

GT01R

hood

inline_1

ldoor

nd24k

para

bolic_fe

m

PR02R

pw

t

pw

tk

rma10

ship

sec1

srb

1

therm

al2

tors

o1

xenon2

matrix

gfl

op

s(h

igh

er i

s b

ett

er)

original best reordered

(a) Matrices classified as + (reorder).

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1d5pt

apache1

cant

consph

DK01R

ex11

gearb

ox

gridgena

jnlb

rng1

kim

2r li

mac_econ_fw

d500

min

surf

o

nd6kg

obstc

lae

opt1

pdb1H

YS

ram

age02

s3dkq4m

2

s3dkt3

m2

tors

ion1

wath

en100

wath

en120

matrix

gfl

op

s

(h

igh

er i

s b

ett

er)


(b) Matrices classified as − (don’t reorder).

Figure 4.1: Gflops for correctly predicted matrices.

23

0

0.2

0.4

0.6

0.8

1

1.2

1.4

apache2

cfd

2

copte

r1

mc2depi

nd3k

pds10

pre

2

Sta

nfo

rd

Berk

ele

y

webbase-

1M

oilpan

vanbody

-/+ +/-

actual/predicted

gfl

op

s

(h

igh

er i

s b

ett

er)


matrix

Figure 4.2: Gflops for incorrectly classified matrices.

learning evaluation practice to produce a nearly unbiased estimate of themodel’s accuracy.

Based on this evaluation of our classifier, we achieve 82% accuracy. How-ever, as we see in Figure 4.2, the classifier makes no significantly incorrectpredictions; the original and best reordered performance is relatively similarfor each incorrectly classified matrix. Although the actual class labels maybe more desirable, none of the mispredictions will result in significant perfor-mance degradation. Thus, we are able to accurately predict whether or notreordering a matrix will benefit performance of the communication-avoidingpower kernel.

24

bandwidth

+-5294.5 >5294.5

2.77

nonzeros

+-7,698,390 >7,698,390

1.54

bandwidth

-

1.39

nonzeros

+ -

2,856,114.5 >2,856,114.5

1.0

profile

+-381,482,441.5 >381,482,441.5

0.78

rows

+ -

592,429 >592,429

0.64

rows

+-18,364 >18,364

0.51

Figure 4.3: Generated classifier: a weighted combination of decision stumps.

25

Chapter 5

Conclusion

Although matrix reorderings can improve sparse matrix application perfor-mance by transforming unexploited data reuse into data locality, the per-formance improvements do not always justify the cost of generating a goodreordering. We present a classifier that predicts, with 82% accuracy, whenreordering a matrix will improve computation performance. Our classifieris an ensemble of decision stumps generated by the AdaBoost learning algo-rithm and trained on a set of 60 matrices. Of the mispredictions made by theclassifier in testing, none of them are significantly incorrect. Mispredictedmatrices suffer no more than minor decreases in performance. Consequently,our model accurately classifies matrices as likely or unlikely to gain signifi-cant performance improvements from matrix reorderings.

5.1 Future Work

One of the weaknesses of our classifier, however, is that class labels are deter-mined solely by performance on the communication-avoiding power kerneland may not extend to other computations. We would like to develop a morerobust classifier that incorporates, into the decision, information about thedesired computation. This will enable generalization of predictions beyondthe communication-avoiding power kernel. To further improve classificationaccuracy, we can incorporate additional metrics and reordering algorithms,such as Gpart [10], CPACK [4], and block diagonalization. To prevent falsepositives, we can investigate post-reordering checks. To avoid false neg-atives, we can reorder matrices with low confidence predictions and runpost-reordering checks. A related problem we wish to solve is to predict thereordering heuristic that will most improve performance for a matrix.

26

Bibliography

[1] I. Al-Furaih and S. Ranka. Memory hierarchy management for iterative graphstructures. In IPPS ’98: Proceedings of the 12th. International Parallel Pro-cessing Symposium on International Parallel Processing Symposium, page 298,Washington, DC, USA, 1998. IEEE Computer Society.

[2] E. Cuthill and J. McKee. Reducing the bandwidth of sparse symmetric matri-ces. In Proceedings of the 1969 24th national conference, pages 157–172, NewYork, NY, USA, 1969. ACM.

[3] T. A. Davis and Y. Hu. The University of Florida Sparse Matrix Collec-tion. ACM Transactions on Mathematical Software, (to appear). http://www.cise.ufl.edu/research/sparse/matrices.

[4] C. Ding and K. Kennedy. Improving cache performance in dynamic applica-tions through data and computation reorganization at run time. In PLDI ’99:Proceedings of the ACM SIGPLAN 1999 conference on Programming languagedesign and implementation, pages 229–241, New York, NY, USA, 1999. ACM.

[5] R. Durstenfeld. Algorithm 235: Random permutation. Commun. ACM, 7,July 1964.

[6] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-linelearning and an application to boosting. Journal of Computer and SystemSciences, 55(1):119–139, 1997.

[7] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide tothe Theory of NP-Completeness. Bell Telephone Laboratories, Incorporated,1979.

[8] N. E. Gibbs, J. William G. Poole, and P. K. Stockmeyer. An algorithm forreducing the bandwidth and profile of a sparse matrix. SIAM Journal onNumerical Analysis, 13(2):236–250, 1976.

[9] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Wit-ten. The WEKA data mining software: an update. SIGKDD Explorations,11(1):10–18, Nov. 2009.

[10] H. Han and C.-W. Tseng. A comparison of locality transformations for ir-regular codes. In Selected Papers from the 5th International Workshop onLanguages, Compilers, and Run-Time Systems for Scalable Computers, LCR’00, pages 70–84, London, UK, 2000. Springer-Verlag.

27

[11] L. H. Harper. Optimal assignments of numbers to vertices. SIAM Journal,12(1):131–135, 1964.

[12] G. Karypis and V. Kumar. A fast and high quality multilevel scheme for par-titioning irregular graphs. SIAM Journal on Scientific Computing, 20(1):359– 392, 1999.

[13] M. Mohiyuddin, M. Hoemmen, J. Demmel, and K. Yelick. Minimizing commu-nication in sparse matrix solvers. In SC ’09: Proceedings of the Conference onHigh Performance Computing Networking, Storage and Analysis, pages 1–12,New York, NY, USA, 2009. ACM.

[14] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation rank-ing: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab,November 1999.

[15] Y. Saad and M. H. Schultz. GMRES: a generalized minimal residual algo-rithm for solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comput.,7(3):856–869, 1986.

[16] M. M. Strout and P. D. Hovland. Metrics and models for reordering trans-formations. In Proceedings of the The Second ACM SIGPLAN Workshop onMemory System Performance (MSP), pages 23–34, June 8 2004.

28

Documents

SMOReS: Sparse Matrix Omens of Reordering Success