Upload
others
View
14
Download
0
Embed Size (px)
Citation preview
Symbolic Factorisation of SparseMatrix Using Elimination Trees
A Thesis Submitted forpartial fulfillment of Requirements for the Degree ofBachelor-Master of Technology (Dual Degree)
by
Peeyush Jain
Department of Computer Science and EngineeringIndian Institute of Technology Kanpur
Kanpur
Dedicated to My Parents and Teachers
Acknowledgements
I would like to take this opportunity to express my deep sense of gratitude to the per-
son who has taught me what dedication is, my thesis supervisor Dr. Phalguni Gupta.
His benevolent guidance, apt suggestion, unstinted help and constructive criticism has
inspired me in successful completion of present work.
I also extend my sincere thanks to all the faculty members of the Department
of Computer Science and Engineering, Indian Institute of Technology Kanpur, for the
invaluable knowledge they have imparted to me and for teaching the principles in most
exciting and enjoyable way. My stay at Indian Institute of Technology Kanpur has been
excited and enlightening. The time, I spent with my friends Gaurav, Mohit, Rahul,
Ashish, Ashvin, Saeed is unforgettable. I am grateful for their continuous attachment
which strengthen me at difficult moments.
I take this opportunity to thank my parents for all that they have done for me.
Without their love, support and encouragement, I would never have reached this stage
in my life.
Peeyush Jain
Abstract
Many problems in science and engineering require the solving of linear systems of equa-
tions. As the problems get larger it becomes increasingly important to exploit the spar-
sity inherent in many such linear systems. It is well recognized that finding a fill-reducing
ordering is crucial to the success of the numerical solution of sparse linear systems. The
use of hybrid ordering partitioner is expected to improve significantly the fill-in of the
factorized matrices, and also the scalability of the elimination tree obtained by symbolic
factorization. The most obvious way to get the required increase in performance would
be to use parallel algorithms.
For dense symmetric matrices, there are quite a few well-known parallel algorithms
that scale well and can be implemented efficiently on parallel computers. On the other
hand, there are not many efficient, scalable parallel formulation for the sparse matrix
factorization using elimination tree.
A well-known sparse matrix ordering scheme PORD (Paderborn Ordering tool)
uses the last element to compute the present element computation, so this prevents to
do the parallelization part globally for the whole algorithm. PORD spend most of its
time in splitting the graph into two parts and coloring it. So this thesis tries to parallelize
the most occurring part of the factorization algorithm for getting the better result in
symbolic factorization step. The given approach in this thesis might be useful in some of
the parallel graph computation which uses sequential ordering step. By some estimates,
more than 90% of the eigenvalue problems are real symmetric or complex Hermitian
problems. This gives us flexibility to use the ordering step parallely with many other
parallel algorithm of matrix computation.
Several simple modifications to the minimum local fill-in ordering strategy has
also been presented in this thesis such that these strategies exploit readily available in-
formation about node adjacencies to improve the fill bounds used to select a node for
elimination. This thesis describes two simple modifications to the well known node selec-
tion strategy AMMF(Approximate minimum local fill-in) that further improve ordering
quality. It is demonstrated that different types of node selection strategies give less
amount of number of fronts which gives us better ordering followed by better subsequent
factorization complexity.
iii
Contents
Acknowledgements i
Abstract ii
Contents iv
List of Figures viii
1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Application used in our thesis . . . . . . . . . . . . . . . . . . 4
1.1.2 Parallel implementation part of our application . . . . . . . . 5
1.1.3 Node selection strategies of the proposed algorithm . . . . . . 6
1.2 Organization of the thesis . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Literature Review 8
2.1 Greedy ordering heuristics . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Graph-partitioning based heuristics . . . . . . . . . . . . . . . . . . . 10
2.3 Hybrid heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Parallel Construction of ordering scheme 13
iv
3.1 different types of ordering methods . . . . . . . . . . . . . . . . . . . 13
3.1.1 Bottom-up Methods . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.2 Top-down methods . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.2.1 Multilevel approach . . . . . . . . . . . . . . . . . . 14
3.1.2.2 Domain decomposition approach . . . . . . . . . . . 15
3.1.3 Hybrid methods . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Ordering Scheme used by PORD . . . . . . . . . . . . . . . . . . . . 17
3.3 Parallel implementation of the defined scheme . . . . . . . . . . . . . 18
3.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4 Node Selection Strategies for the Construction of Vertex Separators 25
4.1 Node selection strategy of PORD . . . . . . . . . . . . . . . . . . . . 26
4.2 Proposed approach for the node selection . . . . . . . . . . . . . . . . 27
4.2.1 First Modification . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.2 Second Modification . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 Result and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5 Software, tools and the configuration 32
5.1 Some Information About the Softwares Used . . . . . . . . . . . . . . 32
5.1.1 Blas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.1.2 Blacs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.1.3 Scalapack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1.4 MPICH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1.5 MUMPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2 How to use the system . . . . . . . . . . . . . . . . . . . . . . . . . . 34
v
6 Conclusion and Scope for Future Work 36
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.2 Scope of future work . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Appendix A 42
A Symbolic Factorization and Elimination Tree 43
A.1 Cholesky Factorisation . . . . . . . . . . . . . . . . . . . . . . . . . . 43
A.2 Numerical Factorisation . . . . . . . . . . . . . . . . . . . . . . . . . 43
A.3 Symbolic Factorisation . . . . . . . . . . . . . . . . . . . . . . . . . . 44
A.4 Algorithms for symbolic factorisation . . . . . . . . . . . . . . . . . . 45
A.4.1 A graph representation of symbolic matrices . . . . . . . . . . 45
A.4.2 A basic algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 46
A.4.3 Fast symbolic Cholesky factorisation . . . . . . . . . . . . . . 47
A.5 elimination tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Appendix B 48
B Elimination Graph and Quotient Graph 49
B.1 Elimination graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
B.2 Quotient graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Appendix C 52
C Message Passing Interface 53
C.1 Message Passing Interface . . . . . . . . . . . . . . . . . . . . . . . . 53
C.1.1 Point to Point Communication Routines . . . . . . . . . . . . 54
vi
C.1.2 Collective Communication Routines . . . . . . . . . . . . . . . 54
C.1.3 Group and Communicator Management Routines . . . . . . . 55
Appendix D 55
D Global Array Toolkit 56
vii
List of Figures
1.1 Conversion from symmetric matrix A to cholesky factor LLT without
any ordering algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Conversion from symmetric matrix A to cholesky factor LLT after
applying minimum degree algorithm . . . . . . . . . . . . . . . . . 3
3.1 Changes in runtime vs number of processor . . . . . . . . . . . . . 23
3.2 Number of fronts vs number of processor . . . . . . . . . . . . . . . 24
5.1 Dependencies of the softwares . . . . . . . . . . . . . . . . . . . . . 35
A.1 Graph induced by the sparse matrix . . . . . . . . . . . . . . . . . 46
A.2 Note that the forest happens to contain only one tree. . . . . . . . 48
B.1 Elimination graph, quotient graph, and matrix for first three steps 52
D.1 Structure of Global Array Toolkit . . . . . . . . . . . . . . . . . . . 57
viii
Chapter 1
Introduction
1.1 Introduction
When solving large sparse symmetric linear systems of the form Ax = b, it is common
to precede the numerical factorization by a symmetric reordering. This reordering is
chosen so that pivoting down the diagonal in order on the resulting permuted matrix
PAP T = LLT produces much less fill-in and work than computing the factors of A by
pivoting down the diagonal in the original order. This reordering is computed using only
information on the matrix structure without taking account of numerical values and so
may not be stable for general matrices. However, if the matrix A is positive-definite, a
cholesky factorization(Appendix A) can safely be used. This technique of preceding the
numerical factorization with a symbolic analysis can also be extended to unsymmetric
systems although the numerical factorization phase must allow for subsequent numerical
pivoting. The goal of the preordering is to find a permutation matrix P so that the
subsequent factorization has the least fill-in. Unfortunately, this problem is NP-complete,
so heuristics are used.
The main challenge in sparse matrix ordering algorithms is to find a fill-minimizing
permutation without computing AT A or even its nonzero structure. While computing
the nonzero structure of AT A allows us to use existing symmetric ordering algorithms and
codes, it may be grossly inefficient. For example, when an n×n matrix A has non-zeros
only in the first row , first column(because of symmetry) and along the main diagonal,
computing AT A takes Ω(n2) work, but factoring it takes only O(n) work (Figure 1.1).
Improving the run time and quality of ordering heuristics has been a subject of
1
Figure 1.1: Conversion from symmetric matrix A to cholesky factor LLT without anyordering algorithm
research for almost three decades. Two main classes of successful heuristics have evolved
over the years: (1) minimum-degree (MD)-based heuristics, and (2) graph-partitioning
(GP)-based heuristics. MD-based heuristics are local greedy heuristics that reorder the
columns of a symmetric sparse matrix such that the column with the fewest non-zeros at
a given stage of factorization is the next one to be eliminated at that stage. GP-based
heuristics regard the symmetric sparse matrix as the adjacency matrix of a graph and
follow a divide-and-conquer strategy to label the nodes of the graph by partitioning it
into smaller subgraphs.
The striking success of MD-based heuristics prompted intense research to improve
their run time and quality, and they have been the methods of choice among practi-
tioners. The multiple minimum-degree (MMD) algorithm by George and Liu[3] and
the approximate minimum-degree(AMD) algorithm by Davis, Amestoy, and Duff [34]
represent the state of the art in MD-based heuristics. There are other heuristics exists
which are used widely in sparse matrix ordering step such as Minimum local fill-in al-
gorithm(MMF), Approximate minimum mean local fill-in(AMMF), a nested dissection
(ND) or a externally computed hybrid ordering such as Scotch, Metis, Pspases, Spooles,
PORD.
The minimum degree ordering algorithm is one of the most widely used heuristics,
since it produces factors with relatively low fill-in on a wide range of matrices. Because
of this, the algorithm has received much attention over the past three decades. Since
the algorithm performs its pivot selection by choosing from a graph a node of minimum
degree, some improvements has been made to the algorithm to reduce the memory
complexity so that the algorithm can operate within the storage of the original matrix,
and reduce the amount of work needed to keep track of the degrees of nodes in the graph
(which is the most computationally intensive part of the algorithm). More recently,
several researchers have relaxed this heuristic by computing upper bounds on the degrees,
2
rather than the exact degrees, and selecting a node of minimum upper bound on the
degree.
Figure 1.2: Conversion from symmetric matrix A to cholesky factor LLT after applyingminimum degree algorithm
Nested Dissection is one more effective method of finding an elimination ordering.
The algorithm uses a divide and conquer strategy on the graph. Removal of a set of
vertices results in two new graphs on which this dissection may be performed separately.
The results for the two parts may then be combined to find the solution of the entire
graph. This algorithm is based on finding separators. The recursion of these algorithms
suggests a natural decomposition of graphs in terms of their separators. At the highest
level is a separator that divides the graph into components. These components them-
selves have separators, and so on. At the lowest levels are components that may not
be divided any further (possibly singleton vertex sets). This method has been shown to
result in good elimination orderings for certain classes of graphs.
In general, the above two algorithms can be combined to produce better order-
ing. This hybrid ordering is sometimes called incomplete nested dissection. Incomplete
nested dissection is used to produce more robust orderings. In this scheme the recursive
process of constructing vertex separators is terminated after a few levels, and the vertices
in the remaining subgraphs are ordered, using either multiple minimum degree (MMD),
or constraint minimum degree(CMD). Independently Ashcraft and Liu[1, 2], and Roth-
berg [14] have shown that the quality of an incomplete nested dissection ordering can be
further improved when using minimum degree to order the separator vertices instead of
following the given nested dissection ordering. To summarize, two levels of hybridiza-
tion can be found in the literature: incomplete nested dissection, and minimum degree
post-processing on an incomplete nested dissection. The latter one has been applied
successfully in state-of-the-art ordering codes such as BEND, SPOOLES, and WGPP.
Ashcraft and Liu[26] present a more general classification of hybrid schemes known as
multi-section ordering.
3
PORD (Paderborn Ordering tool) which also uses the multi-section ordering, de-
fined by Ashcraft and Liu, basically presents an ordering algorithm that achieves a tighter
coupling of bottom-up and top-down methods with the interpretation of vertex separa-
tors as the boundaries of the last elements in a bottom-up ordering. The fundamental
idea of this algorithm is to generate a sequence of quotient graphs using a bottom-up
node selection strategy. The quality of a bottom-up scheme such as minimum degree is
quite sensitive to the way ties are broken when there is more than one vertex eligible for
elimination.
1.1.1 Application used in our thesis
Sparse matrix ordering is an important problem that has extensive applications in many
areas including scientific computing, VLSI design, task scheduling, geographical infor-
mation systems and operations research etc. MUMPS (‘MUltifrontal Massively Parallel
Solver’) also uses some sparse matrix ordering schemes for solving systems of linear equa-
tions of the form Ax = b, where the matrix A is sparse and can be either unsymmetric,
symmetric positive definite, or general symmetric. MUMPS uses a multi-frontal tech-
nique which is a direct method based on either the LU or LDLT factorization of the
matrix[18].
MUMPS distributes the work tasks among the processors, but an identified proces-
sor (the host) is required to perform most of the analysis phase, distribute the incoming
matrix to the other processors (slaves) in the case where the matrix is centralized, and
collect the solution. The system Ax = b is solved in three main steps:
1. Analysis – A range of orderings to preserve sparsity is available in the analysis
phase. The host performs an ordering based on the symmetrized pattern A + AT , and
carries out symbolic factorization (Appendix A). A mapping of the multifrontal compu-
tational graph is then computed, and symbolic information is transferred from the host
to the other processors. Using this information, the processors estimate the memory
necessary for factorization and solution.
2. Factorization – The original matrix is first distributed to processors that will
participate in the numerical factorization. The numerical factorization on each frontal
matrix is conducted by a master processor (determined by the analysis phase) and one
or more slave processors (determined dynamically). Each processor allocates an array
for contribution blocks and factors; the factors must be kept for the solution phase.
3. Solution – The right-hand side b is broadcast from the host to the other
4
processors. These processors compute the solution X using the (distributed) factors
computed during previous step, and the solution is either assembled on the host or kept
distributed on the processors.
This thesis tries to reduce the complexity of the ordering step to reduce the fill-ins
in the analysis phase. We see that the host(single) is the main processor which do the
analysis phase sequentially. So in this thesis, we try to do the analysis phase parallely
instead of assigning it to one processor. MUMPS uses different types of ordering scheme
to compute the analysis phase. This thesis uses PORD as the main ordering scheme.
1.1.2 Parallel implementation part of our application
The most obvious way to get the required increase in performance would be to use
parallel algorithms. The obvious parallelism in the above algorithms is not so obvious
to use. The observation was that nodes that are far apart can be eliminated (ie, deleted
and the cliques of their neighbors created) in the same step. So, codes that use multi-
section ordering can eliminate several nodes in the same step, and on a parallel machine
different processors could be responsible for eliminating each node. Unfortunately, this
parallelism is very fine grained, which makes it difficult to avoid high communication
costs.
So, because of the tighter coupling of PORD, it is quite a bit difficult to parallelize
the whole algorithm globally because it contains lots of node elimination in one single
step. So this thesis tries to parallelize the most occurring part for getting the better
result in symbolic factorization step.
For the parallelization of the most occurring phase of two-level method, in our im-
plementation we first try to construct it with MPI(Message Passing Interface) (appendix
C)which is used in many software to do the sparse matrix computation in parallel. But
MPI only works with the passing of messages between the processors, it does not have
any notion of global shared memory. our process checks the multisector and changes the
representative of that multisector according to their vertex type, checksum and indis-
tinguishability. And whenever it gets change some representative or vertex type by one
process, then other process has to know about it, which takes too much time to gather
all the information right. So this decreases the performance of the ordering step too
much such that sequential is far better than the parallel version.
In case of increasing the performance of the parallel version, we have to use the
possibility of providing some global shared memory space for some structure. So for
5
this, Global Array toolkit(GA) (Appendix D) has been used which provides a portable
Non-Uniform Memory Access (NUMA) shared memory programming environment in
the context of distributed array data structures (called ’global arrays’).
This approach has been explained in the later section. The given approach in
this thesis might be useful in some of the parallel graph computation which are using
sequential ordering algorithm for better result.
1.1.3 Node selection strategies of the proposed algorithm
In multi-section ordering, selection of the nodes based on their degree, checksum and
vertex weight is an important issue for the better ordering. PORD uses different types of
node selection strategy to get good ordering in less amount of time. A most efficient but
less popular strategy for node selection is the minimum local fill or minimum deficiency
algorithm, which always chooses a node whose elimination creates the least amount of
fill. But this takes huge amount of time to compute the ordering.
There are two reasons why the minimum deficiency algorithm has not become as
popular as the minimum degree algorithm. First, the minimum deficiency algorithm
is typically much more expensive than the minimum degree algorithm. Second, it was
believed that the quality of minimum deficiency orderings is not much better than that
of minimum degree orderings. Contrary to popular belief, minimum local fill produces
significantly better orderings than minimum degree, albeit at a greatly increased runtime.
PORD uses the AMMF node selection strategy for better ordering. Because of
better ordering, PORD invest less amount of time in numerical factorization(appendix
A). This thesis describes two simple modifications to this heuristic that produce even
better orderings. Thus our proposed approach explores simple approximations to these
fill metrics with the goal of reducing the runtime of the known minimum local fill-in
algorithm. The best of these approximations is no more difficult to compute than the
degree, yet the orderings produced require roughly less factorization work than those
produced by minimum degree strategy.
This modification gives us better result in the number of fronts which is quite a bit
striking factor in subsequent ordering.
6
1.2 Organization of the thesis
The present thesis is divided into six chapters, a brief detail of each is given as follows:
• First chapter discusses the general aspects of solving sparse matrix computation.
This chapter mainly discusses about the pros and cons of different types of orderings
algorithms. This also involves some understanding of the symbolic factorization
and the elimination trees which is explained in the appendices at the end of the
thesis.
• In second chapter, a complete literature review is presented to update the user with
the ongoing work in the area of ordering algorithm which is required for solving
large sparse symmetric linear systems.
• In third chapter, A parallel implementation of our proposed approach has been
presented. This chapter also talks about the different types of methods used for
ordering a sparse matrix. This chapter defines the methodology used by PORD and
after that our parallel approach has been presented. Results and some discussion
about this implementation is also been presented in this chapter.
• In fourth chapter, Different types of node selection strategy has been presented.
This chapter have separately presented the PORD’s node selection strategy and
according to that strategies, two new node selection strategies have also been de-
fined to get better ordering. Result and a brief discussion about the cause of node
selection strategies has also been presented in this chapter.
• In fifth chapter, A set of software’s utilities has been discussed with the discus-
sion of the implemented environment to run our software with different types of
dependencies between these utilities.
• Last chapter presents the conclusion drawn from the present work and its contri-
bution towards literature.The scope for future work and the applications have also
been discussed in the same chapter.
7
Chapter 2
Literature Review
It is well known that ordering the rows and columns of a matrix is a crucial step in
the solution of sparse linear systems using elimination graph[12, 22]. The ordering can
drastically affect the amount of fill introduced during factorization and hence the cost
of computing the factorization. When the matrix is symmetric and positive definite,
the ordering step is independent of the numerical values and can be performed prior to
numerical factorization. An ideal choice, for example, is an ordering that introduces the
least fill.
As a matter of fact, ordering has an important place in solving sparse linear ma-
trices[29, 35]. But because of its NP-completeness, almost all ordering algorithms are
heuristic in nature. Examples include reverse Cuthill-McKee, automatic nested dissec-
tion, and minimum degree.
2.1 Greedy ordering heuristics
A greedy ordering heuristic numbers columns successively by selecting at each step a
column with the optimal value of a metric. In the minimum degree algorithm of Tinney
and Walker [41], the metric is the number of operations in the rank-1 update associated
with a column in a right-looking, sparse Cholesky factorization. The algorithm can
be stated in terms of vertex eliminations in a graph representing the matrix. In this
framework, the number of operations in the rank-1 update is proportional to the square
of the degree of a vertex; consequently, implementations use the degree as the metric.
Efficient implementations of minimum degree are due to George and Liu[1, 2, 3] and
8
the minimum degree algorithm with multiple eliminations (MMD), due to Liu[28], has
become very popular in the last decade. Multiple independent vertices are eliminated at
a single step in MMD to reduce the ordering time. More recently, Amestoy, Davis, and
Duff [34] have developed the approximate minimum degree (AMD) algorithm. AMD uses
an approximation to the degree to further reduce the ordering time without degrading
the quality of orderings produced. Berman and Schnitger [37] have analytically shown
that the minimum degree algorithm can, in some rare cases, produce a poor ordering.
However, experiments has shown that the minimum degree algorithm and its variants are
effective heuristics for generating fill-reducing orderings. In fact, only some very recent
separator-based schemes have outperformed MMD for certain classes of sparse matrices.
Two of these new schemes are hybrids of a separator-based scheme and a greedy ordering
strategy such as the minimum degree algorithm.
One more greedy ordering heuristic that was also proposed by Tinney and Walker,
but has largely been ignored, is the minimum deficiency (or minimum fill) algorithm.
The minimum deficiency algorithm minimizes the number of fill entries introduced at
each step of sparse Cholesky factorization (or deficiency in graph terminology). Although
the metrics look similar, the minimum deficiency and minimum degree algorithms are
different. For example, the deficiency could well be zero even when the degree is not.
Some results by Rothberg demonstrate that minimum deficiency leads to significantly
better orderings than minimum degree. However, current implementations of the mini-
mum deficiency algorithm are slower than MMD by more than an order of magnitude.
Rothberg has investigated metrics for greedy ordering schemes based on approximations
to the deficiency.
A recent known greedy ordering algorithm is proposed by Ng and Raghavan[16].
They establish many of the techniques used in efficient implementations of the minimum
degree algorithm (namely, indistinguishable vertices, mass elimination and outmatching)
also apply to the minimum deficiency algorithm. They also corroborate Rothberg’s [15]
empirical results, establishing the superior performance of the minimum deficiency met-
ric. They have describe two heuristics based on approximations to the deficiency and the
degree. Both metrics can be implemented using either the update mechanism in MMD
or the faster scheme in AMD. In their paper, the correction term as an approximation
to edges has been missed because they restrict their work to partial cliques that are
disjoint. However, the heuristic performs poorly if the correction term is absent.
9
2.2 Graph-partitioning based heuristics
Graph partitioning based heuristics are capable of producing better-quality orderings
than Minimum degree based heuristics for finite-element problems, while staying within
a small constant factor of the run time of Minimum degree based heuristics.
An important area where sparse-matrix orderings are used is that of linear program-
ming. Until now, with the exception of Rothberg and Hendrickson[5], most researchers
have focused on ordering sparse matrices arising in finite-element applications, and these
applications have guided the development of the ordering heuristics. The use of the
interior-point method for solving linear programming problems is relatively recent. As a
result, the linear programming community has been using these well-established heuris-
tics that were not originally developed for their applications. graph partitioning based
sparse matrix ordering algorithm is capable of generating robust orderings of sparse
matrices arising in linear problems, in addition to finite-element and finite-difference
matrices.
Graph partitioning based ordering methods are more suitable for solving sparse
systems using direct methods on distributed-memory parallel computers than MD-based
methods, in two respects. First, there is strong theoretical and experimental evidence
that the process of graph partitioning and sparse-matrix ordering based on it can be
parallelized effectively. On the other hand, the only attempt to perform a minimum-
degree ordering in parallel that we are aware of was not successful in reducing the ordering
time over a serial implementation. Second, in addition to being parallelizable itself, a
graph partitioning based ordering also aids the parallelization of the factorization and
triangular solution phases of a direct solver.
Gupta, Karypis, and Kumar [17, 31] have proposed a highly scalable parallel for-
mulation of sparse Cholesky factorization. This algorithm derives a significant part of
its parallelism from the underlying partitioning of the graph of the sparse matrix. Gupta
and Kumar present efficient parallel algorithms for solving lower and upper-triangular
systems resulting from sparse factorization. In both parallel factorization and triangular
solutions, a part of the parallelism would be lost if an Minimum degree based heuristic
were used to preorder the sparse matrix.
Recent research has shown multilevel algorithms to be fast and effective in com-
puting graph partitions. A typical multilevel graph-partitioning algorithm has four com-
ponents: coarsening, initial partitioning, uncoarsening, and refining. Recently Anshul
Gupta[4] has presented a fast and effective way of graph partition which he called WGGP.
10
In WGGP, With a Graph partitioning based ordering, the matrix columns correspond-
ing to the nodes of a separator usually tend to become denser during factorization than
the columns corresponding to the nodes of the subgraphs that the separator separates.
This is because the separator columns receive fill-in from the columns of both of the
subgraphs that they separate. On the other hand, the columns of each subgraph receive
fill-in from the nodes of only that subgraph. In addition, separators typically have fewer
nodes than the separated subgraphs. A large number of columns contributing fill-in to a
small number of separator columns results in the separator columns becoming relatively
dense during factorization. This is a quite a large drawback for the lower dense input
matrices.
2.3 Hybrid heuristics
The quality of a greedy ordering scheme such as minimum degree is quite sensitive to the
way ties are broken when there is more than one vertex eligible for elimination. Berman
and Schnitger [37] describe a minimum degree elimination sequence for the k × k grid
so that the number of factor entries and the number of factorization operations is an
order of magnitude higher than optimal. On the other hand, graph partitioning scheme
such as nested dissection produces asymptotically optimal orderings for these grids. The
situation changes completely when considering h×k grids with large aspect ratio. Here,
minimum degree outperforms nested dissection.
Hybridization of two heuristics like: incomplete nested dissection, and minimum
degree post-processing on an incomplete nested dissection, is an important issue to im-
prove the ordering scheme. This scheme had been widely used successfully in state-of-
the-art ordering codes such as BEND, SPOOLES, TAUCS and PSPASES. Ashcraft and
Liu[6, 7, 8, 9, 10] also present a more general classification of hybrid schemes known as
multi-section ordering.
Jurgen Schulze[26] presented a hybrid heuristic ordering scheme which he called
as PORD. PORD uses the two-level method to construct a vertex separator with the
help of domain decomposition. But it works as a sequential algorithm. Using PORD
as a parallel algorithm is not a easy task because of its tighter coupling of bottom-up
and top-down methods with the interpretation of vertex separators as the boundaries
of the last elements in a bottom-up ordering. So multiple node elimination become the
bottle-neck problem for the parallelization of the PORD.
11
PORD also uses different types of node selection strategies for the construction
of vertex separators which improves the ordering further. In their methodology vertex
separators are interpreted as the boundaries of the last elements in a bottom-up ordering.
They are considered as a tool for guiding the elimination process. The shortcomings of
a bottom-up method such as minimum degree are largely due to the local nature of
the algorithm. On the other hand, vertex separators afford an insight into the global
structure of the graph. They observe that the quality of an ordering will be improved,
if the elements created in the elimination process have smooth boundaries.
In PORD, since the removal of a vertex separator S partitions a graph in two
subgraphs, the variables corresponding to S constitute a large boundary segment that is
shared by two well aligned elements. When eliminating the vertex separators according
to the given nested dissection order, well-aligned elements are merged to form new well-
aligned elements. This is achieved by the recursive structure of the nested dissection
algorithm. However, the nested dissection order represents only one possibility to create
new well-aligned elements. PORD is considering all orderings that can be created by
the rule: (a) Eliminate all vertex separators in levels l + 1, ..., lev using the given nested
dissection order, and (b) Eliminate all vertex separators in levels 0, ...., l using a bottom-
up algorithm where lev denote the depth of the elimination tree and l ∈ 0, ..., l.
Two different types of enhancement of this node selection strategy has been pre-
sented in this thesis which simply decreases the number of fronts of the elimination tree
obtained after the ordering step.
12
Chapter 3
Parallel Construction of ordering
scheme
In this chapter, first all different types of sparse matrix ordering methods are defined
which are most commonly used in the linear programing. Then after that this chapter
demonstrates the ordering method used by PORD and the parallel implementation of
that scheme with the result and discussion.
3.1 different types of ordering methods
From several years, there are lots of heuristics has been proposed in order to get a good
ordering methods. The most common methods used for ordering is presented here. We
have also defined the matrix graph relation in this section.
3.1.1 Bottom-up Methods
There are many bottom-up ordering methods known in sparse matrix world. The mini-
mum degree algorithm is one of the most popular bottom-up ordering schemes[3, 6, 39].
Over the years many enhancements have been proposed to the basic algorithm that have
greatly improved its efficiency.
Perhaps one of the most important enhancements is the concept of supernodes.
Two vertices x, y of an elimination graph Gk belong to the same supernode, if adjGk (x)∪x = adjGk (y) ∪ y. In this context the vertices x, y are called indistinguishable.
Indistinguishable vertices possess two important properties: (a) they can be eliminated
13
consecutively in a minimum degree ordering, and (b) they remain indistinguishable in all
subsequent elimination graphs. As a consequence, all vertices that belong to a supernode
I can be replaced by a single logical node with weight I. Thus, the runtime of the
minimum degree algorithm is significantly reduced.
3.1.2 Top-down methods
The most efficient top-down ordering scheme is Georges nested dissection algorithm[1].
Nested dissection is a divide-and-conquer strategy for ordering sparse matrices. Let Vs
be a set of vertices (called a separator) whose removal, along with all edges incident on
vertices in Vs disconnects the graph into two remaining subgraphs,G1 = (V1, E1)andG2 =
(V2, E2). If the matrix is reordered so that the vertices within each subgraph are num-
bered contiguously and the vertices in the separator are numbered last, then the matrix
will have a bordered block diagonal format. This idea can be applied recursively, break-
ing each subgraph into smaller and smaller pieces with successive separators, giving a
nested sequence of dissections of the graph that inhibit fill and promote concurrency at
each level.
The effectiveness of nested dissection in limiting fill depends on the size of the
separators that split the graph, with smaller separators obviously being better[32, 33].
The relative sizes of the resulting subgraphs is also important. Maximum benefit from
the divide-and-conquer approach is obtained when the remaining subgraphs are of about
the same size, an effective nested dissection algorithm should not permit an arbitrarily
skewed ratio between the sizes of the pieces.
In contrast to the bottom-up methods introduced above the nested dissection al-
gorithm is quite ill-specified. Determination of the separator is an important issue for
better implementation. There are some approaches which are discussed below:
3.1.2.1 Multilevel approach
Multilevel algorithms have been applied successfully to the construction of edge sepa-
rators. Roughly speaking, a multilevel algorithm consists of three phases. In the first
phase the original graph G is approximated by a sequence of smaller graphs that main-
tain the essential properties of G(coarsening phase). Then, an initial edge separator is
constructed for the last graph in the sequence (partitioning phase). Finally, the edge
separator is projected backwards to the next larger graph in the sequence until G is
14
reached (uncoarsening phase). A local improvement heuristic such as Kernighan-Lin or
Fiduccia-Mattheyses is used to refine the edge separator after each uncoarsening step.
This approach is used in graph-partitioning algorithms prominently.
3.1.2.2 Domain decomposition approach
Domain decomposition is the widely known top-down approach used by several ordering
scheme. In contrast to multilevel method, Ashcraft and Liu propose a two-level approach
to construct a vertex separator. Analogous to the domain decomposition methods for
solving PDEs (partial differential equations), the vertex set X of G is partitioned into
X = φ ∪ Ω1 ∪ ........ ∪ Ωr with adjG(Ωi) ⊂ φ for all 1 ≤ i ≤ r where Ωi are the sets
containing the domain. The set φ is called multisector. The removal of φ splits G into
connected subgraphs G(Ω1), ......, G(Ωr). Once φ has been found, a color from WHITE,
BLACK is assigned to each Ωi. This induces a coloring of the vertices u ∈ φ.
color (u) =
WHITE, if all Ωi with u ∈ adjG (Ωi) are colored WHITE
BLACK, if all Ωi with u ∈ adjG (Ωi) are colored BLACK
GRAY, otherwise
According to Ashcraft and Liu, the set S = u ∈ φ; color(u) = GRAY constitutes
a vertex separator of G for every coloring of Ω1, ...., Ωr if and only if
∀u, v ∈ φ : u, v ∈ E ⇒ ∃Ωi with u, v ∈ adjG (Ωi)
In general, not all vertices u, v ∈ φ satisfy the above equation. These vertices
are then blocked to a segment V ⊂ φ. As a result, one obtains a partitioning P =
V1, ..., Vs , φ = V1 ∪ ...∪ Vs, of the multisector. The segments in P satisfy the following
condition:
∀V, V ′ ∈ P : adjG (V ) ∩ V ′ 6= φ ⇒ ∃Ωi with V, V ′ ∩ adjG (Ωi) 6= φ
So now coloring can be defined as:
15
color (V ) =
WHITE, if all Ωi with V ∩ adjG (Ωi) 6= φ are colored WHITE
BLACK, if all Ωi with V ∩ adjG (Ωi) 6= φ are colored BLACK
GRAY, otherwise
This guarantees that S = u ∈ φ;∃V ∈ P with u ∈ V andcolor(V ) = GRAY constitutes a vertex separator of G for every coloring of Ω1, ........, Ωr.
Ashcraft and Liu are using a block Fiduccia-Mattheyes scheme to determine a col-
oring of the sets Ω1, ........, Ωr, that minimizes the size of the induced vertex separator
S. Once S has been found, a sophisticated network-flow algorithm is used to smooth S.
(This part has been taken from Jurgen Schulze[26])
3.1.3 Hybrid methods
There are lots of hybrid methods exist with the combination of bottom-up and top-down
methods. In order to improve both run-times and ordering qualities, this thesis actually
use a hybrid of minimum degree and nested dissection. PORD actually hybridize the
methods in two ways. The first is the standard incomplete nested dissection method.
Starting with the original graph, they perform several levels of nested dissection. Once
the subgraphs are smaller than a certain size, they order them using minimum degree.
This allows the ordering to reap the benefits of nested dissection at the top levels, where
most of the factorization work is performed, while obtaining the runtime advantages of
minimum degree on the smaller problems.
The second hybridization PORD use is minimum degree post-processing on an
incomplete nested dissection ordering. The idea is to reorder the separator vertices
using minimum degree. A simple intuition behind this hybrid method is that nested
dissection makes an implicit assumption that recursive division of the problem is the
best approach to ordering. Allowing minimum degree to reorder the separator vertices
removes this assumption.
16
3.2 Ordering Scheme used by PORD
In PORD, vertex separators are interpreted as the boundaries of the last elements in
a bottom-up ordering. As a consequence, we are using quotient graphs (Appendix B)
and special node selection strategies for the construction of separators. This achieves a
tighter coupling of bottom-up and top-down methods.
Most ordering schemes are using a matching technique to coarsen a graph G =
(X, E). However, in PORD, the coarsening process relies on quotient graphs. It starts
by constructing an initial quotient graph G0 from G. Based on G0 a sequence of quo-
tient graphs G1, ...,Gt is produced, where Gi is obtained from Gi−1, 1 ≤ i ≤ t, by the
elimination of certain variables.
PORD also applies the coloring scheme after finding the separator. PORD tries to
minimize the total weight of all the variables in the quotient graph which increases the
probability of finding a small(i.e. light weighted) separator of Gi and G. This consists
of four steps.
1. Construction of the initial quotient graph – This start with computing a
maximal independent set of the graph. This set is called a multisector and PORD
tries to remove this vertex set from the initial graph as a separator. This separator
follows the rule of the domain-decomposition approach.
2. Construction of further quotient graphs – Once initial separator has been
found, this implementation try to construct the further quotient graph Gi with
the help of Gi−1. Each merging operation corresponds to an elimination step in a
bottom-up algorithm. Next, all remaining variables that are adjacent to exactly
one element are merged with that element. Once a new quotient graph has been
constructed, all variables that are adjacent to the same set of elements can be re-
placed by a single supervariable. This further reduces the number of nodes in that
quotient graph.
3. Coloring of quotient graphs – After finding the separator of each quotient
graph, this scheme try to color the vertices with the help of the coloring scheme
defined by the domain-decomposition approach. This contains three types of col-
oring(white, black and gray) of each supervariable of the quotient graph.
17
4. Smoothing the final separator – After coloring the vertices, this method try to
balance the weight of the white and the black vertices to get a good separator from
that quotient graph. Often separator can be improved by exchanging a boundary
segment with the vertices of a domain. The whole process is repeated until none
of the two minimum weighted vertex covers(black and white) improves the actual
separator.
The performance of this quotient graph method crucially depends on the perfor-
mance of separator function. This function represents the entry point of our optimization
algorithm.
3.3 Parallel implementation of the defined scheme
PORD uses the two-level method to construct a vertex separator with the help of domain
decomposition. In the prevoius section, we already saw that separator function is the
main basic function of the whole ordering algorithm. To do the parallelization part
globally is quite a difficult task because of the tighter coupling of the ordering scheme
and the multivertex elimination in one single step. So we try to choose the most occurring
function and try to parallelize it.
For the computation of large sparse matrix, Shrink Domain Decomposition takes
lots of time, which is a part of separator function. This thesis tries to parallelize this
function, such that the performance increases and gets better result out of it. This func-
tion mainly deals with the multisector properties. First It eliminate all multisectors that
are adjacent to only one domain, then it merge all indistinguishable multisectors accord-
ing to their checksums which is calculated based on the degree of that multisect vertex
with different types of scoretype like QMRDV (maximal relative decrease of variables
in quotient graph), QMD (minimum degree in quotient graph) or QRAND (randomly
generated degree).
This function of shrinking the domain decomposition takes place until the number
of domain of the remaining graph will be lesser than the provided min domain or number
of edges will be lesser than the number of vertices of the remaining graph. This function
works similar as coarsening phase of multilevel method described above which is used
in many sparse matrix ordering. So parallelizing this shrinking method increases the
performance of the symbolic factorization of the matrix.
18
So, this gives us some possibility of parallelization. For the parallelization of the
shrinking method of the defined approach, We first try to construct it with MPI (Message
Passing Interface)[13, 30, 36, 38, 40] which is used in many software to do the sparse ma-
trix computation in parallel. But MPI only works with the passing of messages between
the processors, it does not have any notion of global shared memory. The shrinking pro-
cess checks the multisector and changes the representative of that multisector according
to their vertex type, checksum and indistinguishability. And whenever it gets change
some representative or vertex type by one process, then other process has to know about
it, which takes too much time to gather all the information right. So this decreases the
performance of the ordering step too much such that sequential is far better than the
parallel version.
For the better increment of the performance of the parallel version, we have to
use the possibility of providing some global shared memory space for some structure.
So for this, Global Array toolkit(GA)[24, 25] has been used which provides a portable
Non-Uniform Memory Access (NUMA) shared memory programming environment in
the context of distributed array data structures (called ‘global arrays’). From the user
perspective, a global array can be used as if it was stored in shared memory. All details
of the data distribution, addressing, and data access are encapsulated in the global
array objects. Information about the actual data distribution and locality can be easily
obtained and taken advantage of whenever data locality is important. The primary target
architectures for which GA was developed are massively-parallel distributed-memory and
scalable shared-memory systems.
Global array divides logically shared data structures into ‘local’ and ‘remote’ por-
tions. It recognizes variable data transfer costs required to access the data depending
on the proximity attributes. A local portion of the shared memory is assumed to be
faster to access and the remainder (remote portion) is considered slower to access. In
addition, any processes can access a local portion of the shared data directly/in-place
like any other data in process local memory. Access to other portions of the shared
data must be done through the GA library calls. The Global Arrays library supports
two programming styles: task-parallel and data-parallel. The GA task-parallel model
of computations is based on the explicit remote memory copy: The remote portion of
shared data has to be copied into the local memory area of a process before it can be
used in computations by that process. Of course, the ‘local’ portion of shared data can
always be accessed directly thus avoiding the memory copy.The data distribution and
locality control are provided to the programmer.
19
In this implementation, many functions have been in-cooperated with the help of
global array to increase the run-time of the given ordering scheme. Suppose that one
sequential algorithm is given below for merging the indistinguishable multisectors.
/* ------------- merge indistinguishable multisecs ------------ */
for (k = 0; k < nlist; k++)
u = msvtxlist[k];
if (vtype[u] == 2)
chk = checksum[u];
v = bin[chk]; bin[chk] = -1; /* examine all multisecs in bin[hash] */
while (v != -1)
istart = xadj[v]; istop = xadj[v+1];
for (i = istart; i < istop; i++)
tmp[rep[adjncy[i]]] = flag;
ulast = v; u = next[v]; /* v is principal and u is a potential */
while (u != -1)
keepon = TRUE;
if (key[u] != key[v])
keepon = FALSE;
if (keepon)
istart = xadj[u]; istop = xadj[u+1];
for (i = istart; i < istop; i++)
if (tmp[rep[adjncy[i]]] != flag)
keepon = FALSE; break;
if (keepon) /* found it! mark u as nonprincipal */
rep[u] = v; vtype[u] = 4;
u = next[u]; next[ulast] = u; /* remove u from bin */
else /* failed */
ulast = u; u = next[u];
v = next[v]; /* no more variables can be absorbed by v */
flag++; /* clear tmp vector for next round */
20
Now we have simply put GA function calls after dividing the number of vertices
into processors with the help of their checksums.
/* ----- merge indistinguishable multisecs with global array functionality ----- */
for (k = 0; k < nlist; k++)
u = msvtxlist[k];
if (vt[u] == 2)
chk = checksum[u];
if((chk >= lo) && (chk < hi+1))
v = bin[chk]; bin[chk] = -1; // examine all multisecs in bin[hash]
while (v != -1)
istart = xadj[v]; istop = xadj[v+1];
for (i = istart; i < istop; i++) tmp[rp[adjncy[i]]] = flag;
ulast = v; u = next[v]; // v is principal and u is a potential
while (u != -1)
keepon = TRUE;
if (key[u] != key[v]) keepon = FALSE;
if (keepon)
istart = xadj[u]; istop = xadj[u+1];
for (i = istart; i < istop; i++)
if (tmp[rp[adjncy[i]]] != flag)
keepon = FALSE; break;
if (keepon) // found it! mark u as nonprincipal
NGA_Put(rep, &u, &u, &v, &nnew); //put some variable in global array
NGA_Put(vtype, &u, &u, &newv, &nnew);
rp[u] = v; vt[u] = 4; u = next[u]; next[ulast] = u;
else ulast = u; u = next[u]; /* failed */
v = next[v]; // no more variables can be absorbed by v
flag++; // clear tmp vector for next round
21
Here in the above function, ‘lo’ and ‘hi’ are the lower and the higher indexes of the
checksum value for that processor number of given processor group. We have defined two
global array ‘vtype’ and ‘rep’ here which gives us the flexibility to use the function for
less number of computation according to their checksum values. Everytime a processor
enters into this piece of code, it asks for the values of these two global array and store
that global array into two local array ‘vt’ and ‘rp’ here.
So whenever a processor changes some value into that global array, it suddenly gets
change the value of that array independent of the other processors. Some data-structure
have also been changed globally for getting the better result. (By changing some global
data-structure implies that we have in-cooperated some data arrays with that structure.)
3.4 Results and Discussion
Results of proposed parallel implementation of the algorithms is define below. The
proposed approach looks at the methods for computing elimination trees (in short, at
the symbolic factorization phase)[19, 20, 27] and also at the runtime of the symbolic
factorization phases and the effect of different choices of the real symmetric sample
matrices of harwell-boeing matrix collections[23].
A complete analysis of the proposed parallel multilevel algorithm has to account
for the communication overhead in each shrinking step and the idling overhead that
results from waiting for the different processor to give their results. This performance
evaluation has been taken place only on 8 or less than 8 processors.
But this thesis also says that using more number of processors with lesser amount
of existing load into the processor for lesser run-time, the changes in the time-complexity
is getting decreased with the increment in the number of processors which can be seen
in Figure 3.1. By seeing this change in time-complexity, this can be inferred that the
time-complexity can further improved by increasing the number of processors.
This have been seen that by using more number of processors, a deductive amount
of change in the time complexity can be get. And by this parallel implementation, this
can also be inferred that in some methods, sometimes by the random computation by
each processor, less numbers of fronts of the resulting elimination tree can be obtained.
22
Figure 3.1: Changes in runtime vs number of processor
This will provide us a better ordering at the end of this parallel implementation. For
one matrix (bcsstk17.rsa) number of fronts has been shown here in Figure 3.2.
Figure 3.2 shows that after increasing the number of processor, the less number of
fronts has been obtained in the end of the ordering steps. If further number of proces-
sor is increased, then we can get better result with less time-complexity in the ordering
phase. For defining the structure globally, synchronization of the global data for the
whole parallel implementation is important bottleneck. This synchronization process
will take less time if we have large data for the single processor computation which can
only be done for large sparse matrix. In a way, our parallel implementation perform
good in terms of time-complexity when we have the large sparse matrix.
So, this part of the thesis shows that in a way by parallel computing, the proposed
implementation gets less number of fronts and less complexity if we use high number of
processors. This scheme is also good for the further steps of computation after matrix
ordering because of better ordering. The rest of decrement of number of fronts will be
discussed in the next chapter of the thesis.
23
Figure 3.2: Number of fronts vs number of processor
If we get better ordering in ordering phase then we can say that we have to invest
less amount of time in solving the solution of the matrix. This simply increase our total
time-complexity for the linear programming.
24
Chapter 4
Node Selection Strategies for the
Construction of Vertex Separators
PORD uses the node selection strategy in the construction of quotient graphs which is the
most time-consuming part of the algorithm. Generally, PORD uses the AMMF(Approximate
minimum mean fill-in algorithm) strategy to find the eliminated node.
Basically, the minimum local fill or minimum deficiency heuristic uses the exact
amount of fill rather than the bound above to select a node for elimination[11, 21].
This approach is generally thought to provide limited quality advantages over minimum
degree while requiring significantly higher runtime. The minimum local fill heuristic
has received less attention in the literature than minimum degree, primarily because its
runtime is prohibitive. To compute the fill that would result from the elimination of a
node k, this has to be determined which pairs of nodes in Adj(k) are already adjacent,
and this is much more expensive than simply computing |Adj(k)|. To compound the
problem, while the elimination of a node k can only affect the degrees of nodes in Adj(k),
it can affect fill counts for both nodes in Adj(k) and their neighbors. While many of
the enhancements described for minimum degree are applicable to minimum local fill
(particularly supernodes), run-times are still prohibitive. But, in the recent years, some
algorithms for the minimum local fill have been developed which have the good ordering
with limited amount of run-time.
25
4.1 Node selection strategy of PORD
In construction of quotient graph of multilevel algorithm, Gi+1 is obtained from a quotient
graph Gi by eliminating a set of independent variables U ⊂ V i. This coarsening scheme
leads us to the following interesting questions: What node selection strategy should be
used to find U , and how does the strategy influence the construction of the separators?
Each separator Si = Vp1 , ...., Vpt of Gi induces a separator S = Vp1 ∪ .... ∪ Vpt of
G. . Thus, S is composed of variables that belong to the boundaries of certain elements.
Since our primary goal is to find a small (i.e. light weighted) separator S the elements
of a quotient graph should be merged so that the number and the weight of variables
that are adjacent to the newly formed elements is minimized. This corresponds exactly
to the strategy of the minimum degree algorithm. Therefore, a suitable node selection
strategy can be as follows: For each variable V ∈ V i compute its degree
deg (V ) =∑
U∈MV
weight(U) (4.1)
with MV = V ′ ∈ V i − V ;∃D ∈ adjGi (D). Sort the variables according to their
degrees in ascending order and fill the independent set U starting with the first one in
that order.
In order to accelerate the degree computations, an approximation of the above
equation(4.1) has been set for each element D ∈ Di
deg (D) =∑
V ′∈adjGi (D)
weight(V ′) (4.2)
Score function is also defined for the approximate minimum degree approach for
the vertices in the quotient graph.
scoreQAMD (V ) =∑
D∈adjGi (V )
(deg(D)− weight(V )) (4.3)
This score function is called approximate-minimum-degree-in-quotient-graph (QAMD).
A more direct way to produce elements with light boundaries is to eliminate an
independent set of heavy weighted variables. Unfortunately, this node selection strategy
can lead to a strong growing of only a few elements. Typically, a heavy weighted variable
26
V represents a large boundary segment shared by two large elements/domains. When
removing V the two elements are merged together with V to form an even larger ele-
ment/domain. This unbalanced growing of elements cripples our optimization algorithm.
Therefore, we penalize the growing of large elements by relating the weight of V to the
weight of the newly formed element. This motivates a node selection strategy based on
scoreQMRDV (V ) =1
weight(V )·
∑D∈adjGi (V )
weight(D) (4.4)
The score function is called maximal-relative-decrease-of-variables-in-quotient-graph
(QMRDV). This has been demonstrated that the QMRDV strategy is very effective in
absorbing the vertices of a graph. Also a random elimination strategy enables a fast
coarsening of G.
4.2 Proposed approach for the node selection
This section describes several modifications to the minimum local fill algorithm that
improve the quality of the computed orderings. Furthermore in this section, an intuitive
explanation of their effectiveness and attempt to provide a more formal explanation has
been presented.
To easily describe these new heuristics, ordering algorithms introduce a function
score(K) that captures the cost of eliminating an uneliminated supernode K. The
ordering algorithm always chooses a node with minimum score to eliminate next. In
the case of minimum degree, score(K) = |Adj(K)|; for minimum local fill, score(K) =
|Fill(K)|, where Fill(K) is the set of edges that would be added if K is eliminated.
Generally in approximate minimum mean fill-in(AMMF) algorithm, score function
is different from the standard minimum fill-in. Eliminating a supernode K corresponds
to |K| single-node eliminations, so the average fill associated with each elimination is
score(K) = |Fill(K)| / |K| (4.5)
Since this node selection strategy is widely used in sparse matrix ordering algo-
rithm, so in this thesis, two types of enhancement of this node selection strategy has
been presented.
So This thesis consider a modification to the minimum fill-in node selection strat-
27
egy. Our goal is to introduce some of the flavor of minimum local fill without also
introducing the prohibitive cost.
4.2.1 First Modification
The first modification is motivated by the observation that eliminating a supernode K
corresponds to |K| single-node eliminations. But instead of average fill associated with
each elimination, here the score is used in different manner. This is shown that the score
function is totally dependent upon the supernode K elimination. So in the first function
the following modification has been done.
score(K) = |Fill(K)| / |log2(K + 1)| (4.6)
4.2.2 Second Modification
The second modification is done by seeing that increase in nonzero entries simply failing
the previous modification. So for large input this thesis introduces one more modification
to AMMF. Here in the previous and other modification of MMF, this is shown that
score function is modified according to the decreasing function of K. So in this defined
approach, second modified function is demonstrated according to the large number of
non-zero entries. The second modification has been defined below:
score(K) = |Fill(K)| / |exp(K)| (4.7)
These two modifications simply decreases the number of fronts of the elimination
tree obtained after the ordering step. This thesis demonstrates that the increment of
non-zero entries in the input matrices gives us better ordering if we multiply the score
function of MMF with the higher valued function of K.
4.3 Result and Discussion
To evaluate the effectiveness of these scoring functions, This thesis looks at ordering
quality over a set of more than 40 sparse symmetric matrices from the Harwell-Boeing
sparse matrix test set[23]. To reduce the effect of tie-breaking strategies, all nonzero
28
and operation counts are obtained by ordering each matrix several times (randomly
permuting the rows and columns before each ordering) and taking the median. For the
minimum fill variants, our algorithm takes the median over three permutations. While
for the less costly approximate minimum fill variants, our algorithm take the median
over eleven permutations.
The first modification of the node selection strategy is shown in Table 4.1. This
table provides the nonzero entries in the input matrix A, the number of fronts in the
lower triangle of L required to perform the further factorization after applying the first
modification of node-selection strategies.
Table 4.1: nfronts after first modification
Matrix NZ in A nfronts inAMMF
nfronts in firstmodification
bcsstk13 42945 612 604bcsstk15 60882 1303 1251bcsstk16 147631 728 690bcsstk17 219812 2595 2567bcsstk21 15100 2661 2643bcsstk24 81736 425 415bcsstk25 133 7820 7801bcsstk33 300 1270 1197bcsstm27 28675 140 135dwt 992 8868 284 280bcsstk08 7017 830 828bcsstk09 9760 459 457bcsstk11 17857 408 405
This modification gives us better result in the number of fronts which is quite a bit
striking factor in subsequent ordering. But in some of matrices of the non-zero entries
greater than 100000, this work has seen that second modification gives us far better
result than the first modification. The second modification is shown in Table 4.2. This
table shows the nonzero entries in the input matrix A, the number of fronts in the lower
triangle of L required to perform the further factorization after applying the second
modification of node-selection strategies.
These tables demonstrate that as increasing the multiplying factor, the better
ordering scheme has been obtained. But this multiplying factor should be decreasing as
the increment in the K. Recall that maintaining exact fill information requires updating
the scores of the neighbors of the eliminated nodes as well. Since the approximate
29
Table 4.2: nfronts after Second modification
Matrix NZ in A nfronts inAMMF
nfronts in firstmodification
nfronts in secondmodification
bcsstk18 80519 7843 7845 7681bcsstk19 3835 489 489 482bcsstk23 24156 1579 1588 1530bcsstk25 133840 7820 7801 7488bcsstk29 316740 3139 3171 2980eris1176 9864 903 903 883bcspwr03 297 115 115 110bcspwr04 943 246 246 240bcspwr05 1033 433 433 425bcspwr06 3377 1431 1431 1412bcspwr07 3718 1584 1584 1566bcspwr08 3837 1594 1594 1577bcspwr09 4117 1686 1686 1671bcspwr10 13571 4979 4979 4939
fill variants only update the scores for eliminated nodes, computing exact fill on these
nodes gives an upper bound on the improvement that can be obtained by refining our
approximations.
This thesis believes that the main problem with minimum fill-in is that the cliques
created in the elimination process often have non-smooth boundaries, which have shown
that non-smooth boundaries can lead to asymptotically suboptimal orderings. Note that
AMMF generates significantly smaller cliques than AMD for a given number of interior
nodes, which means that its cliques have smoother boundaries.
So, the alternative scoring functions produce smoother clique boundaries because
of the way they form large cliques. Recall that the approximate fill scoring functions
are more willing to select nodes that already belong to large cliques. As a result these
variants tend to grow large cliques into larger ones. In contrast, AMD forms large cliques
by merging smaller ones. The AMMF approach exhibits significant local growth in clique
sizes; in contrast clique sizes in AMD grow more smoothly.
Our approach finds that our modified strategy generally grows a clique further than
AMF. This is understandable since growing a clique often creates supernodes within the
current clique. Since these supernodes have reduced scores in our strategy, the clique
continues to grow. Apparently, growing larger cliques than those grown by AMF is
beneficial. Our approach experimented with scoring functions that encouraged cliques
30
to continue growing beyond the point where they would stop with our strategies.
Looking at clique growth patterns for the exact fill variants, we have found that
they have actually growing cliques less than the approximate fill variants. Clearly, they
are using a different mechanism to compute good orderings. Our thesis believe one
important property that the exact fill scores capture is clique alignment.
To summarize, this thesis conjecture that AMF is more effective than AMD because
the process of growing cliques creates smoother clique boundaries than the process of
merging smaller cliques. AMMF is even more effective because it allows the clique-
growing process to continue longer. MF is still more effective because cliques must
eventually be merged and exact fill scores capture some notion of clique alignment,
which leads to smoother clique boundaries. These logic gives us more stability to our
modification of node-selection.
31
Chapter 5
Software, tools and the configuration
5.1 Some Information About the Softwares Used
In the implementation, first some libraries has been built for the configuration of the
linear solver MUMPS. Here some sort of information is provided to get a better feel of
these libraries.
5.1.1 Blas
The BLAS (Basic Linear Algebra Subprograms) are routines that provide standard build-
ing blocks for performing basic vector and matrix operations. The Level 1 BLAS performs
scalar, vector and vector-vector operations, where the Level 2 BLAS performs matrix-
vector operations, and the Level 3 BLAS performs matrix-matrix operations. Because
the BLAS are efficient, portable, and widely available, they are commonly used in the
development of high quality linear algebra software. LAPACK is an example.
5.1.2 Blacs
The BLACS (Basic Linear Algebra Communication Subprograms) project is an inves-
tigation which solves the purpose to create a linear algebra oriented message passing
interface that may be implemented efficiently and uniformly across a large range of
distributed memory platforms. The length of time required to implement efficient dis-
tributed memory algorithms makes it impractical to rewrite programs for every new
parallel machine. The BLACS exists in order to make linear algebra applications both
32
easier to program and more portable to design.
5.1.3 Scalapack
The ScaLAPACK (or Scalable LAPACK) library includes a subset of LAPACK routines
redesigned for distributed memory MIMD parallel computers. It is currently written in
a Single-Program-Multiple-Data style using explicit message passing for interprocessor
communication. It assumes matrices are laid out in a two-dimensional block cyclic
decomposition. ScaLAPACK is designed for heterogeneous computing and is portable
on any computer that supports MPI or PVM. Like LAPACK, the ScaLAPACK routines
are based on block-partitioned algorithms in order to minimize the frequency of data
movement between different levels of the memory hierarchy. For such machines, the
memory hierarchy includes the off-processor memory of other processors, in addition to
the hierarchy of registers, cache, and local memory on each processor. The fundamental
building blocks of the ScaLAPACK library are distributed memory versions (PBLAS)
of the Level 1, 2 and 3 BLAS, and a set of BLACS for communication tasks that arise
frequently in parallel linear algebra computations. In the ScaLAPACK routines, all
interprocessor communication occurs within the PBLAS and the BLACS. One of the
design goals of ScaLAPACK is to have the ScaLAPACK routines resemble their LAPACK
equivalents as much as possible.
5.1.4 MPICH
Mpich is a freely available implementation of the MPI standard that runs on a wide
variety of systems. The mpich implementation provides tools that simplify creating
MPI executables. Because mpich programs may require special libraries and compile
options, the commands that mpich provides for compiling and linking programs has to
be used. When mpich is configured, the installation process normally looks for Fortran
90 compiler, and if it finds one, builds two different versions of MPI module. One
module includes only the MPI routines that do not take ‘choice’ arguments while the
other includes all MPI routines.
The relevant information about the ‘MPI’ and the ‘Global Array Toolkit’ has been
given in the Appendices of the thesis.
33
5.1.5 MUMPS
The solution of large sparse linear systems lies at the heart of most calculations in
computational science and engineering and is of increasing importance in computations
in the financial and business sectors. Today, systems of equations with more than one
million unknowns need to be solved. To solve such large systems in a reasonable time
requires the use of powerful parallel computers. To date, only limited software for such
systems has been generally available. The MUMPS software addresses this issue.
The original MUMPS package was only designed for real matrices but, in the new
version, complex symmetric and complex unsymmetric systems are permitted. If there
is sufficient demand, a version for complex Hermitian systems might be developed in
the future. The MUMPS software is written in Fortran 90. It requires MPI for mes-
sage passing and makes use of BLAS, LAPACK, BLACS, and ScaLAPACK subroutines.
However, in recognition that some users prefer the C programming environment, a C
interface has been developed for the new release, and a version has been written that
avoids the use of MPI, BLACS, and ScaLAPACK. This would be suitable for running in
a single processor environment, perhaps for testing and development purposes.
5.2 How to use the system
We have used 8 distributed parallel processor to run our implemented code. First Fortran
compiler has been installed in all the machine. Then we have built Mpich with the help of
Fortran compiler and gcc compiler. We have used ssh protocol for parallel communication
with mpi. We have checked the sample mpi programs working on that machines.
Now some libraries has been built with the help of mpi interface to build the
libraries, which further get used in building the MUMPS.
• BLAS is built with the help of Fortran compiler.
• BLACS is built used with mpich installed on every computer.
• Scalapack is built with mpich, Blas and Blacs libraries.
After this we have installed MUMPS (which is used to solve the system of linear
equation) with the help of above three libraries, Fortran compiler and mpich.
34
Figure 5.1: Dependencies of the softwares
In the multifrontal method, the last step of the factorization phase consists in
the factorization of a dense matrix. MUMPS uses ScaLAPACK for this final node.
Unfortunately ScaLAPACK does not offer the possibility to compute the inertia of a
dense matrix. (And in fact it does not offer either the possibility to perform an LDLT
factorization of a dense matrix so that we use ScaLAPACK LU on that final node).
We need to decide that which system(sequential or parallel) we want to use at the
time of configuration. Because sequential and parallel libraries cannot coexist in the
same application since they have the same interface. We need to decide at compilation
time which library we want to install. If we install the parallel version after the sequen-
tial version (or vice-versa), be sure you use ’make clean’ in between. If we plan to run
MUMPS sequentially from a parallel MPI application, we need to install the parallel ver-
sion of MUMPS and pass a communicator containing a single processor to the MUMPS
library. The reason for this behavior is that the sequential MUMPS uses a special library
libmpiseq.a instead of the true MPI, BLACS, ScaLAPACK. As this libraries implements
all the symbols needed by MUMPS for a sequential environment, it cannot co-exist with
MPI/BLACS/ScaLAPACK.
35
Chapter 6
Conclusion and Scope for Future
Work
6.1 Conclusion
Ordering phase in solving the system of linear equation become more interesting and
striking from the last three decades. There are many algorithm has been proposed for
the better ordering step in sufficient amount of time. PORD has been become so much
popular because of its hybrid scheme and tighter coupling. So we have tried to improve
the ordering with some modification.
In the first part of our present work, a new parallel two-way scheme for the construc-
tion of vertex separators have been presented. The fundamental idea of our algorithm
is to parallely generate a sequence of quotient graphs using a bottom-up node selection
strategy. Our computational experiments indicate that small vertex separators can be
obtained by this approach. Once all vertex separators have been obtained, we are using
them as a skeleton for the computation of several bottom-up orderings. The motiva-
tion is that the recursion in the nested dissection algorithm offers many possibilities for
merging well-aligned elements to new well-aligned elements. We feel that the exploration
of other merging strategies can lead to further improvements. By doing the recursion
parallely our thesis try to get better result.
In the second part of our present work, several simple modifications to the mini-
mum local fill ordering heuristic that exploit readily available information about node
adjacencies to improve the fill bounds used to select a node for elimination has been
presented. Perhaps the most practical of these modifications, which is called AMMF, re-
36
duces floating-point operation counts very significantly. This thesis improve the ordering
steps by reducing the number of fronts by modifying the AMMF node selection strategy.
This part of the thesis have given two modification in the current implementation of
PORD which is giving us better result in terms of number of fronts. Our computational
experiments indicate that increasing the value of non-zero entries in the input matrix for
symbolic factorization, the function with higher value of multiplying factor works better
than the usual AMMF strategy.
These two approaches gives us better ordering with the parallel implementation of
the ordering scheme. We have used parallel implementation of some most occurring func-
tions of the scheme, which gives us less communication overhead. We have defined global
array only for some important variable because of its synchronization issues. This gives
us partial implementation of parallelization in the ordering scheme. We have checked this
implementation for different types of global array variables and whichever gets better
result for this implementation, we have simply implemented that global variable. Many
functions for the modification of the node selection strategies has been experimented
in this implementation. Out of these, two good modification have been selected which
improves the ordering to greater extent in limited amount of time.
The capabilities of the present methodologies are demonstrated using various ex-
amples in the respective chapters.
6.2 Scope of future work
Further research should focus on improving the speedup of the parallel elimination tree
computation, since this part of the algorithm limits the speedup of the ordering scheme.
We give some suggestions for further research in the parallel implementation of the
ordering step.
In this thesis, all processors have full knowledge of the elimination tree after com-
puting it in parallel. However, not all processors use all that information for performing
the symbolic factorization. An improvement of the proposed algorithm could be to dis-
tribute the elimination tree over the processors on a need-to-know basis. A reduction of
the communication volume could be obtained and performance could hence be improved.
Another improvement could be reached by improving the move of the first nonzero
of each column in a row block. These are now moved based on parent information,
which leads to suboptimal moves. Better moves would require more information on
37
larger ancestors and thus require more communication. But possibly an improvement
can be reached.
We can use some other interface than MPI which has the in-built notion of global
shared memory thing. But this thing can only be useful when the other computation is
getting done into that environment. Creating and destroying any global array needs to
be done by all the processors at the same time. So, this leads us the slow computation
of the algorithm. an alternative method should be think to solve this problem.
One more improvement can be done in the node selection strategy. We saw that
somehow by increasing the multiplying factor, we are getting the better result in the
matrix of higher number of non-zero elements. So By some computation, we can predict
that at what number of non-zero elements, which strategy is giving us the better result
and what are the dependencies for that prediction.
38
References
[1] A. George. Nested dissection of a regular finite element mesh. SIAM J. Numer.
Anal., 10(2):p345–363, 1973.
[2] A. George and J. W. H. Liu. An automatic nested dissection algorithm for irregular
finite element problems. SIAM J. Numer. Anal., 15(5):p1053–1069, 1978.
[3] A. George and J. W. H. Liu. The evolution of the minimum degree ordering algo-
rithm. SIAM J. Numer. Anal., 31(1):p1–19, 1989.
[4] A. Gupta. Fast and effective algorithms for graph partitioning and sparse-matrix
ordering. IBM Journal of research and development, 41(1/2), 1997.
[5] Bruce Hendrickson and Ed Rothberg. Effective sparse matrix ordering just around
the bend. Eighth SIAM conf. Parallel processing for Scientific Computing.
[6] C. Ashcraft. Compressed graphs and the minimum degree algorithm. SIAM. J.
Matrix Anal. Appl., 16:pp1404–1411, 1995.
[7] C. Ashcraft and J. W. H. Liu. Robust ordering of sparse matrices using multisection.
SIAM. J. Matrix Anal. Appl., 19:p816–832, 1998.
[8] C. Ashcraft and J. W. H. Liu. A partition improvement algorithm for general-
ized nested dissection. Techn. Rep. BCSTECH-94-020, Boeing Computer Services,
Seattle, 1994.
[9] C. Ashcraft and J. W. H. Liu. Generalized nested dissection: Some recent progress.
Mini Symposium 5th SIAM Conference on Applied Linear Algebra, Snowbird, Utah,,
1994.
[10] C. Ashcraft and J. W. H. Liu. Using domain decomposition to find graph bisectors.
BIT, 37:p506–534, 1997.
39
[11] C. Meszaros. The inexact minimum local fill-in ordering algorithm. Techn. Report
WP 95 7, Computer and Automation Research Institute, Hungarian Academy of
Sciences, Budapest, 1995.
[12] D. J. Rose. A graph-theoretic study of the numerical solution of sparse positive
definite systems of linear equations. in Graph Theory and Computing, R. Read, ed.,
Academic Press, New York, pages pp183–217, 1972.
[13] D. Walker. The design of a standard message-passing interface for distributed mem-
ory concurrent computers. Parallel Computing,, 20(4):pp657–73, 1994.
[14] E. Rothberg. Robust ordering of sparse matrices: a minimum degree, nested dis-
section hybrid. Silicon Graphics manuscript, 1995.
[15] E. Rothberg and S. C. Eisenstat. Node selection strategies for bottom-up sparse
matrix ordering. SIAM J. Matrix Anal. Appl., 19(3):p682–695, 1998.
[16] Esmond G. Ng and Padma Raghavan. Performance of greedy ordering heuristics for
sparse cholesky factorization. Siam J. Matrix Anal. Appl., 20(4):pp.902–914, 1999.
[17] George Karypis and Vipin Kumar. A parallel algorithm for multilevel graph parti-
tioning and sparse matrix ordering. University of Minnesota, Department of Com-
puter Science/ Army HPC Research Center, Minneapolis,MN 55455, Technical Re-
port: 95-036, 1998.
[18] Gregoire Richard. Coupling mumps and ordering software. CERFACS report:
WN/PA/02/24, 2002.
[19] H. M. Markowitz. The elimination form of the inverse and its application to linear
programming. Management Sci., 3:pp255–269, 1957.
[20] Hans L. Bodlaender, John R. Gilbert, Hjalmtyr Hafsteinsson, and Ton kloks. Ap-
proximating treewidth, pathwidth and minimum elimination tree height. Technical
Report RUU-CS-91-1, 1991.
[21] I. A. Cavers. Using deficiency measure for tie-breaking the minimum degree algo-
rithm. Tech. report 89-2, Department of Computer Science, University of British
Columbia, Vancouver, B.C., 1989.
[22] I. S. Duff, A. M. Erisman, and J. K. Reid. Direct methods for sparse matrices.
Oxford University Press, Oxford., 1987.
40
[23] I. S. Duff, R. G. Grimes, and J. G. Lewis. Users guide for the harwell-boeing
sparse matrix collection. Technical Report TR/PA/92/86, Res. and Techn. Division,
Boeing Computer Services, Seattle, 1992.
[24] Jaroslaw Nieplocha, Robert J. Harrison, and Richard J. Littlefield. Global arrays:
A portable shared-memory programming model for distributed memory computers.
Pacific Northwest Laboratory, Richland WA 99352, 1994.
[25] Jarek Nieplocha, Jialin Ju, Manoj Kumar Krishnan, Bruce Palmer, and Vinod
Tipparaju. Global array toolkit. Pacific Northwest National Laboratory Technical
Report No. PNNL-13130, 2002.
[26] Jurgen Schulze. Towards a tighter coupling of bottom-up and top-down sparse
matrix ordering methods. BIT Numerical Mathematics, 41(4):p800–841, 2001.
[27] Jeroen van Grondelle. Symbolic sparse cholesky factorisation using elimination trees.
Master’s Thesis, Department of Mathematics, Utrecht University, 1999.
[28] J. W. H. Liu. Modification of the minimum-degree algorithm by multiple elimina-
tion. ACM Trans. Math. Software, 11(2):p141–153, 1985.
[29] N. I. M. Gould, Y. Hu, and J. A. Scott. A numerical evaluation of sparse di-
rect solvers for the solution of large sparse, symmetric linear systems of equations.
Council for the Central Laboratory of the Research Councils, 2005.
[30] Neil MacDonald, Elspeth Minty, Tim Harding, and Simon Brown. Writing message-
passing parallel programs with mpi. Edinburgh Parallel Computing Centre, The
University of Edinburgh.
[31] Mahesh Joshi, George Karypis, Vipin Kumar, Anshul Gupta, and Fred Gustavson.
Pspases: Bulding a high performance scalable parallel direct solver for sparse linear
systems. 2003.
[32] Manpreet S. Khaira, Gary L. Miller, and Thomas J. Sheffler. Nested dissection: A
survey and comparison of various nested dissection algorithms. CMU-CS-92-106R,
School of Computer Science, Carnegie Mellon University, Pittsburgh,, 1992.
[33] Michael T. Heath and Padma Raghavan. A cartesian parallel nested dissection al-
gorithm. Department of Computer Science and National Center for Supercomputing
Applications, University of Illinois.
41
[34] P. R. Amestoy, T. A. Davis, and I. S. Duff. An approximate minimum degree
ordering algorithm. SIAM J. Matrix Anal. Appl., 17:p886–905, 1996.
[35] Patrick R. Amestoy, Abdou Guermouche, Jean-Yves L’Excellent, and Stephane
Pralet. Hybrid scheduling for the parallel solution of linear systems. Technical
Report TR/PA/04/140, 2004.
[36] P. R. Amestoy, I. S. Duff, J. Y. LExcellent, and J. Koster. Multifrontal massively
parallel solver, users guide. 2003.
[37] P. Berman and G. Schnitger. On the performance of the minimum degree algorithm
for gaussian elimination. SIAM J. Matrix Anal. Appl., 11:pp83–88, 1990.
[38] Peter S. Pacheco. Department of mathematics, university of san fransisco. A user’s
guide to MPI, 1998.
[39] Tzu-Yi Chen, John R. Gilbert, and Sivan Toledo. Toward an efficient column
minimum degree code for symmetric multiprocessors. 2003.
[40] William Gropp, Ewing Lusk, and Anthony Skjellum. Using mpi: Portable parallel
programming with the message passing. MIT Press, 1994.
[41] W. F. Tinney and J. W. Walker. Direct solutions of sparse network equations by
optimally ordered triangular factorization. Proc. of the IEEE, 55:pp. 1801–1809,
1967.
42
Appendix A
Symbolic Factorization and
Elimination Tree
A.1 Cholesky Factorisation
Cholesky factorisation is a technique for solving linear systems Ax = b where A is a pos-
itive definite symmetric matrix. These computations appear frequently in for instance
the interior point method. This iterative alternative to the simplex method is used for
linear programming, a widely used optimization technique.
Definition 1.1.1 (Cholesky Factorisation) Given a symmetric positive definite
matrix A, the Cholesky factor L is the lower triangular matrix that satisfies
LLT = A (A.1)
After factoring A, we can first solve Ly = b and then LT x = y. Because the upper
and lower triangular systems are easy to solve, Cholesky factorisation is a convenient
way of solving a symmetric linear system Ax = b.
A.2 Numerical Factorisation
We can calculate such a matrix L as follows. If we assume that upper equation of
cholesky factorisation holds, then
43
aij =n−1∑k=0
LikLTkj =
j∑k=0
likljk =j−1∑k=0
likljk + lijljj (A.2)
where 0 ≤ j ≤ i < n. For i = j this leads to
ljj =
ajj −j−1∑k=0
l2jk
12
(A.3)
and for all 0 ≤ j < i < n to
ljj =1
ljj
aij −j−1∑k=0
likljk
. (A.4)
Algorithm 1.1 is a right-looking algorithm, based on equations (A.3) and (A.4) It
is formulated in terms of dense matrices and is called right-looking because it adds a
column to all the columns it should be added to, which are all on its right. A left-
looking algorithm takes a column and adds to it all the columns that should be added
to it. These columns are all on its left.
Algorithm 1.1 A dense numerical Cholesky factorisation algorithm
Input: A = lower(A0)
Output: A, A = L such that LLT = A0
for k := 0 to n− 1 do
akk :=√
akk
for i := k + 1 to n− 1 do
aik := aik/akk
for j := k + 1 to n− 1 do
for i := j to n− 1 do
aij := aij − aikajk
A.3 Symbolic Factorisation
Throughout this thesis, we will be factoring large sparse matrices. Sparse matrices have
many zero coefficients. In general, at most twenty percent of the entries of a sparse
44
matrix will be non-zeros. Taking advantage of the sparsity of matrices allows us to
compute Cholesky factors using far fewer floating point operations (flops) as we would
need factoring dense matrices of the same size. Also we can store sparse matrices using
less memory than their dense counterparts would use.
When factoring these matrices, we will see that Cholesky factors are often much
denser than the original matrices. These new non-zeros or fill-in are for instance gen-
erated by adding two columns with different nonzero positions. Because we use data
structures that only store the non-zeros, it is useful to know the structure of the Cholesky
factor before factoring it. Then we can reserve space in our data structure for the fill-in.
Symbolic factorisation determines the structure of the Cholesky factor. Because
we are not interested in the numerical values of the entries in the factor, this can be done
much faster than a full numerical factorisation.
A.4 Algorithms for symbolic factorisation
In the previous section we mentioned symbolic factorisation. In this section we deduce
a fast symbolic algorithm from the numerical algorithm in the previous section.
A.4.1 A graph representation of symbolic matrices
When dealing with symbolic factorisation, the algorithm can be formulated
conveniently in the language of graph theory.
Definition 1.4.1 An n× n matrix A induces the graph GA = (VA, EA) where
VA = 0, ..., n− 1 and EA = (i, j) |0 ≤ i, j < n ∧ aij 6= 0.
EA is called the set of edges and VA the set of vertices. Because A is assumed
symmetric, the graph need not be directed. Because of the symmetry of A. if
(i, j) ∈ EA, then (j, i) ∈ EA. And because A is positive definite, all the diagonal
elements of A are positive. As each vertex points to itself, these elements are generally
omitted. Therefore we rewrite the definition of GA for symmetric positive definite
matrices A.
Definition 1.4.2 A symmetric positive definite n× n matrix A induces the graph
45
GA = (VA, EA) where VA = 0, ..., n− 1 and EA = (i, j) |0 ≤ i, j < n ∧ aij 6= 0.
Figure A.1: Graph induced by the sparse matrix
A.4.2 A basic algorithm
Now we will transform algorithm 1.1 into a symbolic factorisation algorithm. To do this,
we simply remove all operations from the algorithm that do not introduce new non-zeros
or destroy existing non-zeros.
There are basically three operations in algorithm 1.1: A square root computation,
a division by akk and the actual column addition. The first two operations clearly do not
introduce new non-zeros nor do they destroy them. so that the only operation we have to
implement in symbolic factorisation is the column addition. Algorithm 1.2 implements
this operation in graph notation.
Algorithm 1.2 A basic symbolic Cholesky factorisation algorithm
Input: GA = (VA, EA)
Output : G = (V, E) where G = GL+LT with LLT = A
for k := 0 to n− 1 do
for all j : k < j < n ∧ (j, k) ∈ E do
for all i : j < i < n ∧ (i, k) ∈ E do
E := E ∪ (i, j)
This algorithm has approximately the same complexity as the numerical factori-
sation, O (nc2)1. We can reduce this runtime significantly as we will show in the next
section.
1c is the average number of non-zeros in each row
46
A.4.3 Fast symbolic Cholesky factorisation
In this section we will introduce a symbolic factorisation algorithm that is a factor c faster
than the basic symbolic Cholesky factorisation algorithm. We will follow the treatment
before but first we need a definition:
Definition 1.4.3 (Parent) Every column is said to have a parent column. The
parent of column k is defined as:
parent(k) = min i : k < i < n ∧ lik 6= 0 = min i : (i, k) ∈ E
Furthermore, parent (k) = ∞ if this minimum does not exist. This emplies that
∀i ∈ 0, ...., n− 1 : i < parent(i)
The proof is trivial, since the parent of column i was defined as the row index of
the first non-zero below the diagonal in column i. Apparently, this index will be greater
than i.Applying this theorem repeatedly at each stage of the calculation reduces the
runtime considerably. Algorithm 1.3 implements this technique.
Algorithm 1.3 The Fast Symbolic Cholesky factorisation algorithm
Input: GA = (VA, EA)
Output : G = (V, E) where G = GL+LT with LLT = A
for k := 0 to n− 1 do
parent(k) = min i : k < i < n ∧ (i, k) ∈ Efor all i : k < i < n ∧ (i, k) ∈ E do
E := E ∪ (i, parent (i))
In each one of n steps, a column of c elements is added to its parent, therefore
this algorithm has a runtime complexity of O (nc). This is a factor c faster than the
basic algorithm from the previous subsection. From now on, when we refer to symbolic
Cholesky factorisation, we mean the factorisation by algorithm 1.3.
47
A.5 elimination tree
In the previous section, we saw that there exists a parent-child relation between columns.
At the end of the sequential factorisation algorithm, the parent of each column is known.
In this section, we will take a closer look at these relations and try to compute them in
advance.
Definition 1.4.4 (Elimination forest) The elimination forest, associated with the
Cholesky factor L, is the directed graph G = (V, E) with V = 0, ..., n− 1 containing
all column numbers of L and
E = (i, j) ∈ V × V : i = parent (j)
Figure A.2: Note that the forest happens to contain only one tree.
we will assume that the elimination forest contains only one tree and refer to that
tree as the elimination tree. This assumption is reasonable since if necessary, a simple
preprocessing step divides the matrix into sub matrices that have a unique elimination
tree. Apart from the parent relation, the elimination tree contains a more general relation
between columns.
48
Appendix B
Elimination Graph and Quotient
Graph
B.1 Elimination graphs
The nonzero pattern of a symmetric n × n matrix, A, can be represented by a graph
G0 = (V 0, E0) with nodes V 0 = 1, ...., n and edges E0. An edge (i, j) is in E0 if and only
if aij 6= 0 and i 6= j. Since A is symmetric, G0 is undirected.
The elimination graph, Gk =(V k, Ek
)describes the nonzero pattern of the sub-
matrix still to be factorized after the first k pivots have been chosen and eliminated.
It is undirected, since the matrix remains symmetric as it is factorized. At step k, the
graph Gk depends on Gk−1 and the selection of the kth pivot. To find Gk, the kth pivot
node p is selected from V k−1. Edges are added to Ek−1 to make the nodes adjacent to
p in Gk−1 a clique (a fully connected subgraph). This addition of edges(fill-in) means
that we cannot know the storage requirements in advance. The edges added correspond
to fill-in caused by the kth step of factorization. A fill-in is a nonzero entry Lij, where(PAP T
)ij
is zero. The pivot node p and its incident edges are then removed from the
graph Gk−1 to yield the graph Gk. Let AdjGk(i) denote the set of nodes adjacent to i in
the graph Gk. When the kth pivot is eliminated, the graph Gk is given by
V k = V k−1/ p
and
Ek =(Ek−1 ∪ (AdjGk−1(p)× AdjGk−1(p))
)⋂(V k × V k
)49
The minimum degree algorithm selects node p as the kth pivot such that the de-
gree of p, tp ≡ |AdjGk−1(p)|, is minimum (where |...| denotes the size of a set or the
number of nonzeros in a matrix, depending on the context). The minimum degree al-
gorithm is a non-optimal greedy heuristic for reducing the number of new edges(fill-ins)
introduced during the factorization. We have already noted that the optimal solution is
NP-complete. By minimizing the degree, the algorithm minimizes the upper bound on
the fill-in caused by the kth pivot. Selecting p as pivot creates at most(t2p − tp
)/2 new
edges in G.
B.2 Quotient graphs
In contrast to the elimination graph, the quotient graph models the factorization of A
using an amount of storage that never exceeds the storage for the original graph G0.
The quotient graph is also referred to as the generalized element model. An important
component of a quotient graph is a clique. It is a particularly economic structure since
a clique is represented by a list of its members rather than by a list of all the edges in
the clique. Following the generalized element model, we refer to nodes removed from the
elimination graph as elements (George and Liu refer to them as eliminated nodes). We
use the term variable to refer to un-eliminated nodes.
The quotient graph, Gk = (V k, V k, Ek, Ek), implicitly represents the elimination
graph Gk where G0 = G0 , V 0 = V , V 0 = φ, E0 = E and E0 = φ. For clarity, we
drop the superscript k in the following. The nodes in G consist of variables(the set V ),
and elements(the set V ). The edges are divided into two sets: edges between variables
E ⊆ V × V and between variables and elements E ⊆ V × V . Edges between elements
are not required since we could generate the elimination graph from the quotient graph
without them.The sets V 0 and E0 are empty.
We use the following set notation (A, ε and £) to describe the quotient graph
model and our approximate degree bounds. Let Ai be the set of variables adjacent to
variable i in G, and let εi be the set of elements adjacent to variable i in G (we refer to
εi as element list i). That is, if i is a variable in V , then
Ai ≡ j : (i, j) ∈ E ⊆ V,
εi ≡e : (i, e) ∈ E
⊆ V ,
50
and
AdjG(i) ≡ Ai ∪ εi ⊆ V ∪ V
The set Ai refers to a subset of the nonzero entries in row i of the original matrix
A (thus the notation A). That is, A0i ≡ j : aij 6= 0, and Ak
i ⊆ Ak−1i for 1 ≤ k ≤ n.
Let £e denote the set of variables adjacent to element e in G. That is, if e is an element
in V , then we define
£e ≡ AdjG(e) =i : (i, e) ∈ E
⊆ V.
The edges E and E in the quotient graph are represented using the sets Ai and εi
for each variable in G, and the sets £e for each element in G. We will use A, ε, and £
to denote three sets containing all Ai, εi, and £e, respectively, for all variables i and all
elements e. George and Liu show that the quotient graph takes no more storage than
the original graph(∣∣∣Ak
∣∣∣+ ∣∣∣εk∣∣∣+ ∣∣∣£k
∣∣∣ ≤ |A′| for all k).
The quotient graph G and the elimination graph G are closely related. If i is a
variable in G, it is also a variable in G, and
AdjG (i) =
(Ai ∪
⋃e∈εi
£e
)/ i ,
where the / is the standard set subtraction operator. When variable p is selected as
the kth pivot, element p is formed (variable p is removed from V and added to V ). The
set £p = AdjG (p) is found using the above equation. The set £p represents a permuted
nonzero pattern of the kth column of L (thus the notation £). If i ∈ £p, where p is the
kth pivot, and variable i will become the mth pivot (for some m > k), then the entry
Lmk will be nonzero.
The above equation implies that £e/ p ⊆ £p for all elements e adjacent to
variable p. This means that all variables adjacent to an element e ∈ εp are adjacent to
the element p and these elements e ∈ εp are no longer needed. They are absorbed into
the new element p and deleted, and reference to them is replaced by reference to the new
element p. The new element p is added to the element lists, epsiloni, for all variables i
adjacent to element p. Absorbed elements, e ∈ εp, are removed from all element lists.
The sets Ap and εp, and £e for all e in εp, are deleted. Finally, any entry j in Ai,
51
where both i and j are in £p, is redundant and is deleted. The set Ai is thus disjoint
with any set £e for e ∈ εi. In other words, Aki is the pattern of those entries in row i of
A that are not modified by steps 1 through k of the Cholesky factorization of PAP T .
The net result is that the new graph G takes the same, or less, storage than before the
kth pivot was selected.
The following equations summarize how the sets £, ε, and A change when pivot p
is chosen and eliminated. The new element p is added, old elements are absorbed, and
redundant entries are deleted:
£k =
£k−1/⋃
e∈εp
£e
∪£p
εk =
εk−1/⋃
e∈εp
e
∪ pAk =
(Ak−1/ (£p ×£p)
)∪ (Vk × Vk)
Figure B.1: Elimination graph, quotient graph, and matrix for first three steps
52
Appendix C
Message Passing Interface
C.1 Message Passing Interface
The Message-Passing Interface or MPI is a library of functions and macros that can be
used in C, FORTRAN and C++ programs. As its name implies, MPI is intended for use
in programs that exploit the existence of multiple processors by message-passing. Mes-
sage passing is a programming paradigm used widely on parallel computers, especially
Scalable Parallel Computers (SPCs) with distributed memory and on Networks of Work-
stations (NOWs). Although there are many variations, the basic concept of processes
communicating through messages is well understood.
The major goal of MPI, as with most standards, is a degree of portability across
different machines. The expectation is for a degree of portability comparable to that
given by programming languages such as Fortran. This means that the same message-
passing source code can be executed on a variety of machines as long as the MPI library
is available, while some tuning might be needed to take best advantage of the features of
each system. Though message passing is often thought of in the context of distributed-
memory parallel computers, the same code can run well on a shared-memory parallel
computer.Knowing that efficient MPI implementations exist across a wide variety of com-
puters gives a high degree of flexibility in code development, debugging and in choosing
a platform for production runs.
The goal of the Message Passing Interface, simply stated, is to develop a widely used
standard for writing message-passing programs. As such the interface should establish a
practical, portable, efficient and flexible standard for message passing.
53
There are three types of communication routines exists for MPI so far:
C.1.1 Point to Point Communication Routines
MPI point-to-point operations typically involve message passing between two, and only
two, different MPI tasks. One task is performing a send operation and the other task is
performing a matching receive operation. There are different types of send and receive
routines used for different purposes. For example:
• Synchronous send
• Blocking send / blocking receive
• Non-blocking send / non-blocking receive
• Buffered send
• Combined send/receive
• Ready send
Any type of send routine can be paired with any type of receive routine. MPI also
provides several routines associated with send-receive operations, such as those used to
wait for a message’s arrival or probe to find out if a message has arrived.
C.1.2 Collective Communication Routines
Collective communication must involve all processes in the scope of a communicator.
All processes are by default, members in the communicator MPI COMM WORLD. It
is the programmer’s responsibility to insure that all processes within a communicator
participate in any collective operations. There are three types of collective operations
exist in message passing interface:
• Synchronization - processes wait until all members of the group have reached the
synchronization point.
• Data Movement - broadcast, scatter/gather, all to all.
54
• Collective Computation (reductions) - one member of the group collects data from
the other members and performs an operation (min, max, add, multiply, etc.) on
that data.
Collective operations are blocking. Collective communication routines do not take
message tag arguments. Collective operations within subsets of processes are accom-
plished by first partitioning the subsets into new groups and then attaching the new
groups to new communicators.
C.1.3 Group and Communicator Management Routines
A group is an ordered set of processes. Each process in a group is associated with a
unique integer rank. Rank values start at zero and go to N-1, where N is the number
of processes in the group. In MPI, a group is represented within system memory as an
object. It is accessible to the programmer only by a handle. A group is always associated
with a communicator object.
But in a way a communicator encompasses a group of processes that may commu-
nicate with each other. All MPI messages must specify a communicator. In the simplest
sense, the communicator is an extra tag that must be included with MPI calls. Like
groups, communicators are represented within system memory as objects and are acces-
sible to the programmer only by handles. For example, the handle for the communicator
that comprises all tasks is MPI COMM WORLD. From the programmer’s perspective,
a group and a communicator are one. The group routines are primarily used to specify
which processes should be used to construct a communicator.
55
Appendix D
Global Array Toolkit
Portability, efficiency and ease of coding are all important considerations in choosing
the programming model for a scalable parallel application. The message-passing pro-
gramming model is widely used because of its portability, but some applications are
too complex to code in it while also trying to maintain a balanced computation load
and avoid redundant computations. The shared-memory programming model simplifies
coding, but it is not portable and often provides little control over interprocessor data
transfer costs.
But in some sense Global Arrays (GA), that combines the better features of both
other models, leading to both simple coding and efficient execution. The key concept of
GA is that it provides a portable interface through which each process in a MIMD parallel
program can asynchronously access logical blocks of physically distributed matrices with
no need for explicit cooperation by other processes.
The Global Arrays (GA) toolkit provides a shared memory style programming
environment in the context of distributed array data structures (called global arrays).
From the user perspective, a global array can be used as if it was stored in shared memory.
All details of the data distribution, addressing, and data access are encapsulated in
the global array objects. Information about the actual data distribution and locality
can be easily obtained and taken advantage of whenever data locality is important.
The primary target architectures for which GA was developed are massively-parallel
distributed-memory and scalable shared-memory systems.
GA divides logically shared data structures into local and remote portions. It recog-
nizes variable data transfer costs required to access the data depending on the proximity
attributes. A local portion of the shared memory is assumed to be faster to access and
56
Figure D.1: Structure of Global Array Toolkit
the remainder (remote portion) is considered slower to access. These differences do not
hinder the ease-of-use since the library provides uniform access mechanisms for all the
shared data regardless where the referenced data is located. In addition, any processes
can access a local portion of the shared data directly/in-place like any other data in
process local memory. Access to other portions of the shared data must be done through
the GA library calls.
GA was designed to complement rather than substitute the message-passing model,
and it allows the user to combine shared-memory and message-passing styles of pro-
gramming in the same program. GA inherits an execution environment from a message-
passing library (w.r.t. processes, file descriptors etc.) that started the parallel program.
The basic shared memory operations supported include get, put, scatter and gather.
They are complemented by atomic read-and-increment, accumulate (reduction operation
that combines data in local memory with data in the shared memory location), and lock
operations. However, these operations can only be used to access data in global arrays
rather than arbitrary memory locations. At least one global array has to be created before
data transfer operations can be used. These GA operations are truly one-sided/unilateral
and will complete regardless of actions taken by the remote process(es) that own(s) the
referenced data. In particular, GA does not offer or rely on a polling operation or
require inserting any other GA library calls to assure communication progress on the
remote side.
57