9
Effortless and Efficient Distributed Data-partitioning in Linear Algebra Carlos de Blas Cartón Dpto. Informática University of Valladolid, Spain Email: [email protected] Arturo Gonzalez-Escribano Dpto. Informática University of Valladolid, Spain Email: [email protected] Diego R. Llanos Dpto. Informática University of Valladolid, Spain Email: [email protected] Abstract—This paper introduces a new technique to exploit compositions of different data-layout techniques with Hitmap, a library for hierarchical-tiling and automatic mapping of arrays. We show how Hitmap is used to implement block-cyclic layouts for a parallel LU decomposition algorithm. The paper compares the well-known ScaLAPACK implementation of LU, as well as other carefully optimized MPI versions, with a Hitmap imple- mentation. The comparison is made in terms of both performance and code length. Our results show that the Hitmap version outperforms the ScaLAPACK implementation and is almost as efficient as our best manual MPI implementation. The insertion of this composition technique in the automatic data-layouts of Hitmap allows the programmer to develop parallel programs with both a significant reduction of the development effort and a negligible loss of efficiency. Index Terms—Automatic data partition; layouts; distributed systems. I. I NTRODUCTION Data distribution and layout is a key part for any parallel al- gorithm in distributed environments. To improve performance and scalability, many parallel algorithms need to compose different data partition and layout techniques at different levels. A typical example is the block-cyclic partition, useful in many linear algebra problems, such as the LU decomposition. Block-cyclic can be seen as a composition of two basic layouts: Blocking layout (divide data into same-sized blocks), and a cyclic layout (distribute blocks cyclically to processes). LU decomposition is a matrix factorization that transforms a matrix in the product of a lower triangular matrix and a upper triangular matrix. Applications of this decomposition include solving linear equations, matrix inverse calculation, and determinant computation. While many languages offer a predefined set of one-level, data-parallel partition techniques (e.g. HPF [1], OpenMP [2], UPC [3]), it is not straightforward to use them to create a multiple-level distributed layout that adapts automatically for different grain-levels. On the other hand, message-passing interfaces for distributed environments (such as MPI [4]) allow to develop programs carefully tuned for a given platform, but at the cost of a high development effort [5], [6]. The reason is that these interfaces only provide basic, simple tools to calculate the data distribution and communication details. In this paper we present a technique to introduce com- positions of different layout techniques at different levels in Hitmap, a library for hierarchical-tiling and mapping of arrays. Hitmap library is a basic tool in the back-end of the Trasgo compilation system [7]. It is designed as a supporting library for parallel compilation techniques, offering a collection of functionalities for management, mapping and communication of tiles. Htimap have a extensibility philosophy similar to the Fortress framework [10], and it can be used to build higher levels of abstraction. Other tiling arrays libraries, offering abstract data-types for distributed tiles (such as HTA [8]), could be reprogrammed and extended using Hitmap. The tile decomposition in Hitmap is based on index-domain partitions, in a similar way as proposed in the Chapel language [9]. In Hitmap, data-layout and load-balancing techniques are independent modules that belong to the plug-in system. The techniques are invoked from the code and applied at run-time when needed, using internal information of the target system topology to distribute the data. The programmer does not need to reason in terms of the number of physical processors. Instead, it uses highly abstract communication patterns for the distributed tiles at any grain level. Thus, coding, and debug- ging operations with entire data structures is easy. We show how the use of layout compositions at different levels, together with hierarchical Hitmap tiles, can also automatically benefit the sequential part of the code, improving data locality and exploiting the memory hierarchy for better performance. This composition technique allows to use Hitmap to program highly efficient distributed computations with low development cost. We use the LU decomposition as a case study, focusing on the ScaLAPACK implementation as reference. ScaLAPACK is a software library of high-performance linear algebra routines for distributed-memory, message-passing clusters, built on top of a message-passing interface (such as MPI or PVM). ScaLAPACK is a parallel version of the LAPACK project, which in turn is the successor of the LINPACK benchmark to measure supercomputer performance. Currently, there exists other parallel evolution of LINPACK named HPL (High Performance LINPACK). HPL is a benchmark to measure performance in modern supercomputers, and the fastest 500 are published in the top500.org website. Thus, algorithms and implementations used in ScaLAPACK are a key reference to measure performance in supercomputing environments. Our experimental results show that the Hitmap implementation of LU offers better performance than ScaLAPACK version, and 2010 12th IEEE International Conference on High Performance Computing and Communications 978-0-7695-4214-0/10 $26.00 © 2010 IEEE DOI 10.1109/HPCC.2010.37 89 2010 12th IEEE International Conference on High Performance Computing and Communications 978-0-7695-4214-0/10 $26.00 © 2010 IEEE DOI 10.1109/HPCC.2010.37 89

[IEEE 2010 IEEE 12th International Conference on High Performance Computing and Communications (HPCC 2010) - Melbourne, VIC (2010.09.1-2010.09.3)] 2010 IEEE 12th International Conference

  • Upload
    diego-r

  • View
    215

  • Download
    1

Embed Size (px)

Citation preview

Page 1: [IEEE 2010 IEEE 12th International Conference on High Performance Computing and Communications (HPCC 2010) - Melbourne, VIC (2010.09.1-2010.09.3)] 2010 IEEE 12th International Conference

Effortless and Efficient Distributed Data-partitioningin Linear Algebra

Carlos de Blas CartónDpto. Informática

University of Valladolid, Spain

Email: [email protected]

Arturo Gonzalez-EscribanoDpto. Informática

University of Valladolid, Spain

Email: [email protected]

Diego R. LlanosDpto. Informática

University of Valladolid, Spain

Email: [email protected]

Abstract—This paper introduces a new technique to exploitcompositions of different data-layout techniques with Hitmap, alibrary for hierarchical-tiling and automatic mapping of arrays.We show how Hitmap is used to implement block-cyclic layoutsfor a parallel LU decomposition algorithm. The paper comparesthe well-known ScaLAPACK implementation of LU, as well asother carefully optimized MPI versions, with a Hitmap imple-mentation. The comparison is made in terms of both performanceand code length. Our results show that the Hitmap versionoutperforms the ScaLAPACK implementation and is almost asefficient as our best manual MPI implementation. The insertionof this composition technique in the automatic data-layouts ofHitmap allows the programmer to develop parallel programswith both a significant reduction of the development effort anda negligible loss of efficiency.

Index Terms—Automatic data partition; layouts; distributedsystems.

I. INTRODUCTION

Data distribution and layout is a key part for any parallel al-

gorithm in distributed environments. To improve performance

and scalability, many parallel algorithms need to compose

different data partition and layout techniques at different

levels.A typical example is the block-cyclic partition, useful in

many linear algebra problems, such as the LU decomposition.

Block-cyclic can be seen as a composition of two basic

layouts: Blocking layout (divide data into same-sized blocks),

and a cyclic layout (distribute blocks cyclically to processes).

LU decomposition is a matrix factorization that transforms

a matrix in the product of a lower triangular matrix and a

upper triangular matrix. Applications of this decomposition

include solving linear equations, matrix inverse calculation,

and determinant computation.While many languages offer a predefined set of one-level,

data-parallel partition techniques (e.g. HPF [1], OpenMP [2],

UPC [3]), it is not straightforward to use them to create

a multiple-level distributed layout that adapts automatically

for different grain-levels. On the other hand, message-passing

interfaces for distributed environments (such as MPI [4]) allow

to develop programs carefully tuned for a given platform, but

at the cost of a high development effort [5], [6]. The reason

is that these interfaces only provide basic, simple tools to

calculate the data distribution and communication details.In this paper we present a technique to introduce com-

positions of different layout techniques at different levels in

Hitmap, a library for hierarchical-tiling and mapping of arrays.

Hitmap library is a basic tool in the back-end of the Trasgo

compilation system [7]. It is designed as a supporting library

for parallel compilation techniques, offering a collection of

functionalities for management, mapping and communication

of tiles. Htimap have a extensibility philosophy similar to the

Fortress framework [10], and it can be used to build higher

levels of abstraction. Other tiling arrays libraries, offering

abstract data-types for distributed tiles (such as HTA [8]),

could be reprogrammed and extended using Hitmap. The tile

decomposition in Hitmap is based on index-domain partitions,

in a similar way as proposed in the Chapel language [9].

In Hitmap, data-layout and load-balancing techniques are

independent modules that belong to the plug-in system. The

techniques are invoked from the code and applied at run-time

when needed, using internal information of the target system

topology to distribute the data. The programmer does not

need to reason in terms of the number of physical processors.

Instead, it uses highly abstract communication patterns for the

distributed tiles at any grain level. Thus, coding, and debug-

ging operations with entire data structures is easy. We show

how the use of layout compositions at different levels, together

with hierarchical Hitmap tiles, can also automatically benefit

the sequential part of the code, improving data locality and

exploiting the memory hierarchy for better performance. This

composition technique allows to use Hitmap to program highly

efficient distributed computations with low development cost.

We use the LU decomposition as a case study, focusing on

the ScaLAPACK implementation as reference. ScaLAPACK is

a software library of high-performance linear algebra routines

for distributed-memory, message-passing clusters, built on

top of a message-passing interface (such as MPI or PVM).

ScaLAPACK is a parallel version of the LAPACK project,

which in turn is the successor of the LINPACK benchmark to

measure supercomputer performance. Currently, there exists

other parallel evolution of LINPACK named HPL (High

Performance LINPACK). HPL is a benchmark to measure

performance in modern supercomputers, and the fastest 500

are published in the top500.org website. Thus, algorithms and

implementations used in ScaLAPACK are a key reference to

measure performance in supercomputing environments. Our

experimental results show that the Hitmap implementation of

LU offers better performance than ScaLAPACK version, and

2010 12th IEEE International Conference on High Performance Computing and Communications

978-0-7695-4214-0/10 $26.00 © 2010 IEEE

DOI 10.1109/HPCC.2010.37

89

2010 12th IEEE International Conference on High Performance Computing and Communications

978-0-7695-4214-0/10 $26.00 © 2010 IEEE

DOI 10.1109/HPCC.2010.37

89

Page 2: [IEEE 2010 IEEE 12th International Conference on High Performance Computing and Communications (HPCC 2010) - Melbourne, VIC (2010.09.1-2010.09.3)] 2010 IEEE 12th International Conference

⎛⎜⎜⎜⎝

a11 a12 · · · a1n

a21 a22 · · · a2n

......

. . ....

an1 an2 · · · ann

⎞⎟⎟⎟⎠ =

⎛⎜⎜⎜⎝

1 0 · · · 0l21 1 · · · 0...

.... . .

...

ln1 ln2 · · · 1

⎞⎟⎟⎟⎠×

⎛⎜⎜⎜⎝

u11 u12 · · · u1n

0 u22 · · · u2n

......

. . ....

0 0 · · · unn

⎞⎟⎟⎟⎠

Figure 1: LU decomposition.

almost the same performance as a carefully-tuned, manual MPI

implementation, with a significantly lower development effort.

II. A STUDY CASE: LU ALGORITHM

A. LU decomposition

LU decomposition [11], [12] writes a matrix A as a product

of two matrices: a lower triangular matrix L with 1’s in the

diagonal and an upper triangular matrix U . See Fig. 1.

LU decomposition is used to solve systems of linear equa-

tions or calculate determinants. The advantage of solving

systems with this method is that, once the matrix is factorized,

it can be used to compute solutions for different right-hand

sides.

As it is shown in Fig. 2, there are three different forms

of designing the LU matrix factorization: Right-looking, left-

looking and crout LU. There exists iterative and recursive

versions of the associated algorithms. In this paper we focus

on the iterative, right-looking version, since it provides a better

load balancing, as we will see in Sect. IV.

Figure 2 shows the state associated to a particular iteration

of the corresponding algorithms. Stripped portions refer to

elements that will get their final value at the end of the

iteration. Grey areas refer to read elements in left-looking and

crout, and updated elements in right looking. These elements

will be modified in subsequent iterations.

Algorithm 1 shows the right-looking LU decomposition. As

can be seen, the algorithm may be implemented with three

loops. It is worthwhile to note that the division in line 4 may

cause precision errors if divider is nearby 0 when using finite

precision. Thus, some implementations perform row-swapping

to avoid small divisors. In each iteration the algorithm searches

the greater aik with i > k and swap row i−th and k−th in

matrix and vector. After solving the system, it is necessary to

undo the row swapping in the solution vector.

Left−looking LU Right−looking LU Crout LU

Figure 2: Different versions of LU decomposition. Activity at

a middle iteration.

Algorithm 1 LU right-looking decomposition.

Require: A n× n matrix

Ensure: L and U n × n matrices overwriting A and non

storing either zeros in both matrices or ones in L diagonal

1: for k = 0 to n− 1 do2: {elements of matrix L}

3: for i = k + 1 to n− 1 do4: aik ← aik

akk

5: end for6: {update rest of elements}

7: for j = k + 1 to n− 1 do8: for i = k + 1 to n− 1 do9: aij ← aij − akjaik

10: end for11: end for12: end for

B. Solution of a system of linear equations

After the decomposition, it is possible to solve the system

Ax = b ≡ LUx = b in two steps. Let Ux = y. First we solve

Ly = b using forward substitution, and then we solve Ux = yusing backward substitution.

Definition A system Ax = b using forward substitution with

A lower triangular is solved as follows:

x0 =b0

a00

For i = 1, . . . , n− 1

xi =bi −

∑i−1j=0 aijxj

aii

Definition A system Ax = b using backward substitution with

A upper triangular is solved as follows:

xn−1 =bn−1

an−1,n−1

For i = n− 2, . . . , 0

xi =bi −

∑n−1j=i+1 aijxj

aii

The cost of the decomposition is in O(n3) and the cost

of the two substitutions are O(n2). Summing up, the cost of

the whole algorithm is in O(n3) and the cost of the solution

9090

Page 3: [IEEE 2010 IEEE 12th International Conference on High Performance Computing and Communications (HPCC 2010) - Melbourne, VIC (2010.09.1-2010.09.3)] 2010 IEEE 12th International Conference

������������������������������������������proc 0 proc 1 proc P−1

(a) Block distribution.

proc P−1proc 0 proc 1

proc 0 proc 2

proc 3

proc 1 proc 3

proc 2

(b) Cyclic distribution.

proc P−1proc 1proc 0 proc 0 proc 1

(c) Block-cyclic distribution.

Figure 3: Examples of one-dimensional data distributions.

is not significant. Thus, LU decomposition is useful when

solving several right-hand sides. The solving algorithm can be

implemented browsing the matrix by rows or columns, with

the same asymptotic cost.

III. BLOCK-CYCLIC LAYOUT

Parallel implementations of algorithms need to distribute the

data across different processes. The way of distributing the

data is known as layout. Each layout provides good balance

for a different family of algorithms, so it is very important

to choose the most suitable. For our study case, the most

appropriate is the block-cyclic layout. This layout can be seen

as the composition of the following layouts. Let N the number

of elements and P the number of processes.

1) Block layout (Fig. 3a). Each process receives a block of

�NP � contiguous elements.

2) Cyclic layout (Fig. 3b). The data is distributed cyclically

among the processes, so element i is associated to

process i mod P .

Block-cyclic layout [13] is built in two stages: blocks and

cyclic (see Fig. 3c). First, a blocking is made, but dividing

data into portions of the same size S, not necessarily equal to

�NP �. Then, the blocks are distributed cyclically to processes.

Thus, element i is associated to process � iS � mod P . These

three layouts can be easily extended to processes arrangements

in n dimensions.

IV. RIGHT-LOOKING LU IN DISTRIBUTED ENVIRONMENTS

Among the three LU decomposition algorithms described

above, the right-looking LU is the one that most benefits from

load-balancing techniques when implemented in distributed

environments. For this algorithm, block-cyclic layout is the

most appropriate load-balancing technique. The cyclic distri-

bution of the blocks ensures that all processes have roughly

the same workload, because the trailing matrix to be updated

is distributed among processes. In addition, distributing blocks

instead of single elements improves data locality [14], [15].

Left-looking and crout versions are not so useful in dis-

tributed environments, since it is much harder to find a suitable

4

U

L L

LL

U U

U

1 2

3

Figure 4: Steps in a blocked LU parallel algorithm.

data distribution. In left-looking LU algorithm, since only one

column per iteration is updated, only the column of processes

which owns that column performs computations, whereas the

rest only communicate. Something similar happens with crout.

Besides, right-looking implies less communication than

other versions, because only a column and a row is needed

per iteration to update the trailing matrix, whereas in other

versions, almost the whole matrix is locally needed.

Right-looking LU algorithm is usually computed in process

arrangements of 2 dimensions, also called grid of processes. In

Alg. 2, we can see an outline of a block-cyclic, right-looking,

LU parallel decomposition algorithm. Functions minedim1(i)and minedim2(i) in that algorithm return TRUE if i belongs

to the local process. The algorithm iterates by blocks and each

iteration is divided in four steps, as shown in Fig. 4.

The main steps of the factorization algorithm are shown

in Fig. 4. In the figure, dark grey portions refer to elements

updated and light grey ones to elements read on each step.

1) Factorize the diagonal block (Fig. 4.1).

2) Update blocks in the same column of the diagonal block

in matrix L, using the diagonal block (Fig. 4.2).

3) Update blocks in the same row of the diagonal block in

matrix U, using the diagonal block (Fig. 4.3).

4) Update trailing matrix using blocks previously updated

in steps 2 and 3 (Fig. 4.4).

Steps 2 to 4 may imply block communication among

processes. Since all updated blocks will be needed by other

processes of the same row or column, most communications

are in multicast mode.

Figure 5 shows how the processes are arranged into an R×Cgrid. In a given iteration, each process into the grid performs

exactly one of the following four tasks (see Alg. 2).

9191

Page 4: [IEEE 2010 IEEE 12th International Conference on High Performance Computing and Communications (HPCC 2010) - Melbourne, VIC (2010.09.1-2010.09.3)] 2010 IEEE 12th International Conference

Algorithm 2 LU parallel algorithm. Input: n× n matrix A and n× 1 vector b. Output: n× n matrices L and U overwriting

A and stored together ruling out zeros and ones of the diagonal of L, and n × 1 vector x overwriting b. The matrices are

partitioned in S × S blocks and the processes form a R× C grid.

1: for k = 0 to �NS � − 1 do {Iterating by blocks}

2: if minedim1(k) ∧minedim2(k) then3: {TASK TYPE 1: Process with the diagonal block}

4: Factorize diagonal blocks A[k, k]5: Update blocks A[i, k] with i > k with block A[k, k]6: Update blocks A[k, j]/j > k with block A[k, k]7: Update blocks of submatrix

8: else if minedim1(k) then9: {TASK TYPE 2: Process in the same column of the grid than the diagonal block process}

10: Update blocks A[i, k] with i > k with block A[k, k]11: Update blocks of submatrix

12: else if minedim2(k) then13: {TASK TYPE 3: Process in the same row of the grid than the diagonal block process}

14: Update blocks A[k, j]/j > k with block A[k, k]15: Update blocks of submatrix

16: else17: {TASK TYPE 4: Process neither in the same row of the grid nor in the same column of the grid as the diagonal block

process}

18: Update blocks of submatrix

19: end if20: end for

1) There is one process that owns the diagonal block. This

process executes the four steps and communicates in 2,

3 and 4 (process P[1,2] in Fig. 5).

2) There are R− 1 processes that own blocks in the same

column as the diagonal block. These processes execute

steps 2 and 4. In step 4 they communicate the blocks

updated in step 2, receiving blocks updated that belong

to the same row of the diagonal block. (processes P[0,2],

P[2,2], and P[3,2] in Fig. 5).

3) There are C−1 processes that owns blocks in the same

row as the diagonal block. These processes execute steps

3 and 4. In step 4 they communicate the blocks updated

in 3, receiving blocks updated that belong to the same

column of the diagonal block. (processes P[1,0], P[1,1],

P[1,3], and P[1,4] in Fig. 5).

4) The remaining (R ∗ C −R− C + 1) processes execute

step 4, receiving the blocks updated in steps 2 and 3

from one process on the same row and one process on

the same column.

A. An example

In Fig.6a we can see an example of a n×n matrix, divided

in 7 × 7 blocks and distributed to a 2 × 2 grid of processes.

In this example, the computation is at the start of the second

iteration (blocks in column 0 and row 0 are already updated).

Figure 6b shows the communications needed at this point.

Diagonal block is B[1,1] and belongs to process P[1,1]. Process

P[1,1] factorize this block and send it to processes P[1,0] and

P[0,1].

����������

����������

��������

��������

����������

����������

��������

��������

����������

����������

���������������

���������������

���������������

���������������

����������

����������

0

1

2

3

0 1 2 3 4

Figure 5: Example of a 3× 4 grid of processes where process

P[1,2] owns the diagonal block.

Processes P[1,1] and P[1,0] update blocks in the same row

as the diagonal block in matrix U (blocks B[1,2], B[1,3], . . . ,

B[1,6]) and processes P[1,1] and P[0,1] update blocks in the

same column as the diagonal block in matrix L (blocks B[2,1],

B[3,1], . . . , B[6,1]).

Then, all process update blocks of the submatrix with block

indexes B[2..6,2..6]. For doing it, process P[1,0] and P[0,1] need

blocks from P[1,1], and process P[0,0] needs blocks from P[0,1]

and P[1,0]. Remember the update expression aij = aij−akjaik

in line 9 of Alg.1. A process only need blocks of other

processes in the same row or column in the grid. Thus,

communications are limited to rows or columns.

The right-hand side vector has only one dimension and it

is mapped to a row or column of processes. Solution of the

system using the LU factorization is straightforward and need

no further explanation.

9292

Page 5: [IEEE 2010 IEEE 12th International Conference on High Performance Computing and Communications (HPCC 2010) - Melbourne, VIC (2010.09.1-2010.09.3)] 2010 IEEE 12th International Conference

1

0 1 2 3 4 5 6

2

3

4

5

6

0 process 0,0

process 1,0

process 0,1

process 1,1

(a) Distribution of blocks.

communication

1

0 1 2 3 4 5 6

2

3

4

5

6

0

(b) Communications in example on left figure in iteration 2.

Figure 6: Example of a 7 × 7 blocks matrix distributed in a

two-dimensional block-cyclic layout.

V. TWO IMPLEMENTATION APPROACHES

In this section we describe two different approaches for

LU decomposition implementation in distributed-systems. The

first one is using directly a message-passing library. The well-

know ScaLAPACK implementation is a good representative

of this approach. Instead of manage data-partition and MPI

communications directly, a different approach is to use a

library with a higher-level of abstraction. We present an LU-

implementation using Hitmap, a library to efficiently map

and communicate hierarchical tile-arrays. Hitmap is used in

the back-ends of the TRASGO compilation system [7]. After

reviewing both approaches, we compare both in terms of

performance and development effort.

A. ScaLAPACK

ScaLAPACK [16], [17] is a parallel library of high-

performance linear algebra routines for message-passing sys-

tems. These routines are written in Fortran using the BLACS

communications library. BLACS is defined as an interface,

and has different implementations depending on the commu-

nications library used, such as PVM, MPI, MPL, and NX.

ScaLAPACK is an evolution from the previous sequential

library, LAPACK designed for shared-memory and vector su-

percomputers. Their goals are efficiency, scalability, reliability,

portability, flexibility and easy of use.

One of the routines from ScaLAPACK , called pdgesv,

computes the solution of a real system of linear equations

Ax = b. This routine implements firstly a decomposition of a

n×n matrix A in a product of two triangular matrices L ∗U ,

where L is a lower triangular matrix with ones in the diagonal

and U is a upper triangular one. Then, this decomposition

makes easier the later resolution of the system Ax = b where

x is the n× 1 vector solution of the system.

ScaLAPACK implements the non-recursive right-looking

version of LU algorithm, since it is the better option for block-

cyclic layouts as explained before.

The matrices are distributed using a two-dimensional block-

cyclic layout and the local part of the matrices after distribu-

tion are stored contiguously in memory.

B. LU with Hitmap library

Hitmap is a library to support hierarchical tiling mapping

and manipulation [7]. It is designed to be used with an

SPMD parallel programming language, allowing the creation,

manipulation, distribution and efficient communication of tiles

and tile hierarchies. The focus of Hitmap is to provide an

abstract layer for tiling in parallel computations. Thus, it may

be used directly, or to build higher-level constructions for tiling

in parallel languages.

Although Hitmap is designed with an object-oriented ap-

proach, it is implemented in C language. It defines several

object-classes implemented with C structures and functions.

A Signature, represented by a HitSig object, is a tuple of

three numbers representing a selection of array indexes in a

one-dimensional domain. The three numbers represent: The

starting index, the ending index, and a stride to specify a non-

contiguous but regular subset of indexes. A Shape (HitShapeobject) is a set of n signatures representing a selection of

indexes in a multidimensional domain. Finally, a Tile, rep-

resented by a HitTile object is an array whose indexes are

defined by a shape. Tiles may be declared directly, or as a

selection of another tile, whose indexes are defined by a new

shape. Thus, they may be created hierarchically or recursively.

Tiles have two coordinate system to access its elements: (1)

Tile coordinates; and (2) Array coordinates. Tile coordinates

start always at index 0 and have no stride. Array coordinates

associate elements with their original domain indexes, as

defined by the shape of the root tile. In Hitmap a tile may

be defined with, or without allocated memory. This allows to

declare and partition arrays without memory, finally allocating

only the parts mapped to a given processor.

Hitmap also provides a HitTopology object-class to encapsu-

late simple functions to create virtual topologies from hidden

physical topology information, directly obtained by the library

from the lower implementation layer. HitLayout objects store

the results of the partition or distribution of a shape domain

over a topology of virtual processors. They are created calling

one of the layout plug-in modules included in the library.

9393

Page 6: [IEEE 2010 IEEE 12th International Conference on High Performance Computing and Communications (HPCC 2010) - Melbourne, VIC (2010.09.1-2010.09.3)] 2010 IEEE 12th International Conference

1 HitTopology topo = hit_topology( plug_topArray2DComplete );2 int numberBlocks = ceil_int( matrixSize / blockSize );3 HitTile_HitTile matrixTile, matrixSelection;4 hit_tileDomainHier( & matrixTile, sizeof( HitTile ), 2, numberBlocks, numberBlocks );5 HitLayout matrixLayout = hit_layout( plug_layCyclic,6 topo,7 hit_tileShape( matrixTile ) );8 hit_tileSelect( & matrixSelection, & matrixTile, hit_layShape( matrixLayout ) );9 hit_tileAlloc( & matrixSelection );10 HitTile_double block;11 hit_tileDomainAlloc( & block, sizeof( double ), 2, blockSize, blockSize );12 hit_tileFill( & matrixSelection, & block );13 hit_tileFree( block );14 hit_comAllowDims( matrixLayout );

Figure 7: Hitmap code for data partitioning.

There are some predefined layout modules, but it allows to

easily create new ones. The resulting layout objects have

information about the local part of the input domain, neighbor

relationships, and methods to get information about any other

virtual process part. The layout modules encapsulate mapping

decisions for load-balancing and neighborhood information for

either, data or task parallelism. The information in the layout

objects may be exploited in several abstract communication

functionalities which build HitComm objects. These objects

may be composed in reusable patterns represented by HitPat-tern objects. The library is built on top of the MPI commu-

nication library, internally exploiting several MPI techniques

for efficient buffering, marshalling, and communication.

Block-cyclic layout in Hitmap library

In this section we show how to use the Hitmap tools to

implement a block-cyclic layout. We will use a two-level tile

hierarchy and a cyclic Hitmap layout to distribute the upper

level blocks among processors.

Hitmap uses the HitTile objects to represent multidimen-

sional arrays, or arrays subselections. HitTile is an abstract

type which is specialized with the base type of the desired

array elements. Thus, HitTile_double is the type for tiles of

double elements. In this work, we include a new derived

type in the library: HitTile_HitTile. The base elements of tiles

of this type will also be tiles of any derived type. Thus,

heterogeneous hierarchies of tile types are easily created using

HitTile_HitTile objects on the branches, and other specific

derived types on the leaves of the structure. In Fig. 9 we show

the internal structure of a two-level hierarchy of tiles. The

HitTile_HitTile variable is internally a C structure that records

the information needed to access the elements of the array and

manipulate the whole tile. One of its fields is a pointer to the

data. In this case, it is a pointer to an array of HitTile elements.

We have added other information fields to the structure to

record details about the hierarchical structure at a given level.

Other Hitmap functionalities, related to the communication

and input/output of tiles, have been reprogrammed to support

marshalling/demarshalling whole hierarchical structures.

The rest of this section shows how the HitTile_HitTile type

is used to define and layout aggregations of subselection tiles

(e.g. array blocks), allowing to easily apply compositions of

layouts at different levels.

Data distribution

In Fig. 7 we show how the Hitmap tile creation and

layout functions can be used in a parallel environment, to

automatically generate the structure represented in Fig. 9,

with a HitTile_double element for each block assigned to the

local parallel processor. Line 1 creates a virtual 2D topology

of processors using a Hitmap plug-in. Topology plug-ins

automatically obtain information from the physical topology.

The number of blocks is calculated in line 2, using the matrix

size and the block size parameters.

Line 3 declares the HitTile variables needed. In line 4, we

define the domain of a tile representing the global array of

blocks. This declaration does not allocate memory for the

blocks, or the upper level array of tiles. Instead, it creates

the HitTile template with the arrays sizes. It is used in Line 5

to call a cyclic layout plug-in that distributes the tile elements

among processors. The returned HitLayout object contains the

information about the elements assigned to any processor,

specifically the local one. This data is used in line 9 to select

appropriate local elements of the original tile. The memory

for these selected local blocks is allocated in line 10.

The following four lines define and allocate a new tile with

the sizes of a block, using it as a pattern to fill up the local

selection of the matrix of blocks. The function hit_tileFillmakes clones of the second parameter for any data element of

the first tile parameter. The last line activates the possibility of

declaring 1D collective communications in the virtual topology

associated to the layout.

Factorization

Figure 8 shows an excerpt of the matrix factorization code.

Inside the main loop, we show the code for the Task Type

1, which updates (a) the blocks of the L matrix in the same

column as the diagonal block, and (b) the blocks of the trailing

matrix.

Before the task code, we can see the declaration of Hitmap

communication objects used in that task during the given

iteration. Hitmap provides functions to declare HitCom objects

9494

Page 7: [IEEE 2010 IEEE 12th International Conference on High Performance Computing and Communications (HPCC 2010) - Melbourne, VIC (2010.09.1-2010.09.3)] 2010 IEEE 12th International Conference

1 HitCom commRowSend, commRowRecv, commColSend, commColRcv, commBlockSend, commBlockRecv;2 for( i = 0; i < numberBlocks; i++ ) {34 // 1. COMMUNICATION OBJECTS DECLARATION5 commBlockSend = ...;6 commBlockRecv = hit_comBroadcastDimSelect( matrixLayout, 0, diagProc[0], & bufferDim0,7 hit_shape( 1, hit_sigIndex( currentBlock[1] ) ),8 HIT_COM_TILECOORDS, HIT_DOUBLE );9 commRowSend = ...;

10 commRowRecv = hit_comBroadcastDimSelect( matrixLayout, 0, diagProc[0], & bufferDim1,11 hit_shape( 1, hit_tileSigDimTail( matrixSelection, 1, currentBlock[1] ) ),12 HIT_COM_TILECOORDS, HIT_DOUBLE );13 commColSend = hit_comBroadcastDimSelect( matrixLayout, 1, hit_topDimRank( topo, 1 ),14 & matrixSelection, hit_shape( 2,15 hit_tileSigDimTail( matrixSelection, 0, currentBlock[0] ) ),16 hit_sigIndex( currentBlock[1] ) ),17 HIT_COM_TILECOORDS, HIT_DOUBLE );18 commColRecv = ...;1920 // 2. COMPUTATION21 // 2.1. TASK TYPE 1: UPDATE DIAGONAL BLOCK, U ROW, L COLUMN, AND TRAILING MATRIX22 if ( hit_topAmIDim( topo, 0, diagProc[0]23 && hit_topAmIDim( topo, 1, diagProc[1] ) {24 ...25 }2627 // 2.2 TASK TYPE 2: UPDATE L COLUMN, AND TRAILING MATRIX28 else if ( hit_topAmIDim( topo, 1, diagProc[1]) ) {29 hit_comDo( & commBlockRecv );30 for( j = currentBlock[0]; j < hit_tileDimCard( matrixSelection, 0 ); j++ ) {31 updateMatrixL( matrixBlockAt( bufferDim1, 1, currentBlock[1] ),32 matrixBlockAt( matrixSelection, j, currentBlock[1] ) );33 }34 hit_comDo( & commRowRecv );35 hit_comDo( & commColSend );36 for( j = currentBlock[0]; j < hit_tileDimCard( matrixSelection, 0 ); j++ ) {37 for( k = currentBlock[1]; k < hit_tileDimCard( matrixSelection, 1 ); k++ ) {38 updateTrailingMatrix( matrixBlockAt( matrixSelection, j, currentBlock[1] ),39 matrixBlockAt( bufferDim1, 1, k ),40 matrixBlockAt( matrixSelection, j, k ) );41 }42 }43 }4445 // 2.3. TASK TYPE 3: UPDATE U ROW, AND TRAILING MATRIX46 else if ( hit_topAmIDim( topo, 0, diagProc[0] ) {47 ...48 }49 // 2.4. TASK TYPE 4: UPDATE ONLY TRAILING MATRIX50 else {51 ...52 }53 hit_comFree( commRowSend ); hit_comFree( commRowRecv ); hit_comFree( commColSend );54 hit_comFree( commColRecv ); hit_comFree( commBlockSend ); hit_comFree( commBlockRecv );55 }

Figure 8: Hitmap factorization code.

containing information about the subtile (set of blocks) to com-

municate, and the sender/receiver processor in virtual topology

coordinates. Communications needed on this algorithm are

always broadcasts of blocks along a given axis of the 2D

virtual topology associated with the layout object. In the task

code, the communications associated to a HitCom object are

invoked, any number of times, with the Hit_comDo primitive.

VI. EXPERIMENTAL RESULTS

In this section we evaluate the approaches described in the

previous section. The evaluation is carried out in terms of

performance and ease of programming.

ScaLAPACK’s LU version is originally written in Fortran,

while Hitmap is a C library. To ensure a fair comparison, we

translate ScalaPACK’s LU version to C, using the MPI library

for communications. Since C language uses row-major storing

instead of Fortran’s column-major storing, we have inverted

loops and indexes for matrix accesses, and also adapted the

solving part of the algorithm. Moreover, the row-swapping

operation is not fairly comparable in terms of performance

with the C versions due to the different storing order. There-

fore, we have eliminated the row-swapping stage from both the

9595

Page 8: [IEEE 2010 IEEE 12th International Conference on High Performance Computing and Communications (HPCC 2010) - Melbourne, VIC (2010.09.1-2010.09.3)] 2010 IEEE 12th International Conference

...

...

...

...

...

...double

data

data

HitTile_double

baseExtent=sizeof(HitTile)

baseExtent=sizeof(double)

HitTile_HitTile

Figure 9: Two levels support for block-cyclic layouts in

Hitmap Library.

LU ScaLAPACK routine and our C programs. This translation

process also helped us to estimate the development effort of

developing a hand-made, right-looking, message-passing LU.

The hierarchical nature of a two-level Hitmap tile exploits

memory accesses locality in a natural way. To isolate this

effect, we developed two modified versions of the manual C

code with different optimizations. These changes represent a

good trade-off between development effort and performance.

Both modified versions are described below.

Blocks browsing: ScaLAPACK LU main loop iterates in a

block-by-block basis, but the inner loops browse the remaining

part of the matrix by columns. Thus, several block columns

are updated before proceeding to the next column. More data

locality can be achieved in the sequential parts of the code

browsing matrices by blocks, updating a whole block before

proceeding to the following one.

Blocks storing: ScaLAPACK stores the local parts of the

global matrix contiguously in memory: a n×m matrix is de-

clared as a 2-dimensional n×m array in which all the column

elements are contiguous. Storing each block contiguously and

using a matrix of pointers to locate the blocks improves both

locality and performance.

Experiments design and environment

The codes have been run on a shared-memory machine

called Geopar. Geopar is an Intel S7000FC4URE server,

equipped with four quad-core Intel Xeon MPE7310 processors

at 1.6GHz and 32GB of RAM. Geopar runs OpenSolaris

2008.05, with the Sun Studio 12 compiler suite. We use up

to 15 cores simultaneously, to avoid the interference of the

operative system running on the last core.

We present results with input data matrices of 15 000 ×

Par. Code Lines Total Lines %MPI Ref. Version 151 646 23.37%Hitmap Version 117 586 19.97%

Reduction 22.52% 9.29%

Table I: Comparison of code lines due to parallelism and total

code lines between Hitmap and the MPI reference implemen-

tation.

15 000 elements to achieve good scalability. The optimal block

size is platform-dependent. In Geopar, we have found that the

best block size is 64× 64 elements.

Results

Figure VI (left) shows execution times in seconds for the

different implementations. Fig.VI (right) shows the corre-

sponding speed-ups obtained. The plots show that our direct

translation of the ScaLAPACK code to C presents a similar

performance as the original Fortran code only when executed

in sequential. When executing in parallel, our results show that

the optimization effort applied to the communication stages in

ScaLAPACK routines can not be achieved by a naive transla-

tion to C. However, the data locality optimizations introduced

in blocks-browsing and blocks-storing versions outperform the

original ScaLAPACK routines clearly. The speed-up plots also

show that they escalate properly. The buffering of data needed

during communications also benefits from the block-storing

scheme.

Regarding the Hitmap version, it achieves similar perfor-

mance figures as the carefully-optimized C+MPI implementa-

tions with data locality improvements. The reason is that, as

we stated before, the Hitmap development techniques exploit

locality in a natural way. The flexible and general Hitmap tile

management only introduces a small overhead in the access

to tile elements.

Finally, in Table I we show a comparison of the Hitmap

version with the best optimized C version in terms of lines

of code. The comparison takes into account not only the

total number of lines of code, but also the lines devoted to

parallelism, including data-partition, communication initial-

izations, and communication calls. As the table shows, the

use of Hitmap library leads to a 9.29% reduction on the

total number of code lines. This reduction affects mainly to

lines directly related to parallelism (22.52%). The reason is

that Hitmap greatly simplifies the programmer effort for data

distribution and communication, compared with the equivalent

code needed when using MPI routines directly. Moreover, the

use of Hitmap tile-management primitives eliminates some

more lines in the sequential treatment of matrices and blocks.

VII. CONCLUSIONS

In this paper we present a technique that allows Hitmap, a

library for hierarchical array tiling and mapping, to support

layout compositions at different levels. The technique takes

advantage of a new derived type that allows to represent

heterogeneous tile hierarchies. This technique improves the

9696

Page 9: [IEEE 2010 IEEE 12th International Conference on High Performance Computing and Communications (HPCC 2010) - Melbourne, VIC (2010.09.1-2010.09.3)] 2010 IEEE 12th International Conference

100

1000

10000

0 2 4 6 8 10 12 14 16

time

(s)

number of processes

Matrix 15000−by−15000 and block 64−by−64

C naive translationScaLAPACK

HitmapC blocks browsing

C blocks storing

0

2

4

6

8

10

12

14

16

0 2 4 6 8 10 12 14 16

spee

d−up

number of processes

Matrix 15000−by−15000 and block 64−by−64

idealC blocks storing

C blocks browsingHitmap

ScaLAPACKC naive translation

Figure 10: Results for a 15 000× 15 000 matrix (left) and 64× 64 blocks (right).

applicability of Hitmap to new problem classes and program-

ming paradigms, such as multilevel adaptative PDE solvers,

or recursive domain partitions; and to other scenarios, such

as heterogeneous platforms. The improved features of Hitmap

allow the programmer to work at a higher abstraction level

than using message-passing programming directly, thus sig-

nificantly reducing coding and debugging effort.

We show how this new type allows to create data-layout

compositions. We use the block-cyclic layout in the well-

known parallel LU decomposition as a study case. We compare

the reference implementation of LU in ScaLAPACK with

different, carefully-optimized MPI versions, as well as a

Hitmap implementation. Our experimental results show that

the highly-efficient techniques used in Hitmap for tile man-

agement, memory access, data-locality, buffering, marshalling

and communication allows to achieve better performance than

the ScaLAPACK implementation. The performance obtained is

almost the same as developing a manual, carefully-optimized

MPI version with a fraction of its associated development cost.

ACKNOWLEDGEMENTS

This research is partly supported by the Ministerio de

Educación y Ciencia, Spain (TIN2007-62302), Ministerio

de Industria, Spain (FIT-350101-2007-27, FIT-350101-2006-

46, TSI-020302-2008-89, CENIT MARTA, CENIT OASIS,

CENIT OCEAN LEADER), Junta de Castilla y León,

Spain (VA094A08), and also by the Dutch government

STW/PROGRESS project DES.6397. Part of this work was

carried out under the HPC-EUROPA project (RII3-CT-2003-

506079), with the support of the European Community -

Research Infrastructure Action under the FP6 “Structuring the

European Research Area” Programme. The authors wish to

thank the members of the Trasgo Group for their support.

REFERENCES

[1] S. Hiranandani, K. Kennedy, and C. Tseng, “Compiling Fortran D forMIMD distributed-memory machines,” Commun. ACM, vol. 35, no. 8,pp. 66–80, 1992.

[2] R. Chandra, L. Dagum, D. Kohr, D. Maydan, J. McDonald, andR. Menon, Parallel programming in OpenMP. Morgan KaufmannPublishers, 1 ed., 2001. ISBN 1-55860-671-8.

[3] W. Carlson, J. Draper, D. Culler, K. Yelick, E. Brooks, and K. Warren,“Introduction to UPC and language specification,” Tech. Rep. CCS-TR-99-157, IDA Center for Computing Sciences, 1999.

[4] W. Gropp, E. Lusk, and A. Skjellum, Using MPI - 2nd Edition: PortableParallel Programming with the Message Passing Interface. The MITPress, 2 ed., Nov. 1999.

[5] S. Gorlatch, “Send-Recv considered harmful? Myths and truths aboutparallel programming,” in PaCT’2001 (V. Malyshkin, ed.), vol. 2127 ofLNCS, pp. 243–257, Springer-Verlag, 2001.

[6] K. Gatlin, “Trials and tribulations of debugging concurrency,” ACMQueue, vol. 2, pp. 67–73, Oct 2004.

[7] A. Gonzalez-Escribano and D. Llanos, “Trasgo: A nestedparallel programming system,” Journal of Supercomputing,vol. doi:10.1007/s11227-009-0367-5, 2009.

[8] G. Bikshandi, J. Guo, D. Hoeflinger, G. Almasi, B. Fraguela,M. Garzarán, D. Padua, and C. von Praun, “Programming for parallelismand locality with hierarchical tiled arrays,” in PPoPP’06, pp. 48–57,ACM Press, March 2006.

[9] B. Chamberlain, D. Callahan, and H. Zima, “Parallel programmabilityand the Chapel language,” International Journal of High PerformanceComputing Applications, vol. 21, pp. 291–312, Aug 2007.

[10] G. S. Jr., “Parallel programming and code selection in Fortress,” inPPoPP’06, pp. 1–1, ACM Press, March 2006.

[11] G. H. Golub and C. F. V. Loan, Matrix computations. JHU Press, Oct.1996.

[12] M. T. H. Heath, Scientific Computing. McGraw-Hill Sci-ence/Engineering/Math, 2 ed., 2001.

[13] J. Choi and J. J. Dongarra, “Scalable linear algebra software librariesfor distributed memory concurrent computers,” in Proceedings of the 5thIEEE Workshop on Future Trends of Distributed Computing Systems,p. 170, IEEE Computer Society, 1995.

[14] J. Choi, J. J. Dongarra, L. S. Ostrouchov, A. P. Petitet, D. W. Walker,and R. C. Whaley, “Design and implementation of the ScaLAPACK LU,QR, and cholesky factorization routines,” Sci. Program., vol. 5, no. 3,pp. 173–184, 1996.

[15] R. Reddy, A. Lastovetsky, and P. Alonso, “Parallel solvers for denselinear systems for heterogeneous computational clusters,” in Proceedingsof the 2009 IEEE International Symposium on Parallel & DistributedProcessing, pp. 1–8, IEEE Computer Society, 2009.

[16] J. Dongarra and A. Petitet, “ScaLAPACK tutorial,” in Proceedings ofthe Second International Workshop on Applied Parallel Computing,Computations in Physics, Chemistry and Engineering Science, pp. 166–176, Springer-Verlag, 1996.

[17] L. S. Blackford, A. Cleary, J. Choi, E. D’Azevedo, J. Demmel,I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley,D. Walker, and R. C. Whaley, ScaLAPACK Users’ Guide. SIAM, 1997.

9797