193
16 December 2005 Universidad de Murcia 1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistema Javier Cuenca Dpto. de Ingeniería y Tecnología de Computadores Universidad de Murcia http://dis.um.es/~domingo ... and more: J. González (Intel Barcelona), L.P. García (Politécnica Cartagena), A.M. Vidal (Politécnica Valencia), G. Carrillo (?), P. Alberti (U. Magallanes), P. Alonso (Politécnica Valencia), J.P. Martínez (U. Miguel Hernández), J. Dongarra (U. Tennessee), K. Roche (?)

16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

Embed Size (px)

Citation preview

Page 1: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 1

Research in parallel routines optimization

Domingo GiménezDpto. de Informática y Sistemas

Javier CuencaDpto. de Ingeniería y Tecnología de Computadores

Universidad de Murciahttp://dis.um.es/~domingo

... and more: J. González (Intel Barcelona), L.P. García (Politécnica Cartagena), A.M. Vidal (Politécnica

Valencia), G. Carrillo (?), P. Alberti (U. Magallanes), P. Alonso (Politécnica Valencia),

J.P. Martínez (U. Miguel Hernández), J. Dongarra (U. Tennessee), K. Roche (?)

Page 2: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 2

Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing

Page 3: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 3

Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing

Page 4: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 4

A little history Parallel optimization in the past:

Hand-optimization for each platform Time consuming Incompatible with hardware evolution Incompatible with changes in the system

(architecture and basic libraries) Unsuitable for systems with variable

workloads Misuse by non expert users

Page 5: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 5

A little history Initial solutions to this situation:

Problem-specific solutions Polyalgorithms Installation tests

Page 6: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 6

A little history Problem specific solutions:

Brewer (1994): Sorting Algorithms, Differential Equations

Frigo (1997): FFTW: The Fastest Fourier Transform in the West

LAWRA (1997): Linear Algebra With Recursive Algorithms

Page 7: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 7

A little history Polyalgorithms:

Brewer FFTW PHiPAC (1997): Linear Algebra

Page 8: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 8

A little history Installation tests:

ATLAS (2001): Dense Linear Algebra, sequential

Carrillo + Giménez (2000): Gauss elimination, heterogeneous algorithm

I-LIB (2000): some parallel linear algebra routines

Page 9: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 9

A little history Parallel optimization today:

Optimization based on computational kernels

Systematic development of routines Auto-optimization of routines Middleware for auto-optimization

Page 10: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 10

A little history Optimization based on

computational kernels: Efficient kernels (BLAS) and

algorithms based on these kernels Auto-optimization of the basic kernels

(ATLAS)

Page 11: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 11

A little history Systematic development of

routines: FLAME project

R. van de Geijn + E. Quintana + … Dense Linear Algebra Based on Object Oriented Design

LAWRA Dense Linear Algebra For Shared Memory Systems

Page 12: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 12

A little history Auto-optimization of routines:

At installation time: ATLAS, Dongarra + Whaley I-LIB, Kanada + Katagiri + Kuroda SOLAR, Cuenca + Giménez + González LFC, Dongarra + Roche

At execution time: Solve a reduced problem in each processor (

Kalinov + Lastovetsky) Use a system evaluation tool (NWS)

Page 13: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 13

A little history Middleware for auto-optimization:

LFC: Middleware for Dense Linear Algebra Software in

Clusters. Hierarchy of autotuning libraries:

Include in the libraries installation routines to be used in the development of higher level libraries

FIBER: Proposal of general middleware Evolution of I-LIB

mpC: For heterogeneous systems

Page 14: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 14

A little history Parallel optimization in the

future?: Skeletons and languages Heterogeneous and variable-load

systems Distributed systems P2P computing

Page 15: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 15

A little history Skeletons and languages:

Develop skeletons for parallel algorithmic schemes

together with execution time modelsand provide the users with these libraries (

MALLBA, Málaga-La Laguna-Barcelona) or languages (P3L, Pisa)

Page 16: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 16

A little history Heterogeneous and variable-load

systems:Heterogeneous algorithms: unbalanced

distribution of data (static or dynamic)Homogeneous algorithms: more processes

than processors and assignation of processes to processors (static or dynamic)

Variable-load systems as dynamic heterogeneous

Page 17: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 17

A little history Distributed systems:

Intrinsically heterogeneous and variable-load

Very high cost of communicationsNecessary special middleware (Globus,

NWS)There can be servers to attend queries

of clients

Page 18: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 18

A little history P2P computing:

Users can go in and out dynamicallyAll the users are the same type

(initially)Is distributed, heterogeneous and

variable-loadBut special middleware is necessary

Page 19: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 19

Outline A little story Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing

Page 20: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 20

Modelling Linear Algebra Routines

Necessary to predict accurately the execution time and select The number of processes The number of processors Which processors The number of rows and columns of processes (the

topology) The processes to processors assignation The computational block size (in linear algebra

algorithms) The communication block size The algorithm (polyalgorithms) The routine or library (polylibraries)

Page 21: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 21

Cost of a parallel program:

: arithmetic time: communication time: overhead, for synchronization, imbalance,

processes creation, ...: overlapping of communication and

computation

Modelling Linear Algebra Routines

overlapoverheadcommarithparallel ttttt aritht

commt

overheadt

overlapt

Page 22: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 22

Estimation of the time:

Considering computation and communication divided in a number of steps:

And for each part of the formula that of the process which gives the highest value.

Modelling Linear Algebra Routines

commarithparallel ttt

...2,2,1,1, commarithcommarithparallel ttttt

Page 23: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 23

The time depends on the problem (n) and the system (p) size:

But also on some ALGORITHMIC PARAMETERS like the block size (b) and the number of rows (r) and columns (c) of processors in algorithms for a mesh of processors

Modelling Linear Algebra Routines

),( pnt parallel

),,,( crbnt parallel

Page 24: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 24

And some SYSTEM PARAMETERS which reflect the computation and communication characteristics of the system.

Typically the cost of an arithmetic operation (tc) and the start-up (ts) and

word-sending time (tw)

Modelling Linear Algebra Routines

),,( SPAPnt parallel

Page 25: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 25

LU factorisation (Golub - Van Loan):

=

Step 1: (factorisation LU no blocks) Step 2: (multiple lower triangular systems)

Step 3: (multiple upper triangular systems)

Step 4: (update south-east blocks)

Modelling Linear Algebra Routines

A11

A22

A33A32A31

A23A21

A13A12 L11

L22

L33L32L31

L21

U1

1 U2

2 U3

3

U2

3

U1

3

U1

2

111111 ULA ii ULA 1111

1111 ULA ii

jiijij ULAA 11

Page 26: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 26

The execution time is

If the blocks are of size 1, the operations are all with individual elements, but if the block size is b the cost is

With k3 and k2 the cost of operations performed with BLAS 3 or 2

Modelling Linear Algebra Routines

3

3

2)( nnt tcsequential

nbbnnnt kkksequential2

2

3

3

3

3 3

1

3

2)(

Page 27: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 27

But the cost of different operations of the same level is different, and the theoretical cost could be better modelled as:

Thus, the number of SYSTEM PARAMETERS increases (one for each basic routine), and ...

Modelling Linear Algebra Routines

nbbnnnt kkk dgetfdtrsmdgemmsequential2

2_2

3

_3

3

_3 31

32

)(

Page 28: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 28

The value of each System Parameter can depend on the problem size (n) and on the value of the Algorithmic Parameters (b)

The formula has the form:

And what we want is to obtain the values of AP with which the lowest execution time is obtained

Modelling Linear Algebra Routines

nbbnbnbnnbnbnt kkk dgetfdtrsmdgemmsequential2

2_2

3

_3

3

_3 31

),(),(32

),(),(

)),(,,( APnSPAPnt

Page 29: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 29

The values of the System Parameters could be obtained With installation routines associated to each

linear algebra routine From information stored when the library

was installed in the system, thus generating a hierarchy of libraries with auto-optimization

At execution time by testing the system conditions prior to the call to the routine

Modelling Linear Algebra Routines

Page 30: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 30

These values can be obtained as simple values (traditional method) or as function of the Algorithmic Parameters.

In this case a multidimensional table of values as a function of the problem size and the Algorithmic Parameters is stored,

And when a problem of a particular size is being solved the execution time is estimated with the values of the stored size closest to the real size

And the problem is solved with the values of the Algorithmic Parameters which predict the lowest execution time

Modelling Linear Algebra Routines

Page 31: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 31

Parallel block LU factorisation:

matrix

distribution of computations in the first step

processors

Modelling Linear Algebra Routines

Page 32: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 32

Distribution of computations on successive steps:

second step third step

Modelling Linear Algebra Routines

Page 33: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 33

The cost of parallel block LU factorisation:

Tuning Algorithmic Parameters:block size: b

2D-mesh of p proccesors: p = r c d=max(r,c)

System Parameters:cost of arithmetic operations: k2,getf2 k3,trsmm k3,gemm

communication parameters: ts tw

Modelling Linear Algebra Routines

nkbnbkp

crp

nkT getftrsmgemmARI 2,2

22,3

3

,3 31

32

p

dnt

b

ndtT wsCOM

222

Page 34: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 34

The cost of parallel block QR factorisation:

Tuning Algorithmic Parameters:block size: b

2D-mesh of p proccesors: p = r c

System Parameters:cost of arithmetic operations: k2,geqr2 k2,larft k3,gemm k3,trmm

communication parameters: ts tw

Modelling Linear Algebra Routines

r

bkn

r

bkn

c

bkn

p

knT

larftgeqr

trmmgemm

ARI

,22

2,22,3

2,3

3

2

1

4

1

3

4

pnb

r

r

c

r

r

rntcrb

b

ntT wsCOM log

loglog12

2log2log32

2

2

Page 35: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 35

The same basic operations appear repeatedly in different higher level routines: the information generated for one routine (let’s say LU) could be stored and used for other routines (e.g. QR)and a common format is necessary to store the information

Modelling Linear Algebra Routines

Page 36: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 36

Modelling Linear Algebra Routines

IBM-SP2. 8 processors0,00

10,00

20,00

30,00

40,00

50,00

60,00

70,00

80,00

512 1024 1536 2048 2560 3072 3584

problem size

time

(sec

onds

)

mean

model

optimum

Parallel QR factorisation

“mean” refers to the mean of the execution times with representative values of the Algorithmic Parameters (execution time which could be obtained by a non-expert user)

“optimum” is the lowest time of all the executions performed with representative values of the Algorithmic Parameters

“model” is the execution time with the values selected with the model

Page 37: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 37

Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing

Page 38: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 38

In the formulas (parallel block LU factorisation)

The values of the System Parameters (k2,getf2 ,

k3,trsmm , k3,gemm , ts , tw) must be estimated as functions of the problem size (n) and the Algorithmic Parameters (b, r, c)

Installation Routines

ncrbnkbncrbnbkp

crp

ncrbnkcrbnT getftrsmgemmARI ),,,(

31

),,,(),,,(32

),,,( 2,222

,3

3

,3

pdn

crbntbnd

crbntcrbnT wsCOM

22),,,(

2),,,(),,,(

Page 39: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 39

Installation RoutinesBy running at installation time Installation

Routines associated to the linear algebra routine

And storing the information generated to be used at running time

Each linear algebra routine must be

designed together with the corresponding installation routines, and the installation process must be detailed

Page 40: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 40

is estimated by performing matrix-matrix multiplications and updatings of size

(n/r b) (b n/c)

Because during the execution the size of the matrix to work with decreases, different values can be estimated for different problem sizes, and the formula can be modified to include the posibility of these estimations with different values, for example, splitting the formula into four formulas with different problem sizes

Installation Routines

),,,(,3 crbnk gemm

Page 41: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 41

two multiple triangular systems are solved, one upper triangular of size b n/c , and another lower triangular of size n/r b

Thus, two parameters are estimated, one of them depending on n, b and c, and the other depending on n, b and r

As for the previous parameter, values can be obtained for different problem sizes

Installation Routines

),,,(,3 crbnk trsm

Page 42: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 42

corresponds to a level 2 LU sequential factorisation of size b b

At installation time each of the basic routines is executed varying the value of the parameters they depend on, and with representative values (selected by the routine designer or the system manager),

And the information generated is stored in a file to be used at running time or in the code of the linear algebra routine before its installation

Installation Routines

),,,(2,2 crbnk getf

Page 43: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 43

and appear in communications of three types,

In one of them a block of size b b is broadcast in a row, and this parameter depends on b and c

In another a block of size b b is broadcast in a column, and the parameter depends on b and r

And in the other, blocks of sizes b n/c and n/r b are broadcast in each one of the columns and rows of processors. These parameters depend on n, b, r and c

Installation Routines

),,,( crbnts ),,,( crbntw

Page 44: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 44

In practice each System Parameter depends on a more reduced number of Algorithmic Parameters, but this is known only after the installation process is completed.

The routine designer also designs the installation process, and can take into consideration the experience he has to guide the installation.

The basic installation process can be designed allowing the intervention of the system manager.

Installation Routines

Page 45: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 45

Some results in different systems (physical and logical platform)

Values of k3_DTRMM (≈ k3_DGEMM) on the different platforms (in microseconds)

Installation Routines

0.00250.00250.00300.0070512,.., 4096

macBLASR10K

0.00180.00180.00190.0023512,.., 4096

macBLASPPC

0.00300.00300.00330.0038512,.., 4096

ATLASPIII

0.01500.00500.0025

0.01400.00500.0025

0.01300.00500.0032

0.01200.00600.0040

512,.., 4096512,.., 4096512,.., 4096

refBLASmacBLAS

ATLAS

SUN5

0.02800.01100.0060

0.02200.01100.0060

0.02000.01100.0060

0.02000.01200.0070

512,.., 4096512,.., 4096512,.., 4096

refBLASmacBLAS

ATLAS

SUN1

128643216nSystem

Block size

Page 46: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 46

Installation RoutinesValues of k2_DGEQR2 (≈ k2_DLARFT) on the different platforms (in microseconds)

0.0250512,.., 4096

macBLASR10K

0.0100512,.., 4096

macBLASPPC

0.0150512,.., 4096

ATLASPIII

0.00500.03000.0500

512,.., 4096512,.., 4096512,.., 4096

refBLASmacBLAS

ATLAS

SUN5

0.02000.05000.0700

512,.., 4096512,.., 4096512,.., 4096

refBLASmacBLAS

ATLAS

SUN1

128643216nSystem

Block size

Page 47: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 47

Typically the values of the communication parameters are well estimated with a ping-pong

Installation Routines

20 / 0.1512,.., 4096

Mac-MPIOrigin 2K

75 / 0.3512,.., 4096

Mac-MPIIBM-SP2

60 / 0.7512,.., 4096

MPICHcPIII

170 / 7.0512,.., 4096

MPICHcSUN1

128643216nSystem

Block size

Page 48: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 48

Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing

Page 49: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 49

Modelling the Linear Algebra Routine

(LAR)

Obtaining information from

the System

Selectionof

parameters values

Executionof LAR

DESIGN

INSTALLATION

RUN-TI

ME

Autotuning routines

Page 50: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 50

DESIGN PROCESS

DESIGN

LAR: Linear Algebra RoutineMade by the LAR Designer

Example of LAR: Parallel Block LU factorisation

LAR

Page 51: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 51

Modelling the LAR

DESIGN

LAR

Modellingthe LAR

MODEL

Page 52: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 52

Modelling the LAR

DESIGN

MODELTexec = f (SP, AP, n)

SP: System Parameters AP: Algorithmic Parameters n : Problem size

Made by the LAR-DesignerOnly once per LAR

LAR

Modellingthe LAR

MODEL

Page 53: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 53

Modelling the LAR

DESIGN

SP: k3, k2, ts, tw

AP: p = r x c, bn : Problem size

MODEL LAR: Parallel Block LU factorisation

LAR

Modellingthe LAR

MODEL

Page 54: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 54

Implementation of SP-Estimators

DESIGN

LAR

Modellingthe LAR

MODEL

Implementationof SP-

Estimators

SP-Estimators

Page 55: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 55

Implementation of SP-Estimators

DESIGN

LAR

Modellingthe LAR

MODEL

Implementationof SP-

Estimators

SP-Estimators

Estimators of Arithmetic-SPComputation Kernel of the LARSimilar storage schemeSimilar quantity of data

Estimators of Communication-SP Communication Kernel of the LAR Similar kind of communicationSimilar quantity of data

Page 56: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 56

INSTALLATION PROCESS

INSTALLATION

LAR

Modellingthe LAR

MODEL

Implementationof SP-

Estimators

SP-Estimators

DESIGN

Installation ProcessOnly once per PlatformDone by the System Manager

Page 57: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 57

Estimation of Static-SP

INSTALLATION

LAR

Modellingthe LAR

MODEL

Implementationof SP-

Estimators

SP-Estimators

Estimationof Static-SP

Static-SP-File

Basic Libraries Installation-File

DESIGN

Page 58: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 58

INSTALLATION

LAR

Modellingthe LAR

MODEL

Implementationof SP-

Estimators

SP-Estimators

Estimationof Static-SP

Static-SP-File

Basic Libraries Installation-File

DESIGN

Estimation of Static-SPBasic Libraries

Basic Communication Library: MPI PVM

Basic Linear Algebra Library: reference-BLAS

machine-specific-BLASATLAS

Installation File

SP values are obtained using the information (n and AP values) of this file.

Page 59: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 59

Estimation of Static-SPLAR

Modellingthe LAR

MODEL

Implementationof SP-Estimators

SP-Estimators

Estimationof Static-SP

Static-SP-File

Basic Libraries Installation-File

DESIGN

INSTALLATION

Estimation of the Static-SP tw-static (in sec)

Message size (Kbytes) 32 256 1024 2048tw-static 0.700 0.690 0.680 0.675

Platform:Cluster of Pentium III + Fast Ethernet

Basic Libraries: ATLAS and MPI

Estimation of the Static-SP k3-static (in sec)

Block size 16 32 64 128k3-static 0.0038 0.0033 0.0030 0.0027

Page 60: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 60

RUN-TIME PROCESS

INSTALLATION

LAR

Modellingthe LAR

MODEL

Implementationof SP-

Estimators

SP-Estimators

Estimationof Static-SP

Static-SP-File

Basic Libraries Installation-File

DESIGN

RUN-TI

ME

Page 61: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 61

RUN-TIME PROCESS

INSTALLATION

LAR

Modellingthe LAR

MODEL

Implementationof SP-

Estimators

SP-Estimators

Estimationof Static-SP

Static-SP-File

Basic Libraries Installation-File

Optimum-AP

Selectionof Optimum AP

DESIGN

RUN-TI

ME

Page 62: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 62

RUN-TIME PROCESS

INSTALLATION

LAR

Modellingthe LAR

MODEL

Implementationof SP-

Estimators

SP-Estimators

Estimationof Static-SP

Static-SP-File

Basic Libraries Installation-File

Optimum-AP

Selectionof Optimum AP

Executionof LAR

DESIGN

RUN-TI

ME

Page 63: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 63

Autotuning routines

Experiments

LAR: block LU factorization.

Platforms: IBM SP2,

SGI Origin 2000,

NoW

Basic Libraries: reference BLAS,

machine BLAS, ATLAS

Page 64: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 64

Autotuning routines

LU on IBM SP2

Quotient between the

execution time with the

parameters selected by

the model and the lowest

experimentl execution

time (varying the

value of the parameters) 0

0,2

0,4

0,6

0,8

1

1,2

1,4

SEQPAR4PAR8

Page 65: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 65

Autotuning routines

LU on Origin 2000

Quotient between the

execution time with the

parameters selected by

the model and the lowest

experimentl execution

time (varying the

value of the parameters) 0

0,2

0,4

0,6

0,8

1

1,2

1,4

SEQPAR4PAR8PAR16

Page 66: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 66

Autotuning routines

LU on NoW

Quotient between the

execution time with the

parameters selected by

the model and the lowest

experimentl execution

time (varying the

value of the parameters) 0,96

0,98

1

1,02

1,04

1,06

1,08

1,1

512 1024 1536 2048

SEQ BLASSEQ ATLASPAR4 BLASPAR4 ATLAS

Page 67: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 67

Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing

Page 68: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 68

Modifications to libraries’ hierarchy

In the optimization of routines individual basic operations appear repeatedly: LU:

QR:

nkbnbkp

crp

nkT getftrsmgemmARI 2,2

22,3

3

,3 31

32

p

dnt

b

ndtT wsCOM

222

r

bkn

r

bkn

c

bkn

p

knT

larftgeqr

trmmgemm

ARI

,22

2,22,3

2,3

3

2

1

4

1

3

4

pnb

r

r

c

r

r

rntcrb

b

ntT wsCOM log

loglog12

2log2log32

2

2

Page 69: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 69

Modifications to libraries’ hierarchy

The information generated to instal a routine could be used for another different routine without additional experiments: ts and tw are obtained when the

communication library (MPI, PVM, …) is installed

K3,gemm is obtained when the basic computational library (BLAS, ATLAS, …) is installed

Page 70: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 70

Modifications to libraries’ hierarchy

To determine: the type of experiments necessary for

the different routines in the library: ts and tw ¿obtained with ping-pong, broadcast,

… ? K3,gemm ¿obtained for small block sizes, … ?

the format in which the data will be stored, to facilitate the use of them when installing other routines

Page 71: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 71

Modifications to libraries’ hierarchy

The method could be valid not only for one library (that I am developing) but also for others libraries I or somebody else will develop in the future: the type of experiments the format in which the data will be stored

must be decided by the Parallel Linear Algebra Community

… and the typical hierarchy of libraries would change

Page 72: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 72

Modifications to libraries’ hierarchy

typical hierarchy of Parallel Linear Algebra libraries

ScaLAPACK

LAPACK

BLAS

PBLAS

BLACS

Communications

Page 73: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 73

Modifications to libraries’ hierarchy

To includeinstallation informationin the lowest levelsof the hierarchy

ScaLAPACK

LAPACK

BLAS

PBLAS

BLACS

CommunicationsSelf-Optimisation

Information Self-Optimisation Information

Page 74: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 74

Modifications to libraries’ hierarchy

When installing librariesin a higher level thisinformation can be used,and new informationis generated …

ScaLAPACK

LAPACK

BLAS

PBLAS

BLACS

CommunicationsSelf-Optimisation

Information

Self-Optimisation Information

Self-Optimisation Information

Self-Optimisation Information

Page 75: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 75

Modifications to libraries’ hierarchy

And so in higher levels

ScaLAPACK

LAPACK

BLAS

PBLAS

BLACS

CommunicationsSelf-Optimisation

Information Self-Optimisation Information

Self-Optimisation Information

Self-Optimisation Information

Self-Optimisation Information

Self-Optimisation Information

Page 76: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 76

Modifications to libraries’ hierarchy

And new libraries with autotunig capacitycould be developed

ScaLAPACK

LAPACK

BLAS

PBLAS

BLACS

CommunicationsSelf-Optimisation

Information Self-Optimisation Information

Self-Optimisation Information

Self-Optimisation Information

Self-Optimisation Information

Self-Optimisation Information

Inverse Eigenvalue ProblemLeast Square ProblemPDE Solver

Self-Optimisation Information

Self-Optimisation Information

Self-Optimisation Information

Page 77: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 77

Modifications to libraries’ hierarchy

Movement of information between routinesin the different levels of thehierarchy

GETRF from LAPACK (level 1)

GETRF_manager

k3_information

Model

GETRF {

}

GEMM from BLAS (level 0)

GEMM_manager

k3_information

Model

GEMM {

}

23

333

2nbknkTexec

332 nkTexec

Page 78: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 78

Modifications to libraries’ hierarchy

Movement of information between routinesin the different levels of thehierarchy

Page 79: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 79

Modifications to libraries’ hierarchy

Movement of information between routinesin the different levels of thehierarchy

Page 80: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 80

Modifications to libraries’ hierarchy

Movement of information between routinesin the different levels of thehierarchy

Page 81: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 81

Modifications to libraries’ hierarchy

Architecture ofa Self OptimizedLinear AlgebraRoutine manager

SP1_information

SP1_manager

Installation_SP1_values

AP1 .......... APz

n1 SP11,1 .... SP1

1,z

nw SP1w,1 .... SP1

w,z

Current_SP1_values

AP1 .......... APz

nc SP1c,1 .... SP1

c,z

SP1_information

SP1_manager

Installation_SP1_values

AP1 .......... APz

n1 SP11,1 .... SP1

1,z

nw SP1w,1 .... SP1

w,z

Current_SP1_values

AP1 .......... APz

nc SP1c,1 .... SP1

c,z

. . .

LAR(n, AP){

...}

Model

Texec = f (SP,AP, n)SP = f(AP,n)

Installation_information

n1 ... nw

AP1

...APz

Current_problem_sizenc

Current_system_informationCurrent_CPUs_availability

%CPU1 ... %CPUp

Current_network_availability

% net1-1 ...%net1-p

...% netP-1 ..%netp-p

SOLAR_manager

Optimum_AP AP0

SP1_information

SP1_manager

Installation_SP1_values

AP1 .......... APz

n1 SP11,1 .... SP1

1,z

nw SP1w,1 .... SP1

w,z

Current_SP1_values

AP1 .......... APz

nc SP1c,1 .... SP1

c,z

SPt_information

SPt_manager

Installation_SP1_values

AP1 .......... APz

n1 SPt1,1 .... SPt

1,z

nw SPtw,1 .... SPt

w,z

Current_SP1_values

AP1 .......... APz

nc SPtc,1 .... SPt

c,z

Page 82: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 82

Outline●A little history●Modelling Linear Algebra Routines●Installation routines●Autotuning routines●Modifications to libraries’ hierarchy●Polylibraries●Algorithmic schemes●Heterogeneous systems●Hybrid programming●Peer to peer computing

Page 83: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 83

Polylibraries●Different basic libraries can be available:

● Reference BLAS, machine specific BLAS, ATLAS, …● MPICH, machine specific MPI, PVM, …● Reference LAPACK, machine specific LAPACK, …● ScaLAPACK, PLAPACK, …

●To use a number of different basic libraries to develop a polylibrary

Page 84: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 84

PolylibrariesTypical parallel linear algebra libraries hierarchy

ScaLAPACK

LAPACK

BLAS

PBLAS

BLACS

MPI, PVM, ...

Page 85: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 85

PolylibrariesA possible parallel linear algebra polylibraries hierarchy

ScaLAPACK

LAPACK PBLAS

BLACS

MPI, PVM, ...

ref. BLAS

mac. BLAS

ATLAS

Page 86: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 86

PolylibrariesA possible parallel linear algebra polylibraries hierarchy

ScaLAPACK

LAPACK PBLAS

BLACS

ref. BLAS

mac. BLAS

ATLAS

mac. MPI

LAM

MPICH

PVM

Page 87: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 87

PolylibrariesA possible parallel linear algebra polylibraries hierarchy

ScaLAPACK

mac. LAPACK

PBLAS

BLACS

ref. BLAS

mac. BLAS

ATLAS

mac. MPI

LAM

MPICH

PVM

ESSL

ref. LAPACK

Page 88: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 88

Polylibraries

BLACS

PBLAS

ref. BLAS

mac. BLAS

ATLAS

mac. MPI

LAM

MPICH

PVM

mac. LAPACK

ESSL

ref. LAPACK

mac. ScaLAPACK

ESSL

ref. ScaLAPACK

Page 89: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 89

Polylibraries●The advantage of Polylibraries

● A library optimised for the system might not be available

● The characteristics of the system can change

● Which library is the best may vary according to the routines and the systems

● Even for different problem sizes or different data access schemes the preferred library can change

● In parallel system with the file system shared by processors of different types

Page 90: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 90

Architecture of a Polylibrary

Library_1

Page 91: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 91

Architecture of a Polylibrary

Library_1

LIF_1

Installation

Page 92: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 92

Architecture of a Polylibrary

Library_1

LIF_1

Installation

X MflopsX MflopsX Mflops80

X MflopsX MflopsX Mflops40n

X MflopsX MflopsX Mflops20

804020

m

Routine: DGEMM

Page 93: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 93

Architecture of a Polylibrary

Library_1

LIF_1

Installation

X MflopsX MflopsX Mflops400

X MflopsX MflopsX Mflops200n

X MflopsX MflopsX Mflops100

2001001

Leading dimension

Routine: DROT

Page 94: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 94

Architecture of a Polylibrary

Library_2Library_1

LIF_1

Installation

Page 95: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 95

Architecture of a Polylibrary

Library_2

LIF_2

Library_1

LIF_1

Installation Installation

Page 96: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 96

Architecture of a Polylibrary

Library_2

LIF_2

Library_3Library_1

LIF_1

Installation Installation

Page 97: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 97

Architecture of a Polylibrary

Library_2

LIF_2

Library_3

LIF_3

Installation

Library_1

LIF_1

Installation Installation

Page 98: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 98

Architecture of a Polylibrary

PolyLibrary

interface routine_1interface routine_2

...

Library_2

LIF_2

Library_3

LIF_3

Installation

Library_1

LIF_1

Installation Installation

Page 99: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 99

Architecture of a Polylibrary

PolyLibrary

interface routine_1interface routine_2

...

interface routine_1 if n<value call routine_1 from Library_1 else depending on data storage call routine_1 from Library_1 or call routine_1 from Library_2 ...

Library_2

LIF_2

Library_3

LIF_3

Installation

Library_1

LIF_1

Installation Installation

Page 100: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 100

Polylibraries●Combining Polylibraries with other Optimisation Techniques:

● Polyalgorithms● Algorithmic Parameters

● Block size● Number of processors● Logical topology of processors

Page 101: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 101

Experimental Results

Routines of different levels in the hierarchy:● Lowest level:

● GEMM: matrix-matrix multiplication● Medium level:

● LU and QR factorisations● Highest level:

● a Lift-and-Project algorithm to solve the inverse additive eigenvalue problem

● an algorithm to solve the Toeplitz least square problem

Page 102: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 102

Experimental Results

The platforms: ● SGI Origin 2000● IBM-SP2● Different networks of processors

● SUN Workstations + Ethernet● PCs + Fast-Ethernet● PCs + Myrinet

Page 103: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 103

Experimental Results: GEMMRoutine: GEMM (matrix-matrix multiplication)

Platform: five SUN Ultra 1 / one SUN Ultra 5

Libraries:refBLAS macBLASATLAS1 ATLAS2 ATLAS5

Algorithms and Parameters: Strassen base sizeBy blocks block sizeDirect method

Page 104: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 104

Experimental Results: GEMMMATRIX-MATRIX MULTIPLICATION INTERFACE:if processor is SUN Ultra 5 if problem-size<600

solve using ATLAS5 and Strassen method with base size half of problem size

else if problem-size<1000solve using ATLAS5 and block method with block size 400

elsesolve using ATLAS5 and Strassen method with base size half of problem size

endifelse if processor is SUN Ultra 1 if problem-size<600 solve using ATLAS5 and direct method else if problem-size<1000

solve using ATLAS5 and Strassen method with base size half of problem size

else solve using ATLAS5 and direct method endifendif

Page 105: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 105

Experimental Results: GEMM

20.03ATL5block

s400

12.53ATL2Stras

s2

4.68ATL5Stras

s2

1.06ATL5direct

0.04ATL5direct

TimeLibraryMethodParameter

Low

31.0213.504.831.060.04TimeATLAS5Direct

26.57ATL5Stras

s2

12.58ATL5Stras

s2

4.68ATL5Stras

s2

1.11ATL5block

s400

0.04ATL5Stras

s2

TimeLibraryMethodParameter

Mod

160014001000600200

n

Page 106: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 106

Experimental Results: LURoutine: LU factorisation

Platform: 4 PentiumIII + Myrinet

Libraries:

ATLAS

BLAS for Pentium II

BLAS for Pentium III

Page 107: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 107

The cost of parallel block LU factorisation:

Tuning Algorithmic Parameters:

block size: b

2D-mesh of p proccesors: p = r c d=max(r,c)

System Parameters:cost of arithmetic operations: k2,getf2 k3,trsmm k3,gemm

communication parameters: ts tw

Experimental Results: LU

nkbnbkp

cr

p

nkT getftrsmgemmARI 2,2

22,3

3

,3 3

1

3

2

p

dnt

b

ndtT wsCOM

222

p

dnt

b

ndtT wsCOM

222

Page 108: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 108

Experimental Results: LU

BLAS-III

BLAS-II

ATLAS

51210241536

51210241536

51210241536

n

322.13322.13322.30320.70320.70320.77320.11320.11320.13

322.13322.13322.30320.71320.71320.77320.11320.11320.13

322.27642.21322.36320.74320.74320.79320.12320.12320.13btimebtimebtime

mod.low.the.

Page 109: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 109

Experimental Results: L&PRoutine: Lift-and-Project method for the Inverse Additive Eigenvalue Prob

Platform: dual Pentium III

Libraries combinations:

reference LAPACK and the BLAS installed which uses threads

La_Re+B_In_Th

reference LAPACK and a freely available BLAS for Pentium II using threads

La_Re+B_II_Th

LAPACK and BLAS installed for the use of threadsLa_In_Th+B_In_Th

reference LAPACK and the installed BLASLa_Re+B_In

reference LAPACK and a freely available BLAS for Pentium II

La_Re+B_II

reference LAPACK and a freely available BLAS for PentiumIII

La_Re+B_III

LAPACK and BLAS installed in the system and supposedly optimized for the machine

La_In+B_In

Page 110: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 110

The theoretical model of the sequential algorithm cost:

System Parameters:

ksyev LAPACK

k3, gemm k3, diaggemm BLAS-3

k1,dot k1,scal k1,axpy BLAS-1

Experimental Results: L&P

3

,3,323

22nkkkiter diaggemmgemmsyev

222,1

2,1,1,1 22 nLkLnkLnkkkiter sumdotaxpyscaldot

Page 111: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 111

Experimental Results: L&P

Page 112: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 112

Experimental Results: L&P

Lowest

Lowest with threads

La_Re

La_Re

La_In_Th

Lowest no threads

La_Re

La_Re

La_Re

La_In

197.069.996.660.62165.8112.861.10

281.709.996.660.62249.5913.711.10

290.6811.9013.740.62249.5913.711.10B_In_Th

288.669.996.660.79254.3415.681.16B_II_Th

308.8012.3414.130.66266.6313.921.10B_In_Th

201.6410.4410.520.83165.8112.861.16

497.5918.03123.731.21336.4916.411.69B_In

293.8510.4410.520.86255.2015.651.16B_II

264.8910.4626.700.83210.8514.871.16B_III

294.3214.2298.790.94165.8112.861.69B_In

TOTALZKAOAMATMATMATEIGEIGENADKTRACE

Page 113: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 113

Outline●A little history●Modelling Linear Algebra Routines●Installation routines●Autotuning routines●Modifications to libraries’ hierarchy●Polylibraries●Algorithmic schemes●Heterogeneous systems●Hybrid programming●Peer to peer computing

Page 114: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 114

Algorithmic schemes

●To study ALGORITHMIC SCHEMES, and not individual routines. The study could be useful to:

● Design libraries to solve problems in different fields.● Divide and Conquer, Dynamic Programming, Branch and

Bound (La Laguna)● Develop SKELETONS which could be used in parallel

programming languages.● Skil, Skipper, CARAML, P3L, …

Page 115: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 115

Dynamic Programming●There are different Parallel Dynamic Programming Schemes.●The simple scheme of the “coins problem” is used:

● A quantity C and n coins of values v=(v1,v2,…,vn), and a quantity q=(q1,q2,…,qn) of each type. Minimize the quantity of coins to be used to give C.

● But the granularity of the computation has been varied to study the scheme, not the problem.

Page 116: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 116

Dynamic Programming●Sequential scheme:for i=1 to number_of_decisions

for j=1 to problem_sizeobtain the optimum solution with i

decisions and problem size jendfor Complete the table with the formula:

endfor

  1 2 . . . . . . . . j .   . . .   . N1                                      2                                      

….                                      i                                      

…                                      n                                      

i v

i v2 ]k v-j1,-Change[ik min j]Change[i, i j/vi0,1,...,k

Page 117: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 117

Dynamic Programming●Parallel scheme:for i=1 to number_of_decisions In Parallel: for j=1 to problem_size obtain the optimum solution with i decisions and problem size j endfor endInParallelendfor

  1 2 . . . . . . . j .   . . .   .

1                                      

2                                      

...                                  

i                                      

…                                      

n                                      

PO P1 P2 ...... PS ... PK-1 PK

vi

vi2

vi3

Page 118: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 118

Dynamic Programming●Message-passing scheme:

In each processor Pj

for i=1 to number_of_decisions communication step obtain the optimum solution with i decisions and the problem

sizes Pj has assigned endforendInEachProcessor

  1 2 . . . . . . . . j .   . . .   .

1                                      

2                                      

...                                  

i                                      

…                                      

n                                      

PO P1 P2 .................... PK-1 PK

N

Page 119: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 119

Dynamic Programming●Theoretical model:

Sequential cost:

Computational parallel cost (qi large):

Communication cost:

●The only AP is p●The SPs are tc , ts and tw

...2,2,1,1, commarithcommarithparallel ttttt

one step

vC

i

o2

2

Process Pptt ws

pCpp

2

)1(

2

)1(

Page 120: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 120

Dynamic Programming●How to estimate arithmetic SPs:

Solving a small problem●How to estimate communication SPs:

● Using a ping-pong (CP1)● Solving a small problem varying the number of processors

(CP2)● Solving problems of selected sizes in systems of selected sizes

(CP3)

Page 121: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 121

Dynamic Programming●Experimental results:

● Systems:● SUNEt: five SUN Ultra 1 and one SUN Ultra 5 (2.5 times faster) + Ethernet● PenET: seven Pentium III + FastEthernet

● Varying:● The problem size C = 10000, 50000, 100000, 500000● Large value of qi

● The granularity of the computation (the cost of a computational step)

Page 122: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 122

Dynamic Programming●Experimental results:

● CP1:● ping-pong (point-to-point communication).● Does not reflect the characteristics of the system

● CP2:● Executions with the smallest problem (C =10000) and varying the number of

processors● Reflects the characteristics of the system, but the time also changes with C● Larger installation time (6 and 9 seconds)

● CP2:● Executions with selected problem (C =10000, 100000) and system (p =2, 4, 6) sizes,

and linear interpolation for other sizes● Larger installation time (76 and 35 seconds)

Page 123: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 123

5161516551666166100

416141614161616650

111111111111111110

CP3CP2CP1LTCP3CP2CP1LTCP3CP2CP1LTCP3CP2CP1LTgra

7776777677777576100

617761746177717550

116111611161116110

CP3CP2CP1LTCP3CP2CP1LTCP3CP2CP1LTCP3CP2CP1LTgra

500.000100.00050.00010.000

SUNEt

PenFE

Parameter selection

Dynamic Programming

Page 124: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 124

Dynamic Programming●Quotient between the execution time with the parameter selected by each one of the selection methods and the lowest execution time, in SUNEt:

Page 125: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 125

Dynamic Programming●Quotient between the execution time with the parameter selected by each one of the selection methods and the lowest execution time, in PenFE:

Page 126: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 126

Dynamic Programming●Three types of users are considered:

● GU (greedy user):● Uses all the available processors.

● CU (conservative user):● Uses half of the available processors

● EU (expert user):● Uses a different number of processors depending on the granularity:

● 1 for low granularity● Half of the available processors for middle granularity● All the processors for high granularity

Page 127: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 127

Dynamic Programming●Quotient between the execution time with the parameter selected by each type of user and the lowest execution time, in SUNEt:

Page 128: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 128

Dynamic Programming●Quotient between the execution time with the parameter selected by each type of user and the lowest execution time, in PenFE:

Page 129: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 129

Outline●A little history●Modelling Linear Algebra Routines●Installation routines●Autotuning routines●Modifications to libraries’ hierarchy●Polylibraries●Algorithmic schemes●Heterogeneous systems●Hybrid programming●Peer to peer computing

Page 130: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 130

Heterogeneous algorithms●Necessary new algorithms with unbalanced distribution of data:

Different SPs for different processors

APs include

vector of selected processors

vector of block sizes

Gauss eliminationb0b1b2b0b1b2b0b1b2b0

Page 131: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 131

Heterogeneous algorithms●Parameter selection:

● RI-THE: obtains p and b from the formula (homogeneous distribution)

● RI-HOM: obtains p and b through a reduced number of executions (homogeneous distribution)

● RI-HET: ● obtains p and b through a reduced number of executions● and each

pbs

sb p

jj

ii

1

Page 132: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 132

Heterogeneous algorithms●Quotient with respect to the lowest experimental execution time:

0

0,5

1

1,5

2

500

1000

1500

2000

2500

3000

RI-THEORI-HOMORI-HETE

0

0,5

1

1,5

2

500

1000

1500

2000

2500

3000

0

0,5

1

1,5

2

500

1000

1500

2000

2500

3000

Homogeneous system:Five SUN Ultra 1

Hybrid system:Five SUN Ultra 1One SUN Ultra 5

Heterogeneous system:Two SUN Ultra 1 (one manages the file system)One SUN Ultra 5

Page 133: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 133

INSTALLATION

LAR

Modellingthe LAR

MODEL

Implementationof SP-

EstimatorsSP-Estimators

Estimationof Static-SP

Static-SP-File

Basic Libraries Installation-File

DESIGN R

UN-TI

ME

Parameter selection at running time

Page 134: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 134

INSTALLATION

LAR

Modellingthe LAR

MODEL

Implementationof SP-

EstimatorsSP-Estimators

Estimationof Static-SP

Static-SP-File

Basic Libraries Installation-File

NWS Information

Call to NWS

DESIGN R

UN-TI

ME

Parameter selection at running time

Page 135: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 135

INSTALLATION

LAR

Modellingthe LAR

MODEL

Implementationof SP-

EstimatorsSP-Estimators

Estimationof Static-SP

Static-SP-File

Basic Libraries Installation-File

NWS Information

Call to NWS

DESIGN R

UN-TI

ME

The NWS is called and it reports:

the fraction of available CPU (fCPU)

the current word sending time (tw_current) for a specific n and AP values (n0, AP0).

Then the fraction of available

network is calculated:

Parameter selection at running time

Page 136: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 136

node1 node2 node3 node4 node5 node6 node7 node8

Situation A

CPU avail. 100% 100% 100% 100% 100% 100% 100% 100%

tw-current 0.7sec

Situation B

CPU avail. 80% 80% 80% 80% 100% 100% 100% 100%

tw-current 0.8sec 0.7sec

Situation C

CPU avail. 60% 60% 60% 60% 100% 100% 100% 100%

tw-current 1.8sec 0.7sec

Situation D

CPU avail. 60% 60% 60% 60% 100% 100% 80% 80%

tw-current 1.8sec 0.7sec 0.8sec

Situation E

CPU avail. 60% 60% 60% 60% 100% 100% 50% 50%

tw-current 1.8sec 0.7sec 4.0sec

Parameter selection at running time

Page 137: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 137

INSTALLATION

LAR

Modellingthe LAR

MODEL

Implementationof SP-

EstimatorsSP-Estimators

Estimationof Static-SP

Static-SP-File

Basic Libraries Installation-File

NWS Information

Call to NWS

DESIGN R

UN-TI

ME

Parameter selection at running time

Page 138: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 138

INSTALLATION

LAR

Modellingthe LAR

MODEL

Implementationof SP-

EstimatorsSP-Estimators

Estimationof Static-SP

Static-SP-File

Basic Libraries Installation-File

Current-SP

Dynamic Adjustment

of SP

NWS Information

Call to NWS

DESIGN R

UN-TI

ME

Parameter selection at running time

Page 139: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 139

INSTALLATION

LAR

Modellingthe LAR

MODEL

Implementationof SP-

EstimatorsSP-Estimators

Estimationof Static-SP

Static-SP-File

Basic Libraries Installation-File

Current-SP

Dynamic Adjustment

of SP

NWS Information

Call to NWS

DESIGN R

UN-TI

ME

The values of the SP are tuned,

according to the current situation:

Parameter selection at running time

Page 140: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 140

INSTALLATION

LAR

Modellingthe LAR

MODEL

Implementationof SP-

EstimatorsSP-Estimators

Estimationof Static-SP

Static-SP-File

Basic Libraries Installation-File

Current-SP

Dynamic Adjustment

of SP

NWS Information

Call to NWS

DESIGN R

UN-TI

ME

Parameter selection at running time

Page 141: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 141

INSTALLATION

LAR

Modellingthe LAR

MODEL

Implementationof SP-

EstimatorsSP-Estimators

Estimationof Static-SP

Static-SP-File

Basic Libraries Installation-File

Current-SP

Dynamic Adjustment

of SP

Optimum-AP

Selectionof Optimum AP

NWS Information

Call to NWS

DESIGN R

UN-TI

ME

Parameter selection at running time

Page 142: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 142

INSTALLATION

LAR

Modellingthe LAR

MODEL

Implementationof SP-

EstimatorsSP-Estimators

Estimationof Static-SP

Static-SP-File

Basic Libraries Installation-File

Current-SP

Dynamic Adjustment

of SP

Optimum-AP

Selectionof Optimum AP

NWS Information

Call to NWS

DESIGN R

UN-TI

ME

Block size

Situation of the Platform Load

n A B C D E

1024 32 32 64 64 64

2048 64 64 64 128 128

3072 64 64 128 128 128Number of nodes to use p = r c

Situation of the Platform Load

n A B C D E

1024 42 42 22 22 21

2048 42 42 22 22 21

3072 42 42 22 22 21

Parameter selection at running time

Page 143: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 143

INSTALLATION

LAR

Modellingthe LAR

MODEL

Implementationof SP-

EstimatorsSP-Estimators

Estimationof Static-SP

Static-SP-File

Basic Libraries Installation-File

Current-SP

Dynamic Adjustment

of SP

Optimum-AP

Selectionof Optimum AP

NWS Information

Call to NWS

DESIGN R

UN-TI

ME

Parameter selection at running time

Page 144: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 144

INSTALLATION

LAR

Modellingthe LAR

MODEL

Implementationof SP-

EstimatorsSP-Estimators

Estimationof Static-SP

Static-SP-File

Basic Libraries Installation-File

Current-SP

Dynamic Adjustment

of SP

Optimum-AP

Selectionof Optimum AP

Executionof LAR

NWS Information

Call to NWS

DESIGN R

UN-TI

ME

Parameter selection at running time

Page 145: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 145

Parameter selection at running time

n = 1024

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

A B C D E

Static Model

Dynamic Model

n = 2048

0%

20%

40%

60%

80%

100%

120%

140%

160%

A B C D E

Situation of the platform load

n = 3072

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

A B C D E

Page 146: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 146

Work distribution

●There are different possibilities in heterogeneous systems: ● Heterogeneous algorithms (Gauss elimination).● Homogeneous algorithms and assignation of:

● One process to each processor (LU factorization)● A variable number of processes to each processor, depending on the relative

speed

The general assignation problem is NP use of heuristics approximations

Page 147: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 147

Work distribution

●Dynamic Programming (the coins problem scheme) Homogeneous algorithm +

Heterogeneous algorithm distribution

  1 2 . . . . . . . j .   . . .   .

1                                      

2                                      

...                                  

i                                      

…                                      

n                                      

P0 P1 P2 ...... PS ... PK-1 PK

  1 2 . . . . . . . j .   . . .   .

1                                      

2                                      

...                                  

i                                      

…                                      

n                                      

p0 p1 p2 p3 p4 p5 ... ps ... pr-1 pr

P0 P0 P1 P3 P3 P3 ... PS ... PK PK

Page 148: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 148

Work distribution●The model:

t(n,C,v,q,tc(n,C,v,q,p,b,d),ts(n,C,v,q,p,b,d),tw(n,C,v,q,p,b,d))● Problem size:

● n number of types of coins● C value to give● v array of values of the coins● q quantity of coins of each type

● Algorithmic parameters:● p number of processes● b block size (here n/p)● d processes to processors assignment

● System parameters:● tc cost of basic arithmetic operations● ts start-up time● tw word-sending time

Page 149: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 149

Work distribution●Theoretical model:The same as for the homogeneous case because the same homogeneous algorithm is usedSequential cost:

Computational parallel cost (qi large):

Communication cost:

●There is a new AP: d●SPs are now unidimensional (tc) or bidimensional (ts ,tw ) tables

vC

i

o2

2

Process Pptt ws

pCpp

2

)1(

2

)1(

Page 150: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 150

Work distribution●Assignment tree (P types of processors and p processes):

3 P

PP3P32P321

21 ...

.........

P processors

p p

roce

sses

...

Some limit in the height of the tree (the number of processes)is necessary

Page 151: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 151

Work distribution●Assignment tree (P types of processors and p processes):

P =2 and p =3: 10 nodes

in general:

1

1 1

1 5 10 10 5 1

1 4 6 4 1

1 3 3 1

1 2 1

Page 152: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 152

Work distribution●Assignment tree. SUNEt P=2 types of processors (five SUN1 + one SUN5):

nodes:

when more processes than available processors are

assigned to a type of processor, the costs of operations (SPs) change

U5

U1U5 U1

U1

2 processors

p p

roce

sses

...

U1

U1

U1

U1U1U5

U1

U1

2

)1)(2( pp

one process to each processor

Page 153: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 153

Work distribution●Assignment tree. TORC, used P=4 types of processors:

one 1.7 Ghz Pentium 4 (only one process can be assigned). Type 1

one 1.2 Ghz AMD Athlon. Type 2

one 600 Mhz single Pentium III. Type 3

eight 550 Mhz dual Pentium III. Type 4

3 4

4434324321

21

4 processors

p p

roce

sses

...

not in the tree

the values of SPs change

two consecutiveprocessesare assignedto a samenode

Page 154: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 154

Work distribution●Use Branch and Bound or Backtracking (with nodes elimination) to search through the tree:

● Use the theoretical execution model to estimate the cost at each node with the highest values of the SPs between those of the types of processors considered, through multiplying the values by the number of processes assigned to the processor of this type with more charge:

tnpdt cd i

ii

i

pic 0/,...,1max

tt sdd jiji

pjis,0,/,...,1,

max

tt wdd jiji

pjiw,0,/,...,1,

max

Page 155: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 155

Work distribution●Use Branch and Bound or Backtracking (with nodes elimination) to search through the tree:

● Use the theoretical execution model to obtain a lower bound for each nodeFor example, with an array of types of processors (1,1,1,2,2,2,3,3,3,4,4,4), with

relative speeds si, and array of assignations a=(2,2,3), the array of possible assignations is pa=(0,0,0,1,1,0,1,1,1,1,1,1), and the maximum achievable speed is

the minimum arithmetic cost is obtained from this speed, and the lowest communication costs are obtained from those between processors in the array of assignations

tt saa jiji

pjis,0,/,...,1,

max

tt waa jiji

pjiw,0,/,...,1,

max

spas i

p

iiT

1

Page 156: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 156

Work distribution●Theoretical model:

Sequential cost:

Computational parallel cost (qi large):

Communication cost:

●The APs are p and the assignation array d●The SPs are the unidimensional array tc , and the bidimensional arrays ts and tw

...2,2,1,1, commarithcommarithparallel ttttt

one step

vCt

ic

o2

2

vCtvt

ic

C

p

CCj i

c po

j2

1

1

tt ws

pCpp

2

)1(

2

)1(

Maximum values

Page 157: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 157

Work distribution●How to estimate arithmetic SPs:

Solving a small problem on each type of processors●How to estimate communication SPs:

● Using a ping-pong between each pair of processors, and processes in the same processor (CP1)

● Does not reflect the characteristics of the system● Solving a small problem varying the number of processors,

and with linear interpolation (CP2)● Larger installation time

Page 158: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 158

Work distribution●Three types of users are considered:

● GU (greedy user):● Uses all the available processors, with one process per processor.

● CU (conservative user):● Uses half of the available processors (the fastest), with one process per processor.

● EU (user expert in the problem, the system and heterogeneous computing):● Uses a different number of processes and processors depending on the granularity:

● 1 process in the fastest processor, for low granularity● The number of processes is half of the available processors, and in the appropriate

processors, for middle granularity● A number of processes equal to the number of processors, and in the appropriate processors,

for large granularity

Page 159: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 159

Work distribution●Quotient between the execution time with the parameters selected by each one of the selection methods and the modelled users and the lowest execution time, in SUNEt:

Page 160: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 160

Work distribution●Parameters selection, in TORC, with CP2:

C gra LT CP2

50000 10 (1,2) (1,2)

50000 50 (1,2) (1,2,4,4)

50000 100 (1,2) (1,2,4,4)

100000 10 (1,2) (1,2)

100000 50 (1,2) (1,2,4,4)

100000 100 (1,2) (1,2,4,4)

500000 10 (1,2) (1,2)

500000 50 (1,2) (1,2,3,4)

500000 100 (1,2) (1,2,3,4)

Page 161: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 161

Work distribution●Parameters selection, in TORC (without the 1.7 Ghz Pentium 4), with CP2:one 1.2 Ghz AMD Athlon. Type 1

one 600 Mhz single Pentium III. Type 2

eight 550 Mhz dual Pentium III. Type 3

C gra LT CP2

50000 10 (1,1,2) (1,1,2,3,3,3,3,3,3)

50000 50 (1,1,2) (1,1,2,3,3,3,3,3,3,3,3)

50000 100 (1,1,3,3) (1,1,2,3,3,3,3,3,3,3,3)

100000 10 (1,1,2) (1,1,2)

100000 50 (1,1,3) (1,1,2,3,3,3,3,3,3,3,3)

100000 100 (1,1,3) (1,1,2,3,3,3,3,3,3,3,3)

500000 10 (1,1,2) (1,1,2)

500000 50 (1,1,2) (1,1,2,3)

500000 100 (1,1,2) (1,1,2)

Page 162: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 162

Work distribution●Quotient between the execution time with the parameters selected by each one of the selection methods and the modelled users and the lowest execution time, in TORC:

Page 163: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 163

Work distribution●Quotient between the execution time with the parameters selected by each one of the selection methods and the modelled users and the lowest execution time, in TORC (without the 1.7 Ghz Pentium 4):

Page 164: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 164

Outline●A little history●Modelling Linear Algebra Routines●Installation routines●Autotuning routines●Modifications to libraries’ hierarchy●Polylibraries●Algorithmic schemes●Heterogeneous systems●Hybrid programming●Peer to peer computing

Page 165: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 165

Hybrid programming

OpenMPFine-grain parallelism

Efficient in SMP

Sequential and parallel codes are similar

Tools for development and parallelisation

Allows run time scheduling

Memory allocation can reduce performance

MPICoarse-grain parallelismMore portableParallel code very

different from sequential

Development and debugging more complex

Static assigment of processes

Local memories, which facilitates efficient use

Page 166: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 166

Hybrid programming

Advantages of Hybrid Programming● To improve scalability● When too many tasks produce load imbalance● Applications with fine and coarse-grain parallelism● Redution of the code development time● When the number of MPI processes is fixed● In case of a mixture of functional and data parallelism

Page 167: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 167

Hybrid programmingHybrid Programming in the literature

● Most of the papers are about particular applications● Some papers present hybrid models● No theoretical models of the execution time are

available

Page 168: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 168

Hybrid programmingSystems

● Networks of Dual Pentiums● HPC160 (each node four processors)● IBM SP● Blue Horizon (144 nodes, each 8 processors)● Earth Simulator (640x8 vector processors)

Page 169: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 169

Hybrid programming

Page 170: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 170

Hybrid programming

ModelsMPI+OpenMP

OpenMP used for loops parallelisation

OpenMP+MPI Unsafe threads

MPI and OpenMP processes in SPMD modelReduces cost of communications

                                                                          

Page 171: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 171

Hybrid programming

Page 172: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 172

Hybrid programming!$OMP PARALLEL DO REDUCTION (+:sum) PRIVATE (x)

do 20 i = myid+1, n, numprocs

x = h * (dble(i) - 0.5d0)

sum = sum + f(x)

20 enddo

!$OMP END PARALLEL DO

mypi = h * sum

call MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISION, &MPI_SUM,0,MPI_COMM_WORLD,ierr)

call MPI_FINALIZE(ierr)

stop

end

program main

include 'mpif.h'

double precision mypi, pi, h, sum, x, f, a

integer n, myid, numprocs, i, ierr

f(a) = 4.d0 / (1.d0 + a*a)

call MPI_INIT( ierr )

call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )

call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )

call MPI_BCAST(n,1,MPI_INTEGER,0, & MPI_COMM_WORLD,ierr)

h = 1.0d0/n

sum = 0.0d0

Page 173: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 173

Hybrid programmingIt is not clear if with hybrid programming the execution time would

be lowerLanucara, Rovida: Conjugate-Gradient

Page 174: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 174

Hybrid programmingIt is not clear if with

hybrid programming

the execution time

would be lowerDjomehri, Jin:

CFD Solver

Page 175: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 175

Hybrid programmingIt is not clear if with

hybrid programming

the execution time

would be lowerViet, Yoshinaga, Abderazek,

Sowa: Linear system

Page 176: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 176

Hybrid programming●Matrix-matrix multiplication:

MPI

SPMD MPI+OpenMPdecide which is preferable

MPI+OpenMP:less memory

fewer communicationsmay have worse memory use

N0 p0N0 p1

N2 p0N2 p1

N1 p0N1 p1

N0 p0N0 p1

N2 p0N2 p1

N1 p0N1 p1

N0 p0N0 p1

N2 p0N2 p1

N1 p0N1 p1

N0

N2

N1

Page 177: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 177

Hybrid programming●In the time theoretical model more Algorithmic Parameters appear:

8 processors: p=rxs, 1x8, 2x4, 4x2, 8x1

p=rxs, 1x4, 2x2, 4x1

q=uxv, 1x2, 2x1

total 6 configurations

16 processors: p=rxs, 1x16, 2x8, 4x4, 8x2, 16x1

p=rxs, 1x4, 2x2, 4x1

q=uxv, 1x4, 2x2, 4x1

total 9 configurations

Page 178: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 178

Hybrid programming●And more System Parameters:

● The cost of communications is different inside and outside a node (similar to the heterogeneous case with more than one process per processor)

● The cost of arithmetic operations can vary when the number of threads in the node varies

●Consequently, the algorithms must be recoded and new models of the execution time must be obtained

Page 179: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 179

Hybrid programming… and the formulas change:

P0 P1 P2 P3 P4 P5 P6 synchronizations

Node 1

Node 2

Node 3

Node 4

Node 5

Node 6communications

The formula changes,for some systems 6x1 nodes and1x6 threads couldbe better, and for others 1x6 nodes and6x1 threads

Page 180: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 180

Hybrid programming●Open problem

● Is it possible to generate automatically MPI+OpenMP programs from MPI programs? Maybe for the SPMD model.

● Or at least for some type of programs, such as matricial problems in meshes of processors?

● And is it possible to obtain the execution time of the MPI+OpenMP program from that of the MPI program and some description of how the time model has been obtained?

Page 181: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 181

Outline●A little history●Modelling Linear Algebra Routines●Installation routines●Autotuning routines●Modifications to libraries’ hierarchy●Polylibraries●Algorithmic schemes●Heterogeneous systems●Hybrid programming●Peer to peer computing

Page 182: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 182

Peer to peer computing●Distributed systems:

● They are inherently heterogeneous and dynamic● But there are other problems:

● Higher communication cost● Special middleware is necessary

● The typical paradigms are master/slave, client/server, where different types of processors (users) are considered.

Page 183: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 183

Peer to peer computingPeer-to-Peer Computing. Dejan S. Milojicic, Vana Kalogeraki,

Rajan Lukose, Kiran Nagaraja1, Jim Pruyne, Bruno Richard, Sami Rollins, Zhichen Xu. HP Laboratories Palo Alto. 2002

Page 184: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 184

Peer to peer computing●Peer to peer:

● All the processors (users) are at the same level (at least initially)● The community selects, in a democratic and continuous way, the

topology of the global network

●Would it be interesting to have a P2P system for computing?●Is some system of this type available?

Page 185: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 185

Peer to peer computing●Would it be interesting to have a P2P system for computing?

● I think it would be interesting to develop a system of this type● And to leave the community to decide, in a democratic and

continuous way, if it is worthwhile●Is some system of this type available?

● I think there is no pure P2P dedicated to computation

Page 186: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 186

Peer to peer computing●… and other people seem to think the same:

● Lichun Ji (2003): “… P2P networks seem to outperform other approaches largely due to the anonymity of the participants in the peer-network, low network costs and the inexpensive disk-space. Trying to apply P2P principles in the area of distributed computation was significantly less successful”

● Arjav J. Chakravarti, Gerald Baumgartner, Mario Lauria (2004): “… current approaches to utilizing desktop resources require either centralized servers or extensive knowledge of the underlying system, limiting their scalability”

Page 187: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 187

Peer to peer computing●There are a lot of tools for Grid Computing:

● Globus (of course), but does Globus provide computational P2P capacity or is it a tool with which P2P computational systems can be developed?

● Netsolve/Gridsolve. Uses a client/server structure.● PlanetLab (at present 387 nodes and 162 sites). In each site one Principal Researcher and one System Administrator.

Page 188: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 188

Peer to peer computing●For Computation on P2P the shared resources are:

● Information: books, papers, …, in a typical way.● Libraries. One peer takes a library from another peer.

● Necessary description of the library and the system to know if the library fulfils our requests.

● Computation. One peer colaborates to solve a problem proposed by another peer.● This is the central idea of Computation on P2P…

Page 189: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 189

Peer to peer computing●Two peers collaborate in the solution of a computational problem using the hierarchy of parallel linear algebra libraries

PLAPACK

Mac. LAPACK

BLAS

Reference MPI

ScaLAPACK

Ref. LAPACK

ATLAS

PBLAS

BLACS

Machine MPI

Peer 2Peer 1

Page 190: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 190

Peer to peer computing●There are

● Different global hierarchies● Different libraries

PLAPACK

Mac. LAPACK

BLAS

Reference MPI

ScaLAPACK

Ref. LAPACK

ATLAS

PBLAS

BLACS

Machine MPI

Peer 2Peer 1

Page 191: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 191

Peer to peer computing●And the installation information varies, which makes the efficient use of the theoretical model more difficult than in the heterogeneous case

PLAPACK

Mac. LAPACK

BLAS

Reference MPI

ScaLAPACK

Ref. LAPACK

ATLAS

PBLAS

BLACS

Machine MPI

Peer 2Peer 1Inst. Inform.

Inst. Inform.

Inst. Inform.

Inst. Inform.

Inst. Inform.

Inst. Inform.

Inst. Inform.Inst. Inform.

Inst. Inform.Inst. Inform.

Page 192: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 192

Peer to peer computing●Trust problems appear:

● Does the library solve the problems we require to be solved?● Is the library optimized for the system it claims to be optimized

for?● Is the installation information correct?● Is the system stable?

There are trust-algorithms for P2P systems; are they (or some modification) applicable to these trust problems?

Page 193: 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 193

Peer to peer computing●Each peer would have the possibility of establishing a policy of use:

● The use of the resources could be payable● The percentage of CPU dedicated to computations for the

community● The type of problems it is interested in

And the MAIN PROBLEM: is it interesting to develop a P2P system for the management and optimization of computational codes?