16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de

16 December 2005 Universidad de Murcia 1

Research in parallel routines optimization

Domingo GiménezDpto. de Informática y Sistemas

Javier CuencaDpto. de Ingeniería y Tecnología de Computadores

Universidad de Murciahttp://dis.um.es/~domingo

... and more: J. González (Intel Barcelona), L.P. García (Politécnica Cartagena), A.M. Vidal (Politécnica

Valencia), G. Carrillo (?), P. Alberti (U. Magallanes), P. Alonso (Politécnica Valencia),

J.P. Martínez (U. Miguel Hernández), J. Dongarra (U. Tennessee), K. Roche (?)


Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing




A little history Parallel optimization in the past:

Hand-optimization for each platform Time consuming Incompatible with hardware evolution Incompatible with changes in the system

(architecture and basic libraries) Unsuitable for systems with variable

workloads Misuse by non expert users


A little history Initial solutions to this situation:

Problem-specific solutions Polyalgorithms Installation tests


A little history Problem specific solutions:

Brewer (1994): Sorting Algorithms, Differential Equations

Frigo (1997): FFTW: The Fastest Fourier Transform in the West

LAWRA (1997): Linear Algebra With Recursive Algorithms


A little history Polyalgorithms:

Brewer FFTW PHiPAC (1997): Linear Algebra


A little history Installation tests:

ATLAS (2001): Dense Linear Algebra, sequential

Carrillo + Giménez (2000): Gauss elimination, heterogeneous algorithm

I-LIB (2000): some parallel linear algebra routines


A little history Parallel optimization today:

Optimization based on computational kernels

Systematic development of routines Auto-optimization of routines Middleware for auto-optimization


A little history Optimization based on

computational kernels: Efficient kernels (BLAS) and

algorithms based on these kernels Auto-optimization of the basic kernels

(ATLAS)


A little history Systematic development of

routines: FLAME project

R. van de Geijn + E. Quintana + … Dense Linear Algebra Based on Object Oriented Design

LAWRA Dense Linear Algebra For Shared Memory Systems


A little history Auto-optimization of routines:

At installation time: ATLAS, Dongarra + Whaley I-LIB, Kanada + Katagiri + Kuroda SOLAR, Cuenca + Giménez + González LFC, Dongarra + Roche

At execution time: Solve a reduced problem in each processor (

Kalinov + Lastovetsky) Use a system evaluation tool (NWS)


A little history Middleware for auto-optimization:

LFC: Middleware for Dense Linear Algebra Software in

Clusters. Hierarchy of autotuning libraries:

Include in the libraries installation routines to be used in the development of higher level libraries

FIBER: Proposal of general middleware Evolution of I-LIB

mpC: For heterogeneous systems


A little history Parallel optimization in the

future?: Skeletons and languages Heterogeneous and variable-load

systems Distributed systems P2P computing


A little history Skeletons and languages:

Develop skeletons for parallel algorithmic schemes

together with execution time modelsand provide the users with these libraries (

MALLBA, Málaga-La Laguna-Barcelona) or languages (P3L, Pisa)


A little history Heterogeneous and variable-load

systems:Heterogeneous algorithms: unbalanced

distribution of data (static or dynamic)Homogeneous algorithms: more processes

than processors and assignation of processes to processors (static or dynamic)

Variable-load systems as dynamic heterogeneous


A little history Distributed systems:

Intrinsically heterogeneous and variable-load

Very high cost of communicationsNecessary special middleware (Globus,

NWS)There can be servers to attend queries

of clients


A little history P2P computing:

Users can go in and out dynamicallyAll the users are the same type

(initially)Is distributed, heterogeneous and

variable-loadBut special middleware is necessary


Outline A little story Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing


Modelling Linear Algebra Routines

Necessary to predict accurately the execution time and select The number of processes The number of processors Which processors The number of rows and columns of processes (the

topology) The processes to processors assignation The computational block size (in linear algebra

algorithms) The communication block size The algorithm (polyalgorithms) The routine or library (polylibraries)


Cost of a parallel program:

: arithmetic time: communication time: overhead, for synchronization, imbalance,

processes creation, ...: overlapping of communication and

computation


overlapoverheadcommarithparallel ttttt aritht

commt

overheadt

overlapt


Estimation of the time:

Considering computation and communication divided in a number of steps:

And for each part of the formula that of the process which gives the highest value.


commarithparallel ttt

...2,2,1,1, commarithcommarithparallel ttttt


The time depends on the problem (n) and the system (p) size:

But also on some ALGORITHMIC PARAMETERS like the block size (b) and the number of rows (r) and columns (c) of processors in algorithms for a mesh of processors


),( pnt parallel

),,,( crbnt parallel


And some SYSTEM PARAMETERS which reflect the computation and communication characteristics of the system.

Typically the cost of an arithmetic operation (tc) and the start-up (ts) and

word-sending time (tw)


),,( SPAPnt parallel


LU factorisation (Golub - Van Loan):

=

Step 1: (factorisation LU no blocks) Step 2: (multiple lower triangular systems)

Step 3: (multiple upper triangular systems)

Step 4: (update south-east blocks)


A11

A22

A33A32A31

A23A21

A13A12 L11

L22

L33L32L31

L21

U1

1 U2

2 U3

3

U2

3

U1

3

U1

2

111111 ULA ii ULA 1111

1111 ULA ii

jiijij ULAA 11


The execution time is

If the blocks are of size 1, the operations are all with individual elements, but if the block size is b the cost is

With k3 and k2 the cost of operations performed with BLAS 3 or 2


3

3

2)( nnt tcsequential

nbbnnnt kkksequential2

2

3

3

3

3 3

1

3

2)(


But the cost of different operations of the same level is different, and the theoretical cost could be better modelled as:

Thus, the number of SYSTEM PARAMETERS increases (one for each basic routine), and ...


nbbnnnt kkk dgetfdtrsmdgemmsequential2

2_2

3

_3

3

_3 31

32

)(


The value of each System Parameter can depend on the problem size (n) and on the value of the Algorithmic Parameters (b)

The formula has the form:

And what we want is to obtain the values of AP with which the lowest execution time is obtained


nbbnbnbnnbnbnt kkk dgetfdtrsmdgemmsequential2

2_2

3

_3

3

_3 31

),(),(32

),(),(

)),(,,( APnSPAPnt


The values of the System Parameters could be obtained With installation routines associated to each

linear algebra routine From information stored when the library

was installed in the system, thus generating a hierarchy of libraries with auto-optimization

At execution time by testing the system conditions prior to the call to the routine



These values can be obtained as simple values (traditional method) or as function of the Algorithmic Parameters.

In this case a multidimensional table of values as a function of the problem size and the Algorithmic Parameters is stored,

And when a problem of a particular size is being solved the execution time is estimated with the values of the stored size closest to the real size

And the problem is solved with the values of the Algorithmic Parameters which predict the lowest execution time



Parallel block LU factorisation:

matrix

distribution of computations in the first step

processors



Distribution of computations on successive steps:

second step third step



The cost of parallel block LU factorisation:

Tuning Algorithmic Parameters:block size: b

2D-mesh of p proccesors: p = r c d=max(r,c)

System Parameters:cost of arithmetic operations: k2,getf2 k3,trsmm k3,gemm

communication parameters: ts tw


nkbnbkp

crp

nkT getftrsmgemmARI 2,2

22,3

3

,3 31

32

p

dnt

b

ndtT wsCOM

222


The cost of parallel block QR factorisation:

Tuning Algorithmic Parameters:block size: b

2D-mesh of p proccesors: p = r c

System Parameters:cost of arithmetic operations: k2,geqr2 k2,larft k3,gemm k3,trmm



r

bkn

r

bkn

c

bkn

p

knT

larftgeqr

trmmgemm

ARI

,22

2,22,3

2,3

3

2

1

4

1

3

4

pnb

r

r

c

r

r

rntcrb

b

ntT wsCOM log

loglog12

2log2log32

2

2


The same basic operations appear repeatedly in different higher level routines: the information generated for one routine (let’s say LU) could be stored and used for other routines (e.g. QR)and a common format is necessary to store the information




IBM-SP2. 8 processors0,00

10,00

20,00

30,00

40,00

50,00

60,00

70,00

80,00

512 1024 1536 2048 2560 3072 3584

problem size

time

(sec

onds

)

mean

model

optimum

Parallel QR factorisation

“mean” refers to the mean of the execution times with representative values of the Algorithmic Parameters (execution time which could be obtained by a non-expert user)

“optimum” is the lowest time of all the executions performed with representative values of the Algorithmic Parameters

“model” is the execution time with the values selected with the model




In the formulas (parallel block LU factorisation)

The values of the System Parameters (k2,getf2 ,

k3,trsmm , k3,gemm , ts , tw) must be estimated as functions of the problem size (n) and the Algorithmic Parameters (b, r, c)

Installation Routines

ncrbnkbncrbnbkp

crp

ncrbnkcrbnT getftrsmgemmARI ),,,(

31

),,,(),,,(32

),,,( 2,222

,3

3

,3

pdn

crbntbnd

crbntcrbnT wsCOM

22),,,(

2),,,(),,,(


Installation RoutinesBy running at installation time Installation

Routines associated to the linear algebra routine

And storing the information generated to be used at running time

Each linear algebra routine must be

designed together with the corresponding installation routines, and the installation process must be detailed


is estimated by performing matrix-matrix multiplications and updatings of size

(n/r b) (b n/c)

Because during the execution the size of the matrix to work with decreases, different values can be estimated for different problem sizes, and the formula can be modified to include the posibility of these estimations with different values, for example, splitting the formula into four formulas with different problem sizes


),,,(,3 crbnk gemm


two multiple triangular systems are solved, one upper triangular of size b n/c , and another lower triangular of size n/r b

Thus, two parameters are estimated, one of them depending on n, b and c, and the other depending on n, b and r

As for the previous parameter, values can be obtained for different problem sizes


),,,(,3 crbnk trsm


corresponds to a level 2 LU sequential factorisation of size b b

At installation time each of the basic routines is executed varying the value of the parameters they depend on, and with representative values (selected by the routine designer or the system manager),

And the information generated is stored in a file to be used at running time or in the code of the linear algebra routine before its installation


),,,(2,2 crbnk getf


and appear in communications of three types,

In one of them a block of size b b is broadcast in a row, and this parameter depends on b and c

In another a block of size b b is broadcast in a column, and the parameter depends on b and r

And in the other, blocks of sizes b n/c and n/r b are broadcast in each one of the columns and rows of processors. These parameters depend on n, b, r and c


),,,( crbnts ),,,( crbntw


In practice each System Parameter depends on a more reduced number of Algorithmic Parameters, but this is known only after the installation process is completed.

The routine designer also designs the installation process, and can take into consideration the experience he has to guide the installation.

The basic installation process can be designed allowing the intervention of the system manager.



Some results in different systems (physical and logical platform)

Values of k3_DTRMM (≈ k3_DGEMM) on the different platforms (in microseconds)


0.00250.00250.00300.0070512,.., 4096

macBLASR10K

0.00180.00180.00190.0023512,.., 4096

macBLASPPC

0.00300.00300.00330.0038512,.., 4096

ATLASPIII

0.01500.00500.0025

0.01400.00500.0025

0.01300.00500.0032

0.01200.00600.0040

512,.., 4096512,.., 4096512,.., 4096

refBLASmacBLAS

ATLAS

SUN5

0.02800.01100.0060

0.02200.01100.0060

0.02000.01100.0060

0.02000.01200.0070

512,.., 4096512,.., 4096512,.., 4096

refBLASmacBLAS

ATLAS

SUN1

128643216nSystem

Block size


Installation RoutinesValues of k2_DGEQR2 (≈ k2_DLARFT) on the different platforms (in microseconds)

0.0250512,.., 4096

macBLASR10K

0.0100512,.., 4096

macBLASPPC

0.0150512,.., 4096

ATLASPIII

0.00500.03000.0500

512,.., 4096512,.., 4096512,.., 4096

refBLASmacBLAS

ATLAS

SUN5

0.02000.05000.0700

512,.., 4096512,.., 4096512,.., 4096

refBLASmacBLAS

ATLAS

SUN1

128643216nSystem

Block size


Typically the values of the communication parameters are well estimated with a ping-pong


20 / 0.1512,.., 4096

Mac-MPIOrigin 2K

75 / 0.3512,.., 4096

Mac-MPIIBM-SP2

60 / 0.7512,.., 4096

MPICHcPIII

170 / 7.0512,.., 4096

MPICHcSUN1

128643216nSystem

Block size




Modelling the Linear Algebra Routine

(LAR)

Obtaining information from

the System

Selectionof

parameters values

Executionof LAR

DESIGN

INSTALLATION

RUN-TI

ME

Autotuning routines


DESIGN PROCESS

DESIGN

LAR: Linear Algebra RoutineMade by the LAR Designer

Example of LAR: Parallel Block LU factorisation

LAR


Modelling the LAR

DESIGN

LAR

Modellingthe LAR

MODEL


Modelling the LAR

DESIGN

MODELTexec = f (SP, AP, n)

SP: System Parameters AP: Algorithmic Parameters n : Problem size

Made by the LAR-DesignerOnly once per LAR

LAR

Modellingthe LAR

MODEL


Modelling the LAR

DESIGN

SP: k3, k2, ts, tw

AP: p = r x c, bn : Problem size

MODEL LAR: Parallel Block LU factorisation

LAR

Modellingthe LAR

MODEL


Implementation of SP-Estimators

DESIGN

LAR

Modellingthe LAR

MODEL

Implementationof SP-

Estimators

SP-Estimators


Implementation of SP-Estimators

DESIGN

LAR

Modellingthe LAR

MODEL


Estimators

SP-Estimators

Estimators of Arithmetic-SPComputation Kernel of the LARSimilar storage schemeSimilar quantity of data

Estimators of Communication-SP Communication Kernel of the LAR Similar kind of communicationSimilar quantity of data


INSTALLATION PROCESS

INSTALLATION

LAR

Modellingthe LAR

MODEL


Estimators

SP-Estimators

DESIGN

Installation ProcessOnly once per PlatformDone by the System Manager


Estimation of Static-SP

INSTALLATION

LAR

Modellingthe LAR

MODEL


Estimators

SP-Estimators

Estimationof Static-SP

Static-SP-File

Basic Libraries Installation-File

DESIGN


INSTALLATION

LAR

Modellingthe LAR

MODEL


Estimators

SP-Estimators


Static-SP-File


DESIGN

Estimation of Static-SPBasic Libraries

Basic Communication Library: MPI PVM

Basic Linear Algebra Library: reference-BLAS

machine-specific-BLASATLAS

Installation File

SP values are obtained using the information (n and AP values) of this file.


Estimation of Static-SPLAR

Modellingthe LAR

MODEL

Implementationof SP-Estimators

SP-Estimators


Static-SP-File


DESIGN

INSTALLATION

Estimation of the Static-SP tw-static (in sec)

Message size (Kbytes) 32 256 1024 2048tw-static 0.700 0.690 0.680 0.675

Platform:Cluster of Pentium III + Fast Ethernet

Basic Libraries: ATLAS and MPI

Estimation of the Static-SP k3-static (in sec)

Block size 16 32 64 128k3-static 0.0038 0.0033 0.0030 0.0027


RUN-TIME PROCESS

INSTALLATION

LAR

Modellingthe LAR

MODEL


Estimators

SP-Estimators


Static-SP-File


DESIGN

RUN-TI

ME


RUN-TIME PROCESS

INSTALLATION

LAR

Modellingthe LAR

MODEL


Estimators

SP-Estimators


Static-SP-File


Optimum-AP

Selectionof Optimum AP

DESIGN

RUN-TI

ME


RUN-TIME PROCESS

INSTALLATION

LAR

Modellingthe LAR

MODEL


Estimators

SP-Estimators


Static-SP-File


Optimum-AP


Executionof LAR

DESIGN

RUN-TI

ME


Autotuning routines

Experiments

LAR: block LU factorization.

Platforms: IBM SP2,

SGI Origin 2000,

NoW

Basic Libraries: reference BLAS,

machine BLAS, ATLAS


Autotuning routines

LU on IBM SP2

Quotient between the

execution time with the

parameters selected by

the model and the lowest

experimentl execution

time (varying the

value of the parameters) 0

0,2

0,4

0,6

0,8

1

1,2

1,4

SEQPAR4PAR8


Autotuning routines

LU on Origin 2000






time (varying the

value of the parameters) 0

0,2

0,4

0,6

0,8

1

1,2

1,4

SEQPAR4PAR8PAR16


Autotuning routines

LU on NoW






time (varying the

value of the parameters) 0,96

0,98

1

1,02

1,04

1,06

1,08

1,1

512 1024 1536 2048

SEQ BLASSEQ ATLASPAR4 BLASPAR4 ATLAS




Modifications to libraries’ hierarchy

In the optimization of routines individual basic operations appear repeatedly: LU:

QR:

nkbnbkp

crp


22,3

3

,3 31

32

p

dnt

b

ndtT wsCOM

222

r

bkn

r

bkn

c

bkn

p

knT

larftgeqr

trmmgemm

ARI

,22

2,22,3

2,3

3

2

1

4

1

3

4

pnb

r

r

c

r

r

rntcrb

b

ntT wsCOM log

loglog12

2log2log32

2

2



The information generated to instal a routine could be used for another different routine without additional experiments: ts and tw are obtained when the

communication library (MPI, PVM, …) is installed

K3,gemm is obtained when the basic computational library (BLAS, ATLAS, …) is installed



To determine: the type of experiments necessary for

the different routines in the library: ts and tw ¿obtained with ping-pong, broadcast,

… ? K3,gemm ¿obtained for small block sizes, … ?

the format in which the data will be stored, to facilitate the use of them when installing other routines



The method could be valid not only for one library (that I am developing) but also for others libraries I or somebody else will develop in the future: the type of experiments the format in which the data will be stored

must be decided by the Parallel Linear Algebra Community

… and the typical hierarchy of libraries would change



typical hierarchy of Parallel Linear Algebra libraries

ScaLAPACK

LAPACK

BLAS

PBLAS

BLACS

Communications



To includeinstallation informationin the lowest levelsof the hierarchy

ScaLAPACK

LAPACK

BLAS

PBLAS

BLACS

CommunicationsSelf-Optimisation

Information Self-Optimisation Information



When installing librariesin a higher level thisinformation can be used,and new informationis generated …

ScaLAPACK

LAPACK

BLAS

PBLAS

BLACS


Information

Self-Optimisation Information





And so in higher levels

ScaLAPACK

LAPACK

BLAS

PBLAS

BLACS









And new libraries with autotunig capacitycould be developed

ScaLAPACK

LAPACK

BLAS

PBLAS

BLACS







Inverse Eigenvalue ProblemLeast Square ProblemPDE Solver






Movement of information between routinesin the different levels of thehierarchy

GETRF from LAPACK (level 1)

GETRF_manager

k3_information

Model

GETRF {

}

GEMM from BLAS (level 0)

GEMM_manager

k3_information

Model

GEMM {

}

23

333

2nbknkTexec

332 nkTexec












Architecture ofa Self OptimizedLinear AlgebraRoutine manager

SP1_information

SP1_manager

Installation_SP1_values

AP1 .......... APz

n1 SP11,1 .... SP1

1,z

nw SP1w,1 .... SP1

w,z

Current_SP1_values

AP1 .......... APz

nc SP1c,1 .... SP1

c,z

SP1_information

SP1_manager


AP1 .......... APz

n1 SP11,1 .... SP1

1,z

nw SP1w,1 .... SP1

w,z

Current_SP1_values

AP1 .......... APz

nc SP1c,1 .... SP1

c,z

. . .

LAR(n, AP){

...}

Model

Texec = f (SP,AP, n)SP = f(AP,n)

Installation_information

n1 ... nw

AP1

...APz

Current_problem_sizenc

Current_system_informationCurrent_CPUs_availability

%CPU1 ... %CPUp

Current_network_availability

% net1-1 ...%net1-p

...% netP-1 ..%netp-p

SOLAR_manager

Optimum_AP AP0

SP1_information

SP1_manager


AP1 .......... APz

n1 SP11,1 .... SP1

1,z

nw SP1w,1 .... SP1

w,z

Current_SP1_values

AP1 .......... APz

nc SP1c,1 .... SP1

c,z

SPt_information

SPt_manager


AP1 .......... APz

n1 SPt1,1 .... SPt

1,z

nw SPtw,1 .... SPt

w,z

Current_SP1_values

AP1 .......... APz

nc SPtc,1 .... SPt

c,z


Outline●A little history●Modelling Linear Algebra Routines●Installation routines●Autotuning routines●Modifications to libraries’ hierarchy●Polylibraries●Algorithmic schemes●Heterogeneous systems●Hybrid programming●Peer to peer computing


Polylibraries●Different basic libraries can be available:

● Reference BLAS, machine specific BLAS, ATLAS, …● MPICH, machine specific MPI, PVM, …● Reference LAPACK, machine specific LAPACK, …● ScaLAPACK, PLAPACK, …

●To use a number of different basic libraries to develop a polylibrary


PolylibrariesTypical parallel linear algebra libraries hierarchy

ScaLAPACK

LAPACK

BLAS

PBLAS

BLACS

MPI, PVM, ...


PolylibrariesA possible parallel linear algebra polylibraries hierarchy

ScaLAPACK

LAPACK PBLAS

BLACS

MPI, PVM, ...

ref. BLAS

mac. BLAS

ATLAS



ScaLAPACK

LAPACK PBLAS

BLACS

ref. BLAS

mac. BLAS

ATLAS

mac. MPI

LAM

MPICH

PVM



ScaLAPACK

mac. LAPACK

PBLAS

BLACS

ref. BLAS

mac. BLAS

ATLAS

mac. MPI

LAM

MPICH

PVM

ESSL

ref. LAPACK


Polylibraries

BLACS

PBLAS

ref. BLAS

mac. BLAS

ATLAS

mac. MPI

LAM

MPICH

PVM

mac. LAPACK

ESSL

ref. LAPACK

mac. ScaLAPACK

ESSL

ref. ScaLAPACK


Polylibraries●The advantage of Polylibraries

● A library optimised for the system might not be available

● The characteristics of the system can change

● Which library is the best may vary according to the routines and the systems

● Even for different problem sizes or different data access schemes the preferred library can change

● In parallel system with the file system shared by processors of different types


Architecture of a Polylibrary

Library_1



Library_1

LIF_1

Installation



Library_1

LIF_1

Installation

X MflopsX MflopsX Mflops80

X MflopsX MflopsX Mflops40n


804020

m

Routine: DGEMM



Library_1

LIF_1

Installation


X MflopsX MflopsX Mflops200n


2001001

Leading dimension

Routine: DROT



Library_2Library_1

LIF_1

Installation



Library_2

LIF_2

Library_1

LIF_1

Installation Installation



Library_2

LIF_2

Library_3Library_1

LIF_1




Library_2

LIF_2

Library_3

LIF_3

Installation

Library_1

LIF_1




PolyLibrary

interface routine_1interface routine_2

...

Library_2

LIF_2

Library_3

LIF_3

Installation

Library_1

LIF_1




PolyLibrary

interface routine_1interface routine_2

...

interface routine_1 if n<value call routine_1 from Library_1 else depending on data storage call routine_1 from Library_1 or call routine_1 from Library_2 ...

Library_2

LIF_2

Library_3

LIF_3

Installation

Library_1

LIF_1



Polylibraries●Combining Polylibraries with other Optimisation Techniques:

● Polyalgorithms● Algorithmic Parameters

● Block size● Number of processors● Logical topology of processors


Experimental Results

Routines of different levels in the hierarchy:● Lowest level:

● GEMM: matrix-matrix multiplication● Medium level:

● LU and QR factorisations● Highest level:

● a Lift-and-Project algorithm to solve the inverse additive eigenvalue problem

● an algorithm to solve the Toeplitz least square problem


Experimental Results

The platforms: ● SGI Origin 2000● IBM-SP2● Different networks of processors

● SUN Workstations + Ethernet● PCs + Fast-Ethernet● PCs + Myrinet


Experimental Results: GEMMRoutine: GEMM (matrix-matrix multiplication)

Platform: five SUN Ultra 1 / one SUN Ultra 5

Libraries:refBLAS macBLASATLAS1 ATLAS2 ATLAS5

Algorithms and Parameters: Strassen base sizeBy blocks block sizeDirect method


Experimental Results: GEMMMATRIX-MATRIX MULTIPLICATION INTERFACE:if processor is SUN Ultra 5 if problem-size<600

solve using ATLAS5 and Strassen method with base size half of problem size

else if problem-size<1000solve using ATLAS5 and block method with block size 400

elsesolve using ATLAS5 and Strassen method with base size half of problem size

endifelse if processor is SUN Ultra 1 if problem-size<600 solve using ATLAS5 and direct method else if problem-size<1000

solve using ATLAS5 and Strassen method with base size half of problem size

else solve using ATLAS5 and direct method endifendif


Experimental Results: GEMM

20.03ATL5block

s400

12.53ATL2Stras

s2

4.68ATL5Stras

s2

1.06ATL5direct

0.04ATL5direct

TimeLibraryMethodParameter

Low

31.0213.504.831.060.04TimeATLAS5Direct

26.57ATL5Stras

s2

12.58ATL5Stras

s2

4.68ATL5Stras

s2

1.11ATL5block

s400

0.04ATL5Stras

s2

TimeLibraryMethodParameter

Mod

160014001000600200

n


Experimental Results: LURoutine: LU factorisation

Platform: 4 PentiumIII + Myrinet

Libraries:

ATLAS

BLAS for Pentium II

BLAS for Pentium III


The cost of parallel block LU factorisation:

Tuning Algorithmic Parameters:

block size: b

2D-mesh of p proccesors: p = r c d=max(r,c)

System Parameters:cost of arithmetic operations: k2,getf2 k3,trsmm k3,gemm


Experimental Results: LU

nkbnbkp

cr

p


22,3

3

,3 3

1

3

2

p

dnt

b

ndtT wsCOM

222

p

dnt

b

ndtT wsCOM

222


Experimental Results: LU

BLAS-III

BLAS-II

ATLAS

51210241536

51210241536

51210241536

n

322.13322.13322.30320.70320.70320.77320.11320.11320.13

322.13322.13322.30320.71320.71320.77320.11320.11320.13

322.27642.21322.36320.74320.74320.79320.12320.12320.13btimebtimebtime

mod.low.the.


Experimental Results: L&PRoutine: Lift-and-Project method for the Inverse Additive Eigenvalue Prob

Platform: dual Pentium III

Libraries combinations:

reference LAPACK and the BLAS installed which uses threads

La_Re+B_In_Th

reference LAPACK and a freely available BLAS for Pentium II using threads

La_Re+B_II_Th

LAPACK and BLAS installed for the use of threadsLa_In_Th+B_In_Th

reference LAPACK and the installed BLASLa_Re+B_In

reference LAPACK and a freely available BLAS for Pentium II

La_Re+B_II

reference LAPACK and a freely available BLAS for PentiumIII

La_Re+B_III

LAPACK and BLAS installed in the system and supposedly optimized for the machine

La_In+B_In


The theoretical model of the sequential algorithm cost:

System Parameters:

ksyev LAPACK

k3, gemm k3, diaggemm BLAS-3

k1,dot k1,scal k1,axpy BLAS-1

Experimental Results: L&P

3

,3,323

22nkkkiter diaggemmgemmsyev

222,1

2,1,1,1 22 nLkLnkLnkkkiter sumdotaxpyscaldot





Lowest

Lowest with threads

La_Re

La_Re

La_In_Th

Lowest no threads

La_Re

La_Re

La_Re

La_In

197.069.996.660.62165.8112.861.10

281.709.996.660.62249.5913.711.10

290.6811.9013.740.62249.5913.711.10B_In_Th

288.669.996.660.79254.3415.681.16B_II_Th

308.8012.3414.130.66266.6313.921.10B_In_Th

201.6410.4410.520.83165.8112.861.16

497.5918.03123.731.21336.4916.411.69B_In

293.8510.4410.520.86255.2015.651.16B_II

264.8910.4626.700.83210.8514.871.16B_III

294.3214.2298.790.94165.8112.861.69B_In

TOTALZKAOAMATMATMATEIGEIGENADKTRACE




Algorithmic schemes

●To study ALGORITHMIC SCHEMES, and not individual routines. The study could be useful to:

● Design libraries to solve problems in different fields.● Divide and Conquer, Dynamic Programming, Branch and

Bound (La Laguna)● Develop SKELETONS which could be used in parallel

programming languages.● Skil, Skipper, CARAML, P3L, …


Dynamic Programming●There are different Parallel Dynamic Programming Schemes.●The simple scheme of the “coins problem” is used:

● A quantity C and n coins of values v=(v1,v2,…,vn), and a quantity q=(q1,q2,…,qn) of each type. Minimize the quantity of coins to be used to give C.

● But the granularity of the computation has been varied to study the scheme, not the problem.


Dynamic Programming●Sequential scheme:for i=1 to number_of_decisions

for j=1 to problem_sizeobtain the optimum solution with i

decisions and problem size jendfor Complete the table with the formula:

endfor

1 2 . . . . . . . . j . . . . . N1 2

…. i

… n

i v

i v2 ]k v-j1,-Change[ik min j]Change[i, i j/vi0,1,...,k


Dynamic Programming●Parallel scheme:for i=1 to number_of_decisions In Parallel: for j=1 to problem_size obtain the optimum solution with i decisions and problem size j endfor endInParallelendfor

1 2 . . . . . . . j . . . . .

1

2

...

i

…

n

PO P1 P2 ...... PS ... PK-1 PK

vi

vi2

vi3


Dynamic Programming●Message-passing scheme:

In each processor Pj

for i=1 to number_of_decisions communication step obtain the optimum solution with i decisions and the problem

sizes Pj has assigned endforendInEachProcessor

1 2 . . . . . . . . j . . . . .

1

2

...

i

…

n

PO P1 P2 .................... PK-1 PK

N


Dynamic Programming●Theoretical model:

Sequential cost:

Computational parallel cost (qi large):

Communication cost:

●The only AP is p●The SPs are tc , ts and tw


one step

vC

i

o2

2

Process Pptt ws

pCpp

2

)1(

2

)1(


Dynamic Programming●How to estimate arithmetic SPs:

Solving a small problem●How to estimate communication SPs:

● Using a ping-pong (CP1)● Solving a small problem varying the number of processors

(CP2)● Solving problems of selected sizes in systems of selected sizes

(CP3)


Dynamic Programming●Experimental results:

● Systems:● SUNEt: five SUN Ultra 1 and one SUN Ultra 5 (2.5 times faster) + Ethernet● PenET: seven Pentium III + FastEthernet

● Varying:● The problem size C = 10000, 50000, 100000, 500000● Large value of qi

● The granularity of the computation (the cost of a computational step)


Dynamic Programming●Experimental results:

● CP1:● ping-pong (point-to-point communication).● Does not reflect the characteristics of the system

● CP2:● Executions with the smallest problem (C =10000) and varying the number of

processors● Reflects the characteristics of the system, but the time also changes with C● Larger installation time (6 and 9 seconds)

● CP2:● Executions with selected problem (C =10000, 100000) and system (p =2, 4, 6) sizes,

and linear interpolation for other sizes● Larger installation time (76 and 35 seconds)


5161516551666166100

416141614161616650

111111111111111110

CP3CP2CP1LTCP3CP2CP1LTCP3CP2CP1LTCP3CP2CP1LTgra

7776777677777576100

617761746177717550

116111611161116110

CP3CP2CP1LTCP3CP2CP1LTCP3CP2CP1LTCP3CP2CP1LTgra

500.000100.00050.00010.000

SUNEt

PenFE

Parameter selection

Dynamic Programming


Dynamic Programming●Quotient between the execution time with the parameter selected by each one of the selection methods and the lowest execution time, in SUNEt:


Dynamic Programming●Quotient between the execution time with the parameter selected by each one of the selection methods and the lowest execution time, in PenFE:


Dynamic Programming●Three types of users are considered:

● GU (greedy user):● Uses all the available processors.

● CU (conservative user):● Uses half of the available processors

● EU (expert user):● Uses a different number of processors depending on the granularity:

● 1 for low granularity● Half of the available processors for middle granularity● All the processors for high granularity


Dynamic Programming●Quotient between the execution time with the parameter selected by each type of user and the lowest execution time, in SUNEt:


Dynamic Programming●Quotient between the execution time with the parameter selected by each type of user and the lowest execution time, in PenFE:




Heterogeneous algorithms●Necessary new algorithms with unbalanced distribution of data:

Different SPs for different processors

APs include

vector of selected processors

vector of block sizes

Gauss eliminationb0b1b2b0b1b2b0b1b2b0


Heterogeneous algorithms●Parameter selection:

● RI-THE: obtains p and b from the formula (homogeneous distribution)

● RI-HOM: obtains p and b through a reduced number of executions (homogeneous distribution)

● RI-HET: ● obtains p and b through a reduced number of executions● and each

pbs

sb p

jj

ii

1


Heterogeneous algorithms●Quotient with respect to the lowest experimental execution time:

0

0,5

1

1,5

2

500

1000

1500

2000

2500

3000

RI-THEORI-HOMORI-HETE

0

0,5

1

1,5

2

500

1000

1500

2000

2500

3000

0

0,5

1

1,5

2

500

1000

1500

2000

2500

3000

Homogeneous system:Five SUN Ultra 1

Hybrid system:Five SUN Ultra 1One SUN Ultra 5

Heterogeneous system:Two SUN Ultra 1 (one manages the file system)One SUN Ultra 5


INSTALLATION

LAR

Modellingthe LAR

MODEL


EstimatorsSP-Estimators


Static-SP-File


DESIGN R

UN-TI

ME

Parameter selection at running time


INSTALLATION

LAR

Modellingthe LAR

MODEL




Static-SP-File


NWS Information

Call to NWS

DESIGN R

UN-TI

ME



INSTALLATION

LAR

Modellingthe LAR

MODEL




Static-SP-File


NWS Information

Call to NWS

DESIGN R

UN-TI

ME

The NWS is called and it reports:

the fraction of available CPU (fCPU)

the current word sending time (tw_current) for a specific n and AP values (n0, AP0).

Then the fraction of available

network is calculated:



node1 node2 node3 node4 node5 node6 node7 node8

Situation A

CPU avail. 100% 100% 100% 100% 100% 100% 100% 100%

tw-current 0.7sec

Situation B

CPU avail. 80% 80% 80% 80% 100% 100% 100% 100%

tw-current 0.8sec 0.7sec

Situation C

CPU avail. 60% 60% 60% 60% 100% 100% 100% 100%

tw-current 1.8sec 0.7sec

Situation D

CPU avail. 60% 60% 60% 60% 100% 100% 80% 80%

tw-current 1.8sec 0.7sec 0.8sec

Situation E

CPU avail. 60% 60% 60% 60% 100% 100% 50% 50%

tw-current 1.8sec 0.7sec 4.0sec



INSTALLATION

LAR

Modellingthe LAR

MODEL




Static-SP-File


NWS Information

Call to NWS

DESIGN R

UN-TI

ME



INSTALLATION

LAR

Modellingthe LAR

MODEL




Static-SP-File


Current-SP

Dynamic Adjustment

of SP

NWS Information

Call to NWS

DESIGN R

UN-TI

ME



INSTALLATION

LAR

Modellingthe LAR

MODEL




Static-SP-File


Current-SP

Dynamic Adjustment

of SP

NWS Information

Call to NWS

DESIGN R

UN-TI

ME

The values of the SP are tuned,

according to the current situation:



INSTALLATION

LAR

Modellingthe LAR

MODEL




Static-SP-File


Current-SP

Dynamic Adjustment

of SP

NWS Information

Call to NWS

DESIGN R

UN-TI

ME



INSTALLATION

LAR

Modellingthe LAR

MODEL




Static-SP-File


Current-SP

Dynamic Adjustment

of SP

Optimum-AP


NWS Information

Call to NWS

DESIGN R

UN-TI

ME



INSTALLATION

LAR

Modellingthe LAR

MODEL




Static-SP-File


Current-SP

Dynamic Adjustment

of SP

Optimum-AP


NWS Information

Call to NWS

DESIGN R

UN-TI

ME

Block size

Situation of the Platform Load

n A B C D E

1024 32 32 64 64 64

2048 64 64 64 128 128

3072 64 64 128 128 128Number of nodes to use p = r c

Situation of the Platform Load

n A B C D E

1024 42 42 22 22 21

2048 42 42 22 22 21

3072 42 42 22 22 21



INSTALLATION

LAR

Modellingthe LAR

MODEL




Static-SP-File


Current-SP

Dynamic Adjustment

of SP

Optimum-AP


NWS Information

Call to NWS

DESIGN R

UN-TI

ME



INSTALLATION

LAR

Modellingthe LAR

MODEL




Static-SP-File


Current-SP

Dynamic Adjustment

of SP

Optimum-AP


Executionof LAR

NWS Information

Call to NWS

DESIGN R

UN-TI

ME




n = 1024

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

A B C D E

Static Model

Dynamic Model

n = 2048

0%

20%

40%

60%

80%

100%

120%

140%

160%

A B C D E

Situation of the platform load

n = 3072

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

A B C D E


Work distribution

●There are different possibilities in heterogeneous systems: ● Heterogeneous algorithms (Gauss elimination).● Homogeneous algorithms and assignation of:

● One process to each processor (LU factorization)● A variable number of processes to each processor, depending on the relative

speed

The general assignation problem is NP use of heuristics approximations


Work distribution

●Dynamic Programming (the coins problem scheme) Homogeneous algorithm +

Heterogeneous algorithm distribution

1 2 . . . . . . . j . . . . .

1

2

...

i

…

n

P0 P1 P2 ...... PS ... PK-1 PK

1 2 . . . . . . . j . . . . .

1

2

...

i

…

n

p0 p1 p2 p3 p4 p5 ... ps ... pr-1 pr

P0 P0 P1 P3 P3 P3 ... PS ... PK PK


Work distribution●The model:

t(n,C,v,q,tc(n,C,v,q,p,b,d),ts(n,C,v,q,p,b,d),tw(n,C,v,q,p,b,d))● Problem size:

● n number of types of coins● C value to give● v array of values of the coins● q quantity of coins of each type

● Algorithmic parameters:● p number of processes● b block size (here n/p)● d processes to processors assignment

● System parameters:● tc cost of basic arithmetic operations● ts start-up time● tw word-sending time


Work distribution●Theoretical model:The same as for the homogeneous case because the same homogeneous algorithm is usedSequential cost:


Communication cost:

●There is a new AP: d●SPs are now unidimensional (tc) or bidimensional (ts ,tw ) tables

vC

i

o2

2

Process Pptt ws

pCpp

2

)1(

2

)1(


Work distribution●Assignment tree (P types of processors and p processes):

3 P

PP3P32P321

21 ...

.........

P processors

p p

roce

sses

...

Some limit in the height of the tree (the number of processes)is necessary


Work distribution●Assignment tree (P types of processors and p processes):

P =2 and p =3: 10 nodes

in general:

1

1 1

1 5 10 10 5 1

1 4 6 4 1

1 3 3 1

1 2 1


Work distribution●Assignment tree. SUNEt P=2 types of processors (five SUN1 + one SUN5):

nodes:

when more processes than available processors are

assigned to a type of processor, the costs of operations (SPs) change

U5

U1U5 U1

U1

2 processors

p p

roce

sses

...

U1

U1

U1

U1U1U5

U1

U1

2

)1)(2( pp

one process to each processor


Work distribution●Assignment tree. TORC, used P=4 types of processors:

one 1.7 Ghz Pentium 4 (only one process can be assigned). Type 1

one 1.2 Ghz AMD Athlon. Type 2

one 600 Mhz single Pentium III. Type 3

eight 550 Mhz dual Pentium III. Type 4

3 4

4434324321

21

4 processors

p p

roce

sses

...

not in the tree

the values of SPs change

two consecutiveprocessesare assignedto a samenode


Work distribution●Use Branch and Bound or Backtracking (with nodes elimination) to search through the tree:

● Use the theoretical execution model to estimate the cost at each node with the highest values of the SPs between those of the types of processors considered, through multiplying the values by the number of processes assigned to the processor of this type with more charge:

tnpdt cd i

ii

i

pic 0/,...,1max

tt sdd jiji

pjis,0,/,...,1,

max

tt wdd jiji

pjiw,0,/,...,1,

max


Work distribution●Use Branch and Bound or Backtracking (with nodes elimination) to search through the tree:

● Use the theoretical execution model to obtain a lower bound for each nodeFor example, with an array of types of processors (1,1,1,2,2,2,3,3,3,4,4,4), with

relative speeds si, and array of assignations a=(2,2,3), the array of possible assignations is pa=(0,0,0,1,1,0,1,1,1,1,1,1), and the maximum achievable speed is

the minimum arithmetic cost is obtained from this speed, and the lowest communication costs are obtained from those between processors in the array of assignations

tt saa jiji

pjis,0,/,...,1,

max

tt waa jiji

pjiw,0,/,...,1,

max

spas i

p

iiT

1


Work distribution●Theoretical model:

Sequential cost:


Communication cost:

●The APs are p and the assignation array d●The SPs are the unidimensional array tc , and the bidimensional arrays ts and tw


one step

vCt

ic

o2

2

vCtvt

ic

C

p

CCj i

c po

j2

1

1

tt ws

pCpp

2

)1(

2

)1(

Maximum values


Work distribution●How to estimate arithmetic SPs:

Solving a small problem on each type of processors●How to estimate communication SPs:

● Using a ping-pong between each pair of processors, and processes in the same processor (CP1)

● Does not reflect the characteristics of the system● Solving a small problem varying the number of processors,

and with linear interpolation (CP2)● Larger installation time


Work distribution●Three types of users are considered:

● GU (greedy user):● Uses all the available processors, with one process per processor.

● CU (conservative user):● Uses half of the available processors (the fastest), with one process per processor.

● EU (user expert in the problem, the system and heterogeneous computing):● Uses a different number of processes and processors depending on the granularity:

● 1 process in the fastest processor, for low granularity● The number of processes is half of the available processors, and in the appropriate

processors, for middle granularity● A number of processes equal to the number of processors, and in the appropriate processors,

for large granularity


Work distribution●Quotient between the execution time with the parameters selected by each one of the selection methods and the modelled users and the lowest execution time, in SUNEt:


Work distribution●Parameters selection, in TORC, with CP2:

C gra LT CP2

50000 10 (1,2) (1,2)

50000 50 (1,2) (1,2,4,4)

50000 100 (1,2) (1,2,4,4)

100000 10 (1,2) (1,2)

100000 50 (1,2) (1,2,4,4)

100000 100 (1,2) (1,2,4,4)

500000 10 (1,2) (1,2)

500000 50 (1,2) (1,2,3,4)

500000 100 (1,2) (1,2,3,4)


Work distribution●Parameters selection, in TORC (without the 1.7 Ghz Pentium 4), with CP2:one 1.2 Ghz AMD Athlon. Type 1

one 600 Mhz single Pentium III. Type 2

eight 550 Mhz dual Pentium III. Type 3

C gra LT CP2

50000 10 (1,1,2) (1,1,2,3,3,3,3,3,3)

50000 50 (1,1,2) (1,1,2,3,3,3,3,3,3,3,3)

50000 100 (1,1,3,3) (1,1,2,3,3,3,3,3,3,3,3)

100000 10 (1,1,2) (1,1,2)

100000 50 (1,1,3) (1,1,2,3,3,3,3,3,3,3,3)

100000 100 (1,1,3) (1,1,2,3,3,3,3,3,3,3,3)

500000 10 (1,1,2) (1,1,2)

500000 50 (1,1,2) (1,1,2,3)

500000 100 (1,1,2) (1,1,2)


Work distribution●Quotient between the execution time with the parameters selected by each one of the selection methods and the modelled users and the lowest execution time, in TORC:


Work distribution●Quotient between the execution time with the parameters selected by each one of the selection methods and the modelled users and the lowest execution time, in TORC (without the 1.7 Ghz Pentium 4):




Hybrid programming

OpenMPFine-grain parallelism

Efficient in SMP

Sequential and parallel codes are similar

Tools for development and parallelisation

Allows run time scheduling

Memory allocation can reduce performance

MPICoarse-grain parallelismMore portableParallel code very

different from sequential

Development and debugging more complex

Static assigment of processes

Local memories, which facilitates efficient use


Hybrid programming

Advantages of Hybrid Programming● To improve scalability● When too many tasks produce load imbalance● Applications with fine and coarse-grain parallelism● Redution of the code development time● When the number of MPI processes is fixed● In case of a mixture of functional and data parallelism


Hybrid programmingHybrid Programming in the literature

● Most of the papers are about particular applications● Some papers present hybrid models● No theoretical models of the execution time are

available


Hybrid programmingSystems

● Networks of Dual Pentiums● HPC160 (each node four processors)● IBM SP● Blue Horizon (144 nodes, each 8 processors)● Earth Simulator (640x8 vector processors)

…


Hybrid programming


Hybrid programming

ModelsMPI+OpenMP

OpenMP used for loops parallelisation

OpenMP+MPI Unsafe threads

MPI and OpenMP processes in SPMD modelReduces cost of communications


Hybrid programming


Hybrid programming!$OMP PARALLEL DO REDUCTION (+:sum) PRIVATE (x)

do 20 i = myid+1, n, numprocs

x = h * (dble(i) - 0.5d0)

sum = sum + f(x)

20 enddo

!$OMP END PARALLEL DO

mypi = h * sum

call MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISION, &MPI_SUM,0,MPI_COMM_WORLD,ierr)

call MPI_FINALIZE(ierr)

stop

end

program main

include 'mpif.h'

double precision mypi, pi, h, sum, x, f, a

integer n, myid, numprocs, i, ierr

f(a) = 4.d0 / (1.d0 + a*a)

call MPI_INIT( ierr )

call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )

call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )

call MPI_BCAST(n,1,MPI_INTEGER,0, & MPI_COMM_WORLD,ierr)

h = 1.0d0/n

sum = 0.0d0


Hybrid programmingIt is not clear if with hybrid programming the execution time would

be lowerLanucara, Rovida: Conjugate-Gradient


Hybrid programmingIt is not clear if with

hybrid programming

the execution time

would be lowerDjomehri, Jin:

CFD Solver


Hybrid programmingIt is not clear if with

hybrid programming

the execution time

would be lowerViet, Yoshinaga, Abderazek,

Sowa: Linear system


Hybrid programming●Matrix-matrix multiplication:

MPI

SPMD MPI+OpenMPdecide which is preferable

MPI+OpenMP:less memory

fewer communicationsmay have worse memory use

N0 p0N0 p1

N2 p0N2 p1

N1 p0N1 p1

N0 p0N0 p1

N2 p0N2 p1

N1 p0N1 p1

N0 p0N0 p1

N2 p0N2 p1

N1 p0N1 p1

N0

N2

N1


Hybrid programming●In the time theoretical model more Algorithmic Parameters appear:

8 processors: p=rxs, 1x8, 2x4, 4x2, 8x1

p=rxs, 1x4, 2x2, 4x1

q=uxv, 1x2, 2x1

total 6 configurations

16 processors: p=rxs, 1x16, 2x8, 4x4, 8x2, 16x1

p=rxs, 1x4, 2x2, 4x1

q=uxv, 1x4, 2x2, 4x1

total 9 configurations


Hybrid programming●And more System Parameters:

● The cost of communications is different inside and outside a node (similar to the heterogeneous case with more than one process per processor)

● The cost of arithmetic operations can vary when the number of threads in the node varies

●Consequently, the algorithms must be recoded and new models of the execution time must be obtained


Hybrid programming… and the formulas change:

P0 P1 P2 P3 P4 P5 P6 synchronizations

Node 1

Node 2

Node 3

Node 4

Node 5

Node 6communications

The formula changes,for some systems 6x1 nodes and1x6 threads couldbe better, and for others 1x6 nodes and6x1 threads


Hybrid programming●Open problem

● Is it possible to generate automatically MPI+OpenMP programs from MPI programs? Maybe for the SPMD model.

● Or at least for some type of programs, such as matricial problems in meshes of processors?

● And is it possible to obtain the execution time of the MPI+OpenMP program from that of the MPI program and some description of how the time model has been obtained?




Peer to peer computing●Distributed systems:

● They are inherently heterogeneous and dynamic● But there are other problems:

● Higher communication cost● Special middleware is necessary

● The typical paradigms are master/slave, client/server, where different types of processors (users) are considered.


Peer to peer computingPeer-to-Peer Computing. Dejan S. Milojicic, Vana Kalogeraki,

Rajan Lukose, Kiran Nagaraja1, Jim Pruyne, Bruno Richard, Sami Rollins, Zhichen Xu. HP Laboratories Palo Alto. 2002


Peer to peer computing●Peer to peer:

● All the processors (users) are at the same level (at least initially)● The community selects, in a democratic and continuous way, the

topology of the global network

●Would it be interesting to have a P2P system for computing?●Is some system of this type available?


Peer to peer computing●Would it be interesting to have a P2P system for computing?

● I think it would be interesting to develop a system of this type● And to leave the community to decide, in a democratic and

continuous way, if it is worthwhile●Is some system of this type available?

● I think there is no pure P2P dedicated to computation


Peer to peer computing●… and other people seem to think the same:

● Lichun Ji (2003): “… P2P networks seem to outperform other approaches largely due to the anonymity of the participants in the peer-network, low network costs and the inexpensive disk-space. Trying to apply P2P principles in the area of distributed computation was significantly less successful”

● Arjav J. Chakravarti, Gerald Baumgartner, Mario Lauria (2004): “… current approaches to utilizing desktop resources require either centralized servers or extensive knowledge of the underlying system, limiting their scalability”


Peer to peer computing●There are a lot of tools for Grid Computing:

● Globus (of course), but does Globus provide computational P2P capacity or is it a tool with which P2P computational systems can be developed?

● Netsolve/Gridsolve. Uses a client/server structure.● PlanetLab (at present 387 nodes and 162 sites). In each site one Principal Researcher and one System Administrator.


Peer to peer computing●For Computation on P2P the shared resources are:

● Information: books, papers, …, in a typical way.● Libraries. One peer takes a library from another peer.

● Necessary description of the library and the system to know if the library fulfils our requests.

● Computation. One peer colaborates to solve a problem proposed by another peer.● This is the central idea of Computation on P2P…


Peer to peer computing●Two peers collaborate in the solution of a computational problem using the hierarchy of parallel linear algebra libraries

PLAPACK

Mac. LAPACK

BLAS

Reference MPI

ScaLAPACK

Ref. LAPACK

ATLAS

PBLAS

BLACS

Machine MPI

Peer 2Peer 1


Peer to peer computing●There are

● Different global hierarchies● Different libraries

PLAPACK

Mac. LAPACK

BLAS

Reference MPI

ScaLAPACK

Ref. LAPACK

ATLAS

PBLAS

BLACS

Machine MPI

Peer 2Peer 1


Peer to peer computing●And the installation information varies, which makes the efficient use of the theoretical model more difficult than in the heterogeneous case

PLAPACK

Mac. LAPACK

BLAS

Reference MPI

ScaLAPACK

Ref. LAPACK

ATLAS

PBLAS

BLACS

Machine MPI

Peer 2Peer 1Inst. Inform.

Inst. Inform.

Inst. Inform.

Inst. Inform.

Inst. Inform.

Inst. Inform.

Inst. Inform.Inst. Inform.

Inst. Inform.Inst. Inform.


Peer to peer computing●Trust problems appear:

● Does the library solve the problems we require to be solved?● Is the library optimized for the system it claims to be optimized

for?● Is the installation information correct?● Is the system stable?

There are trust-algorithms for P2P systems; are they (or some modification) applicable to these trust problems?


Peer to peer computing●Each peer would have the possibility of establishing a policy of use:

● The use of the resources could be payable● The percentage of CPU dedicated to computations for the

community● The type of problems it is interested in

And the MAIN PROBLEM: is it interesting to develop a P2P system for the management and optimization of computational codes?

Documents

16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de