Upload
melvyn-townsend
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
16 December 2005 Universidad de Murcia 1
Research in parallel routines optimization
Domingo GiménezDpto. de Informática y Sistemas
Javier CuencaDpto. de Ingeniería y Tecnología de Computadores
Universidad de Murciahttp://dis.um.es/~domingo
... and more: J. González (Intel Barcelona), L.P. García (Politécnica Cartagena), A.M. Vidal (Politécnica
Valencia), G. Carrillo (?), P. Alberti (U. Magallanes), P. Alonso (Politécnica Valencia),
J.P. Martínez (U. Miguel Hernández), J. Dongarra (U. Tennessee), K. Roche (?)
16 December 2005 Universidad de Murcia 2
Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing
16 December 2005 Universidad de Murcia 3
Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing
16 December 2005 Universidad de Murcia 4
A little history Parallel optimization in the past:
Hand-optimization for each platform Time consuming Incompatible with hardware evolution Incompatible with changes in the system
(architecture and basic libraries) Unsuitable for systems with variable
workloads Misuse by non expert users
16 December 2005 Universidad de Murcia 5
A little history Initial solutions to this situation:
Problem-specific solutions Polyalgorithms Installation tests
16 December 2005 Universidad de Murcia 6
A little history Problem specific solutions:
Brewer (1994): Sorting Algorithms, Differential Equations
Frigo (1997): FFTW: The Fastest Fourier Transform in the West
LAWRA (1997): Linear Algebra With Recursive Algorithms
16 December 2005 Universidad de Murcia 7
A little history Polyalgorithms:
Brewer FFTW PHiPAC (1997): Linear Algebra
16 December 2005 Universidad de Murcia 8
A little history Installation tests:
ATLAS (2001): Dense Linear Algebra, sequential
Carrillo + Giménez (2000): Gauss elimination, heterogeneous algorithm
I-LIB (2000): some parallel linear algebra routines
16 December 2005 Universidad de Murcia 9
A little history Parallel optimization today:
Optimization based on computational kernels
Systematic development of routines Auto-optimization of routines Middleware for auto-optimization
16 December 2005 Universidad de Murcia 10
A little history Optimization based on
computational kernels: Efficient kernels (BLAS) and
algorithms based on these kernels Auto-optimization of the basic kernels
(ATLAS)
16 December 2005 Universidad de Murcia 11
A little history Systematic development of
routines: FLAME project
R. van de Geijn + E. Quintana + … Dense Linear Algebra Based on Object Oriented Design
LAWRA Dense Linear Algebra For Shared Memory Systems
16 December 2005 Universidad de Murcia 12
A little history Auto-optimization of routines:
At installation time: ATLAS, Dongarra + Whaley I-LIB, Kanada + Katagiri + Kuroda SOLAR, Cuenca + Giménez + González LFC, Dongarra + Roche
At execution time: Solve a reduced problem in each processor (
Kalinov + Lastovetsky) Use a system evaluation tool (NWS)
16 December 2005 Universidad de Murcia 13
A little history Middleware for auto-optimization:
LFC: Middleware for Dense Linear Algebra Software in
Clusters. Hierarchy of autotuning libraries:
Include in the libraries installation routines to be used in the development of higher level libraries
FIBER: Proposal of general middleware Evolution of I-LIB
mpC: For heterogeneous systems
16 December 2005 Universidad de Murcia 14
A little history Parallel optimization in the
future?: Skeletons and languages Heterogeneous and variable-load
systems Distributed systems P2P computing
16 December 2005 Universidad de Murcia 15
A little history Skeletons and languages:
Develop skeletons for parallel algorithmic schemes
together with execution time modelsand provide the users with these libraries (
MALLBA, Málaga-La Laguna-Barcelona) or languages (P3L, Pisa)
16 December 2005 Universidad de Murcia 16
A little history Heterogeneous and variable-load
systems:Heterogeneous algorithms: unbalanced
distribution of data (static or dynamic)Homogeneous algorithms: more processes
than processors and assignation of processes to processors (static or dynamic)
Variable-load systems as dynamic heterogeneous
16 December 2005 Universidad de Murcia 17
A little history Distributed systems:
Intrinsically heterogeneous and variable-load
Very high cost of communicationsNecessary special middleware (Globus,
NWS)There can be servers to attend queries
of clients
16 December 2005 Universidad de Murcia 18
A little history P2P computing:
Users can go in and out dynamicallyAll the users are the same type
(initially)Is distributed, heterogeneous and
variable-loadBut special middleware is necessary
16 December 2005 Universidad de Murcia 19
Outline A little story Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing
16 December 2005 Universidad de Murcia 20
Modelling Linear Algebra Routines
Necessary to predict accurately the execution time and select The number of processes The number of processors Which processors The number of rows and columns of processes (the
topology) The processes to processors assignation The computational block size (in linear algebra
algorithms) The communication block size The algorithm (polyalgorithms) The routine or library (polylibraries)
16 December 2005 Universidad de Murcia 21
Cost of a parallel program:
: arithmetic time: communication time: overhead, for synchronization, imbalance,
processes creation, ...: overlapping of communication and
computation
Modelling Linear Algebra Routines
overlapoverheadcommarithparallel ttttt aritht
commt
overheadt
overlapt
16 December 2005 Universidad de Murcia 22
Estimation of the time:
Considering computation and communication divided in a number of steps:
And for each part of the formula that of the process which gives the highest value.
Modelling Linear Algebra Routines
commarithparallel ttt
...2,2,1,1, commarithcommarithparallel ttttt
16 December 2005 Universidad de Murcia 23
The time depends on the problem (n) and the system (p) size:
But also on some ALGORITHMIC PARAMETERS like the block size (b) and the number of rows (r) and columns (c) of processors in algorithms for a mesh of processors
Modelling Linear Algebra Routines
),( pnt parallel
),,,( crbnt parallel
16 December 2005 Universidad de Murcia 24
And some SYSTEM PARAMETERS which reflect the computation and communication characteristics of the system.
Typically the cost of an arithmetic operation (tc) and the start-up (ts) and
word-sending time (tw)
Modelling Linear Algebra Routines
),,( SPAPnt parallel
16 December 2005 Universidad de Murcia 25
LU factorisation (Golub - Van Loan):
=
Step 1: (factorisation LU no blocks) Step 2: (multiple lower triangular systems)
Step 3: (multiple upper triangular systems)
Step 4: (update south-east blocks)
Modelling Linear Algebra Routines
A11
A22
A33A32A31
A23A21
A13A12 L11
L22
L33L32L31
L21
U1
1 U2
2 U3
3
U2
3
U1
3
U1
2
111111 ULA ii ULA 1111
1111 ULA ii
jiijij ULAA 11
16 December 2005 Universidad de Murcia 26
The execution time is
If the blocks are of size 1, the operations are all with individual elements, but if the block size is b the cost is
With k3 and k2 the cost of operations performed with BLAS 3 or 2
Modelling Linear Algebra Routines
3
3
2)( nnt tcsequential
nbbnnnt kkksequential2
2
3
3
3
3 3
1
3
2)(
16 December 2005 Universidad de Murcia 27
But the cost of different operations of the same level is different, and the theoretical cost could be better modelled as:
Thus, the number of SYSTEM PARAMETERS increases (one for each basic routine), and ...
Modelling Linear Algebra Routines
nbbnnnt kkk dgetfdtrsmdgemmsequential2
2_2
3
_3
3
_3 31
32
)(
16 December 2005 Universidad de Murcia 28
The value of each System Parameter can depend on the problem size (n) and on the value of the Algorithmic Parameters (b)
The formula has the form:
And what we want is to obtain the values of AP with which the lowest execution time is obtained
Modelling Linear Algebra Routines
nbbnbnbnnbnbnt kkk dgetfdtrsmdgemmsequential2
2_2
3
_3
3
_3 31
),(),(32
),(),(
)),(,,( APnSPAPnt
16 December 2005 Universidad de Murcia 29
The values of the System Parameters could be obtained With installation routines associated to each
linear algebra routine From information stored when the library
was installed in the system, thus generating a hierarchy of libraries with auto-optimization
At execution time by testing the system conditions prior to the call to the routine
Modelling Linear Algebra Routines
16 December 2005 Universidad de Murcia 30
These values can be obtained as simple values (traditional method) or as function of the Algorithmic Parameters.
In this case a multidimensional table of values as a function of the problem size and the Algorithmic Parameters is stored,
And when a problem of a particular size is being solved the execution time is estimated with the values of the stored size closest to the real size
And the problem is solved with the values of the Algorithmic Parameters which predict the lowest execution time
Modelling Linear Algebra Routines
16 December 2005 Universidad de Murcia 31
Parallel block LU factorisation:
matrix
distribution of computations in the first step
processors
Modelling Linear Algebra Routines
16 December 2005 Universidad de Murcia 32
Distribution of computations on successive steps:
second step third step
Modelling Linear Algebra Routines
16 December 2005 Universidad de Murcia 33
The cost of parallel block LU factorisation:
Tuning Algorithmic Parameters:block size: b
2D-mesh of p proccesors: p = r c d=max(r,c)
System Parameters:cost of arithmetic operations: k2,getf2 k3,trsmm k3,gemm
communication parameters: ts tw
Modelling Linear Algebra Routines
nkbnbkp
crp
nkT getftrsmgemmARI 2,2
22,3
3
,3 31
32
p
dnt
b
ndtT wsCOM
222
16 December 2005 Universidad de Murcia 34
The cost of parallel block QR factorisation:
Tuning Algorithmic Parameters:block size: b
2D-mesh of p proccesors: p = r c
System Parameters:cost of arithmetic operations: k2,geqr2 k2,larft k3,gemm k3,trmm
communication parameters: ts tw
Modelling Linear Algebra Routines
r
bkn
r
bkn
c
bkn
p
knT
larftgeqr
trmmgemm
ARI
,22
2,22,3
2,3
3
2
1
4
1
3
4
pnb
r
r
c
r
r
rntcrb
b
ntT wsCOM log
loglog12
2log2log32
2
2
16 December 2005 Universidad de Murcia 35
The same basic operations appear repeatedly in different higher level routines: the information generated for one routine (let’s say LU) could be stored and used for other routines (e.g. QR)and a common format is necessary to store the information
Modelling Linear Algebra Routines
16 December 2005 Universidad de Murcia 36
Modelling Linear Algebra Routines
IBM-SP2. 8 processors0,00
10,00
20,00
30,00
40,00
50,00
60,00
70,00
80,00
512 1024 1536 2048 2560 3072 3584
problem size
time
(sec
onds
)
mean
model
optimum
Parallel QR factorisation
“mean” refers to the mean of the execution times with representative values of the Algorithmic Parameters (execution time which could be obtained by a non-expert user)
“optimum” is the lowest time of all the executions performed with representative values of the Algorithmic Parameters
“model” is the execution time with the values selected with the model
16 December 2005 Universidad de Murcia 37
Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing
16 December 2005 Universidad de Murcia 38
In the formulas (parallel block LU factorisation)
The values of the System Parameters (k2,getf2 ,
k3,trsmm , k3,gemm , ts , tw) must be estimated as functions of the problem size (n) and the Algorithmic Parameters (b, r, c)
Installation Routines
ncrbnkbncrbnbkp
crp
ncrbnkcrbnT getftrsmgemmARI ),,,(
31
),,,(),,,(32
),,,( 2,222
,3
3
,3
pdn
crbntbnd
crbntcrbnT wsCOM
22),,,(
2),,,(),,,(
16 December 2005 Universidad de Murcia 39
Installation RoutinesBy running at installation time Installation
Routines associated to the linear algebra routine
And storing the information generated to be used at running time
Each linear algebra routine must be
designed together with the corresponding installation routines, and the installation process must be detailed
16 December 2005 Universidad de Murcia 40
is estimated by performing matrix-matrix multiplications and updatings of size
(n/r b) (b n/c)
Because during the execution the size of the matrix to work with decreases, different values can be estimated for different problem sizes, and the formula can be modified to include the posibility of these estimations with different values, for example, splitting the formula into four formulas with different problem sizes
Installation Routines
),,,(,3 crbnk gemm
16 December 2005 Universidad de Murcia 41
two multiple triangular systems are solved, one upper triangular of size b n/c , and another lower triangular of size n/r b
Thus, two parameters are estimated, one of them depending on n, b and c, and the other depending on n, b and r
As for the previous parameter, values can be obtained for different problem sizes
Installation Routines
),,,(,3 crbnk trsm
16 December 2005 Universidad de Murcia 42
corresponds to a level 2 LU sequential factorisation of size b b
At installation time each of the basic routines is executed varying the value of the parameters they depend on, and with representative values (selected by the routine designer or the system manager),
And the information generated is stored in a file to be used at running time or in the code of the linear algebra routine before its installation
Installation Routines
),,,(2,2 crbnk getf
16 December 2005 Universidad de Murcia 43
and appear in communications of three types,
In one of them a block of size b b is broadcast in a row, and this parameter depends on b and c
In another a block of size b b is broadcast in a column, and the parameter depends on b and r
And in the other, blocks of sizes b n/c and n/r b are broadcast in each one of the columns and rows of processors. These parameters depend on n, b, r and c
Installation Routines
),,,( crbnts ),,,( crbntw
16 December 2005 Universidad de Murcia 44
In practice each System Parameter depends on a more reduced number of Algorithmic Parameters, but this is known only after the installation process is completed.
The routine designer also designs the installation process, and can take into consideration the experience he has to guide the installation.
The basic installation process can be designed allowing the intervention of the system manager.
Installation Routines
16 December 2005 Universidad de Murcia 45
Some results in different systems (physical and logical platform)
Values of k3_DTRMM (≈ k3_DGEMM) on the different platforms (in microseconds)
Installation Routines
0.00250.00250.00300.0070512,.., 4096
macBLASR10K
0.00180.00180.00190.0023512,.., 4096
macBLASPPC
0.00300.00300.00330.0038512,.., 4096
ATLASPIII
0.01500.00500.0025
0.01400.00500.0025
0.01300.00500.0032
0.01200.00600.0040
512,.., 4096512,.., 4096512,.., 4096
refBLASmacBLAS
ATLAS
SUN5
0.02800.01100.0060
0.02200.01100.0060
0.02000.01100.0060
0.02000.01200.0070
512,.., 4096512,.., 4096512,.., 4096
refBLASmacBLAS
ATLAS
SUN1
128643216nSystem
Block size
16 December 2005 Universidad de Murcia 46
Installation RoutinesValues of k2_DGEQR2 (≈ k2_DLARFT) on the different platforms (in microseconds)
0.0250512,.., 4096
macBLASR10K
0.0100512,.., 4096
macBLASPPC
0.0150512,.., 4096
ATLASPIII
0.00500.03000.0500
512,.., 4096512,.., 4096512,.., 4096
refBLASmacBLAS
ATLAS
SUN5
0.02000.05000.0700
512,.., 4096512,.., 4096512,.., 4096
refBLASmacBLAS
ATLAS
SUN1
128643216nSystem
Block size
16 December 2005 Universidad de Murcia 47
Typically the values of the communication parameters are well estimated with a ping-pong
Installation Routines
20 / 0.1512,.., 4096
Mac-MPIOrigin 2K
75 / 0.3512,.., 4096
Mac-MPIIBM-SP2
60 / 0.7512,.., 4096
MPICHcPIII
170 / 7.0512,.., 4096
MPICHcSUN1
128643216nSystem
Block size
16 December 2005 Universidad de Murcia 48
Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing
16 December 2005 Universidad de Murcia 49
Modelling the Linear Algebra Routine
(LAR)
Obtaining information from
the System
Selectionof
parameters values
Executionof LAR
DESIGN
INSTALLATION
RUN-TI
ME
Autotuning routines
16 December 2005 Universidad de Murcia 50
DESIGN PROCESS
DESIGN
LAR: Linear Algebra RoutineMade by the LAR Designer
Example of LAR: Parallel Block LU factorisation
LAR
16 December 2005 Universidad de Murcia 51
Modelling the LAR
DESIGN
LAR
Modellingthe LAR
MODEL
16 December 2005 Universidad de Murcia 52
Modelling the LAR
DESIGN
MODELTexec = f (SP, AP, n)
SP: System Parameters AP: Algorithmic Parameters n : Problem size
Made by the LAR-DesignerOnly once per LAR
LAR
Modellingthe LAR
MODEL
16 December 2005 Universidad de Murcia 53
Modelling the LAR
DESIGN
SP: k3, k2, ts, tw
AP: p = r x c, bn : Problem size
MODEL LAR: Parallel Block LU factorisation
LAR
Modellingthe LAR
MODEL
16 December 2005 Universidad de Murcia 54
Implementation of SP-Estimators
DESIGN
LAR
Modellingthe LAR
MODEL
Implementationof SP-
Estimators
SP-Estimators
16 December 2005 Universidad de Murcia 55
Implementation of SP-Estimators
DESIGN
LAR
Modellingthe LAR
MODEL
Implementationof SP-
Estimators
SP-Estimators
Estimators of Arithmetic-SPComputation Kernel of the LARSimilar storage schemeSimilar quantity of data
Estimators of Communication-SP Communication Kernel of the LAR Similar kind of communicationSimilar quantity of data
16 December 2005 Universidad de Murcia 56
INSTALLATION PROCESS
INSTALLATION
LAR
Modellingthe LAR
MODEL
Implementationof SP-
Estimators
SP-Estimators
DESIGN
Installation ProcessOnly once per PlatformDone by the System Manager
16 December 2005 Universidad de Murcia 57
Estimation of Static-SP
INSTALLATION
LAR
Modellingthe LAR
MODEL
Implementationof SP-
Estimators
SP-Estimators
Estimationof Static-SP
Static-SP-File
Basic Libraries Installation-File
DESIGN
16 December 2005 Universidad de Murcia 58
INSTALLATION
LAR
Modellingthe LAR
MODEL
Implementationof SP-
Estimators
SP-Estimators
Estimationof Static-SP
Static-SP-File
Basic Libraries Installation-File
DESIGN
Estimation of Static-SPBasic Libraries
Basic Communication Library: MPI PVM
Basic Linear Algebra Library: reference-BLAS
machine-specific-BLASATLAS
Installation File
SP values are obtained using the information (n and AP values) of this file.
16 December 2005 Universidad de Murcia 59
Estimation of Static-SPLAR
Modellingthe LAR
MODEL
Implementationof SP-Estimators
SP-Estimators
Estimationof Static-SP
Static-SP-File
Basic Libraries Installation-File
DESIGN
INSTALLATION
Estimation of the Static-SP tw-static (in sec)
Message size (Kbytes) 32 256 1024 2048tw-static 0.700 0.690 0.680 0.675
Platform:Cluster of Pentium III + Fast Ethernet
Basic Libraries: ATLAS and MPI
Estimation of the Static-SP k3-static (in sec)
Block size 16 32 64 128k3-static 0.0038 0.0033 0.0030 0.0027
16 December 2005 Universidad de Murcia 60
RUN-TIME PROCESS
INSTALLATION
LAR
Modellingthe LAR
MODEL
Implementationof SP-
Estimators
SP-Estimators
Estimationof Static-SP
Static-SP-File
Basic Libraries Installation-File
DESIGN
RUN-TI
ME
16 December 2005 Universidad de Murcia 61
RUN-TIME PROCESS
INSTALLATION
LAR
Modellingthe LAR
MODEL
Implementationof SP-
Estimators
SP-Estimators
Estimationof Static-SP
Static-SP-File
Basic Libraries Installation-File
Optimum-AP
Selectionof Optimum AP
DESIGN
RUN-TI
ME
16 December 2005 Universidad de Murcia 62
RUN-TIME PROCESS
INSTALLATION
LAR
Modellingthe LAR
MODEL
Implementationof SP-
Estimators
SP-Estimators
Estimationof Static-SP
Static-SP-File
Basic Libraries Installation-File
Optimum-AP
Selectionof Optimum AP
Executionof LAR
DESIGN
RUN-TI
ME
16 December 2005 Universidad de Murcia 63
Autotuning routines
Experiments
LAR: block LU factorization.
Platforms: IBM SP2,
SGI Origin 2000,
NoW
Basic Libraries: reference BLAS,
machine BLAS, ATLAS
16 December 2005 Universidad de Murcia 64
Autotuning routines
LU on IBM SP2
Quotient between the
execution time with the
parameters selected by
the model and the lowest
experimentl execution
time (varying the
value of the parameters) 0
0,2
0,4
0,6
0,8
1
1,2
1,4
SEQPAR4PAR8
16 December 2005 Universidad de Murcia 65
Autotuning routines
LU on Origin 2000
Quotient between the
execution time with the
parameters selected by
the model and the lowest
experimentl execution
time (varying the
value of the parameters) 0
0,2
0,4
0,6
0,8
1
1,2
1,4
SEQPAR4PAR8PAR16
16 December 2005 Universidad de Murcia 66
Autotuning routines
LU on NoW
Quotient between the
execution time with the
parameters selected by
the model and the lowest
experimentl execution
time (varying the
value of the parameters) 0,96
0,98
1
1,02
1,04
1,06
1,08
1,1
512 1024 1536 2048
SEQ BLASSEQ ATLASPAR4 BLASPAR4 ATLAS
16 December 2005 Universidad de Murcia 67
Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing
16 December 2005 Universidad de Murcia 68
Modifications to libraries’ hierarchy
In the optimization of routines individual basic operations appear repeatedly: LU:
QR:
nkbnbkp
crp
nkT getftrsmgemmARI 2,2
22,3
3
,3 31
32
p
dnt
b
ndtT wsCOM
222
r
bkn
r
bkn
c
bkn
p
knT
larftgeqr
trmmgemm
ARI
,22
2,22,3
2,3
3
2
1
4
1
3
4
pnb
r
r
c
r
r
rntcrb
b
ntT wsCOM log
loglog12
2log2log32
2
2
16 December 2005 Universidad de Murcia 69
Modifications to libraries’ hierarchy
The information generated to instal a routine could be used for another different routine without additional experiments: ts and tw are obtained when the
communication library (MPI, PVM, …) is installed
K3,gemm is obtained when the basic computational library (BLAS, ATLAS, …) is installed
16 December 2005 Universidad de Murcia 70
Modifications to libraries’ hierarchy
To determine: the type of experiments necessary for
the different routines in the library: ts and tw ¿obtained with ping-pong, broadcast,
… ? K3,gemm ¿obtained for small block sizes, … ?
the format in which the data will be stored, to facilitate the use of them when installing other routines
16 December 2005 Universidad de Murcia 71
Modifications to libraries’ hierarchy
The method could be valid not only for one library (that I am developing) but also for others libraries I or somebody else will develop in the future: the type of experiments the format in which the data will be stored
must be decided by the Parallel Linear Algebra Community
… and the typical hierarchy of libraries would change
16 December 2005 Universidad de Murcia 72
Modifications to libraries’ hierarchy
typical hierarchy of Parallel Linear Algebra libraries
ScaLAPACK
LAPACK
BLAS
PBLAS
BLACS
Communications
16 December 2005 Universidad de Murcia 73
Modifications to libraries’ hierarchy
To includeinstallation informationin the lowest levelsof the hierarchy
ScaLAPACK
LAPACK
BLAS
PBLAS
BLACS
CommunicationsSelf-Optimisation
Information Self-Optimisation Information
16 December 2005 Universidad de Murcia 74
Modifications to libraries’ hierarchy
When installing librariesin a higher level thisinformation can be used,and new informationis generated …
ScaLAPACK
LAPACK
BLAS
PBLAS
BLACS
CommunicationsSelf-Optimisation
Information
Self-Optimisation Information
Self-Optimisation Information
Self-Optimisation Information
16 December 2005 Universidad de Murcia 75
Modifications to libraries’ hierarchy
And so in higher levels
ScaLAPACK
LAPACK
BLAS
PBLAS
BLACS
CommunicationsSelf-Optimisation
Information Self-Optimisation Information
Self-Optimisation Information
Self-Optimisation Information
Self-Optimisation Information
Self-Optimisation Information
16 December 2005 Universidad de Murcia 76
Modifications to libraries’ hierarchy
And new libraries with autotunig capacitycould be developed
ScaLAPACK
LAPACK
BLAS
PBLAS
BLACS
CommunicationsSelf-Optimisation
Information Self-Optimisation Information
Self-Optimisation Information
Self-Optimisation Information
Self-Optimisation Information
Self-Optimisation Information
Inverse Eigenvalue ProblemLeast Square ProblemPDE Solver
Self-Optimisation Information
Self-Optimisation Information
Self-Optimisation Information
16 December 2005 Universidad de Murcia 77
Modifications to libraries’ hierarchy
Movement of information between routinesin the different levels of thehierarchy
GETRF from LAPACK (level 1)
GETRF_manager
k3_information
Model
GETRF {
}
GEMM from BLAS (level 0)
GEMM_manager
k3_information
Model
GEMM {
}
23
333
2nbknkTexec
332 nkTexec
16 December 2005 Universidad de Murcia 78
Modifications to libraries’ hierarchy
Movement of information between routinesin the different levels of thehierarchy
16 December 2005 Universidad de Murcia 79
Modifications to libraries’ hierarchy
Movement of information between routinesin the different levels of thehierarchy
16 December 2005 Universidad de Murcia 80
Modifications to libraries’ hierarchy
Movement of information between routinesin the different levels of thehierarchy
16 December 2005 Universidad de Murcia 81
Modifications to libraries’ hierarchy
Architecture ofa Self OptimizedLinear AlgebraRoutine manager
SP1_information
SP1_manager
Installation_SP1_values
AP1 .......... APz
n1 SP11,1 .... SP1
1,z
nw SP1w,1 .... SP1
w,z
Current_SP1_values
AP1 .......... APz
nc SP1c,1 .... SP1
c,z
SP1_information
SP1_manager
Installation_SP1_values
AP1 .......... APz
n1 SP11,1 .... SP1
1,z
nw SP1w,1 .... SP1
w,z
Current_SP1_values
AP1 .......... APz
nc SP1c,1 .... SP1
c,z
. . .
LAR(n, AP){
...}
Model
Texec = f (SP,AP, n)SP = f(AP,n)
Installation_information
n1 ... nw
AP1
...APz
Current_problem_sizenc
Current_system_informationCurrent_CPUs_availability
%CPU1 ... %CPUp
Current_network_availability
% net1-1 ...%net1-p
...% netP-1 ..%netp-p
SOLAR_manager
Optimum_AP AP0
SP1_information
SP1_manager
Installation_SP1_values
AP1 .......... APz
n1 SP11,1 .... SP1
1,z
nw SP1w,1 .... SP1
w,z
Current_SP1_values
AP1 .......... APz
nc SP1c,1 .... SP1
c,z
SPt_information
SPt_manager
Installation_SP1_values
AP1 .......... APz
n1 SPt1,1 .... SPt
1,z
nw SPtw,1 .... SPt
w,z
Current_SP1_values
AP1 .......... APz
nc SPtc,1 .... SPt
c,z
16 December 2005 Universidad de Murcia 82
Outline●A little history●Modelling Linear Algebra Routines●Installation routines●Autotuning routines●Modifications to libraries’ hierarchy●Polylibraries●Algorithmic schemes●Heterogeneous systems●Hybrid programming●Peer to peer computing
16 December 2005 Universidad de Murcia 83
Polylibraries●Different basic libraries can be available:
● Reference BLAS, machine specific BLAS, ATLAS, …● MPICH, machine specific MPI, PVM, …● Reference LAPACK, machine specific LAPACK, …● ScaLAPACK, PLAPACK, …
●To use a number of different basic libraries to develop a polylibrary
16 December 2005 Universidad de Murcia 84
PolylibrariesTypical parallel linear algebra libraries hierarchy
ScaLAPACK
LAPACK
BLAS
PBLAS
BLACS
MPI, PVM, ...
16 December 2005 Universidad de Murcia 85
PolylibrariesA possible parallel linear algebra polylibraries hierarchy
ScaLAPACK
LAPACK PBLAS
BLACS
MPI, PVM, ...
ref. BLAS
mac. BLAS
ATLAS
16 December 2005 Universidad de Murcia 86
PolylibrariesA possible parallel linear algebra polylibraries hierarchy
ScaLAPACK
LAPACK PBLAS
BLACS
ref. BLAS
mac. BLAS
ATLAS
mac. MPI
LAM
MPICH
PVM
16 December 2005 Universidad de Murcia 87
PolylibrariesA possible parallel linear algebra polylibraries hierarchy
ScaLAPACK
mac. LAPACK
PBLAS
BLACS
ref. BLAS
mac. BLAS
ATLAS
mac. MPI
LAM
MPICH
PVM
ESSL
ref. LAPACK
16 December 2005 Universidad de Murcia 88
Polylibraries
BLACS
PBLAS
ref. BLAS
mac. BLAS
ATLAS
mac. MPI
LAM
MPICH
PVM
mac. LAPACK
ESSL
ref. LAPACK
mac. ScaLAPACK
ESSL
ref. ScaLAPACK
16 December 2005 Universidad de Murcia 89
Polylibraries●The advantage of Polylibraries
● A library optimised for the system might not be available
● The characteristics of the system can change
● Which library is the best may vary according to the routines and the systems
● Even for different problem sizes or different data access schemes the preferred library can change
● In parallel system with the file system shared by processors of different types
16 December 2005 Universidad de Murcia 90
Architecture of a Polylibrary
Library_1
16 December 2005 Universidad de Murcia 91
Architecture of a Polylibrary
Library_1
LIF_1
Installation
16 December 2005 Universidad de Murcia 92
Architecture of a Polylibrary
Library_1
LIF_1
Installation
X MflopsX MflopsX Mflops80
X MflopsX MflopsX Mflops40n
X MflopsX MflopsX Mflops20
804020
m
Routine: DGEMM
16 December 2005 Universidad de Murcia 93
Architecture of a Polylibrary
Library_1
LIF_1
Installation
X MflopsX MflopsX Mflops400
X MflopsX MflopsX Mflops200n
X MflopsX MflopsX Mflops100
2001001
Leading dimension
Routine: DROT
16 December 2005 Universidad de Murcia 94
Architecture of a Polylibrary
Library_2Library_1
LIF_1
Installation
16 December 2005 Universidad de Murcia 95
Architecture of a Polylibrary
Library_2
LIF_2
Library_1
LIF_1
Installation Installation
16 December 2005 Universidad de Murcia 96
Architecture of a Polylibrary
Library_2
LIF_2
Library_3Library_1
LIF_1
Installation Installation
16 December 2005 Universidad de Murcia 97
Architecture of a Polylibrary
Library_2
LIF_2
Library_3
LIF_3
Installation
Library_1
LIF_1
Installation Installation
16 December 2005 Universidad de Murcia 98
Architecture of a Polylibrary
PolyLibrary
interface routine_1interface routine_2
...
Library_2
LIF_2
Library_3
LIF_3
Installation
Library_1
LIF_1
Installation Installation
16 December 2005 Universidad de Murcia 99
Architecture of a Polylibrary
PolyLibrary
interface routine_1interface routine_2
...
interface routine_1 if n<value call routine_1 from Library_1 else depending on data storage call routine_1 from Library_1 or call routine_1 from Library_2 ...
Library_2
LIF_2
Library_3
LIF_3
Installation
Library_1
LIF_1
Installation Installation
16 December 2005 Universidad de Murcia 100
Polylibraries●Combining Polylibraries with other Optimisation Techniques:
● Polyalgorithms● Algorithmic Parameters
● Block size● Number of processors● Logical topology of processors
16 December 2005 Universidad de Murcia 101
Experimental Results
Routines of different levels in the hierarchy:● Lowest level:
● GEMM: matrix-matrix multiplication● Medium level:
● LU and QR factorisations● Highest level:
● a Lift-and-Project algorithm to solve the inverse additive eigenvalue problem
● an algorithm to solve the Toeplitz least square problem
16 December 2005 Universidad de Murcia 102
Experimental Results
The platforms: ● SGI Origin 2000● IBM-SP2● Different networks of processors
● SUN Workstations + Ethernet● PCs + Fast-Ethernet● PCs + Myrinet
16 December 2005 Universidad de Murcia 103
Experimental Results: GEMMRoutine: GEMM (matrix-matrix multiplication)
Platform: five SUN Ultra 1 / one SUN Ultra 5
Libraries:refBLAS macBLASATLAS1 ATLAS2 ATLAS5
Algorithms and Parameters: Strassen base sizeBy blocks block sizeDirect method
16 December 2005 Universidad de Murcia 104
Experimental Results: GEMMMATRIX-MATRIX MULTIPLICATION INTERFACE:if processor is SUN Ultra 5 if problem-size<600
solve using ATLAS5 and Strassen method with base size half of problem size
else if problem-size<1000solve using ATLAS5 and block method with block size 400
elsesolve using ATLAS5 and Strassen method with base size half of problem size
endifelse if processor is SUN Ultra 1 if problem-size<600 solve using ATLAS5 and direct method else if problem-size<1000
solve using ATLAS5 and Strassen method with base size half of problem size
else solve using ATLAS5 and direct method endifendif
16 December 2005 Universidad de Murcia 105
Experimental Results: GEMM
20.03ATL5block
s400
12.53ATL2Stras
s2
4.68ATL5Stras
s2
1.06ATL5direct
0.04ATL5direct
TimeLibraryMethodParameter
Low
31.0213.504.831.060.04TimeATLAS5Direct
26.57ATL5Stras
s2
12.58ATL5Stras
s2
4.68ATL5Stras
s2
1.11ATL5block
s400
0.04ATL5Stras
s2
TimeLibraryMethodParameter
Mod
160014001000600200
n
16 December 2005 Universidad de Murcia 106
Experimental Results: LURoutine: LU factorisation
Platform: 4 PentiumIII + Myrinet
Libraries:
ATLAS
BLAS for Pentium II
BLAS for Pentium III
16 December 2005 Universidad de Murcia 107
The cost of parallel block LU factorisation:
Tuning Algorithmic Parameters:
block size: b
2D-mesh of p proccesors: p = r c d=max(r,c)
System Parameters:cost of arithmetic operations: k2,getf2 k3,trsmm k3,gemm
communication parameters: ts tw
Experimental Results: LU
nkbnbkp
cr
p
nkT getftrsmgemmARI 2,2
22,3
3
,3 3
1
3
2
p
dnt
b
ndtT wsCOM
222
p
dnt
b
ndtT wsCOM
222
16 December 2005 Universidad de Murcia 108
Experimental Results: LU
BLAS-III
BLAS-II
ATLAS
51210241536
51210241536
51210241536
n
322.13322.13322.30320.70320.70320.77320.11320.11320.13
322.13322.13322.30320.71320.71320.77320.11320.11320.13
322.27642.21322.36320.74320.74320.79320.12320.12320.13btimebtimebtime
mod.low.the.
16 December 2005 Universidad de Murcia 109
Experimental Results: L&PRoutine: Lift-and-Project method for the Inverse Additive Eigenvalue Prob
Platform: dual Pentium III
Libraries combinations:
reference LAPACK and the BLAS installed which uses threads
La_Re+B_In_Th
reference LAPACK and a freely available BLAS for Pentium II using threads
La_Re+B_II_Th
LAPACK and BLAS installed for the use of threadsLa_In_Th+B_In_Th
reference LAPACK and the installed BLASLa_Re+B_In
reference LAPACK and a freely available BLAS for Pentium II
La_Re+B_II
reference LAPACK and a freely available BLAS for PentiumIII
La_Re+B_III
LAPACK and BLAS installed in the system and supposedly optimized for the machine
La_In+B_In
16 December 2005 Universidad de Murcia 110
The theoretical model of the sequential algorithm cost:
System Parameters:
ksyev LAPACK
k3, gemm k3, diaggemm BLAS-3
k1,dot k1,scal k1,axpy BLAS-1
Experimental Results: L&P
3
,3,323
22nkkkiter diaggemmgemmsyev
222,1
2,1,1,1 22 nLkLnkLnkkkiter sumdotaxpyscaldot
16 December 2005 Universidad de Murcia 111
Experimental Results: L&P
16 December 2005 Universidad de Murcia 112
Experimental Results: L&P
Lowest
Lowest with threads
La_Re
La_Re
La_In_Th
Lowest no threads
La_Re
La_Re
La_Re
La_In
197.069.996.660.62165.8112.861.10
281.709.996.660.62249.5913.711.10
290.6811.9013.740.62249.5913.711.10B_In_Th
288.669.996.660.79254.3415.681.16B_II_Th
308.8012.3414.130.66266.6313.921.10B_In_Th
201.6410.4410.520.83165.8112.861.16
497.5918.03123.731.21336.4916.411.69B_In
293.8510.4410.520.86255.2015.651.16B_II
264.8910.4626.700.83210.8514.871.16B_III
294.3214.2298.790.94165.8112.861.69B_In
TOTALZKAOAMATMATMATEIGEIGENADKTRACE
16 December 2005 Universidad de Murcia 113
Outline●A little history●Modelling Linear Algebra Routines●Installation routines●Autotuning routines●Modifications to libraries’ hierarchy●Polylibraries●Algorithmic schemes●Heterogeneous systems●Hybrid programming●Peer to peer computing
16 December 2005 Universidad de Murcia 114
Algorithmic schemes
●To study ALGORITHMIC SCHEMES, and not individual routines. The study could be useful to:
● Design libraries to solve problems in different fields.● Divide and Conquer, Dynamic Programming, Branch and
Bound (La Laguna)● Develop SKELETONS which could be used in parallel
programming languages.● Skil, Skipper, CARAML, P3L, …
16 December 2005 Universidad de Murcia 115
Dynamic Programming●There are different Parallel Dynamic Programming Schemes.●The simple scheme of the “coins problem” is used:
● A quantity C and n coins of values v=(v1,v2,…,vn), and a quantity q=(q1,q2,…,qn) of each type. Minimize the quantity of coins to be used to give C.
● But the granularity of the computation has been varied to study the scheme, not the problem.
16 December 2005 Universidad de Murcia 116
Dynamic Programming●Sequential scheme:for i=1 to number_of_decisions
for j=1 to problem_sizeobtain the optimum solution with i
decisions and problem size jendfor Complete the table with the formula:
endfor
1 2 . . . . . . . . j . . . . . N1 2
…. i
… n
i v
i v2 ]k v-j1,-Change[ik min j]Change[i, i j/vi0,1,...,k
16 December 2005 Universidad de Murcia 117
Dynamic Programming●Parallel scheme:for i=1 to number_of_decisions In Parallel: for j=1 to problem_size obtain the optimum solution with i decisions and problem size j endfor endInParallelendfor
1 2 . . . . . . . j . . . . .
1
2
...
i
…
n
PO P1 P2 ...... PS ... PK-1 PK
vi
vi2
vi3
16 December 2005 Universidad de Murcia 118
Dynamic Programming●Message-passing scheme:
In each processor Pj
for i=1 to number_of_decisions communication step obtain the optimum solution with i decisions and the problem
sizes Pj has assigned endforendInEachProcessor
1 2 . . . . . . . . j . . . . .
1
2
...
i
…
n
PO P1 P2 .................... PK-1 PK
N
16 December 2005 Universidad de Murcia 119
Dynamic Programming●Theoretical model:
Sequential cost:
Computational parallel cost (qi large):
Communication cost:
●The only AP is p●The SPs are tc , ts and tw
...2,2,1,1, commarithcommarithparallel ttttt
one step
vC
i
o2
2
Process Pptt ws
pCpp
2
)1(
2
)1(
16 December 2005 Universidad de Murcia 120
Dynamic Programming●How to estimate arithmetic SPs:
Solving a small problem●How to estimate communication SPs:
● Using a ping-pong (CP1)● Solving a small problem varying the number of processors
(CP2)● Solving problems of selected sizes in systems of selected sizes
(CP3)
16 December 2005 Universidad de Murcia 121
Dynamic Programming●Experimental results:
● Systems:● SUNEt: five SUN Ultra 1 and one SUN Ultra 5 (2.5 times faster) + Ethernet● PenET: seven Pentium III + FastEthernet
● Varying:● The problem size C = 10000, 50000, 100000, 500000● Large value of qi
● The granularity of the computation (the cost of a computational step)
16 December 2005 Universidad de Murcia 122
Dynamic Programming●Experimental results:
● CP1:● ping-pong (point-to-point communication).● Does not reflect the characteristics of the system
● CP2:● Executions with the smallest problem (C =10000) and varying the number of
processors● Reflects the characteristics of the system, but the time also changes with C● Larger installation time (6 and 9 seconds)
● CP2:● Executions with selected problem (C =10000, 100000) and system (p =2, 4, 6) sizes,
and linear interpolation for other sizes● Larger installation time (76 and 35 seconds)
16 December 2005 Universidad de Murcia 123
5161516551666166100
416141614161616650
111111111111111110
CP3CP2CP1LTCP3CP2CP1LTCP3CP2CP1LTCP3CP2CP1LTgra
7776777677777576100
617761746177717550
116111611161116110
CP3CP2CP1LTCP3CP2CP1LTCP3CP2CP1LTCP3CP2CP1LTgra
500.000100.00050.00010.000
SUNEt
PenFE
Parameter selection
Dynamic Programming
16 December 2005 Universidad de Murcia 124
Dynamic Programming●Quotient between the execution time with the parameter selected by each one of the selection methods and the lowest execution time, in SUNEt:
16 December 2005 Universidad de Murcia 125
Dynamic Programming●Quotient between the execution time with the parameter selected by each one of the selection methods and the lowest execution time, in PenFE:
16 December 2005 Universidad de Murcia 126
Dynamic Programming●Three types of users are considered:
● GU (greedy user):● Uses all the available processors.
● CU (conservative user):● Uses half of the available processors
● EU (expert user):● Uses a different number of processors depending on the granularity:
● 1 for low granularity● Half of the available processors for middle granularity● All the processors for high granularity
16 December 2005 Universidad de Murcia 127
Dynamic Programming●Quotient between the execution time with the parameter selected by each type of user and the lowest execution time, in SUNEt:
16 December 2005 Universidad de Murcia 128
Dynamic Programming●Quotient between the execution time with the parameter selected by each type of user and the lowest execution time, in PenFE:
16 December 2005 Universidad de Murcia 129
Outline●A little history●Modelling Linear Algebra Routines●Installation routines●Autotuning routines●Modifications to libraries’ hierarchy●Polylibraries●Algorithmic schemes●Heterogeneous systems●Hybrid programming●Peer to peer computing
16 December 2005 Universidad de Murcia 130
Heterogeneous algorithms●Necessary new algorithms with unbalanced distribution of data:
Different SPs for different processors
APs include
vector of selected processors
vector of block sizes
Gauss eliminationb0b1b2b0b1b2b0b1b2b0
16 December 2005 Universidad de Murcia 131
Heterogeneous algorithms●Parameter selection:
● RI-THE: obtains p and b from the formula (homogeneous distribution)
● RI-HOM: obtains p and b through a reduced number of executions (homogeneous distribution)
● RI-HET: ● obtains p and b through a reduced number of executions● and each
pbs
sb p
jj
ii
1
16 December 2005 Universidad de Murcia 132
Heterogeneous algorithms●Quotient with respect to the lowest experimental execution time:
0
0,5
1
1,5
2
500
1000
1500
2000
2500
3000
RI-THEORI-HOMORI-HETE
0
0,5
1
1,5
2
500
1000
1500
2000
2500
3000
0
0,5
1
1,5
2
500
1000
1500
2000
2500
3000
Homogeneous system:Five SUN Ultra 1
Hybrid system:Five SUN Ultra 1One SUN Ultra 5
Heterogeneous system:Two SUN Ultra 1 (one manages the file system)One SUN Ultra 5
16 December 2005 Universidad de Murcia 133
INSTALLATION
LAR
Modellingthe LAR
MODEL
Implementationof SP-
EstimatorsSP-Estimators
Estimationof Static-SP
Static-SP-File
Basic Libraries Installation-File
DESIGN R
UN-TI
ME
Parameter selection at running time
16 December 2005 Universidad de Murcia 134
INSTALLATION
LAR
Modellingthe LAR
MODEL
Implementationof SP-
EstimatorsSP-Estimators
Estimationof Static-SP
Static-SP-File
Basic Libraries Installation-File
NWS Information
Call to NWS
DESIGN R
UN-TI
ME
Parameter selection at running time
16 December 2005 Universidad de Murcia 135
INSTALLATION
LAR
Modellingthe LAR
MODEL
Implementationof SP-
EstimatorsSP-Estimators
Estimationof Static-SP
Static-SP-File
Basic Libraries Installation-File
NWS Information
Call to NWS
DESIGN R
UN-TI
ME
The NWS is called and it reports:
the fraction of available CPU (fCPU)
the current word sending time (tw_current) for a specific n and AP values (n0, AP0).
Then the fraction of available
network is calculated:
Parameter selection at running time
16 December 2005 Universidad de Murcia 136
node1 node2 node3 node4 node5 node6 node7 node8
Situation A
CPU avail. 100% 100% 100% 100% 100% 100% 100% 100%
tw-current 0.7sec
Situation B
CPU avail. 80% 80% 80% 80% 100% 100% 100% 100%
tw-current 0.8sec 0.7sec
Situation C
CPU avail. 60% 60% 60% 60% 100% 100% 100% 100%
tw-current 1.8sec 0.7sec
Situation D
CPU avail. 60% 60% 60% 60% 100% 100% 80% 80%
tw-current 1.8sec 0.7sec 0.8sec
Situation E
CPU avail. 60% 60% 60% 60% 100% 100% 50% 50%
tw-current 1.8sec 0.7sec 4.0sec
Parameter selection at running time
16 December 2005 Universidad de Murcia 137
INSTALLATION
LAR
Modellingthe LAR
MODEL
Implementationof SP-
EstimatorsSP-Estimators
Estimationof Static-SP
Static-SP-File
Basic Libraries Installation-File
NWS Information
Call to NWS
DESIGN R
UN-TI
ME
Parameter selection at running time
16 December 2005 Universidad de Murcia 138
INSTALLATION
LAR
Modellingthe LAR
MODEL
Implementationof SP-
EstimatorsSP-Estimators
Estimationof Static-SP
Static-SP-File
Basic Libraries Installation-File
Current-SP
Dynamic Adjustment
of SP
NWS Information
Call to NWS
DESIGN R
UN-TI
ME
Parameter selection at running time
16 December 2005 Universidad de Murcia 139
INSTALLATION
LAR
Modellingthe LAR
MODEL
Implementationof SP-
EstimatorsSP-Estimators
Estimationof Static-SP
Static-SP-File
Basic Libraries Installation-File
Current-SP
Dynamic Adjustment
of SP
NWS Information
Call to NWS
DESIGN R
UN-TI
ME
The values of the SP are tuned,
according to the current situation:
Parameter selection at running time
16 December 2005 Universidad de Murcia 140
INSTALLATION
LAR
Modellingthe LAR
MODEL
Implementationof SP-
EstimatorsSP-Estimators
Estimationof Static-SP
Static-SP-File
Basic Libraries Installation-File
Current-SP
Dynamic Adjustment
of SP
NWS Information
Call to NWS
DESIGN R
UN-TI
ME
Parameter selection at running time
16 December 2005 Universidad de Murcia 141
INSTALLATION
LAR
Modellingthe LAR
MODEL
Implementationof SP-
EstimatorsSP-Estimators
Estimationof Static-SP
Static-SP-File
Basic Libraries Installation-File
Current-SP
Dynamic Adjustment
of SP
Optimum-AP
Selectionof Optimum AP
NWS Information
Call to NWS
DESIGN R
UN-TI
ME
Parameter selection at running time
16 December 2005 Universidad de Murcia 142
INSTALLATION
LAR
Modellingthe LAR
MODEL
Implementationof SP-
EstimatorsSP-Estimators
Estimationof Static-SP
Static-SP-File
Basic Libraries Installation-File
Current-SP
Dynamic Adjustment
of SP
Optimum-AP
Selectionof Optimum AP
NWS Information
Call to NWS
DESIGN R
UN-TI
ME
Block size
Situation of the Platform Load
n A B C D E
1024 32 32 64 64 64
2048 64 64 64 128 128
3072 64 64 128 128 128Number of nodes to use p = r c
Situation of the Platform Load
n A B C D E
1024 42 42 22 22 21
2048 42 42 22 22 21
3072 42 42 22 22 21
Parameter selection at running time
16 December 2005 Universidad de Murcia 143
INSTALLATION
LAR
Modellingthe LAR
MODEL
Implementationof SP-
EstimatorsSP-Estimators
Estimationof Static-SP
Static-SP-File
Basic Libraries Installation-File
Current-SP
Dynamic Adjustment
of SP
Optimum-AP
Selectionof Optimum AP
NWS Information
Call to NWS
DESIGN R
UN-TI
ME
Parameter selection at running time
16 December 2005 Universidad de Murcia 144
INSTALLATION
LAR
Modellingthe LAR
MODEL
Implementationof SP-
EstimatorsSP-Estimators
Estimationof Static-SP
Static-SP-File
Basic Libraries Installation-File
Current-SP
Dynamic Adjustment
of SP
Optimum-AP
Selectionof Optimum AP
Executionof LAR
NWS Information
Call to NWS
DESIGN R
UN-TI
ME
Parameter selection at running time
16 December 2005 Universidad de Murcia 145
Parameter selection at running time
n = 1024
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
A B C D E
Static Model
Dynamic Model
n = 2048
0%
20%
40%
60%
80%
100%
120%
140%
160%
A B C D E
Situation of the platform load
n = 3072
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
A B C D E
16 December 2005 Universidad de Murcia 146
Work distribution
●There are different possibilities in heterogeneous systems: ● Heterogeneous algorithms (Gauss elimination).● Homogeneous algorithms and assignation of:
● One process to each processor (LU factorization)● A variable number of processes to each processor, depending on the relative
speed
The general assignation problem is NP use of heuristics approximations
16 December 2005 Universidad de Murcia 147
Work distribution
●Dynamic Programming (the coins problem scheme) Homogeneous algorithm +
Heterogeneous algorithm distribution
1 2 . . . . . . . j . . . . .
1
2
...
i
…
n
P0 P1 P2 ...... PS ... PK-1 PK
1 2 . . . . . . . j . . . . .
1
2
...
i
…
n
p0 p1 p2 p3 p4 p5 ... ps ... pr-1 pr
P0 P0 P1 P3 P3 P3 ... PS ... PK PK
16 December 2005 Universidad de Murcia 148
Work distribution●The model:
t(n,C,v,q,tc(n,C,v,q,p,b,d),ts(n,C,v,q,p,b,d),tw(n,C,v,q,p,b,d))● Problem size:
● n number of types of coins● C value to give● v array of values of the coins● q quantity of coins of each type
● Algorithmic parameters:● p number of processes● b block size (here n/p)● d processes to processors assignment
● System parameters:● tc cost of basic arithmetic operations● ts start-up time● tw word-sending time
16 December 2005 Universidad de Murcia 149
Work distribution●Theoretical model:The same as for the homogeneous case because the same homogeneous algorithm is usedSequential cost:
Computational parallel cost (qi large):
Communication cost:
●There is a new AP: d●SPs are now unidimensional (tc) or bidimensional (ts ,tw ) tables
vC
i
o2
2
Process Pptt ws
pCpp
2
)1(
2
)1(
16 December 2005 Universidad de Murcia 150
Work distribution●Assignment tree (P types of processors and p processes):
3 P
PP3P32P321
21 ...
.........
P processors
p p
roce
sses
...
Some limit in the height of the tree (the number of processes)is necessary
16 December 2005 Universidad de Murcia 151
Work distribution●Assignment tree (P types of processors and p processes):
P =2 and p =3: 10 nodes
in general:
1
1 1
1 5 10 10 5 1
1 4 6 4 1
1 3 3 1
1 2 1
16 December 2005 Universidad de Murcia 152
Work distribution●Assignment tree. SUNEt P=2 types of processors (five SUN1 + one SUN5):
nodes:
when more processes than available processors are
assigned to a type of processor, the costs of operations (SPs) change
U5
U1U5 U1
U1
2 processors
p p
roce
sses
...
U1
U1
U1
U1U1U5
U1
U1
2
)1)(2( pp
one process to each processor
16 December 2005 Universidad de Murcia 153
Work distribution●Assignment tree. TORC, used P=4 types of processors:
one 1.7 Ghz Pentium 4 (only one process can be assigned). Type 1
one 1.2 Ghz AMD Athlon. Type 2
one 600 Mhz single Pentium III. Type 3
eight 550 Mhz dual Pentium III. Type 4
3 4
4434324321
21
4 processors
p p
roce
sses
...
not in the tree
the values of SPs change
two consecutiveprocessesare assignedto a samenode
16 December 2005 Universidad de Murcia 154
Work distribution●Use Branch and Bound or Backtracking (with nodes elimination) to search through the tree:
● Use the theoretical execution model to estimate the cost at each node with the highest values of the SPs between those of the types of processors considered, through multiplying the values by the number of processes assigned to the processor of this type with more charge:
tnpdt cd i
ii
i
pic 0/,...,1max
tt sdd jiji
pjis,0,/,...,1,
max
tt wdd jiji
pjiw,0,/,...,1,
max
16 December 2005 Universidad de Murcia 155
Work distribution●Use Branch and Bound or Backtracking (with nodes elimination) to search through the tree:
● Use the theoretical execution model to obtain a lower bound for each nodeFor example, with an array of types of processors (1,1,1,2,2,2,3,3,3,4,4,4), with
relative speeds si, and array of assignations a=(2,2,3), the array of possible assignations is pa=(0,0,0,1,1,0,1,1,1,1,1,1), and the maximum achievable speed is
the minimum arithmetic cost is obtained from this speed, and the lowest communication costs are obtained from those between processors in the array of assignations
tt saa jiji
pjis,0,/,...,1,
max
tt waa jiji
pjiw,0,/,...,1,
max
spas i
p
iiT
1
16 December 2005 Universidad de Murcia 156
Work distribution●Theoretical model:
Sequential cost:
Computational parallel cost (qi large):
Communication cost:
●The APs are p and the assignation array d●The SPs are the unidimensional array tc , and the bidimensional arrays ts and tw
...2,2,1,1, commarithcommarithparallel ttttt
one step
vCt
ic
o2
2
vCtvt
ic
C
p
CCj i
c po
j2
1
1
tt ws
pCpp
2
)1(
2
)1(
Maximum values
16 December 2005 Universidad de Murcia 157
Work distribution●How to estimate arithmetic SPs:
Solving a small problem on each type of processors●How to estimate communication SPs:
● Using a ping-pong between each pair of processors, and processes in the same processor (CP1)
● Does not reflect the characteristics of the system● Solving a small problem varying the number of processors,
and with linear interpolation (CP2)● Larger installation time
16 December 2005 Universidad de Murcia 158
Work distribution●Three types of users are considered:
● GU (greedy user):● Uses all the available processors, with one process per processor.
● CU (conservative user):● Uses half of the available processors (the fastest), with one process per processor.
● EU (user expert in the problem, the system and heterogeneous computing):● Uses a different number of processes and processors depending on the granularity:
● 1 process in the fastest processor, for low granularity● The number of processes is half of the available processors, and in the appropriate
processors, for middle granularity● A number of processes equal to the number of processors, and in the appropriate processors,
for large granularity
16 December 2005 Universidad de Murcia 159
Work distribution●Quotient between the execution time with the parameters selected by each one of the selection methods and the modelled users and the lowest execution time, in SUNEt:
16 December 2005 Universidad de Murcia 160
Work distribution●Parameters selection, in TORC, with CP2:
C gra LT CP2
50000 10 (1,2) (1,2)
50000 50 (1,2) (1,2,4,4)
50000 100 (1,2) (1,2,4,4)
100000 10 (1,2) (1,2)
100000 50 (1,2) (1,2,4,4)
100000 100 (1,2) (1,2,4,4)
500000 10 (1,2) (1,2)
500000 50 (1,2) (1,2,3,4)
500000 100 (1,2) (1,2,3,4)
16 December 2005 Universidad de Murcia 161
Work distribution●Parameters selection, in TORC (without the 1.7 Ghz Pentium 4), with CP2:one 1.2 Ghz AMD Athlon. Type 1
one 600 Mhz single Pentium III. Type 2
eight 550 Mhz dual Pentium III. Type 3
C gra LT CP2
50000 10 (1,1,2) (1,1,2,3,3,3,3,3,3)
50000 50 (1,1,2) (1,1,2,3,3,3,3,3,3,3,3)
50000 100 (1,1,3,3) (1,1,2,3,3,3,3,3,3,3,3)
100000 10 (1,1,2) (1,1,2)
100000 50 (1,1,3) (1,1,2,3,3,3,3,3,3,3,3)
100000 100 (1,1,3) (1,1,2,3,3,3,3,3,3,3,3)
500000 10 (1,1,2) (1,1,2)
500000 50 (1,1,2) (1,1,2,3)
500000 100 (1,1,2) (1,1,2)
16 December 2005 Universidad de Murcia 162
Work distribution●Quotient between the execution time with the parameters selected by each one of the selection methods and the modelled users and the lowest execution time, in TORC:
16 December 2005 Universidad de Murcia 163
Work distribution●Quotient between the execution time with the parameters selected by each one of the selection methods and the modelled users and the lowest execution time, in TORC (without the 1.7 Ghz Pentium 4):
16 December 2005 Universidad de Murcia 164
Outline●A little history●Modelling Linear Algebra Routines●Installation routines●Autotuning routines●Modifications to libraries’ hierarchy●Polylibraries●Algorithmic schemes●Heterogeneous systems●Hybrid programming●Peer to peer computing
16 December 2005 Universidad de Murcia 165
Hybrid programming
OpenMPFine-grain parallelism
Efficient in SMP
Sequential and parallel codes are similar
Tools for development and parallelisation
Allows run time scheduling
Memory allocation can reduce performance
MPICoarse-grain parallelismMore portableParallel code very
different from sequential
Development and debugging more complex
Static assigment of processes
Local memories, which facilitates efficient use
16 December 2005 Universidad de Murcia 166
Hybrid programming
Advantages of Hybrid Programming● To improve scalability● When too many tasks produce load imbalance● Applications with fine and coarse-grain parallelism● Redution of the code development time● When the number of MPI processes is fixed● In case of a mixture of functional and data parallelism
16 December 2005 Universidad de Murcia 167
Hybrid programmingHybrid Programming in the literature
● Most of the papers are about particular applications● Some papers present hybrid models● No theoretical models of the execution time are
available
16 December 2005 Universidad de Murcia 168
Hybrid programmingSystems
● Networks of Dual Pentiums● HPC160 (each node four processors)● IBM SP● Blue Horizon (144 nodes, each 8 processors)● Earth Simulator (640x8 vector processors)
…
16 December 2005 Universidad de Murcia 169
Hybrid programming
16 December 2005 Universidad de Murcia 170
Hybrid programming
ModelsMPI+OpenMP
OpenMP used for loops parallelisation
OpenMP+MPI Unsafe threads
MPI and OpenMP processes in SPMD modelReduces cost of communications
16 December 2005 Universidad de Murcia 171
Hybrid programming
16 December 2005 Universidad de Murcia 172
Hybrid programming!$OMP PARALLEL DO REDUCTION (+:sum) PRIVATE (x)
do 20 i = myid+1, n, numprocs
x = h * (dble(i) - 0.5d0)
sum = sum + f(x)
20 enddo
!$OMP END PARALLEL DO
mypi = h * sum
call MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISION, &MPI_SUM,0,MPI_COMM_WORLD,ierr)
call MPI_FINALIZE(ierr)
stop
end
program main
include 'mpif.h'
double precision mypi, pi, h, sum, x, f, a
integer n, myid, numprocs, i, ierr
f(a) = 4.d0 / (1.d0 + a*a)
call MPI_INIT( ierr )
call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )
call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )
call MPI_BCAST(n,1,MPI_INTEGER,0, & MPI_COMM_WORLD,ierr)
h = 1.0d0/n
sum = 0.0d0
16 December 2005 Universidad de Murcia 173
Hybrid programmingIt is not clear if with hybrid programming the execution time would
be lowerLanucara, Rovida: Conjugate-Gradient
16 December 2005 Universidad de Murcia 174
Hybrid programmingIt is not clear if with
hybrid programming
the execution time
would be lowerDjomehri, Jin:
CFD Solver
16 December 2005 Universidad de Murcia 175
Hybrid programmingIt is not clear if with
hybrid programming
the execution time
would be lowerViet, Yoshinaga, Abderazek,
Sowa: Linear system
16 December 2005 Universidad de Murcia 176
Hybrid programming●Matrix-matrix multiplication:
MPI
SPMD MPI+OpenMPdecide which is preferable
MPI+OpenMP:less memory
fewer communicationsmay have worse memory use
N0 p0N0 p1
N2 p0N2 p1
N1 p0N1 p1
N0 p0N0 p1
N2 p0N2 p1
N1 p0N1 p1
N0 p0N0 p1
N2 p0N2 p1
N1 p0N1 p1
N0
N2
N1
16 December 2005 Universidad de Murcia 177
Hybrid programming●In the time theoretical model more Algorithmic Parameters appear:
8 processors: p=rxs, 1x8, 2x4, 4x2, 8x1
p=rxs, 1x4, 2x2, 4x1
q=uxv, 1x2, 2x1
total 6 configurations
16 processors: p=rxs, 1x16, 2x8, 4x4, 8x2, 16x1
p=rxs, 1x4, 2x2, 4x1
q=uxv, 1x4, 2x2, 4x1
total 9 configurations
16 December 2005 Universidad de Murcia 178
Hybrid programming●And more System Parameters:
● The cost of communications is different inside and outside a node (similar to the heterogeneous case with more than one process per processor)
● The cost of arithmetic operations can vary when the number of threads in the node varies
●Consequently, the algorithms must be recoded and new models of the execution time must be obtained
16 December 2005 Universidad de Murcia 179
Hybrid programming… and the formulas change:
P0 P1 P2 P3 P4 P5 P6 synchronizations
Node 1
Node 2
Node 3
Node 4
Node 5
Node 6communications
The formula changes,for some systems 6x1 nodes and1x6 threads couldbe better, and for others 1x6 nodes and6x1 threads
16 December 2005 Universidad de Murcia 180
Hybrid programming●Open problem
● Is it possible to generate automatically MPI+OpenMP programs from MPI programs? Maybe for the SPMD model.
● Or at least for some type of programs, such as matricial problems in meshes of processors?
● And is it possible to obtain the execution time of the MPI+OpenMP program from that of the MPI program and some description of how the time model has been obtained?
16 December 2005 Universidad de Murcia 181
Outline●A little history●Modelling Linear Algebra Routines●Installation routines●Autotuning routines●Modifications to libraries’ hierarchy●Polylibraries●Algorithmic schemes●Heterogeneous systems●Hybrid programming●Peer to peer computing
16 December 2005 Universidad de Murcia 182
Peer to peer computing●Distributed systems:
● They are inherently heterogeneous and dynamic● But there are other problems:
● Higher communication cost● Special middleware is necessary
● The typical paradigms are master/slave, client/server, where different types of processors (users) are considered.
16 December 2005 Universidad de Murcia 183
Peer to peer computingPeer-to-Peer Computing. Dejan S. Milojicic, Vana Kalogeraki,
Rajan Lukose, Kiran Nagaraja1, Jim Pruyne, Bruno Richard, Sami Rollins, Zhichen Xu. HP Laboratories Palo Alto. 2002
16 December 2005 Universidad de Murcia 184
Peer to peer computing●Peer to peer:
● All the processors (users) are at the same level (at least initially)● The community selects, in a democratic and continuous way, the
topology of the global network
●Would it be interesting to have a P2P system for computing?●Is some system of this type available?
16 December 2005 Universidad de Murcia 185
Peer to peer computing●Would it be interesting to have a P2P system for computing?
● I think it would be interesting to develop a system of this type● And to leave the community to decide, in a democratic and
continuous way, if it is worthwhile●Is some system of this type available?
● I think there is no pure P2P dedicated to computation
16 December 2005 Universidad de Murcia 186
Peer to peer computing●… and other people seem to think the same:
● Lichun Ji (2003): “… P2P networks seem to outperform other approaches largely due to the anonymity of the participants in the peer-network, low network costs and the inexpensive disk-space. Trying to apply P2P principles in the area of distributed computation was significantly less successful”
● Arjav J. Chakravarti, Gerald Baumgartner, Mario Lauria (2004): “… current approaches to utilizing desktop resources require either centralized servers or extensive knowledge of the underlying system, limiting their scalability”
16 December 2005 Universidad de Murcia 187
Peer to peer computing●There are a lot of tools for Grid Computing:
● Globus (of course), but does Globus provide computational P2P capacity or is it a tool with which P2P computational systems can be developed?
● Netsolve/Gridsolve. Uses a client/server structure.● PlanetLab (at present 387 nodes and 162 sites). In each site one Principal Researcher and one System Administrator.
16 December 2005 Universidad de Murcia 188
Peer to peer computing●For Computation on P2P the shared resources are:
● Information: books, papers, …, in a typical way.● Libraries. One peer takes a library from another peer.
● Necessary description of the library and the system to know if the library fulfils our requests.
● Computation. One peer colaborates to solve a problem proposed by another peer.● This is the central idea of Computation on P2P…
16 December 2005 Universidad de Murcia 189
Peer to peer computing●Two peers collaborate in the solution of a computational problem using the hierarchy of parallel linear algebra libraries
PLAPACK
Mac. LAPACK
BLAS
Reference MPI
ScaLAPACK
Ref. LAPACK
ATLAS
PBLAS
BLACS
Machine MPI
Peer 2Peer 1
16 December 2005 Universidad de Murcia 190
Peer to peer computing●There are
● Different global hierarchies● Different libraries
PLAPACK
Mac. LAPACK
BLAS
Reference MPI
ScaLAPACK
Ref. LAPACK
ATLAS
PBLAS
BLACS
Machine MPI
Peer 2Peer 1
16 December 2005 Universidad de Murcia 191
Peer to peer computing●And the installation information varies, which makes the efficient use of the theoretical model more difficult than in the heterogeneous case
PLAPACK
Mac. LAPACK
BLAS
Reference MPI
ScaLAPACK
Ref. LAPACK
ATLAS
PBLAS
BLACS
Machine MPI
Peer 2Peer 1Inst. Inform.
Inst. Inform.
Inst. Inform.
Inst. Inform.
Inst. Inform.
Inst. Inform.
Inst. Inform.Inst. Inform.
Inst. Inform.Inst. Inform.
16 December 2005 Universidad de Murcia 192
Peer to peer computing●Trust problems appear:
● Does the library solve the problems we require to be solved?● Is the library optimized for the system it claims to be optimized
for?● Is the installation information correct?● Is the system stable?
There are trust-algorithms for P2P systems; are they (or some modification) applicable to these trust problems?
16 December 2005 Universidad de Murcia 193
Peer to peer computing●Each peer would have the possibility of establishing a policy of use:
● The use of the resources could be payable● The percentage of CPU dedicated to computations for the
community● The type of problems it is interested in
And the MAIN PROBLEM: is it interesting to develop a P2P system for the management and optimization of computational codes?