Paradigmas de Computação Paralela
Optimising performance (MPI)
João Luís Ferreira Sobral Departamento do Informática
Universidade do Minho
10 Nov 2015
Computação Paralela 2
Performance of parallel applications
Performance models l Make it possible to compare algorithms, scalability and the identification of bottlenecks before a
considerable time is invested in implementation
What is the definition of performance?
l There are multiple alternatives: l Execution time, efficiency, scalability, memory requirements, throughput, latency, project costs / development
costs, portability, reuse potential l The importance of each one depends on the concrete application
l Most common measure in parallel applications (speed-up): tseq/tpar
Amdahl law
l The sequential component of an application limits the maximum speed-up l If s is the sequential faction of an algorithm then the maximum possible gain is 1/s.
l Reinforces the idea that we should prefer algorithms suitable for parallel execution: think parallel.
Computação Paralela 3
Performance of parallel applications
Performance models
l Should explain observations and predict behaviour l Defined as a function of the problem dimension, number of processors, number of tasks, etc.
l Execution time
l Time measured since the first processor (core) starts execution until the last processor terminates
l Texec = Tcomp + Tcomm + Tfree
l Computation time – time spent in computations, excluding communication/synchronization and free time. § The sequential version can be used to estimate Tcomp.
l Free time - when a processor becomes starved (without work) § Can be complex to measure since it depends on the order of tasks § Can be minimized with adequate load distribution and/or “sob-positioning” computation and
communication
Computação Paralela 4
Performance of parallel applications
Performance models(cont.)
l Communication time – time that processes spend sending/receiving data. l Computed using communication latency (ts) and throughput (1/tw):
§ Tmens = ts + twL
ts and tw can be obtained experimentally, by a ping-pong test and a linear regression.
Tamnho
MPI (lat)
MPI (bw)
MPI-MX (lat)
MPI-MX (bw)
MPP (lat)
MPP (bw)
KaRMI (lat)
KaRMI (bw)
1 31 0,3 3,4 2,3 46 0,2 63 0,18 31 2,6 3,4 23,7 46 1,7 75 1,1
100 31 25,6 4,7 168,7 48 16,6 75 10,71000 55 146,3 8,1 983,9 94 106,4 200 40,0
10000 258 310,3 18,4 4355,9 279 286,0 387 206,0100000 1136 703,8 125,5 6373,0 1017 786,5 1137 703,01E+06 9859 811,4 970,2 8246,0 8282 953,2 9787 817,0
ts (us) 31 3 46 63tw (us) 0,010 0,001 0,008 0,010
0,1
1,0
10,0
100,0
1000,0
10000,0
1
10
100
1000
10000
1 10 100 1000 10000 100000 1000000
Latê
ncia
(us)
Size
MPI (lat) MPP (lat) KaRMI (lat) MPI-MX (lat) MPI (bw) MPP (bw) KaRMI (bw) MPI-MX (bw)
Computação Paralela 5
Performance of parallel applications
Performance models – Example Jacobi Method
l Iterative method, at each iteration the new matrix value is computed as the average neighbour values
Xi, j(t+1) = aXi, j
(t ) + b(Xi−1, j(t ) + Xi, j−1
(t ) + Xi+1, j(t ) + Xi, j+1
(t ) )
!for(int t=0; t<Niter; t++) { !! for(int i=1; i<N-1; i++) !! for(int j=1; j<N-1; j++) !! r[i][j] = a*x[i][j]+ b*(x[i-1][j] + x[i+1][j] + x[i][j-1] + x[i][j+1]); ! ! // x = r on the next iteration! !}
Computação Paralela 6
Performance of parallel applications
Performance models – Example Jacobi Method
l Execution time for each iteration in Jacobi method, for a NxN matrix, on P processors, a partition by columns and N/P columns per processor.
Tcomp = operations per element x num. elements per processor x tc
= 6 x (N x N/P) x tc (tc = time for a single operation)
= 6tcN2/P
Tcomm = messages per processor x time required for each message
= 2 x (ts + twN)
Tfree = 0 , since in this problem the workload is well distributed
Texec = Tcomp + Tcomm + Tfree
= 6tcN2/P + 2ts + 2twN
= O(N2/P+N)
Computação Paralela 7
Performance of parallel applications
Performance models – Example (cont) l In certain cases, execution time may not be the most adequate performance measure.
l Speed-up and efficiency are two related metrics.
l Speed-up (G) indicates the reduction in execution time attained in P processors l Ratio between the best sequential algorithm and the execution time of the parallel version
speed-up = Tseq / Tpar,
l Efficiency (E) gives the faction of time that processors perform useful work:
E = Tseq / (P x Tpar)
l Jacobi case:
G =6tcN
2P6tcN
2 + 2Pts + 2PtwNE = 6tcN
2
6tcN2 + 2Pts + 2PtwN
Computação Paralela 8
Performance of parallel applications
Scalability analysis
l Execution time, speed-up and efficiency can be used for quantitative analysis of performance
l Jacobi example: l Execution time decreases when P increases, but it is limited by the time to exchange two lines l Execution time increases with N, tc, ts e tw
l Efficiency decreases when P, ts e tw increase Texe = 6tcN2/P + 2ts + 2twN l Efficiency increases with N and Tc;
l Scalability for problems with fixed size. l Analysis of Texec and E when P increases l In general, E decreases. Texec can increase if it has a positive power of P.
l Scalability for problems with variable size l In some cases, more processors are used to solve larger problems, keeping the same efficiency levels l Isoefficiency indicates what is the required increase in the problem dimension, to keep the same
efficiency, when the number of processors increases
N2Pt 2Pt N6tN6t
ws2
c
2c
++=E
Computação Paralela 9
Performance of parallel applications
Scalability analysis (cont)
0,1
1,0
10,0
100,0
1000,0
10000,0
1 10 100 1000 10000
Exe
cutio
n tim
e (m
s)
Processors
Jacobi Method !(N=1024,!Tc=5ns/op,!Ts=1ms,!Tw=5us/byte)!
Tcomp=6TcNN/P
Tcom1=2(Ts+NTw)
Tcom2=2(Ts+NTw)+TsP
Tcom3=2(Ts+NTw)+TsLog2(P)
Computação Paralela 10
Performance of parallel applications
Scalability analysis (cont)
Método de Jacobi
(N=1024, Tc=5ns/op, Ts=1ms, Tw=5us/byte)
1
10
100
1000
10000
1 10 100 1000 10000
Processadores
Tem
po d
e Ex
ecuç
ão (m
s) Tcomp+Tcom1Tcomp+Tcom2Tcomp+Tcom3
Computação Paralela 11
Performance of parallel applications
Scalability analysis (cont)
Método de Jacobi
(N=1024, Tc=5ns/op, Ts=1ms, Tw=5us/byte)
0
1
10
100
1000
1 10 100 1000 10000
Processadores
Gan
hoo
Tcomp+Tcom1Tcomp+Tcom2Tcomp+Tcom3
Computação Paralela 12
Performance of parallel applications
Scalability analysis (cont)
Método de Jacobi
(N=1024, Tc=5ns/op, Ts=1ms, Tw=5us/byte)
0,0001
0,0010
0,0100
0,1000
1,00001 10 100 1000 10000
Processadores
Efic
iênc
ia
Tcomp+Tcom1Tcomp+Tcom2Tcomp+Tcom3
Performance of parallel applications
Measuring time in MPI
l Time functions in MPI
l double MPI_Wtime() – returns the wall time (high resolution)
l double MPI_Wtick() – returns the clock resolution (in seconds)
l Wall time can differ from process to process
l There is no notion of “global time”
§ Each machine provides a local wall time
l Application execution time should be wall time of the slowest process
l Note: in some parallel algorithms process termination is not trivial
Computação Paralela 14
Performance of parallel applications
Experimental study and evaluation of implementations l Parallel computing has a strong experimental component
l Many problems are too complex for a realization only based on models l Performance model can be calibrated with experimental data (e.g., Tc)
l How to ensure that result are precise and reproducible? l Perform multiple experiments and verify clock resolution l Results should not change among in small difference: less than 2-3%
l Execution profile: l Gather several performance data: number of messages, data volume transmited l Can be implemented by specific tools or by directly instrumenting the code
§ There is always an overhead introduced in the base application
l Speed-up anomalies l superlinear (superior to the number of processors) – in most cases it is due the cache effect
Computação Paralela 15
Performance of parallel applications
Technique to measure the application time-profile (profiling)
l Polling: the application is periodically interrupted to collect performance data
l Instrumentation: code is introduced (by the programmer or by tools) to collect performance data about useful events
l Instrumentation tends do produce better results but also produces more interference (e.g., overhead)
l Exemplo: vampir
Computação Paralela 16
Performance of parallel applications
Distributed memory (MPI) vs Shared memory (OpenMP) optimisation
l Distributed memory vs shared memory l Data placement is explicit (vs implicit) l Static scheduling is preferred (vs dynamic) l Synchronization is costly (only by global barriers & message send)
l Improve scalability on distributed memory l Minimise communication among processes
§ Eventually duplicating computation l Minimise idle time with a good load distribution
l Practical advise l Measure communication overhead l Measure load balance l Avoid centralised control
Computação Paralela 17
Exercício
Cálculo de números primos através do crivo de Eratosthenes l algoritmo para calcular todos os primos até um determinado máximo
l pode ser implementado por uma cadeia de atividades, onde cada elemento filtra os seus múltiplos
l os números são enviados para a cadeia por ordem crescente. Cada elemento que chega ao fim da cadeia é primo e é acrescentado ao fim desta como um novo filtro.
l atividade paralela tem um rácio entre computação e a comunicação de uma operação aritmética de inteiros (divisão) por mensagem l rácio demasiado baixo para a generalidade das plataformas de memória distribuída.
2 33 2
325 32
324
5 32 5
Parallel task
Message flow
send recv recv send
Computação Paralela 18
Passagem de Mensagens
Exemplo: Cálculo de números primos (RMI vs MPI) l JavaRMI – O gerador invoca o método filter em cada filtro. O método filter é invocando entre
filtros
l MPI – Os parâmetros de init são passados na linha de comandos (ou através de uma mensagem inicial). Os pacotes de números dever ser recebidos explicitamente e enviados ao filtro seguinte após o processamento.
Generator PrimeFilter PrimeFilter PrimeFilter
init
init
init
filter
filter
filterfilter
filter
filter
send recv recv send
19
Passagem de Mensagens
Exemplo: Cálculo de números primos (RMI vs MPI), cont.
l Cadeia de três objetos/processos para calcular os números primos: !" int MAXP = 1000000; !
int SMAXP = 1000; ! ! PrimeServer *ps1 = new PrimeServer(); ! PrimeServer *ps2 = new PrimeServer(); ! PrimeServer *ps3 = new PrimeServer(); ! ! ps1->minitFilter(1,SMAXP/3,SMAXP); ! ps2->minitFilter(SMAXP/3+1,2*SMAXP/3,SMAXP); ! ps3->minitFilter(2*SMAXP/3+1,SMAXP,SMAXP); ! ! int pack=MAXP/10; ! int *ar = new int[pack/2]; ! for(int i=0; i<10; i++) { ! generate(i*pack, (i+1)*pack, ar); ! ps1->mprocess(ar,pack/2); ! ps2->mprocess(ar,pack/2); ! ps3->mprocess(ar,pack/2); ! } "! ps3->end();
int myrank = comm.rank();
…
if (myrank==0) {
… // criar e iniciar filtro local
… // gerar pacotes de números
… // processar
comm.send(…);
} else if(myrank==1) {
… // criar e iniciar filtro local
comm.recv(…);
…// processar
comm.send(…);
else {
… // criar e iniciar filtro local
comm.recv(…); …// processar
}