Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
ImplementationandEvaluationof2.5DMatrixMultiplication
ontheKcomputer
Nov.14,2017PRACE@SC17,Denver
RIKENAdvancedInstituteforComputationalScienceDaichiMukunoki andToshiyukiImamura
Nov.1414:30-14:40@PRACEbooth
Introductionn 2.5Dmatrixmultiplication(2.5D-PDGEMM) [Solomonik &Demmel 2011]
l Acommunication-avoidingalgorithmforparallelmatrix-multiplication(PDGEMM)toimprovethestrongscalingperformance
l Onhighlyparallelsystems,theperformanceofPDGEMMcanbecomecommunication-boundwhentheproblemsizeisnotsufficientlylarge
l 2.5D-PDGEMMreducesthenumberofcommunicationsbyutilizingdataredundancyina2.5Ddistributionona3Dprocessgrid
2
c
(1)Matriceswitha2Ddistributionarestackedverticallyina3Dprocessgrid(2.5Ddistribution)
(2)1/c of2D-PDGEMMalgorithmisperformedon each level;#ofsteps(andcommunications)isreduced
(3)Thefinalresultiscomputedbyverticallyreducingtheintermediateresults oneachlevel
c:stacksize
Stacking ReducingComputing
n 2D-compatible2.5D-PDGEMMl Tocomputematricesdistributedin2Dona2Dprocessgrid,thematricesmustbe
redistributedinto2.5Donalogical3Dprocessgridcreatedonthe2Dprocessgridl 2D-compatible2.5D-PDGEMM,whichincludesthematrixredistributionbetween2D
and2.5D,isneededtobeasubstitutetotheconventionalPDGEMMl Severalstudieshaveimplementedandevaluateda2.5D-PDGEMM,butsucha2D-
compatibleimplementationhasnotbeenstudiedn Contributionsofthisstudy
l Proposeda2D-compatible2.5D-PDGEMMimplementation,thatincludesmatrixredistributionbetween2Dand2.5D
l Analyzedtheperformanceusingupto16384nodesontheKcomputer,byprovidingtheperformancebreakdown+performanceestimation(newupdate)
Introduction(cont’d)
3
2D 2.5D 2D
(a) On3Dprocessgrid (b)On2Dprocessgrid
2D 2.5D 2D
n 2D-compatible2.5D-PDGEMMbasedonSUMMA1. Alogical3Dprocessgridiscreatedontheinitial2Dprocessgrid,andvertical&
horizontalsub-communicatorsarecreatedusingMPI_Comm_split2. MatricesA&Bareredistributedfrom2Dinto2.5DusingMPI_Allgather onthe
verticalsub-communicators3. Matrixmultiplicationiscomputedby2.5D-SUMMAusingMPI_Bcast onthe
horizontalsub-communicators4. MatrixCiscomputedbyreducingthetemporalresultsoneachlevelofthelogical
3DprocessgridusingMPI_Allreduce ontheverticalsub-communicators
4
Implementation
MPI_Allreduce
c
MPI_Bcast
MPI_Allgather
TBcast (p, m) = (p - 1)(l + s/b) + l + t0 + (m/s) max (l, s/b)TAllgather (p, m) = (p - 1)(l + gm/b) + lTAllreduce = TBcast + Treduce→ TReduce (p, m) = (p - 1)(l + s/b+s/bcomp) + l + (gm/s) max (l + s/bcomp, s/b)
TDgemm (n) = 2n3/sDgemm
PerformanceModeling
5
TPdgemm25D = 2TAllgather (c, en2/p) + sqrt (p/c3) (2TBcast (sqrt (p/c), cen2/p) + TDgemm (n/sqrt (p/c)))+ TAllreduce (c, cen2/p)
Execution time of our 2D-compatible 2.5D-PDGEMM
• p :numberofprocesses• m :messagesize• l = 1.6 [µsec] :MPIlatency• s = 16384 [bytes] :segmentsize• b = 3000 [MB/sec] :bandwidth
• bcomp = 10000 [MB/sec] :throughputofreduction• t0 = 8.37 [sec] :overheadforselectingoptimalalgorithm(incl.oneAllreduce)
• sDgemm = 115.2 :FlopsvalueofDGEMM• g = 1.5sqrt(c) :effectofcongestion
OntheKcomputer
2.5D-SUMMARedistributionsofmat.A&B
Redistribution&Reductionofmat.C
PerformanceEvaluationontheKcomputern TheKcomputer
l 10thfastestsupercomputerinTOP500asofNov2017l 11.28PFlops peakwith88128nodes(thisstudyusedupto16384nodesofthem)l EachnodehasaSPARC64VIIIfx (8-core,128GFlopsondouble-precision)l Torusfusion(Tofu)interconnect(6Dmesh&torus,5GB/sforeachdirection)
n Evaluationl Weevaluatedourimplementationwithdifferentstacksizes: c=1, 4, and16
¥ Whenc=1,ourcodeisequivalenttoconventional2D-SUMMAl EventhoughtheKcomputerhasthe3Dnetworktopology,werun ourprograms as
2Dprocessjobsinordertoassumethe2D-compatible2.5D-PDGEMMisusedina2Dapplication
6
SUMMA(c=1)−measuredSUMMA(c=4)−measuredSUMMA(c=16)−measuredScaLAPACK−measured
SUMMA(c=1)−estimatedSUMMA(c=4)−estimatedSUMMA(c=16)−estimated
Performance (strongscaling,n=32768)
7
0
10
20
30
40
50
60
70
80
256 1024 4096 16384 65536
% o
f the
oret
ical
pea
k
Number of processes (p)
(ii) Fixed problem size: n=32768
†SUMMA(c=1)isequivalentto2D-SUMMA†MPIcommunicatorsetupcostisexcluded†ScaLAPACK wasexecutedwithnb=128
Better
PerformanceontheKcomputer(n=32768)
SUMMA(c=1)−measuredSUMMA(c=4)−measuredSUMMA(c=16)−measuredScaLAPACK−measured
SUMMA(c=1)−estimatedSUMMA(c=4)−estimatedSUMMA(c=16)−estimated
2.5D-PDGEMMimprovesthestrongscalingperformanceevenwhenthecostformatrixredistributionbetween2Dand2.5Disincluded
0
10
20
30
40
50
60
70
80
256 1024 4096 16384 65536
% o
f the
oret
ical
pea
k
Number of processes (p)
(ii) Fixed problem size: n=32768
0
20
40
60
80
100
2561024
409616384
65536%
# of processes
c=16
0
20
40
60
80
100
2561024
409616384
65536
%
# of processes
c=16
0
20
40
60
80
100
2561024
409616384
65536
%
# of processes
c=4
0
20
40
60
80
100
2561024
409616384
65536
%
# of processes
c=4
0
20
40
60
80
100
2561024
409616384
65536
%
# of processes
c=1
0
20
40
60
80
100
2561024
409616384
65536
%
# of processes
c=1
c=16c=4c=1
c=16c=4c=1
Measured performance
Estimated performance
Performance Breakdown (strongscaling,n=32768)
8
MPI_Bcast(B)MPI_Bcast(L)
MPI_Allgather(B)MPI_Allgather(L)MPI_Allreduce(B)MPI_Allreduce(L)
DGEMM
Horizontal communication(computation of SUMMA)
Vertical communication(redistribution & reduction)
SUMMA(c=1)−measuredSUMMA(c=4)−measuredSUMMA(c=16)−measuredScaLAPACK−measured
SUMMA(c=1)−estimatedSUMMA(c=4)−estimatedSUMMA(c=16)−estimated
Performance(n=32768)
†(L):latencyfactor (B):bandwidthfactor
Summaryn Summary
l “2D-compatible2.5D-PDGEMM”,whichisdesignedtoperformcomputationsof2Ddistributedmatricesona2Dprocessgrid
l PerformanceevaluationontheKcomputer,andperformanceanalysisbyprovidingtheperformancebreakdownandperformanceestimation
n Conclusionl 2.5D-PDGEMMiseffectiveinimprovingstrongscalingperformanceevenwhenthe
costformatrixredistributionbetween2Dand2.5Disincludedl 2D-compatibleimplementationwouldbeagoodalternativeto2D-PDGEMMfor
highlyparallelsystems,inparticularforthecasewhenapplicationsusingtheconventionalPDGEMMareportedontofuturesystemsthathavegreaterparallelism
l Theredistributioncostwouldbenon-negligiblewhenthestacksizeofthe2.5Dalgorithmincreases;atradeoffbetweenthecommunicationcostinthematrixmultiplicationandtheredistribution&reductioncosts
9
[Latest publication]Daichi Mukunoki and Toshiyuki Imamura: “Implementation and Performance Analysis of 2.5D-PDGEMM on the K Computer”, 12th International Conference on Parallel Processing and Applied Mathematics (PPAM2017), Sep. 2017 (accepted).