Implementation and Evaluation of 2.5D Matrix Multiplication on … · 2019. 9. 25. · Matrix multiplication is computed by 2.5D -SUMMA using MPI_Bcaston the horizontal sub-communicators

ImplementationandEvaluationof2.5DMatrixMultiplication

ontheKcomputer

Nov.14,2017PRACE@SC17,Denver

RIKENAdvancedInstituteforComputationalScienceDaichiMukunoki andToshiyukiImamura

Nov.1414:30-14:40@PRACEbooth

Introductionn 2.5Dmatrixmultiplication(2.5D-PDGEMM) [Solomonik &Demmel 2011]

l Acommunication-avoidingalgorithmforparallelmatrix-multiplication(PDGEMM)toimprovethestrongscalingperformance

l Onhighlyparallelsystems,theperformanceofPDGEMMcanbecomecommunication-boundwhentheproblemsizeisnotsufficientlylarge

l 2.5D-PDGEMMreducesthenumberofcommunicationsbyutilizingdataredundancyina2.5Ddistributionona3Dprocessgrid

2

c

(1)Matriceswitha2Ddistributionarestackedverticallyina3Dprocessgrid(2.5Ddistribution)

(2)1/c of2D-PDGEMMalgorithmisperformedon each level;#ofsteps(andcommunications)isreduced

(3)Thefinalresultiscomputedbyverticallyreducingtheintermediateresults oneachlevel

c:stacksize

Stacking ReducingComputing

n 2D-compatible2.5D-PDGEMMl Tocomputematricesdistributedin2Dona2Dprocessgrid,thematricesmustbe

redistributedinto2.5Donalogical3Dprocessgridcreatedonthe2Dprocessgridl 2D-compatible2.5D-PDGEMM,whichincludesthematrixredistributionbetween2D

and2.5D,isneededtobeasubstitutetotheconventionalPDGEMMl Severalstudieshaveimplementedandevaluateda2.5D-PDGEMM,butsucha2D-

compatibleimplementationhasnotbeenstudiedn Contributionsofthisstudy

l Proposeda2D-compatible2.5D-PDGEMMimplementation,thatincludesmatrixredistributionbetween2Dand2.5D

l Analyzedtheperformanceusingupto16384nodesontheKcomputer,byprovidingtheperformancebreakdown+performanceestimation(newupdate)

Introduction(cont’d)

3

2D 2.5D 2D

(a) On3Dprocessgrid (b)On2Dprocessgrid

2D 2.5D 2D

n 2D-compatible2.5D-PDGEMMbasedonSUMMA1. Alogical3Dprocessgridiscreatedontheinitial2Dprocessgrid,andvertical&

horizontalsub-communicatorsarecreatedusingMPI_Comm_split2. MatricesA&Bareredistributedfrom2Dinto2.5DusingMPI_Allgather onthe

verticalsub-communicators3. Matrixmultiplicationiscomputedby2.5D-SUMMAusingMPI_Bcast onthe

horizontalsub-communicators4. MatrixCiscomputedbyreducingthetemporalresultsoneachlevelofthelogical

3DprocessgridusingMPI_Allreduce ontheverticalsub-communicators

4

Implementation

MPI_Allreduce

c

MPI_Bcast

MPI_Allgather

TBcast (p, m) = (p - 1)(l + s/b) + l + t0 + (m/s) max (l, s/b)TAllgather (p, m) = (p - 1)(l + gm/b) + lTAllreduce = TBcast + Treduce→ TReduce (p, m) = (p - 1)(l + s/b+s/bcomp) + l + (gm/s) max (l + s/bcomp, s/b)

TDgemm (n) = 2n3/sDgemm

PerformanceModeling

5

TPdgemm25D = 2TAllgather (c, en2/p) + sqrt (p/c3) (2TBcast (sqrt (p/c), cen2/p) + TDgemm (n/sqrt (p/c)))+ TAllreduce (c, cen2/p)

Execution time of our 2D-compatible 2.5D-PDGEMM

• p :numberofprocesses• m :messagesize• l = 1.6 [µsec] :MPIlatency• s = 16384 [bytes] :segmentsize• b = 3000 [MB/sec] :bandwidth

• bcomp = 10000 [MB/sec] :throughputofreduction• t0 = 8.37 [sec] :overheadforselectingoptimalalgorithm(incl.oneAllreduce)

• sDgemm = 115.2 :FlopsvalueofDGEMM• g = 1.5sqrt(c) :effectofcongestion

OntheKcomputer

2.5D-SUMMARedistributionsofmat.A&B

Redistribution&Reductionofmat.C

PerformanceEvaluationontheKcomputern TheKcomputer

l 10thfastestsupercomputerinTOP500asofNov2017l 11.28PFlops peakwith88128nodes(thisstudyusedupto16384nodesofthem)l EachnodehasaSPARC64VIIIfx (8-core,128GFlopsondouble-precision)l Torusfusion(Tofu)interconnect(6Dmesh&torus,5GB/sforeachdirection)

n Evaluationl Weevaluatedourimplementationwithdifferentstacksizes: c=1, 4, and16

¥ Whenc=1,ourcodeisequivalenttoconventional2D-SUMMAl EventhoughtheKcomputerhasthe3Dnetworktopology,werun ourprograms as

2Dprocessjobsinordertoassumethe2D-compatible2.5D-PDGEMMisusedina2Dapplication

6

SUMMA(c=1)−measuredSUMMA(c=4)−measuredSUMMA(c=16)−measuredScaLAPACK−measured

SUMMA(c=1)−estimatedSUMMA(c=4)−estimatedSUMMA(c=16)−estimated

Performance (strongscaling,n=32768)

7

0

10

20

30

40

50

60

70

80

256 1024 4096 16384 65536

% o

f the

oret

ical

pea

k

Number of processes (p)

(ii) Fixed problem size: n=32768

†SUMMA(c=1)isequivalentto2D-SUMMA†MPIcommunicatorsetupcostisexcluded†ScaLAPACK wasexecutedwithnb=128

Better

PerformanceontheKcomputer(n=32768)



2.5D-PDGEMMimprovesthestrongscalingperformanceevenwhenthecostformatrixredistributionbetween2Dand2.5Disincluded

0

10

20

30

40

50

60

70

80

256 1024 4096 16384 65536

% o

f the

oret

ical

pea

k

Number of processes (p)

(ii) Fixed problem size: n=32768

0

20

40

60

80

100

2561024

409616384

65536%

# of processes

c=16

0

20

40

60

80

100

2561024

409616384

65536

%

# of processes

c=16

0

20

40

60

80

100

2561024

409616384

65536

%

# of processes

c=4

0

20

40

60

80

100

2561024

409616384

65536

%

# of processes

c=4

0

20

40

60

80

100

2561024

409616384

65536

%

# of processes

c=1

0

20

40

60

80

100

2561024

409616384

65536

%

# of processes

c=1

c=16c=4c=1

c=16c=4c=1

Measured performance

Estimated performance

Performance Breakdown (strongscaling,n=32768)

8

MPI_Bcast(B)MPI_Bcast(L)

MPI_Allgather(B)MPI_Allgather(L)MPI_Allreduce(B)MPI_Allreduce(L)

DGEMM

Horizontal communication(computation of SUMMA)

Vertical communication(redistribution & reduction)



Performance(n=32768)

†(L):latencyfactor (B):bandwidthfactor

Summaryn Summary

l “2D-compatible2.5D-PDGEMM”,whichisdesignedtoperformcomputationsof2Ddistributedmatricesona2Dprocessgrid

l PerformanceevaluationontheKcomputer,andperformanceanalysisbyprovidingtheperformancebreakdownandperformanceestimation

n Conclusionl 2.5D-PDGEMMiseffectiveinimprovingstrongscalingperformanceevenwhenthe

costformatrixredistributionbetween2Dand2.5Disincludedl 2D-compatibleimplementationwouldbeagoodalternativeto2D-PDGEMMfor

highlyparallelsystems,inparticularforthecasewhenapplicationsusingtheconventionalPDGEMMareportedontofuturesystemsthathavegreaterparallelism

l Theredistributioncostwouldbenon-negligiblewhenthestacksizeofthe2.5Dalgorithmincreases;atradeoffbetweenthecommunicationcostinthematrixmultiplicationandtheredistribution&reductioncosts

9

[Latest publication]Daichi Mukunoki and Toshiyuki Imamura: “Implementation and Performance Analysis of 2.5D-PDGEMM on the K Computer”, 12th International Conference on Parallel Processing and Applied Mathematics (PPAM2017), Sep. 2017 (accepted).

Documents

Implementation and Evaluation of 2.5D Matrix Multiplication on … · 2019. 9. 25. · Matrix multiplication is computed by 2.5D -SUMMA using MPI_Bcaston the horizontal sub-communicators