High level abstractions for irregular computations...Programming irregular computations for...

Preview:

Citation preview

Highlevelabstractionsforirregularcomputations

DavidPaduaDepartmentofComputerScience

UniversityofIllinoisatUrbana-Champaign

1.INTRODUCTION:THEPROBLEM

2December4,2017

Programmingforperformance

• Islaborious– Particularlysowhenthetargetmachineisparallel.

• Mustdecidewhattoexecuteinparallel• Inwhatordertotraversethedata• Howtopartitionthedata…

• Today,tuningmustbedonemanually.• Maintenanceisdifficult

– andtheresultisnotportable.

3December4,2017

Programmingforperformance

• Thereisstillmuchroomforprogressinhighperformanceprogrammingmethodology.

• Ourgoalistofacilitate(automate)tuningandenablemachineindependentprogramming.

4December4,2017

2.IRREGULARCOMPUTATIONS

5December4,2017

Whatareirregularcomputations?

• Irregularcomputationsaredifficulttodefine.• Onepossibledefinitionis:

– Computationson“irregular”datastructures.– Irregulardatastructures=notdensearrays.

• Pingali etal.Thetao ofparallelisminalgorithms.PLDI’11

– Andsubscriptexpressionsnon-linear.

6December4,2017

Whatareirregularcomputations?

• Examplesinclude– Subscriptedsubscripts:

• A[C[i]]– Linkedliststraversal:

• leftCell = leftCell->ColNext;• result+=leftCell->Value;

7December4,2017

Programmingirregularcomputationsforperformance

• Today,programmingofirregularcomputationsistypicallyatalowlevelofabstraction.

8G.HagerandG.Wellein.IntroductiontoHighPerformanceComputingforScientistsandEngineers.CRC Press

do diag=1,Nj, 2 ! two-way unroll amd Jamdiaglen=min( ( jd_ptr(diag+1)-jd_ptr(diag) ) , \

( jd_ptr(diag+2)-jd+ptr(diag+1) ) )offset1 = jd_ptr(diag) - 1offset2 = jd_ptr(diag+1) - 1do i=1, diagLen

C(i) = C(i) + val(offset1+i)*B(col_idx(offsetr1+i))C(i) = C(i) + val(offset2+i)*B(col_idx(offset2+i))

end do! peeled off iterationsoffset1 = jd_ptr(diag)do i=(diagLen+1), (jd_ptr(diag+1)-jd_ptr(diag))

C(i) = C(i)+val(offset1+i)*B(col_idx(offswt1+i))end do

end do

December4,2017

Programmingirregularcomputationsforperformance

• Programmingandoptimizingirregularcomputationsmorechallengingthanforlinear-subscriptscomputations.

• Wanttohavecompilersupportforprogrammingata lowlevelofabstractiontoenable– Automaticallymappingcomputationontodiversemachineclasses– Localityenhancement– Othercompileroptimizations(e.g.commonsubexpression eliminations)

• Inaddition,oftenusinghigherlevelsofabstractionsandapplyingoptimizationsatthatlevelisabettersolution.

• Bothapproacheswillbediscussedinthispresentation

9December4,2017

Whencomputationsarenotirregular

• Compilercandoagoodjobattransformingprograms:– Loopinterchange– Loopdistribution– Tiling– …

• Andwehavethethetechnologytogenerateefficientcodeforavarietyofplatformsevenwhentheinitialcodeissequential– Vectorextensions– Multicores– Distributedmemorymachines– GPUs

10December4,2017

ExampleAutomatictransformationofaregularcomputation

• Considertheloop

for (i=1; i<n; i++) {for (j=1; j<n; j++) {

a[i][j]=a[i][j-1]+a[i-1][j];}}

11December4,2017

for (i=1; i<n; i++) {for (j=1; j<n; j++) {

a[i][j]=a[i][j-1]+a[i-1][j];}}

a[1][1]=a[1][0]+a[0][1]

a[1][2]=a[1][1]+a[0][2]

a[1][3]=a[1][2]+a[0][3]

a[1][4]=a[1][3]+a[0][4]

12

j=1

j=2

j=3

j=4

a[2][1]=a[2][0]+a[1][1]

a[2][2]=a[2][1]+a[1][2]

a[2][3]=a[2][2]+a[1][3]

a[2][4]=a[2][3]+a[1][4]

i=1 i=2

• Computedependences(part1)

ExampleAutomatictransformationofaregularcomputation

December4,2017

for (i=1; i<n; i++) {for (j=1; j<n; j++) {

a[i][j]=a[i][j-1]+a[i-1][j];}}

a[1][1]=a[1][0]+a[0][1]

a[1][2]=a[1][1]+a[0][2]

a[1][3]=a[1][2]+a[0][3]

a[1][4]=a[1][3]+a[0][4]

13

j=1

j=2

j=3

j=4

a[2][1]=a[2][0]+a[1][1]

a[2][2]=a[2][1]+a[1][2]

a[2][3]=a[2][2]+a[1][3]

a[2][4]=a[2][3]+a[1][4]

i=1 i=2

• Computedependences(part2)

ExampleAutomatictransformationofaregularcomputation

December4,2017

for (i=1; i<n; i++) {for (j=1; j<n; j++) {

a[i][j]=a[i][j-1]+a[i-1][j];}}

1 2 3 4 …

1

2

3

4

j

i

14

1,1

or

ExampleAutomatictransformationofaregularcomputation

December4,2017

for (i=1; i<n; i++) {for (j=1; j<n; j++) {

a[i][j]=a[i][j-1]+a[i-1][j];}}

15

• Findparallelism

ExampleAutomatictransformationofaregularcomputation

December4,2017

for (i=1; i<n; i++) {for (j=1; j<n; j++) {

a[i][j]=a[i][j-1]+a[i-1][j];}}

16

• Findparallelism

ExampleAutomatictransformationofaregularcomputation

December4,2017

for (i=1; i<n; i++) {for (j=1; j<n; j++) {

a[i][j]=a[i][j-1]+a[i-1][j];}}

17

• Transformthecode

for (k=4; k<2*n; k++) forall (i=max(2,k-n):min(n,k-2)) a[i][k-i]=...

ExampleAutomatictransformationofaregularcomputation

December4,2017

Analyzingirregularcomputations

• Analysisofthefollowingcodeisdifficultandsometimesimpossible

• Dependencesareafunctionofthevaluesofc andd• Intheabsenceofadditionalinformation,thecompilermustassumetheworst.

• Andthisprecludesautomatictransformations(parallelization,vectorization,datadistribution,…)

18

for (i=1; i<n; i++) {a[c[i]]=a[d[i]]+1;

}

December4,2017

Irregularityisabeastprogramminglanguageresearchershavebeen

fightingforalongtime

19December4,2017

Ithasdefeatedussometimes

20

IgnoringtheimportanceofirregularcomputationswasoneoftheReasonsforthefailureofHighPerformanceFortran

December4,2017

Tamingtheirregularbeast

• CompilersandRuntime(Section3)

• Betternotations(Section4)

21December4,2017

3.COMPILERANDRUNTIMESOLUTIONSFORLOWLEVELIRREGULARPROGRAMMING

22December4,2017

Twoapproachestodealingwithirregularcomputations

• Analyzingirregularcodesdirectly– Statically– Dynamically

• Convertingintohigherlevelofabstraction(densearraycomputations)

23December4,2017

Staticanalysis

• Insomecases,itispossibletoanalyzeirregularalgorithmswritteninconventionalnotation.

• Moreeffortthanforregular/linearsubscriptcomputations

24December4,2017

Staticanalysis

25

Dependenceanalysis

do k = 1, nq = 0do i = 1, p

if ( x(i) > 0 ) thenq = q + 1ind(q) = i

end ifend dodo j = 1, q

x(ind(j)) = x(ind(j))* y(ind(j))end do

end do

December4,2017

Staticanalysis

26

do k = 1, nq = 0do i = 1, p

if ( x(i) > 0 ) thenq = q + 1ind(q) = i

end ifend dodo j = 1, q

x(ind(j)) = x(ind(j))* y(ind(j))end do

end do

Dependenceanalysis

Needinformationaboutind(j)

December4,2017

Staticanalysis

27

do k = 1, nq = 0do i = 1, p

if ( x(i) > 0 ) thenq = q + 1ind(q) = i

end ifend dodo j = 1, q

x(ind(j)) = x(ind(j))* y(ind(j))end do

end do

Dependenceanalysis

Needinformationaboutind(j)Canbefoundbyanalyzingtheprogram

December4,2017

Staticanalysis

28

do i=1, ndo j=2, iblen(i)

do k=1, j-1x(pptr(i)+k-1) = ...

end doend dodo j=1, iblen(i)-1

do k=1, j... = x(iblen(i)+pptr(i)+k-j-1)

end doend do

end do

Arrayprivatization

December4,2017

Staticanalysis

29

do i=1, ndo j=2, iblen(i)

do k=1, j-1x(pptr(i)+k-1) = ...

end doend dodo j=1, iblen(i)-1

do k=1, j... = x(iblen(i)+pptr(i)+k-j-1)

end doend do

end do

Arrayprivatization

Todecideifx canbeprivatized

December4,2017

Staticanalysis

30

do i=1, ndo j=2, iblen(i)

do k=1, j-1x(pptr(i)+k-1) = ...

end doend dodo j=1, iblen(i)-1

do k=1, j... = x(iblen(i)+pptr(i)+k-j-1)

end doend do

end do

Arrayprivatization

Todecideifx canbeprivatizedIssufficienttoshowthatawriteprecedeseachread

December4,2017

Staticanalysis

• Apossibleapproachistoquerythecontrolflowgraph,boundingthesearch.

31

YuanLinandDavidPadua.Compileranalysisofirregularmemoryaccesses.(PLDI'00).

December4,2017

Dynamicanalysis

• Embeddeddependenceanalysis• Inspectorexecutor• Speculation

32December4,2017

Embeddeddependenceanalysis

• ThisisaclevermechanismduetoZhuandYew.

33

Zhu,C.Q.,&Yew,P.C.(1987).SCHEMETOENFORCEDATADEPENDENCEONLARGEMULTIPROCESSORSYSTEMS.IEEETransactionsonSoftwareEngineering,SE-13(6),726-739

for i=1; i<n; i++a(k(i)) = c(i) + 1

a_tag(k(:)) = 0parallel for i=1;i<n;i++

critical a(k(i)) if a_tag(k(i)) < i

a(k(i)) = c(i) + 1a_tag(k(i)) = i

December4,2017

Inspectorexecutor

• Parallelscheduleandcommunicationaggregationcomputedatexecutiontime– Analyzingthememoryaccesspatternofacodesegment(typicallyaloop)

• Overheadreducedwhenthecodesegment(loop)isexecutedmultipletimeswiththesameaccesspattern

34

SeeEncyclopediaofParallelComputingRunTimeParallelization(Saltz andDas)

December4,2017

Speculation

• Thereisanextensivebodyofliteratureonspeculation.• Acodesegmentisexecutedinparallelinthehopethatthereisnoproblem.

• Ifthereis,executionbacktracksandthecodesegmentisre-executedserially

See LawrenceRauchwerger,DavidA.Padua:TheLRPD Test:SpeculativeRun-TimeParallelizationofLoopswithPrivatizationandReductionParallelization.PLDI 1995:218-232

EncyclopediaofParallelComputingThreadlevelspeculation(Torrellas)TransactionalMemories(Herlihy)

35December4,2017

Convertingintodensearraycomputations

• Theobjectiveistoconvertsparsecomputationsintoequivalentdenseform.

• Thisincreasesthenumberofoperationsinvolvingzeroes(oridentity)– Sometimesmakingthecomputationimpossible

• Butitallowstheapplicationofcompile-timetransformations.

• Sparsity isrecoveredbyapplyingafinaltransformation(seebelow)

36December4,2017

Convertingintodensearraycomputations

37

for(col=0; col<cols; col++) {for(row=0; row<dimensions; row++) {

leftCell = left.Rows[row];while( leftCell != NULL ) {

result[row][col] +=leftCell->Value * right[leftCell->ColIndex][col];

leftCell = leftCell->ColNext;

for( col = 0; col < cols; col++ ) for( row = 0; row < dimensions; row++ )

result[row][col] += A_Valuep[row][:] * right[:][col];

December4,2017

Convertingintodensearraycomputations

38

Harmen L.A.vanderSpek,HarryA.G.Wijshoff:Sublimation:ExpandingDataStructurestoEnableDataInstanceSpecificOptimizations.LCPC2010:106-120

Anand Venkat,MaryHall,andMichelleStrout.2015.Loopanddatatransformationsforsparsematrixcode.In Proceedingsofthe36thACMSIGPLANConferenceonProgrammingLanguageDesignandImplementation(PLDI'15).

December4,2017

Wheredowestand?

• Thereisanextensivebodyofliteratureoncompilationandruntimetofacilitatetheparallelprogrammingofirregularcomputations.

• Althoughmoreideasareimportant,whatweneedmosturgentlyisanevaluationoftheireffectiveness.– Ameansforrefinementandprogress.

• Allcompilertechniquessufferfromlackofunderstandingoftheireffectiveness.

39December4,2017

Needtoolstoevaluateprogress

• Measuringadvancesincompilertechnologyiscrucialforprogress.

• Anidea:apubliclyavailablerepository– Containingextensiveandrepresentativecodecollections.

– Keeptrackofresultsforcompilers/techniquesalongtime.

• Ongoingworkononesuchrepository:– LORE:ALoopRepositoryfortheEvaluationofCompilersbyZ.i Chen,Z.Gong,J.

JosefSzaday,D.Won,D.Padua,A.Nicolau ,A.Veidenbaum ,N.Watkinson,Z.Sura,S.Maleki,J.Torrellas,G.DeJong.IISWC-2017

40December4,2017

4. NOTATIONS

41December4,2017

4. NOTATIONS4.1ABSTRACTION

42December4,2017

WhatareProgrammingabstractions?

§ Anabstractionisformedbyreducinginformation.

§ Ineverydaylife,itformstheworldofideaswherewecanreason.No(naturallanguage)statementcanbemadewithoutabstractions

43December4,2017

WhatareProgrammingabstractions?

§ Abstractmachinelanguage/lowerlevelimplementation§ Scalarcomputations:

§ a=b*c+d^3.4§ Noloads,stores,registerallocation.

§ Scalarloops:§ for (i= …..) {…}§ Noifstatements,branch/goto instructions.§ Givesstructuretotheprogram.

§ Abstractalgorithmimplementation§ min(A(1:n))§ it = find (myvec.begin(), myvec.end(), 30);§ Hidealgorithm,datarepresentation

44December4,2017

Whatarethebenefitofusingabstractions?

• Programmerproductivity(expressiveness)– Codesareeasiertowrite– Tounderstand,maintain

• Portability• Optimization(manualandautomatic)

– Programsareeasiertoanalyze– Easiertotransform– Deepertransformationsareenabled

45December4,2017

Abstractionsandirregularcomputations

• Whatabstractionsshouldweuseforprogrammingirregularcomputationsatahigherlevel?

• Bestanswerseemstobetorepresentirregularalgorithmsintermsof:

Densearrayoperations

46December4,2017

4. NOTATIONS4.2.ARRAYNOTATIONANDIRREGULARCOMPUTATIONS

47December4,2017

Whyarraynotation

• Couldaccessarraysinscalarmode• Butoperationsonaggregatesare(often)easiertoread.

for (i=0; i<n; i++) for (j=0;j<n;j++)

a[i][j]=b[i][j]+c[i][j]

a=b+c

• Theyarewellunderstoodabstractions.• Powerful.

48December4,2017

Arrayoperationsandperformance

• Butusefulnessofarraynotationgobeyondexpressiveness.

• Thereareatleastthreereasonstousearrayabstractionsforperformance– Torepresentparallelism– Toreducetheoverheadofinterpretation.– Toenableautomaticoptimizations.

49December4,2017

Representationofparallelism• Popularintheearlydays:

• Usedtodaybutnotwidely

• Tensorflow• Futhark

– T.Henriksen,N.Serup,M.Elsman,F.Henglein,andC.Oancea.Futhark:PurelyFunctionalGPU-ProgrammingwithNestedParallelismandIn-PlaceArrayUpdates PLDI 2017

50

do 10 i = 1, 100, 2

M (i) = .true.M (i+l) = .false.

10 continueA(*) = B(*) + A(*)

C(M(*)) = B(M(*)) + A(M(*))

Illiac IVFortran

C[i][ j] = __sec_reduce_add( a[ i][:] * b[:][j]);

Cilk plus

December4,2017

Array operations and optimizationsAnexample

• Applyingarithmeticrulessuchasassociativityofmatrix-matrixmultiplicationtoreducethenumberofoperations.– Whichoneisbetter?

• (((A*B)*C)*D)*E,• (A*((B*C)*D))*E,• (A*B)*((C*D)*E),• …

– Findbestusingdynamicprogramming– YoichiMuraokaandDavidJ.Kuck.1973.Onthetimerequiredforasequenceofmatrixproducts.Commun.ACM16,1(January1973),22-26.

51December4,2017

Reduceoverheadofinterpretation

• SeeHaichuan Wang,DavidA.Padua,PengWu:Vectorization ofapplytoreduceinterpretationoverheadofR.OOPSLA 2015:400-415

52December4,2017

Arrayoperationsandirregularcomputations

• Arrayoperationsareanidealnotationtomanipulatesparsearrays.

53December4,2017

SparsearraysinMATLABload west0479.matA = west0479;S = A * A' + speye(size(A));pct = 100 / numel(A);

figurespy(S)title('A Sparse Symmetric Matrix')nz = nnz(S);xlabel(sprintf('nonzeros = %d (%.3f%%)',nz,nz*pct));

• Arraysappearasdense,butrepresentedinternallyasspaarse.• Theinterpreterandlibrarieshandlethesparseness

54

S = sparse(m,n)

December4,2017

Sparsearrayscanbeabstractedasdensearraysincompiledlanguages

• BikandWijshoff*developedastrategytotranslatedensearrayprogramsontosparseequivalents.

55

do i=1,nacc=acc+a(i,:)*<1:n>

do i=1,ndo j=1,n

if (i,j) ∈ E a thenacc=acc+a’(σa(i,j))*j

do i=1,ndo ad∈PADa

j=π2σa-1(ad)acc=acc+a’(ad)*j

do i=1,ndo ad=alow(i),ahigh(j)

j=aind(ad)acc=acc+aval(ad)*j

*Aart J.C.Bik,HarryA.G.Wijshoff:CompilationTechniquesforSparseMatrixComputations.InternationalConferenceonSupercomputing1993:416-424December4,2017

Arrayoperationsandirregularcomputations

• Notonlycanarraynotationbeusedtorepresentdensecomputations,itcanbeusedforanumberofotherdomains.

• Awiderangeofdomains

56December4,2017

Howwide?

• Atthetime(ca.1959),Illinois,withthefirstofits‘ILLIAC’supercomputers,wasagreatcentre ofcomputing,andGolubshowedhisaffectionforhisAlmaMaterbyendowingachairthere50yearslater.Rumour hasitthatthefundsforthegiftcamefromGooglestockacquiredinexchangeforsomeadviceonlinearalgebra.Google’sPageRanksearchtechnologystartsfromamatrixcomputation— aneigenvalueproblemwithdimensionsinthebillions.Hardlysurprising,Golub wouldhavesaid:everythingislinearalgebra.Nature2007

57December4,2017

New operatorsArraysforgraphsalgorithms

• AsimplealgorithmforSSSP(Bellman-Ford)canberepresentedasfollows:

58

for(i=1; i<=n; i++)d = d ⨁ A ⨂ d

JeremyKepner andJohnGilbert.2011.GraphAlgorithmsintheLanguageofLinearAlgebra.SIAMPressT.Mattson,D.Bader,J.Berry,A.Buluc,J.Dongarra,C.Faloutsos,J.Feo,J.Gilbert,J.Gonzalez,B.Hendrickson,J.Kepner,C.Leiserson,A.Lumsdaine,D.Padua,S.Poole,SteveReinhardt,M.Stonebraker,S.Wallach,A.Yoo.StandardsforGraphAlgorithmPrimitivesDecember4,2017

4. NOTATIONS4.3.EXTENSIONSTOARRAYNOTATIONFORPARALLELISM

59December4,2017

repmat(h, [1, 3])

circshift( h, [0, -1] )

transpose(h)

60

Communicationasarrayoperations

December4,2017

A00 A01

A10

A20

A11

A21 A22

A12

A02 B01

B10

B20

B11

B21 B22

B12

B02B00

A00B00

A01B11

A02B22

A12B21

A11B10

A10B02

A22B20

A20B01

A21B12

A00B00

A01B11

A02B22

A12B21

A11B10

A10B02

A22B20

A20B01

A21B12

initial skew

shift-multiply-add

61

Cannon’salgorithm

December4,2017

for k=1:nc = c + a × b;a = circshift(a,[0 -1]); b = circshift(b,[-1 0]);

end

Cannnon’s Algorithm

62

JamesC.Brodman,G.CarlEvans,MuratManguoglu,AhmedH.Sameh,María Jesús Garzarán,DavidA.Padua:AParallelNumericalSolverUsingHierarchicallyTiledArrays.LCPC 2010:46-61

December4,2017

Barrierelimination

63

X(3)=Y(3)*2X(1)=Y(1)*2 X(2)=Y(2)*2 X(4)=Y(4)*2

Z(1)=X(1) Z(2)=X(3)

Z(3)=X(2) Z(4)=X(4)

X(1:4)=Y(1:4)*2;Z(1:2)=X([1,3]);Z(3:4)=X([2,4]);

December4,2017

Barrierelimination

64December4,2017

Asynchronoussemantics

• Insomecases,itcouldbedesirabletodelayupdatesacrossthewholemachine.Atsomepoint,aglobalupdateismade.

• Thisisaparticularexampleofcommunicationinasynchronousalgorithmswheredataassignmentsufferdelays.

65December4,2017

TiledLinearAlgebra– SSSPExample

// the distance vectord = inf; d(1) = 0;// the mask vectorL = 0; L(1) = 1;while (L.nnz())for i = 1:LOC_ITERr <- d + (trans(A)*(d*.L);L <- (r != d);d <- r;

r += d;L <- (r != d);d <- r;

Partialmultiplication

Localcom

putatio

n

Globalreduction

66

Saeed Maleki,G.CarlEvans,DavidA.Padua:TiledLinearAlgebraaSystemforParallelGraphAlgorithms.LCPC2014:116-130December4,2017

4. NOTATIONS4.3.WHEREDOWESTAND?

67December4,2017

Arraynotation

• Arraynotationhasalonghistory• Isapowerfulnotationthatcanbeusedforawiderangeofapplications.

• Forparallelism,itisanunfulfilledpromise• Thehighlymathematicalnotationisachallenge• Weseeincreaseuseofthenotationandexpectittobecomepopularalsoforirregularalgorithms.

• Challenges:– developthenotation– Designcompileroptimizations

68December4,2017

5.EPILOG

69December4,2017

• In1976notmuchwasknownaboutparallelprogrammingofirregularcomputations.

• 40yearslaterifisimpressivehowmuchhavebeenlearnedoncompilationandnotations.

• Butthedevelopmentofwidelyusedtoolsbasedontheseideasremainachallenge.

70December4,2017

Recommended