42
Sputnik: Automated Decomposition on Heterogeneous Clusters of Multiprocessors Sean Philip Peisert [email protected] http://www.sdsc.edu/~peisert March 20, 2000 Photo courtesy of NASA.

Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Sputnik:Automated Decomposition on

Heterogeneous Clusters ofMultiprocessors

Sean Philip [email protected]

http://www.sdsc.edu/~peisertMarch 20, 2000

Photo courtesy of NASA.

Page 2: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

Thesis Committee

• Scott B. Baden, advisor• Larry Carter• Jeanne Ferrante• Sidney Karin

Page 3: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

Motivation

• Supercomputers in science are evolvingsuch that fewer and fewer are vectormachines and mainframemulticomputers. Most are clusters ofmultiprocessors.

• A multiprocessor is a shared-memorymachine whereas a multicomputer is“shared-nothing,” or distributed memorymachine.

Page 4: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

MultiprocessorMain Memory/Memory Bus

Processor 0

L1 Cache

L2 Cache

Processor 1

L1 Cache

L2 Cache

Processor 2

L1 Cache

L2 Cache

Processor 3

L1 Cache

L2 Cache

Bus

Page 5: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

Clusters of Multiprocessors

• Multiprocessors are built largely usingcomponent parts. They are also verymodular.

• Easy to upgrade a portion of the nodesin the cluster with new nodes.

• Having different speed nodesin the cluster mean thatprograms have to be writtendifferently.

Pho

to c

ourt

esty

of S

GI

Page 6: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

HeterogeneousCluster of Multiprocessors

Page 7: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

Heterogeneity Problem

• A parallel program will only run as fastas the slowest node.

• For example, if one adds new nodes toa cluster that run faster than the existingnodes, the new nodes will finish firstand idle until the slower nodes finish.

Page 8: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

Heterogeneity Problem

• If processors are idle, they are wastingprocessing power when they could beprocessing data.

• The program is not performing optimallyand can be run faster.

• The machine is not being usedefficiently.

Page 9: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

Related Work

• Fink and Baden: KeLP2• Foster and Karonis: Grid-Enabled MPI• Anglano, Schopf, Wolski and Berman:

Zoom• Crandall and Quinn: Decomposition

Advisory System• Wolski, Spring and Peterson: Network

Weather Service

Page 10: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

Heterogeneity Solution

• Optimize the program individually foreach separate node based on priorinformation about each node.

• This is not easy.

Page 11: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

Goals of Sputnik

• Allow a programmer to write softwarefor a heterogeneous cluster as if thecluster is homogeneous. In otherwords, without adding much morecomplexity.

• Improve performance of the programbeing run and the utilization andefficiency of the cluster.

Page 12: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

The Sputnik Model

• Two-stage process for optimizingperformance on a heterogeneouscluster.

• ClusterDiscovery• ClusterOptimization

Page 13: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

ClusterDiscovery

• Performs a “resource discovery” — asearch of a defined parameter space —to understand how the application inquestion runs on each individual node inthe cluster.

• Runs the kernel repeatedly inside a“shell” to determine the best performingoptimizations.

Page 14: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

Cluster Discovery

• Saves the best optimization data insidea file for future use.

Page 15: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

ClusterOptimization

• Makes the specific optimizations foreach node based on what the first stagehas discovered.

• Some possible optimizations include:Adjusting the number of threads pernode, cache tiling, data partitioning,machine and data “class” optimization.

Page 16: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

The Sputnik API

• Focuses on just a few specificoptimizations.

• The Sputnik API is built on topof KeLP, which is used fordata description and inter-node communication.

• OpenMP is used for intra-node communication.

Page 17: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

ClusterDiscovery (API)

• Instead of searching the entireparameter space for possibleoptimizations, I focused on two:– Adjusting the number of OpenMP threads.– Repartitioning the data.

Page 18: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

ClusterDiscovery (API)

• Runs the kernel repeatedly withdifferent numbers of threads on eachnode. When it finds the optimal numberof threads for a given node, it saves thetiming for that node.

• The saved timing is compared with theother timings in the cluster and ratiosare formed.

Page 19: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

ClusterOptimization (API)

• The ratios from ClusterDiscovery areused to re-partition the data so that theratio of a given node’s power in relationto the rest of the cluster is the samefraction of the data that it works on.

Page 20: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

ClusterOptimization (API)

Node 0 Node 1 Node 2

Original Partitioning: New Partitioning:

Example: Node 0 runstwice as fast as node 1and node 2.

Page 21: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

API Limitations/Assumptions

• One-dimensional decomposition.• Confined to two tiers of parallelism.• Assumes that a node given a smaller

chunk of data will run at the sameMFLOPS rate as with the original sizechunk.

• The problems do not fit into the highestlevel of cache.

Page 22: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

API Limitations/Assumptions

• Assume no node is less than half asfast as any other node.

Page 23: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

main()

int main(int argc, char **argv) { MPI_Init(&argc, &argv); // Initialize MPI InitKeLP(argc,argv); // Initialize KeLP

// Call Sputnik's main routine, which in turn will // then call SputnikMain(). SputnikGo(argc,argv); MPI_Finalize(); // Close off MPI return (0);}

Page 24: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

SputnikGo()

while(i < MAX_THREADS && time[last iteration] < time[second-to-last iteration]) {

omp_set_num_threads(i);

time[i] = SputnikMain(int argc, char **argv, NULL);

i = i * 2;}

i = iteration before the best we found in the previous loop;

while (time[last iteration] < time[2nd-to-last iteration]) {

omp_set_num_threads(i);

time[i] = SputnikMain(int argc, char **argv, NULL); i = i + 1;

}

omp_set_num_threads(optimal number);

time[i] = SputnikMain(int argc, char **argv, bestTimes);

Page 25: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

SputnikMain()

double SputnikMain(int argc,char ** argv, double * SputnikTimes) {

double start, finish;

...

<declarations, initializations> ...

start = MPI_Wtime(); // start timing

kernel(); // call the kernel function

finish = MPI_Wtime(); // finish timing

... return finish-start;

}

Page 26: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

Application Study

• The purpose of the experiment is todetermine the effect of theseoptimizations.

• I use a kernel that solves Poisson’sequation using Gauss-Seidel’s methodwith red-black ordering was used.

Page 27: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

!$OMP PARALLEL PRIVATE(jj,ii,k,j,i,jk) do jj = ul1+1, uh1-1, sj do ii = ul0+1, uh0-1, si!$OMP DO SCHEDULE(STATIC) do k = ul2+1, uh2-1 do j = jj, min(jj+sj-1,uh1-1) jk = mod(j+k,2) do i = ii+jk, min(ii+jk+si-1,uh0-1), 2 u(i,j,k) = c * 2 ((u(i-1,j,k) + u(i+1,j,k)) + (u(i,j-1,k) + 3 u(i,j+1,k)) + (u(i,j,k+1) + u(i,j,k-1) - 4 c2*rhs(i,j,k))) end do end do end do!$OMP END DO end do end do!$OMP END PARALLEL

Page 28: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

Image courtesy of SGI.

Page 29: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

Computing HardwareSGI Origin2000’s

balder.ncsa.uiuc.edu• 256 250-MHz

R10000 processors• 128 GB main

memory

aegir.ncsa.uiuc.edu• 128 250-MHz

R10000 processors• 64 GB main memory

Page 30: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

Virtual Cluster

• Sputnik API is designed for commodityclusters. None were available, so a pairof SGI Origin 2000’s at NCSA wereused.

• The API allows the number of OpenMPthreads to be set manually.

• Different numbers of threads used oneach Origin to simulate heterogeneity.

Page 31: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

Predicted Time

Page 32: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

redblack3D - 48 threads on balder

0

10

20

30

40

50

60

70

24 30 36 42 48

threads on aegir

time

(sec

onds

)

balder Original Computation

aegir Original Computation

balder New Computation

aegir New Computation

Predicted Computation

Page 33: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

redblack3D Speedup with 48 threads on balder

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

24 30 36 42 48

threads on aegir

spee

dup

Speedup

Theoretical Speedup

Page 34: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

redblack3D with 32 threads on balder

0

10

20

30

40

50

60

70

80

90

100

16 20 24 28 32

number of threads on aegir

time

(sec

onds

)

balder Original Computation

aegir Original Computation

balder New Computation

aegir New Computation

Predicted Time

Page 35: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

redblack3D Speedup with 32 threads on balder

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

16 20 24 28 32

number of threads on aegir

spee

dup

Speedup

Theoretical Speedup

Page 36: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

Validation

• Results for the API indicate better than35% improvement in the situationswhere balder is running twice as manythreads as aegir.

• The model and the API both succeed inthe goal of being easy to program andimproving performance.

Page 37: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

Anomalies

• The code demonstrated scaling, but ranseveral times slower than with MPI alone,(without OpenMP).

• OpenMP thread binding and memorydistribution are both complex issues on theOrigin 2000 that are the probable causes ofthe slowdown.

• Real target of Sputnik API is commoditycluster.

Page 38: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

Image courtesy of SGI.

Page 39: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

Future Work

• Tests on more applications andcomputing hardware, especially acluster of Sun servers.

• Dynamic repartitioning forgrid/metacomputing applications.

• Supporting PhenomenallyHeterogeneous Clusters (PHCs) — notjust multicomputer-based clusters.

Page 40: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

Future Work

• Different types of optimizations (not justrepartitioning and adjusting the numberof OpenMP threads).

Page 41: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Copyright © 2000 Sean Philip Peisert

Thank you to my family andfriends, everyone in the UCSDComputer Science Department,the Parallel Computation Lab,SDSC, NCSA, my thesiscommittee, and my thesisadvisor, Scott Baden.

Page 42: Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created Date: 3/22/2000 3:53:23 PM

Sputnik:Automated Decomposition on

Heterogeneous Clusters ofMultiprocessors

Sean Philip [email protected]

http://www.sdsc.edu/~peisertMarch 20, 2000

Photo courtesy of NASA.