Dynamic Load Distributions for Adaptive Computations …tkaiser/llnl.pdf• Also called AMR or adaptive mesh ... • May start with balanced load and low ... • Often used on NP-Hard

Dynamic Load Distributions for Adaptive Computations

on MIMD Machines using Hybrid Genetic Algorithms

(a subset)

TIMOTHY H. KAISERPhD, Computer Science - UNM

MS, Electrical Engineering - (Applied Physics) UCSDBS, Physics -UMR

2

Introduction• Difficult talk to put together

• Subset of full dissertation defense

• Full dissertation can be found at www.arc.unm.edu/~tkaiser

• Discuss parallel adaptive grid programs

• Dynamic load balancing - requirements for effectiveness

• Scope of research effort

• Genetic algorithms

• Sine qua nons for GA effectiveness

• Development

• Application

3

What are adaptive grid programs?• Simulate a region of space and time using a grid of cells

• Often solving a PDE

• Weather modeling, fluid flow, aerodynamics

• Regions where higher spatial fidelity are required, have finer gridding

• Would want finer gridding at a weather storm front

• Would want finer gridding at a shock front

• New regions requiring higher spatial fidelity can move or be created

• Tornado

• Explosion

• Also called AMR or adaptive mesh refinement programs

4

An example grid program

• Numerical solution to Euler’s equations

•∂∂

∂∂

∂∂

U F Gt x y

+ + = 0

•

U uv

EF

uu p

uvEu pu

G

vuv

v pEv pv

= = +

+= +

+

ρρρ

ρρ

ρ

ρρ

ρ, ,2

2

• ρ=density u=x velocity v=y velocity E= total energy

• These equations model the flow of fluids and gas

5

Parallelizing grid programs for speed up

• Assign portions of the grid to various processors

• Requires: Communication between processors

• Desirable

• Good load balanced

• Low communication cost

• Problem

• Getting perfect load balance and optimal communication is NP-Hard

• Various heuristics are used to “solve” this problem

6

Dynamic load balancing of adaptive gridprograms

• Why is dynamic load balancing important?

• May start with balanced load and low communication

• The gird is refined in some region

• Processor holding refined region becomes heavily loaded

• Need to dynamically balance to get good performance

• Why is this difficult?

• Arbitrarily moving regions to new processors can cause highcommunication

• Must be done again when the grid changes

• Still NP-Hard

7

Scope of this research effort• Needed to select a subset of all problems on which to work

• Adaptive grid program solves:

• Euler’s fluid equations

• 2 dimensions

• Lower levels of parallelism

• Up to 16 nodes connected together by high speed network (SP1 andSP2)

• The committee felt such configurations are readily available

• Research should be applicable to other problems and architectures

• For very large problems framework would need modification

8

Thesis statement

• A genetic algorithm can be effectively used to decrease the run time ofadaptive grid programs by maintaining good load balance and maintaininggood communication performance.

9

Tools developed for thesis verification

• PLIFE - framework for a parallel adaptive grid or mesh program

• Darwin - framework for a parallel genetic algorithm

• Combined the two to create a parallel adaptive grid hydrodynamicssimulation with dynamic load balancing and communication reduction

10

What is a Genetic Algorithm?

• An “suboptimization” system

• Find good, but maybe not optimal, solutions to difficult problems

• Often used on NP-Hard or combinatorial optimization problems

• Requirements

• Solution(s) to the problem represented as a string

• A fitness function

• Takes as input the solution string

• Output the desirability of the solution

• A method of combining solution strings to generate new solutions

11

More details on Genetic Algorithms

• Find solutions to problems by Darwinian evolution

• Potential solutions are thought of a living entities in a population

• The strings are the genetic codes of the individuals

• Individuals are evaluated for their fitness

• The fittest individuals are allowed to live and “sexually” reproduce

• There may be some mutation

• Parents die and kids start the next generation

12

Use of the GA

• After cells split, GA does the assignment of cells to processors

• For a particular potential assignment of cells to processors the GA:

• Counts the communication cost of the distribution

• Measures the imbalance of the distribution

• Fitness = weight of communication* (communication function) +weight of balance*(balance function)

• GAs have not been used in the past because they are too slow

13

For the GA to be effective

• Run fast

• Return good solutions

• Good load balance

• Low communication

• Key finding of dissertation:

• To enable fast and good solutions observe 2 sine qua nons

• Improvements must be made to the basic GA algorithm

• Domain specific knowledge must be used

14

Simulation used for the development of the sinequa nons

• Used a four node SP1 calculation

• Initial grid is 128 x 128

• Each processor held a corner of the grid

• Exploded a bomb in the bottom left corner of the grid

• Restrictions

• One level of adaptation

• Only split cells are moved by the GA

• Cells are moved in clusters of 16 cells

15

First sine qua non - Use of domain specificknowledge

• Larger contiguous blocks cause less communication

• Clusters of cells are generated at irregular rates

• Future load can be estimated

• The most important piece of domain specific knowledge. . .

16

Cells split in the region of the shock front

•

Shock wavepropagatingto the right.

Boundary between split andunsplit cells across whichcommunication will occur ifthe split cells are moved.

17

Cells split in the region of the shock front• Implication

• Cells behind the shock split and may have been moved by the GA

• Cells in front of shock are not yet split and not yet moved by GA

• Conflict

• To balance load GA wants to move split cells

• To maintain low communication GA wants to not move cells

• Note:

• Next time GA is called more cells will have split

• Because shock propagates, cells in front of the shock will soon split

• GA will not want to move cells if those to the left have not moved

18

Exploitation of this domain specific knowledge

• Don’t count communication between split and unsplit cells

• Sort term effect:

• Increases communication time

• GA is freer to move cells to balance load

• Long term effect:

• Contiguous blocks of cells on the same processor are larger

• Dramatic decrease in communication time

19

Results

• Without using this domain specific knowledge

• Communication time = 348 seconds

• Run time = 1510 seconds

• With domain specific knowledge in use

• Communication time = 228 seconds

• Run time = 1390 seconds

• Run time reduced by 2 minutes

• Show Movies

20

Second sine qua non - Improve the GAalgorithm

• Run GA as a parallel application

• Developed a fast mutation methodology

• Use the Mansour-Fox algorithm

• Improve the Mansour-Fox algorithm

21

The Mansour-Fox algorithm

• Developed for static allocations

• Variable fitness function

• Initially put more weight on communication minimization

• In the end put more weight on balanced load

• About 50% - 75% through generations do the following:

• Call hill climbing routine for each member of GA population

• Greedy algorithm

• If cell has neighbors on different processor find best processor

• Increase mutation rate

• Mutate sections of gene which represent cells on boundaries

22

Mansour-Fox algorithm continued

• Problems:

• Slow because the hill climbing routine is slow

• Can not terminate early because of variable fitness function

23

Solution?

• Run as a parallel application

• Call the hill climbing routine every N generations with N>1

• The hope is:

• Calling hill climbing routine every N generations speeds up routine

• Calling hill climbing routine every N generations will not greatlydegrade the quality of the solution

• Hope was born out by experiment

• Improved Mansour-Fox algorithm reduced the run time of the adaptivegrid simulation

24

What was learned to this point?

• What sine qua nons must be observed to make GA effective

• Use domain specific knowledge

• Improve the algorithm of the GA

• Important specifics:

• Ignore communication between split and unsplit cells in fitness function

• Use improved Mansour-Fox algorithm

25

Application of heuristics

• Problem run on 16 nodes of SP1 and SP2

• Initial grid is 256 x 256

• Multiple levels of adaptation

• Nonuniform distribution of mass

• Nonuniform distribution of energy

0 0.5 1 1.5 2

0

0.5

1

1.5

2

x

Pressurecycle= 0 t=0

y

pre3.3

3.1

2.8

2.6

2.4

2.1

1.9

1.7

1.5

1.2

1.0

26

Initial results using normal Mansour-Foxalgorithm

• Parallel run time without GA = 1115 seconds

• Balance load only run time = 1240 seconds

• Mansour-Fox GA parameters:

• Population size = 320 Generations = 100

• Adjusted weighting for cost of communication

•

Wall time

975

928

952

942

955

Speed up

14.3

20.2

17.2

18.3

16.8

GA

40

44

48

48

47

Return

3.5

4.3

3.4

3.7

3.4

Balance

10.6

5.8

3.5

2.6

2.3

Comm. Time

257

303

337

358

367

Weight

0.6-0.5

0.5-0.4

0.4-0.3

0.3-0.2

0.2-0.1

• Run time with Mansour-Fox algorithm = 928 seconds with 44 seconds GArun time

27

Ran using the improved Mansour-Fox algorithm

• Allow the hill climbing routine to be called less often

• Tuned the GA

• Adjust to find optimum:

• Population size

• Number of generations

• Frequency of hill climbing routine in improved Mansour-Foxalgorithm

28

Results with GA population size = 480

•

1 480 100 956 16.7 60 2.6 5.5 321

1 480 150 1004 11.0 89 1.2 6.1 302

1 480 200 999 11.6 117 1.0 5.7 308

2 480 100 930 19.9 42 4.4 6.3 304

2 480 150 934 19.4 60 3.0 5.7 299

2 480 200 974 14.4 83 1.7 5.5 319

3 480 100 924 20.7 38 5.1 5.5 307

3 480 150 928 20.2 52 3.6 5.2 305

3 480 200 971 14.9 70 2.0 5.2 309

4 480 100 918 21.4 33 6.0 5.8 308

4 480 150 922 20.9 47 4.1 6.1 293

4 480 200 968 15.2 76 1.9 5.1 313

5 480 100 943 18.2 32 5.3 5.8 320

5 480 150 921 21.1 45 4.3 6.1 297

5 480 200 955 16.8 60 2.7 6.0 314

Often Size Gen Wall time Speed up GA Return Balance Comm. Time

29


•

1 320 100 898 24.2 42 5.2 5.9 283

1 320 150 964 15.7 61 2.5 6.5 314

1 320 200 951 17.3 81 2.0 5.6 299

2 320 100 915 21.8 30 6.8 6.4 297

2 320 150 923 20.7 42 4.5 5.2 307

2 320 200 945 18.0 56 3.0 6.2 304

3 320 100 889 25.4 27 8.4 5.7 293

3 320 150 920 21.2 37 5.3 5.9 308

3 320 200 952 17.1 50 3.3 6.7 299

4 320 100 886 25.8 24 9.6 5.0 288

4 320 150 907 22.9 34 6.2 5.0 299

4 320 200 926 20.5 44 4.3 5.5 319

5 320 100 908 22.8 23 9.2 5.9 301

5 320 150 909 22.7 32 6.4 4.9 310

5 320 200 947 17.7 44 3.8 5.1 325


30


•

1 240 100 926 20.5 32 5.8 6.4 312

1 240 150 959 16.3 49 3.2 5.7 332

1 240 200 953 17.0 63 2.6 6.3 309

2 240 100 907 23.0 24 8.8 5.9 307

2 240 150 927 20.2 35 5.3 6.7 296

2 240 200 955 16.8 45 3.5 6.6 320

3 240 100 901 23.8 22 9.9 6.0 300

3 240 150 921 21.1 29 6.7 6.3 306

3 240 200 943 18.3 44 3.9 5.3 323

4 240 100 900 23.9 19 11.2 5.0 312

4 240 150 912 22.2 27 7.4 6.1 298

4 240 200 910 22.6 35 5.9 5.9 301

5 240 100 908 22.8 18 11.2 5.6 317

5 240 150 940 18.6 26 6.9 6.3 317

5 240 200 904 23.3 34 6.2 6.0 293


31

Summary of results

• GA run time was in the range 18 to 117 seconds

• Simulation time was in the range of 1004 to 886 seconds

• Communication time was in the range of 283 to 325 seconds

• Run time reduced from 1115 seconds to 886 seconds by using theimproved Mansour-Fox algorithm

• Speed up = 26%

• Population size = 320

• Generations for GA =100

• Frequency of calling hill climbing routine = 4

32

Improved Mansour-Fox algorithm• By calling the hill climbing routine less often:

• GA run time is reduced from about 117 to about 2O seconds

• The algorithm still returned a good solution.

• Best run times with frequency of 4, longest with frequency of 1

•Frequency of hill climbing routine

| | | | | every 1 every 2 every 3 every 4 every 5

900

920

940

960

980

Sim

ulat

ion

run

time

(sec

onds

)

Generations=200

Generations=150

Generations=100

33

Use of tuning the Genetic Algorithm

• Apply to other architectures

• Apply to other problems

34

Tuning for SP1 applied to SP2• Ran same simulation on the MHPCC SP2

• Used same GA control parameters except:

• Conjecture is that we need a smaller weighting for communicationdown from 0.5-0.4

•

892 0 0 0 24.6 13 0

551 61.9 27 12.8 4.8 75 0.5-0.4

511 74.7 26 14.8 2.2 80 0.3-0.2

512 74.5 26 14.8 1.9 81 0.2-0.1

541 65.0 35 10.1 1.8 93 0.1-0.0

547 63.1 30 11.6 1.8 102 0.0-0.0

Wall time Speed up GA Return Balance Comm. Time Weight

• Using tuning data from SP1 runs enabled 75% speed up on SP2

• Conjecture is shown to be true

• Showed that tuning for one machine can be applied to another

35

Did comparison to bisection methods

• The Hybrid GA with improved Mansour-Fox algorithm produces the bestresults

•

SP1 Best GA 886 288 5

SP1 Centroid Bisection 1014 452 4.7

SP1 Coordinate Bisection 1122 544 4.8

SP2 Best GA 511 80 2.2

SP2 Centroid Bisection 526 104 3.6

SP2 Coordinate Bisection 571 119 3.8

Machine Algorithm Run time Comm. Time Balance

36

Application to a different problem on SP1• Useful for performing trade studies

• Similar simulation

• Uniform blobs of energy and mass

• Deposited at different times

• Used best GA control parameters from the previous simulation(Frequency = 4, Size = 320, Generations = 100)

•

Wall time

1489

1664

1309

Speed up

0

-10.5

13.8

GA

0

0

43

Return

0

0

4.2

Balance

41.4

4.4

7.5

Comm. Time

66

908

539

Algorithm

Baseline

Bisection

GA

• Enabled 14% improvement in run time

• Recursive bisection caused a slow down

37

Future directions

• Using the GA

• Different architectures

• Have done simulations of SP2 SMP nodes

• 3d problems

• Oil field studies

• Improve the adaptive grid framework

• 3d

• Polygonal cells

• Edge based scheme

38

Summary• Developed frameworks for

• Parallel adaptive mesh program

• Parallel Genetic Algorithm

• Studied the use of Genetic Algorithm to perform dynamic allocation ofgrids to processors

• Discovered two sine qua nons to enable the GA to be effective

• Use domain specific knowledge

• Improve the algorithm of the GA

• Showed the hybrid GA to be effective

• Using:

• Improved the Mansour-Fox algorithm

• Domain Knowledge of propagating shock waves

• Improved run time of an example calculation 75%

Documents

Dynamic Load Distributions for Adaptive Computations …tkaiser/llnl.pdf• Also called AMR or adaptive mesh ... • May start with balanced load and low ... • Often used on NP-Hard