24
CPSD CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review 27-28 October 1999, Iowa City

1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review 27-28 October 1999, Iowa City

Embed Size (px)

Citation preview

1

CPSCPSDD

NSF/DARPA OPAAL

Adaptive Parallelization Strategies using Data-driven Objects

NSF/DARPA OPAAL

Adaptive Parallelization Strategies using Data-driven Objects

Laxmikant Kale

First Annual Review

27-28 October 1999, Iowa City

2

CPSCPSDD

OutlineOutline

Quench and solidification codes Coarse grain parallelization of the quench

code Adaptive parallelization techniques

Dynamic variations Adaptive load balancing Finite element framework with adaptivity Preliminary results

3

CPSCPSDD

Coarse grain parallelizationCoarse grain parallelization

Structure of current sequential quench code: 2-D array of elements (each independently refined) Within row dependence Independent rows, but…

—share global variables

Parallelization using Charm++: 3 hours effort (after a false start) about 20 lines of change to F90 code A 100 line Charm++ wrapper Observations:

—Global variables that are defined and used within inner loop iterations are easily dealt with in Charm++ , in contrast to OpenMP

—Dynamic load balancing is possible, but was unnecessary

4

CPSCPSDD

Performance resultsPerformance results

Speedup for Micro1D

0

10

20

30

40

50

60

70

0 10 20 30 40 50 60 70

Speedup

pro

cess

ors

Contributors:

Engineering: N. Sobh, R. Haber

Computer Science: M. Bhandarkar, R. Liu, L. Kale

5

CPSCPSDD

OpenMP experienceOpenMP experience

Work by: J. Hoeflinger, D. Padua, with N. Sobh, R. Haber, J. Dantzig,

N. Provatas

Solidification code: Parallelized using openMp Relatively straightforward, after a key decision

—Parallelize by rows only

6

CPSCPSDD

OpenMP experienceOpenMP experience Quench code on Origin2000

Privatization of variables is needed

—as outer loop was parallelized Unexpected initial difficulties with OpenMP

—Led initially to large slowdown in parallelized code

—Traced to unnecessary locking in MATMUL intrinsic

0

50

100

150

200

250

300

350

400

0 1 2 3 4 5 6 7 8 9

Processors

Exe

cuti

on

tim

e in

sec

on

ds

7

CPSCPSDD

Adaptive StrategiesAdaptive Strategies

Advanced codes model dynamic and irregular behavior

Solidification: adaptive grid refinement Quench:

—Complex dependencies,

—Parallelization within elements To parallelize these effectively,

—adaptive runtime strategies are necessary

8

CPSCPSDD

Multi-partition decomposition:

Idea: decompose the problem into a number of partitions,

independent of the number of processors # Partitions > # Processors

The system maps partitions to processors The system should be able to map and re-map objects as

needed

9

CPSCPSDD Charm++

A parallel C++ library Supports data driven objects

—singleton objects, object arrays, groups, Many objects per processor, with method execution scheduled

with availability of data System supports automatic instrumentation and object

migration Works with other paradigms: MPI, openMP, ..

10

CPSCPSDD

Data driven executionin Charm++Data driven executionin Charm++

Scheduler Scheduler

Message Q Message Q

11

CPSCPSDD

Load Balancing Framework

Aimed at handling ... Continuous (slow) load variation Abrupt load variation (refinement) Workstation clusters in multi-user mode

Measurement based Exploits temporal persistence of computation and communication

structures Very accurate (compared with estimation) instrumentation possible via Charm++/Converse

12

CPSCPSDD

Object balancing frameworkObject balancing framework

13

CPSCPSDD

Utility of the framework: workstation clustersUtility of the framework: workstation clusters

Cluster of 8 machines, One machine gets another job Parallel job slows down on all machines

Using the framework: Detection mechanism Migrate objects away from overloaded processor Restored almost original throughput!

14

CPSCPSDD

Performance on timeshared clustersPerformance on timeshared clusters

Another user logged on at about 28 seconds into a parallel run on 8 workstations.

Throughput dipped from 10 steps per second to 7. The load balancer intervened

at 35 seconds,and restored throughput to almost its initial value.

15

CPSCPSDD

Utility of the framework: Intrinsic load imbalanceUtility of the framework: Intrinsic load imbalance

To test the abilities of the framework A simple problem: Gauss-Jacobi iterations Refine selected sub-domains

ConSpector: web based tool Submit parallel jobs Monitor performance and application behavior Interact with running jobs via GUI interfaces

16

CPSCPSDD

AppSpector view of Load balanceron the synthetic Jacobi relaxation benchmark.

Imbalance is introduced by interactively refining a subset ofcells around 9 seconds..The resultant load imbalancebrings the utilization down to 80%from the peak of 96%.

The load balancer kicks in aroundt = 16, and restores utilizationto around 94%.

17

CPSCPSDD

Charm++

Converse

Load database + balancer

MPI-on-Charm Irecv+

AutomaticConversion from

MPI

FEM Structured

Cross module interpolation

Migration path

Frameworkpath

Using the Load Balancing Framework

18

CPSCPSDD

Example application:Example application:

Crack propagation (P. Geubelle et al) Similar in structure to Quench components 1900 lines of F90

Rewritten using FEM framework in C++ 1200 lines of C++ code Framework: 500 lines of code,

—reused by all applications Parallelization completely by the framework

19

CPSCPSDD

Crack PropagationCrack Propagation

Decomposition into 16 chunks (left) and 128 chunks, 8 for each PE (right). The middle area contains cohesive elements. Both decompositions obtained using Metis. Pictures: S. Breitenfeld, and P. Geubelle

20

CPSCPSDD

“Overhead” of multi-partition method“Overhead” of multi-partition method

0

5

10

15

20

25

30

35

40

1 10 100 1000

Number of partitions

21

CPSCPSDD

Overhead study on 8 processorsOverhead study on 8 processors

0

2

4

6

8

10

12

1 10 100

Number of chunks per processor

Exe

cuti

on

tim

e (o

n 8

pro

cess

ors

)

When running on 8 processors, the effect of using multiple partitions per

processor is also beneficial, due to cache behavior.

22

CPSCPSDD

Cross-approach comparisonCross-approach comparison

Performance compariosn across approaches

0

0.5

1

1.5

2

2.5

3

3.5

4

1 1 2 4 8 16 32 64 128 1 2 4 8 16 32 64 128

Number of partitions

Exe

cuti

on

tim

e in

sec

on

ds

MPI-F90 original Charm++ framework(all C++) F90 + charm++ library

23

CPSCPSDD

Load balancer in actionLoad balancer in action

24

CPSCPSDD

Summary and Planned ResearchSummary and Planned Research

Use the adaptive FEM framework To parallelize Quench code further Quad tree based solidification code:

—First phase: parallelize each phase separately

—Parallelize across refinement phases

Refine the FEM framework Use feedback from applications Support for implicit solvers and multigrid