92
Programming Models for Blue Gene/L : Charm++, AMPI and Applications Laxmikant Kale Parallel Programming Laboratory Dept. of Computer Science University of Illinois at Urbana Champaign http://charm.cs.uiuc.edu

blueGeneLTahoeAug2002.ppt

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: blueGeneLTahoeAug2002.ppt

Programming Models for Blue Gene/L :

Charm++, AMPI and Applications

Laxmikant KaleParallel Programming Laboratory

Dept. of Computer Science

University of Illinois at Urbana Champaign

http://charm.cs.uiuc.edu

Page 2: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 2

Outline

• Scaling to BG/L– Communication, – Mapping, reliability/FT, – Critical paths & load imbalance

• The virtualization model – Basic ideas– Charm++ and AMPI– Virtualization: a magic bullet:

• Logical decomposition, • Software eng., • Flexible map

– Message driven execution:– Principle of persistence– Runtime optimizations

• BG/L Prog. Development env– Emulation setup– Simulation and Perf Prediction

• Timestamp correction• Sdag and determinacy

– Applications using BG/C,BG/L• NAMD on Lemieux• LeanMD, • 3D FFT

• Ongoing research:– Load balancing– Communication optimization– Other models: Converse

• Compiler support

Page 3: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 3

Technical Approach

• Seek optimal division of labor between “system” and programmer:

Specialization

MPI

HPF

Autom

ation

Charm++

Expression

Scheduling

Mapping

Decomposition

Decomposition done by programmer, everything else automated

Page 4: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 4

Object-based Decomposition

• Idea: – Divide the computation into a large number of pieces

• Independent of number of processors

• Typically larger than number of processors

– Let the system map objects to processors

• Old idea? Fox (’86?), DRMS,

• Our approach is “virtualization++”–Language and runtime support for virtualization

–Exploitation of virtualization to the hilt

Page 5: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 5

Object-based Parallelization

User View

System implementation

User is only concerned with interaction between objects

Page 6: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 6

Realizations: Charm++

• Charm++:– Parallel C++ with Data Driven Objects (Chares)

– Object Arrays/ Object Collections

– Object Groups:

• Global object with a “representative” on each PE

– Asynchronous method invocation

• Prioritized scheduling

– Information sharing abstractions: readonly, tables,..

– Mature, robust, portable (http://charm.cs.uiuc.edu)

Page 7: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 7

Chare Arrays

• Elements are data-driven objects• Elements are indexed by a user-defined data type--

[sparse] 1D, 2D, 3D, tree, ...• Send messages to index, receive messages at element.

Reductions and broadcasts across the array• Dynamic insertion, deletion, migration-- and

everything still has to work!

Page 8: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 8

Object Arrays

• A collection of data-driven objects (aka chares), – With a single global name for the collection, and

– Each member addressed by an index

– Mapping of element objects to processors handled by the system

A[0] A[1] A[2] A[3] A[..]

User’s view

Page 9: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 9

Object Arrays

• A collection of chares, – with a single global name for the collection, and

– each member addressed by an index

– Mapping of element objects to processors handled by the system

A[0] A[1] A[2] A[3] A[..]

A[3]A[0]

User’s view

System view

Page 10: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 10

Object Arrays

• A collection of chares, – with a single global name for the collection, and

– each member addressed by an index

– Mapping of element objects to processors handled by the system

A[0] A[1] A[2] A[3] A[..]

A[3]A[0]

User’s view

System view

Page 11: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 11

Comparison with MPI

• Advantage: Charm++– Modules/Abstractions are centered on application data

structures, • Not processors

– Several other…

• Advantage: MPI– Highly popular, widely available, industry standard– “Anthropomorphic” view of processor

• Many developers find this intuitive

• But mostly:– There is no hope of weaning people away from MPI– There is no need to choose between them!

Page 12: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 12

Adaptive MPI

• A migration path for legacy MPI codes – Allows them dynamic load balancing capabilities of Charm++

• AMPI = MPI + dynamic load balancing• Uses Charm++ object arrays and migratable threads• Minimal modifications to convert existing MPI programs

– Automated via AMPizer

• Based on Polaris Compiler Framework

• Bindings for – C, C++, and Fortran90

Page 13: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 13

AMPI:

7 MPI processes

Page 14: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 14

AMPI:

Real Processors

7 MPI “processes”

Implemented as virtual processors (user-level migratable threads)

Page 15: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 15

II: Consequences of Virtualization

• Better Software Engineering• Message Driven Execution • Flexible and dynamic mapping to processors

Page 16: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 16

Modularization

• Logical Units decoupled from “Number of processors” – E.G. Oct tree nodes for particle data

– No artificial restriction on the number of processors

• Cube of power of 2

• Modularity:– Software engineering: cohesion and coupling

– MPI’s “are on the same processor” is a bad coupling principle

– Objects liberate you from that:

• E.G. Solid and fluid moldules in a rocket simulation

Page 17: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 17

Rocket Simulation

• Large Collaboration headed Mike Heath– DOE supported center

• Challenge:– Multi-component code, with modules from independent

researchers

– MPI was common base

• AMPI: new wine in old bottle– Easier to convert

– Can still run original codes on MPI, unchanged

Page 18: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 18

Rocket simulation via virtual processors

Rocflo

Rocface

Rocsolid

Rocflo

Rocface

Rocsolid

Rocflo

Rocface

Rocsolid

Rocflo

Rocface

Rocsolid

Rocflo

Rocface

Rocsolid

Rocflo

Rocface

RocsolidRocface

Rocsolid

Rocface

Rocsolid

Rocface

RocsolidRocface

Rocsolid

RocfloRocflo Rocflo Rocflo

Page 19: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 19

AMPI and Roc*: Communication

Rocflo

Rocface

RocsolidRocface

Rocsolid

Rocface

Rocsolid

Rocface

RocsolidRocface

Rocsolid

RocfloRocflo Rocflo Rocflo

Page 20: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 20

Message Driven Execution

Scheduler Scheduler

Message Q Message Q

Page 21: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 21

Adaptive Overlap via Data-driven Objects

• Problem: – Processors wait for too long at “receive” statements

• Routine communication optimizations in MPI– Move sends up and receives down

– Sometimes. Use irecvs, but be careful

• With Data-driven objects– Adaptive overlap of computation and communication

– No object or threads holds up the processor

– No need to guess which is likely to arrive first

Page 22: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 22

Adaptive overlap and modules

SPMD and Message-Driven Modules (From A. Gursoy, Simplified expression of message-driven programs and quantification of their impact on performance, Ph.D Thesis, Apr 1994.)

Page 23: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 24

Handling OS Jitter via MDE

• MDE encourages asynchrony– Asynchronous reductions, for example– Only data dependence should force synchronization

• One benefit:– Consider an algorithm with N steps

• Each step has different load balance:Tij• Loose dependence between steps

– (on neighbors, for example)

– Sum-of-max (MPI) vs max-of-sum (MDE)

• OS Jitter:– Causes random processors to add delays in each step– Handled Automatically by MDE

Page 24: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 25

Virtualization/MDE leads to predictability

• Ability to predict:– Which data is going to be needed and– Which code will execute– Based on the ready queue of object method invocations

• So, we can:– Prefetch data accurately– Prefetch code if needed– Out-of-core execution– Caches vs controllable SRAM

S SQ Q

Page 25: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 26

Flexible Dynamic Mapping to Processors

• The system can migrate objects between processors– Vacate workstation used by a parallel program

– Dealing with extraneous loads on shared workstations

– Shrink and Expand the set of processors used by an app

• Adaptive job scheduling

• Better System utilization

– Adapt to speed difference between processors

• E.g. Cluster with 500 MHz and 1 Ghz processors

• Automatic checkpointing– Checkpointing = migrate to disk!

– Restart on a different number of processors

Page 26: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 28

Principle of Persistence

• Once the application is expressed in terms of interacting objects:– Object communication patterns and

computational loads tend to persist over time– In spite of dynamic behavior

• Abrupt and large,but infrequent changes (eg:AMR)

• Slow and small changes (eg: particle migration)

• Parallel analog of principle of locality– Heuristics, that holds for most CSE applications– Learning / adaptive algorithms– Adaptive Communication libraries– Measurement based load balancing

Page 27: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 29

Measurement Based Load Balancing

• Based on Principle of persistence• Runtime instrumentation

– Measures communication volume and computation time

• Measurement based load balancers– Use the instrumented data-base periodically to make new

decisions

– Many alternative strategies can use the database

• Centralized vs distributed

• Greedy improvements vs complete reassignments

• Taking communication into account

• Taking dependences into account (More complex)

Page 28: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 30

Load balancer in action

0

5

10

15

20

25

30

35

40

45

501 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91

Iteration Number

Nu

mb

er

of

Ite

rati

on

s P

er

se

con

dAutomatic Load Balancing in Crack Propagation

1. ElementsAdded 3. Chunks

Migrated

2. Load Balancer Invoked

Page 29: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 31

“Overhead” of Virtualization

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 2 4 8 16 32 64 128 256 512 1024 2048

Number of Chunks Per Processor

Tim

e (

Se

co

nd

s) p

er

Ite

rati

on

Page 30: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 32

Optimizing for Communication Patterns

• The parallel-objects Runtime System can observe, instrument, and measure communication patterns– Communication is from/to objects, not processors– Load balancers use this to optimize object placement– Communication libraries can optimize

• By substituting most suitable algorithm for each operation

• Learning at runtime– E.g. Each to all individualized sends

• Performance depends on many runtime characteristics• Library switches between different algorithms

V. Krishnan, MS Thesis, 1996

Page 31: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 33

Example: All to all on Lemieux

0

10

20

30

40

50

60

0 500 1000 1500 2000 2500Processors

Tim

e (

ms

)

MPI

Converse

Page 32: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 34

The Other Side: Pipelining

• A sends a large message to B, whereupon B computes– Problem: B is idle for a long time, while the message gets

there.

– Solution: Pipelining

• Send the message in multiple pieces, triggering a computation on each

• Objects makes this easy to do: • Example:

– Ab Initio Computations using Car-Parinello method

– Multiple 3D FFT kernel

Recent collaboration with: R. Car, M. Klein, G. Martyna, M, Tuckerman, N. Nystrom, J. Torrellas

Page 33: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 35

Page 34: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 36

Effect of Pipelining

0

0.05

0.1

0.15

0.2

0.25

0 5 10 15 20 25 30 35 40 45

Objects per processor

Tim

e (

Se

co

nd

s)

Multiple Concurrent 3D FFTs, on 64 Processors of Lemieux

V. Ramkumar (PPL)

Page 35: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 37

Control Points: learning and tuning

• The RTS can automatically optimize the degree of pipelining– If it is given a control point (knob) to tune

– By the application

Controlling pipelining between a pair of objects:

S. Krishnan, PhD Thesis, 1994

Controlling degree of virtualization:

Orchestration Framework: M. Bhandarkar PhD thesis, 2002

Page 36: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 38

So, What Are We Doing About It?

• How to develop any programming environment for a machine that isn’t built yet

• Blue Gene/C emulator using charm++– Completed last year

– Implememnts low level BG/C API

• Packet sends, extract packet from comm buffers

– Emulation runs on machines with hundreds of “normal” processors

• Charm++ on blue Gene /C Emulator

Page 37: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 39

So, What Are We Doing About It?

• How to develop any programming environment for a machine that isn’t built yet

• Blue Gene/C emulator using charm++– Completed last year

– Implememnts low level BG/C API

• Packet sends, extract packet from comm buffers

– Emulation runs on machines with hundreds of “normal” processors

• Charm++ on blue Gene /C Emulator

Page 38: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 40

Structure of the Emulators

Blue Gene/CLow-level API

Charm++

Converse

Converse

Charm++

BG/C low level API

Charm++

Page 39: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 41

Structure of the Emulators

Blue Gene/CLow-level API

Charm++

Converse

Converse

Charm++

BG/C low level API

Charm++

Page 40: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 42

Emulation on a Parallel Machine

Simulating (Host) Processor

BG/C Nodes

Hardware thread

Page 41: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 43

Emulation on a Parallel Machine

Simulating (Host) Processor

BG/C Nodes

Hardware thread

Page 42: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 44

Extensions to Charm++ for BG/C

• Microtasks:– Objects may fire microtasks that can be executed by any

thread on the same node

– Increases parallelism

– Overhead: sub-microsecond

• Issue:– Object affinity: map to thread or node?

• Thread, currently.

• Microtasks alleviate load balancing within a node

Page 43: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 45

Extensions to Charm++ for BG/C

• Microtasks:– Objects may fire microtasks that can be executed by any

thread on the same node

– Increases parallelism

– Overhead: sub-microsecond

• Issue:– Object affinity: map to thread or node?

• Thread, currently.

• Microtasks alleviate load balancing within a node

Page 44: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 46

Emulation efficiency

• How much time does it take to run an emulation?– 8 Million processors being emulated on 100

– In addition, lower cache performance

– Lots of tiny messages

• On a Linux cluster:– Emulation shows good speedup

Page 45: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 47

Emulation efficiency

• How much time does it take to run an emulation?– 8 Million processors being emulated on 100

– In addition, lower cache performance

– Lots of tiny messages

• On a Linux cluster:– Emulation shows good speedup

Page 46: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 48

Emulation efficiencyEmulation Time on Linux Cluster

0

2

4

6

8

10

12

14

0 10 20 30 40 50 60 70

Num of processors

Tim

e (S

ecs)

1000 BG/C nodes (10x10x10)

Each with 200 threads

(total of 200,000 user-level threads)

But Data is preliminary, based on

one simulation

Page 47: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 49

Emulation efficiencyEmulation Time on Linux Cluster

0

2

4

6

8

10

12

14

0 10 20 30 40 50 60 70

Num of processors

Tim

e (S

ecs)

1000 BG/C nodes (10x10x10)

Each with 200 threads

(total of 200,000 user-level threads)

But Data is preliminary, based on

one simulation

Page 48: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 50

Emulator to Simulator

• Step 1: Coarse grained simulation– Simulation: performance prediction capability

– Models contention for processor/thread

– Also models communication delay based on distance

– Doesn’t model memory access on chip, or network

– How to do this in spite of out-of-order message delivery?

• Rely on determinism of Charm++ programs

• Time stamped messages and threads

• Parallel time-stamp correction algorithm

Page 49: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 51

Emulator to Simulator

• Step 1: Coarse grained simulation– Simulation: performance prediction capability

– Models contention for processor/thread

– Also models communication delay based on distance

– Doesn’t model memory access on chip, or network

– How to do this in spite of out-of-order message delivery?

• Rely on determinism of Charm++ programs

• Time stamped messages and threads

• Parallel time-stamp correction algorithm

Page 50: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 52

Emulator to Simulator

• Step 2: Add fine grained procesor simulation– Sarita Adve: RSIM based simulation of a node

• SMP node simulation: completed

– Also: simulation of interconnection network

– Millions of thread units/caches to simulate in detail?

• Step 3: Hybrid simulation– Instead: use detailed simulation to build model

– Drive coarse simulation using model behavior

– Further help from compiler and RTS

Page 51: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 53

Emulator to Simulator

• Step 2: Add fine grained procesor simulation– Sarita Adve: RSIM based simulation of a node

• SMP node simulation: completed

– Also: simulation of interconnection network

– Millions of thread units/caches to simulate in detail?

• Step 3: Hybrid simulation– Instead: use detailed simulation to build model

– Drive coarse simulation using model behavior

– Further help from compiler and RTS

Page 52: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 54

Applications on the current system

• Using BG Charm++ • LeanMD:

– Research quality Molecular Dyanmics

– Version 0: only electrostatics + van der Vaal

• Simple AMR kernel– Adaptive tree to generate millions of objects

• Each holding a 3D array

– Communication with “neighbors”

• Tree makes it harder to find nbrs, but Charm makes it easy

Page 53: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 55

Modeling layers

Applications

Libraries/RTS

Chip Architecture Network model

For each: need a detailed simulation and a simpler

(e.g. table-driven) “model”

And methods for combining

them

Page 54: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 56

Modeling layers

Applications

Libraries/RTS

Chip Architecture Network model

For each: need a detailed simulation and a simpler

(e.g. table-driven) “model”

And methods for combining

them

Page 55: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 57

Timestamp correction

• Basic execution: – Timestamped messages

• Correction needed when:– A message arrives with an earlier timestamp than other

messages “processed” already

• Cases:– Messages to Handlers or simple objects

– MPI style threads, without wildcard or irecvs

– Charm++ with dependence expressed via structured dagger

Page 56: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 58

M8

M1 M7M6M5M4M3M2

RecvTime

ExecutionTimeLine

Timestamps Correction

Page 57: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 59

M8M1 M7M6M5M4M3M2

RecvTime

ExecutionTimeLine

Timestamps Correction

Page 58: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 60

M1 M7M6M5M4M3M2

RecvTime

ExecutionTimeLine

M8

ExecutionTimeLineM1 M7M6M5M4M3M2 M8

RecvTime

Correction Message

Timestamps Correction

Page 59: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 61

M1 M7M6M5M4M3M2

RecvTime

ExecutionTimeLine

Correction Message (M4)

M4

Correction Message (M4)

M4

M1 M7M4M3M2

RecvTime

ExecutionTimeLineM5 M6

Correction Message

M1 M7M6M4 M3M2

RecvTime

ExecutionTimeLineM5

Correction Message

Timestamps Correction

Page 60: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 62

Performance of correction Algorithm

• Without correction– 15 seconds to emulate a 18msec timstep

– 10x10x10 nodes with k threads each (200?)

• With correction– Version 1: 42 minutes per step!

– Version 2:

• “Chase” and correct messages still in queues

• Optimize search for messages in the log data

• Currently at 30 secs per step

Page 61: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 63

Applications on the current system

• Using BG Charm++ • LeanMD:

– Research quality Molecular Dyanmics

– Version 0: only electrostatics + van der Vaal

• Simple AMR kernel– Adaptive tree to generate millions of objects

• Each holding a 3D array

– Communication with “neighbors”

• Tree makes it harder to find nbrs, but Charm makes it easy

Page 62: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 64

Example: Molecular Dynamics in NAMD

• Collection of [charged] atoms, with bonds– Newtonian mechanics– Thousands of atoms (1,000 - 500,000)– 1 femtosecond time-step, millions needed!

• At each time-step– Calculate forces on each atom

• Bonds:• Non-bonded: electrostatic and van der Waal’s

– Calculate velocities and advance positions– Multiple Time Stepping : PME (3D FFT) every 4 steps

Collaboration with K. Schulten, R. Skeel, and coworkers

Page 63: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 65

NAMD: Molecular Dynamics

• Collection of [charged] atoms, with bonds• Newtonian mechanics• At each time-step

– Calculate forces on each atom

• Bonds:

• Non-bonded: electrostatic and van der Waal’s

– Calculate velocities and advance positions

• 1 femtosecond time-step, millions needed!• Thousands of atoms (1,000 - 100,000)

Collaboration with K. Schulten, R. Skeel, and coworkers

Page 64: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 66

Further MD

• Use of cut-off radius to reduce work– 8 - 14 Å

– Faraway charges ignored!

• 80-95 % work is non-bonded force computations• Some simulations need faraway contributions

Page 65: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 67

Scalability

• The Program should scale up to use a large number of processors. – But what does that mean?

• An individual simulation isn’t truly scalable• Better definition of scalability:

– If I double the number of processors, I should be able to retain parallel efficiency by increasing the problem size

Page 66: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 68

Isoefficiency

• Quantify scalability• How much increase in problem size is needed to retain

the same efficiency on a larger machine?• Efficiency : Seq. Time/ (P · Parallel Time)

– parallel time =

• computation + communication + idle

Page 67: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 69

Traditional Approaches

• Replicated Data:– All atom coordinates stored on each processor

– Non-bonded Forces distributed evenly

– Analysis: Assume N atoms, P processors

• Computation: O(N/P)

• Communication: O(N log P)

• Communication/Computation ratio: P log P

• Fraction of communication increases with number of processors, independent of problem size!

Page 68: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 70

Atom decomposition

• Partition the Atoms array across processors– Nearby atoms may not be on the same processor

– Communication: O(N) per processor

– Communication/Computation: O(P)

Page 69: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 71

Force Decomposition

• Distribute force matrix to processors– Matrix is sparse, non uniform

– Each processor has one block

– Communication: N/sqrt(P)

– Ratio: sqrt(P)

• Better scalability– (can use 100+ processors)

– Hwang, Saltz, et al:

– 6% on 32 Pes 36% on 128 processor

Page 70: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 72

Spatial Decomposition

• Allocate close-by atoms to the same processor• Three variations possible:

– Partitioning into P boxes, 1 per processor

• Good scalability, but hard to implement

– Partitioning into fixed size boxes, each a little larger than the cutoff disctance

– Partitioning into smaller boxes

• Communication: O(N/P)

Page 71: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 73

Spatial Decomposition in NAMD

• NAMD 1 used spatial decomposition• Good theoretical isoefficiency, but for a fixed size

system, load balancing problems• For midsize systems, got good speedups up to 16

processors….• Use the symmetry of Newton’s 3rd law to

facilitate load balancing

Page 72: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 74

Spatial Decomposition

Page 73: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 75

Spatial Decomposition

Page 74: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 77

FD + SD

• Now, we have many more objects to load balance:– Each diamond can be assigned to any processor

– Number of diamonds (3D):

– 14·Number of Patches

Page 75: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 78

Bond Forces

• Multiple types of forces:– Bonds(2), Angles(3), Dihedrals (4), ..

– Luckily, each involves atoms in neighboring patches only

• Straightforward implementation:– Send message to all neighbors,

– receive forces from them

– 26*2 messages per patch!

Page 76: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 79

Bonded Forces:

• Assume one patch per processor

B

CA

Page 77: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 80

Optimizations in scaling to 1000

• Parallelization is based on parallel objects– Charm++ : a parallel C++

• Series of optimizations were implemented to scale performance to 1000+ processors

• Examples:– Load Balancing:

• Grainsize distributions

Page 78: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 81

Grainsize and Amdahls’s law

• A variant of Amdahl’s law, for objects:– The fastest time can be no shorter than the time for the

biggest single object!

• How did it apply to us?– Sequential step time was 57 seconds

– To run on 2k processors, no object should be more than 28 msecs.

– Analysis using our tools showed:

Page 79: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 82

Grainsize analysis

Grainsize distribution

0

100

200

300

400

500

600

700

800

900

1000

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43

grainsize in milliseconds

nu

mb

er

of

ob

jec

ts

Solution:

Split compute objects that may have too much work:

using a heuristics based on number of interacting atoms

Problem

Page 80: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 83

Grainsize reduced

Grainsize distribution after splitting

0

200

400

600

800

1000

1200

1400

1600

1 3 5 7 9 11 13 15 17 19 21 23 25

grainsize in msecs

nu

mb

er o

f o

bje

cts

Page 81: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 84

NAMD performance using virtualization

• Written in Charm++• Uses measurement based load balancing• Object level performance feedback

– using “projections” tool for Charm++

– Identifies problems at source level easily

– Almost suggests fixes

• Attained unprecedented performance

Page 82: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 85

Page 83: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 86

Namd2 Speedup on Lemieux

0

200

400

600

800

1000

1200

1400

1600

4 8 16 32 64 126

255

510

1023

1536

2046

apoa1(cutoff)apoa1(PME)atpase(cutoff)atpase(PME)

Page 84: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 87

PME parallelization

Impor4t picture from sc02 paper (sindhura’s)

Page 85: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 89

Page 86: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 90

Page 87: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 91

Performance: NAMD on Lemieux

Time (ms) Speedup GFLOPSProcs Per Node Cut PME MTS Cut PME MTS Cut PME MTS

1 1 24890 29490 28080 1 1 1 0.494 0.434 0.48128 4 207.4 249.3 234.6 119 118 119 59 51 57256 4 105.5 135.5 121.9 236 217 230 116 94 110512 4 55.4 72.9 63.8 448 404 440 221 175 211510 3 54.8 69.5 63 454 424 445 224 184 213

1024 4 33.4 45.1 36.1 745 653 778 368 283 3731023 3 29.8 38.7 33.9 835 762 829 412 331 3971536 3 21.2 28.2 24.7 1175 1047 1137 580 454 5451800 3 18.6 25.8 22.3 1340 1141 1261 661 495 6052250 3 15.6 23.5 18.4 1599 1256 1527 789 545 733

ATPase: 320,000+ atoms including water

Page 88: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 92

LeanMD for BG/L

• Need many more objects:– Generalize hybrid decompostion scheme

• 1-away to k-away

Page 89: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 94

Page 90: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 95

Role of compilers

• New uses of compiler analysis– Apparently simple, but then again, data flow analysis must

have seemed simple

– Supporting Threads,

– Shades of global variables

– Minimizing state at migration

– Border fusion

– Split-phase semantics (UPC).

– Components (separately compiled)

– Border fusion

• Compiler – RTS collaboration needed!

Page 91: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 96

Summary

• Virtualization as a magic bullet– Logical decomposition, better software eng.

– Flexible and dynamic mapping to processors

• Message driven execution:– Adaptive overlap, modularity, predictability

• Principle of persistence– Measurement based load balancing,

– Adaptive communication libraries

• Future:– Compiler support

– Realize the potential:

• Strategies and applications

More info:

http://charm.cs.uiuc.edu

Page 92: blueGeneLTahoeAug2002.ppt

April 8, 2023 Processor virtualization 97

Component Frameworks

• Seek optimal division of labor between “system” and programmer:– Decomposition done by programmer, everything else automated

– Develop standard library of reusable parallel components

Specialization

MPIexpression

Scheduling

Mapping

Decomposition

HPFCharm++

Domain specific

frameworks