30
NGS Workshop: Feb 2002 PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana-Champaign Http://charm.Cs.uiuc.edu

Programming Environment and Performance Modeling for million-processor machines

  • Upload
    ryann

  • View
    36

  • Download
    0

Embed Size (px)

DESCRIPTION

Programming Environment and Performance Modeling for million-processor machines. Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana-Champaign Http://charm.Cs.uiuc.edu. Context: Group Mission and Approach. - PowerPoint PPT Presentation

Citation preview

Page 1: Programming Environment and Performance Modeling for  million-processor machines

NGS Workshop: Feb 2002 PPL-Dept of Computer Science, UIUC

Programming Environment and Performance Modeling for

million-processor machines

Laxmikant (Sanjay) KaleParallel Programming Laboratory

Department of Computer Science

University of Illinois at Urbana-Champaign

Http://charm.Cs.uiuc.edu

Page 2: Programming Environment and Performance Modeling for  million-processor machines

NGS Workshop: Feb 2002 PPL-Dept of Computer Science, UIUC

Context: Group Mission and Approach

• To enhance Performance and Productivity in programming complex parallel applications– Performance: scalable to very large number of processors

– Productivity: of human programmers

– Complex: irregular structure, dynamic variations

• Approach: Application Oriented yet CS centered research– Develop enabling technology, for a wide collection of apps.

– Develop, use and test it in the context of real applications

– Develop standard library of reusable parallel components

Page 3: Programming Environment and Performance Modeling for  million-processor machines

NGS Workshop: Feb 2002 PPL-Dept of Computer Science, UIUC

Project Objective and Overview• Focus on extremely large parallel machines

– Exemplified by Blue Gene/Cyclops

• Issues:– Programming Environment:

• Objects, threads, compiler support

– Runtime performance adaptation– Performance modeling

• Coarse grained models• Fine grained models• Hybrid

– Applications: • Unstructured Meshes (FEM/Crack Propagation), ..

David Padua

Sanjay Kale

Sarita Adve

Phillipe Geubelle

Page 4: Programming Environment and Performance Modeling for  million-processor machines

NGS Workshop: Feb 2002 PPL-Dept of Computer Science, UIUC

Project Objective and Overview

• Focus on extremely large parallel machines– Exemplified by Blue Gene/Cyclops

• Issues:– Programming Environment– Runtime performance adaptation– Performance modeling

• Coarse grained models• Fine grained models• Hybrid

– Applications: • Unstructured Meshes (FEM/Crack

Propagation), ..

David Padua

Sanjay Kale

Sarita Adve

Phillipe Geubelle

Page 5: Programming Environment and Performance Modeling for  million-processor machines

NGS Workshop: Feb 2002 PPL-Dept of Computer Science, UIUC

Multi-partition Decomposition

• Idea: divide the computation into a large number of pieces– Independent of number of processors– Typically larger than number of processors– Let the system map entities to processors

• Optimal division of labor between “system” and programmer:

• Decomposition done by programmer,

• Everything else automated

Page 6: Programming Environment and Performance Modeling for  million-processor machines

NGS Workshop: Feb 2002 PPL-Dept of Computer Science, UIUC

Object-based Parallelization

User View

System implementation

User is only concerned with interaction between objects

Page 7: Programming Environment and Performance Modeling for  million-processor machines

NGS Workshop: Feb 2002 PPL-Dept of Computer Science, UIUC

Charm++

• Parallel C++ with Data Driven Objects• Object Arrays/ Object Collections• Object Groups:

– Global object with a “representative” on each PE

• Asynchronous method invocation• Prioritized scheduling• Information sharing abstractions: readonly, tables,..• Mature, robust, portable• http://charm.cs.uiuc.edu

Page 8: Programming Environment and Performance Modeling for  million-processor machines

NGS Workshop: Feb 2002 PPL-Dept of Computer Science, UIUC

Data driven execution

Scheduler Scheduler

Message Q Message Q

Page 9: Programming Environment and Performance Modeling for  million-processor machines

NGS Workshop: Feb 2002 PPL-Dept of Computer Science, UIUC

Load Balancing Framework

• Based on object migration – Partitions implemented as objects (or threads) are

mapped to available processors by LB framework

• Measurement based load balancers:– Principle of persistence

• Computational loads and communication patterns

– Runtime system measures actual computation times of every partition, as well as communication patterns

• Variety of “plug-in” LB strategies available– Scalable to a few thousand processors– Including those for situations when principle of

persistence does not apply

Page 10: Programming Environment and Performance Modeling for  million-processor machines

NGS Workshop: Feb 2002 PPL-Dept of Computer Science, UIUC

Building on Object-based Parallelism

• Application induced load imbalances• Environment induced performance issues:

– Dealing with extraneous loads on shared m/cs

– Vacating workstations

– Heterogeneous clusters

– Shrinking and Expanding jobs to available Pes

• Object “migration”: novel uses– Automatic checkpointing

– Automatic prefetching for out-of-core execution

• Reuse: object based components

Page 11: Programming Environment and Performance Modeling for  million-processor machines

NGS Workshop: Feb 2002 PPL-Dept of Computer Science, UIUC

Applications

• Charm++ developed in the context of real applications

• Current applications we are involved with:– Molecular dynamics– Crack propagation– Rocket simulation: fluid dynamics + structures +– QM/MM: Material properties via quantum mech– Cosmology simulations: parallel analysis+viz– Cosmology: gravitational with multiple timestepping

Page 12: Programming Environment and Performance Modeling for  million-processor machines

NGS Workshop: Feb 2002 PPL-Dept of Computer Science, UIUC

Molecular Dynamics

• Collection of [charged] atoms, with bonds

• Newtonian mechanics

• At each time-step– Calculate forces on each atom

• Bonds:

• Non-bonded: electrostatic and van der Waal’s

– Calculate velocities and advance positions

• 1 femtosecond time-step, millions needed!

• Thousands of atoms (1,000 - 100,000)

Page 13: Programming Environment and Performance Modeling for  million-processor machines

NGS Workshop: Feb 2002 PPL-Dept of Computer Science, UIUC

Object Based Parallelization for MD

Page 14: Programming Environment and Performance Modeling for  million-processor machines

NGS Workshop: Feb 2002 PPL-Dept of Computer Science, UIUC

Performance Data: SC2000

Speedup on ASCI Red: BC1 (200k atoms)

0

200

400

600

800

1000

1200

1400

0 500 1000 1500 2000 2500

Processors

Spe

edup

Page 15: Programming Environment and Performance Modeling for  million-processor machines

NGS Workshop: Feb 2002 PPL-Dept of Computer Science, UIUC

Charm++ Is a Good Match for M-PIM

• Encapsulation : objects• Cost model:

– Object data, read-only data, remote data

• Migration and resource management: automatic• One sided communication: since the beginning• Asynchronous global operations (reductions, ..)• Modularity:

– see 1996 paper for why DD Objects enable modularity

• Acceptability: – C++– Now also: AMPI on top of charm++

Page 16: Programming Environment and Performance Modeling for  million-processor machines

NGS Workshop: Feb 2002 PPL-Dept of Computer Science, UIUC

AMPI: Goals• Runtime adaptivity for MPI programs

– Based on multi-domain decomposition and dynamic load balancing features of Charm++

– Minimal changes to the original MPI code

– Full MPI 1.1 standard compliance

– Additional support for coupled codes

– Automatic conversion of existing MPI programs: Polaris/Padua

Original MPI Code AMPI Code

AMPI Runtime

AMPIzer

Page 17: Programming Environment and Performance Modeling for  million-processor machines

NGS Workshop: Feb 2002 PPL-Dept of Computer Science, UIUC

How Good Is the Programmability

• I.E. Do programmers find it easy/good– We think so – Certainly a good intermediate level model

• Higher level abstractions can be built on it

• But what kinds of abstractions?

• We think domain-specific ones

Page 18: Programming Environment and Performance Modeling for  million-processor machines

NGS Workshop: Feb 2002 PPL-Dept of Computer Science, UIUC

Specialization

MPIexpression

Scheduling

Mapping

Decomposition

HPFCharm++

Domain specific

frameworks

/AMPI

Page 19: Programming Environment and Performance Modeling for  million-processor machines

NGS Workshop: Feb 2002 PPL-Dept of Computer Science, UIUC

Further Match With MPIM

• Ability to predict:– Which data is going to be needed and– Which code will execute– Based on the ready queue of object method

invocations– So, we can:

• Prefetch data accurately• Prefetch code if needed

S SQ Q

Page 20: Programming Environment and Performance Modeling for  million-processor machines

NGS Workshop: Feb 2002 PPL-Dept of Computer Science, UIUC

So, What Are We Doing About It?

• How to develop any programming environment for a machine that isn’t built yet

• Blue Gene/C emulator using charm++– Completed last year– Implememnts low level BG/C API

• Packet sends, extract packet from comm buffers

– Emulation runs on machines with hundreds of “normal” processors

• Charm++ on blue Gene /C Emulator

Page 21: Programming Environment and Performance Modeling for  million-processor machines

NGS Workshop: Feb 2002 PPL-Dept of Computer Science, UIUC

Structure of the Emulators

Blue Gene/CLow-level API

Charm++

Converse

Converse

Charm++

BG/C low level API

Charm++

Page 22: Programming Environment and Performance Modeling for  million-processor machines

NGS Workshop: Feb 2002 PPL-Dept of Computer Science, UIUC

Emulation on a Parallel Machine

Simulating (Host) Processor

BG/C Nodes

Hardware thread

Page 23: Programming Environment and Performance Modeling for  million-processor machines

NGS Workshop: Feb 2002 PPL-Dept of Computer Science, UIUC

Extensions to Charm++ for BG/C

• Microtasks:– Objects may fire microtasks that can be

executed by any thread on the same node– Increases parallelism– Overhead: sub-microsecond

• Issue:– Object affinity: map to thread or node?

• Thread, currently.

• Microtasks alleviate load balancing within a node

Page 24: Programming Environment and Performance Modeling for  million-processor machines

NGS Workshop: Feb 2002 PPL-Dept of Computer Science, UIUC

Emulation efficiency

• How much time does it take to run an emulation?– 8 Million processors being emulated on 100– In addition, lower cache performance– Lots of tiny messages

• On a Linux cluster:– Emulation shows good speedup

Page 25: Programming Environment and Performance Modeling for  million-processor machines

NGS Workshop: Feb 2002 PPL-Dept of Computer Science, UIUC

Emulation efficiencyEmulation Time on Linux Cluster

0

2

4

6

8

10

12

14

0 10 20 30 40 50 60 70

Num of processors

Tim

e (S

ecs)

1000 BG/C nodes (10x10x10)

Each with 200 threads

(total of 200,000 user-level threads)

But Data is preliminary, based on

one simulation

Page 26: Programming Environment and Performance Modeling for  million-processor machines

NGS Workshop: Feb 2002 PPL-Dept of Computer Science, UIUC

Emulator to Simulator

• Step 1: Coarse grained simulation– Simulation: performance prediction capability– Models contention for processor/thread– Also models communication delay based on distance– Doesn’t model memory access on chip, or network– How to do this in spite of out-of-order message

delivery?• Rely on determinism of Charm++ programs

• Time stamped messages and threads

• Parallel time-stamp correction algorithm

Page 27: Programming Environment and Performance Modeling for  million-processor machines

NGS Workshop: Feb 2002 PPL-Dept of Computer Science, UIUC

Applications on the current system

• Using BG Charm++

• LeanMD:– Research quality Molecular Dyanmics– Version 0: only electrostatics + van der Vaal

• Simple AMR kernel– Adaptive tree to generate millions of objects

• Each holding a 3D array

– Communication with “neighbors”• Tree makes it harder to find nbrs, but Charm makes it easy

Page 28: Programming Environment and Performance Modeling for  million-processor machines

NGS Workshop: Feb 2002 PPL-Dept of Computer Science, UIUC

Emulator to Simulator

• Step 2: Add fine grained procesor simulation– Sarita Adve: RSIM based simulation of a node

• SMP node simulation: completed

– Also: simulation of interconnection network– Millions of thread units/caches to simulate in detail?

• Step 3: Hybrid simulation– Instead: use detailed simulation to build model– Drive coarse simulation using model behavior– Further help from compiler and RTS

Page 29: Programming Environment and Performance Modeling for  million-processor machines

NGS Workshop: Feb 2002 PPL-Dept of Computer Science, UIUC

Modeling layers

Applications

Libraries/RTS

Chip Architecture Network model

For each: need a detailed simulation and a simpler

(e.g. table-driven) “model”

And methods for combining

them

Page 30: Programming Environment and Performance Modeling for  million-processor machines

NGS Workshop: Feb 2002 PPL-Dept of Computer Science, UIUC

Summary

• Charm++ (data-driven migratable objects)– is a well-matched candidate programming model for

M-PIMs

• We have developed an Emulator/Simulator – For BG/C– Runs on parallel machines

• We have Implemented multi-million object applications using Charm++– And tested on emulated Blue Gene/C

• More info: http://charm.cs.uiuc.edu– Emulator is available for download, along with Charm