Upload
sierra
View
46
Download
0
Tags:
Embed Size (px)
DESCRIPTION
How to CreateApplications With Multi-million Way parallelism. Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana-Champaign http://charm.cs.uiuc.edu. Group Mission and Approach. - PowerPoint PPT Presentation
Citation preview
WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC
How to CreateApplications With Multi-million Way parallelism
Laxmikant (Sanjay) KaleParallel Programming Laboratory
Department of Computer Science
University of Illinois at Urbana-Champaign
http://charm.cs.uiuc.edu
WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC
Group Mission and Approach
• To enhance Performance and Productivity in programming complex parallel applications– Performance: scalable to very large number of processors
– Productivity: of human programmers
– Complex: irregular structure, dynamic variations
• Approach: Application Oriented yet CS centered research– Develop enabling technology, for a wide collection of apps.
– Develop, use and test it in the context of real applications
– Develop standard library of reusable parallel components
WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC
Multi-partition Decomposition
• Idea: divide the computation into a large number of pieces– Independent of number of processors– Typically larger than number of processors– Let the system map entities to processors
• Optimal division of labor between “system” and programmer:
• Decomposition done by programmer,
• Everything else automated
WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC
Object-based Parallelization
User View
System implementation
User is only concerned with interaction between objects
WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC
Charm++
• Parallel C++ with Data Driven Objects• Object Arrays/ Object Collections• Object Groups:
– Global object with a “representative” on each PE
• Asynchronous method invocation• Prioritized scheduling• Information sharing abstractions: readonly, tables,..• Mature, robust, portable• http://charm.cs.uiuc.edu
WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC
Data driven execution
Scheduler Scheduler
Message Q Message Q
WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC
Load Balancing Framework
• Based on object migration – Partitions implemented as objects (or threads) are
mapped to available processors by LB framework
• Measurement based load balancers:– Principle of persistence
• Computational loads and communication patterns
– Runtime system measures actual computation times of every partition, as well as communication patterns
• Variety of “plug-in” LB strategies available– Including those for situations when principle of
persistence does not apply
WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC
Building on Object-based Parallelism
• Application induced load imbalances• Environment induced performance issues:
– Dealing with extraneous loads on shared m/cs
– Vacating workstations
– Heterogeneous clusters
– Shrinking and Expanding jobs to available Pes
• Object “migration”: novel uses– Automatic checkpointing
– Automatic prefetching for out-of-core execution
• Reuse: object based components
WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC
Applications
• Charm++ developed in the context of real applications
• Current applications we are involved with:– Molecular dynamics– Crack propagation– Rocket simulation: fluid dynamics + structures +– QM/MM: Material properties via quantum mech– Cosmology simulations: parallel analysis+viz– Cosmology: gravitational with multiple timestepping
WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC
Molecular Dynamics
• Collection of [charged] atoms, with bonds
• Newtonian mechanics
• At each time-step– Calculate forces on each atom
• Bonds:
• Non-bonded: electrostatic and van der Waal’s
– Calculate velocities and advance positions
• 1 femtosecond time-step, millions needed!
• Thousands of atoms (1,000 - 100,000)
WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC
WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC
WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC
BC1 complex: 200k atoms
WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC
Performance Data: SC2000
Speedup on ASCI Red: BC1 (200k atoms)
0
200
400
600
800
1000
1200
1400
0 500 1000 1500 2000 2500
Processors
Spe
edup
WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC
Charm++ Is a Good Match for M-PIM
• Encapsulation : objects• Cost model:
– Object data, read-only data, remote data
• Migration and resource management: automatic• One sided communication: since the beginning• Asynchronous global operations (reductions, ..)• Modularity: see 1996 paper• Acceptability:
– C++
– Now also: AMPI on top of charm++
WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC
AMPI: Goals• Runtime adaptivity for MPI programs
– Based on multi-domain decomposition and dynamic load balancing features of Charm++
– Minimal changes to the original MPI code
– Full MPI 1.1 standard compliance
– Additional support for coupled codes
– Automatic conversion of existing MPI programs
Original MPI Code AMPI Code
AMPI Runtime
AMPIzer
WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC
How Good Is the Programmability
• I.E. Do programmers find it easy/good– We think so – Certainly a good intermediate level model
• Higher level abstractions can be built on it
• But what kinds of abstractions?
• We think domain-specific ones
WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC
Specialization
MPIexpression
Scheduling
Mapping
Decomposition
HPFCharm++
Domain specific
frameworks
WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC
Further Match With MPIM
• Ability to predict:– Which data is going to be needed and– Which code will execute– Based on the ready queue of object method
invocations
WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC
Remember data driven execution?
Scheduler Scheduler
Message Q Message Q
WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC
Further Match With MPIM
• Ability to predict:– Which data is going to be needed and– Which code will execute– Based on the ready queue of object method
invocations– So, we can:
• Prefetch data accurately• Prefetch code if needed
S SQ Q
WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC
So, What Are We Doing About It?
• How to develop any programming environment for a machine that isn’t built yet
• Blue Gene/C emulator using charm++– Completed last year– Implememnts low level BG/C API
• Packet sends, extract packet from comm buffers
– Emulation runs on machines with hundreds of “normal” processors
• Charm++ on blue Gene /C Emulator
WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC
Structure of the Emulators
Blue Gene/CLow-level API
Charm++
Converse
Converse
Charm++
BG/C low level API
Charm++
WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC
Emulation on a Parallel Machine
Simulating (Host) Processor
BG/C Nodes
Hardware thread
WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC
Extensions to Charm++ for BG/C
• Microtasks:– Objects may fore microtasks that can be
executed by any thread on the same node– Increases parallelism– Overhead: sub-microsecond
• Issue:– Object affinity: map to thread or node?
• Thread, currently.
• Microtasks alleviate load balancing within a node
WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC
Emulation efficiency
• How much time does it take to run an emulation?– 8 Million processors being emulated on 100– In addition, lower cache performance– Lots of tiny messages
• On a Linux cluster:– Emulation shows good speedup
WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC
Emulator to Simulator
• Step 1: Coarse grained simulation– I.e. performance prediction capability– Models contention for processor/thread– Also models communication delay based on distance– Doesn’t model memory access on chip, or network– How to do this in spite of out-of-order message
delivery?• Rely on determinism of Charm++ programs
• Time stamped messages and threads
• Parallel time-stamp correction algorithm
WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC
Applications on the current system
• Using BG Charm++
• LeanMD:– Research quality Molecular Dyanmics– Version 0: only electrostatics + van der Vaal
• Simple AMR kernel– Adaptive tree to generate millions of objects
• Each holding a 3D array
– Communication with “neighbors”• Tree makes it harder to find nbrs, but Charm makes it easy
WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC
Emulator to Simulator
• Step 2: Add fine grained simulation– Sarita Adve: RSIM based simulation of a node
• SMP node first
– Millions of thread units/caches to simulate in detail?
• Step 3: Hybrid simulation– Instead: use detailed simulation to build model– Drive coarse simulation using model behavior
WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC
Summary
• Charm++ (data-driven migratable objects)– is a well-matched candidate programming
model for M-PIMs
• We have developed an Emulator/Simulator – For BG/C– Runs on parallel machines
• We have Implemented multi-million object applications using Charm++– And tested on emulated Blue Gene/C
• More info: http://charm.cs.uiuc.edu