Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Distributed Adaptive Simulations using Structured Adaptive Mesh-Refinement
(SAMR)
Manish ParasharThe Applied Software Systems Laboratory
ECE/CAIP, Rutgers Universityhttp://www.caip.rutgers.edu/TASSL
(Ack: NSF, DoE, NIH, DoD)
http://www.caip.rutgers.edu/TASSL
Overview
• Computational engines for SAMR applications – distributed, dynamic data-management
• Runtime (reactive and proactive) management– dynamic (application and system sensitive) partitioning
and load-balancing– AHMP – Adaptive Hierarchical Meta-Partitioning– Dispatch – Addressing Point-wise Varying Loads
• Conclusion
Adaptive Mesh Refinement•Start with a base coarse grid with minimum acceptable resolution
• Tag regions in the domain requiring additional resolution, cluster the tagged cells, and fit finer grids over these clusters
• Proceed recursively so that regions on the finer grid requiring more resolution are similarly tagged and even finer grids are overlaid on these regions
• Resulting grid structure is a dynamic adaptive grid hierarchy
The Berger-Oliger AlgorithmRecursive Procedure Integrate(level)
If (RegridTime) Regrid Step Δt on all grids at level “level”If (level + 1 exists)
Integrate (level + 1)Update(level, level + 1)
End ifEnd Recursionlevel = 0Integrate(level)
Structured Adaptive Mesh Refinement (SAMR)
Related Work: SAMR Infrastructures
• SAMRAI, Lawrence Livermore National Lab– Object-oriented structured adaptive mesh refinement application infrastructure– Modules handle visualization, mesh management, integration, geometry, etc.
• Chombo, Lawrence Berkeley National Lab– Set of tools for implementing finite difference methods for PDE solutions– Distributed infrastructure for parallel calculations over block-structured,
adaptively refined grids• Paramesh, NASA Goddard Space Flight Center
– Fortran 90 subroutines to extend existing serial code into parallel AMR code– Hierarchy of Cartesian mesh grids which form nodes of a tree data-structure
• Batsrus, University of Michigan– Block-based approach with adaptation distributed over processors in
computational pool in phases• GrACE, Rutgers University
– Adaptive computational and data-management engine for structured grids– Distributed adaptive grid hierarchy, grid function and geometry abstractions– Parallel support for AMR computations in various scientific domains
GrACE: Adaptive Computational Engine for SAMR
• Semantically Specialized DSM– Application-centric programming abstractions – Regular access semantics to dynamic, heterogeneous, and
physically distributed data objects• Encapsulate distribution, communication, and interaction
– Coupling/interactions between multiple physics, models, structures, scales
• Distributed Shared Objects– virtual Hierarchical Distributed Dynamic Array
• Hierarchical Index-Space + Extendible Hashing + Heterogeneous objects
– Multifaceted objects• Integration of computation + data + visualization + interaction
• Adaptive Run-time Management– Application and system sensitive management
• Algorithms, partitioners, load-balancing, communications, etc.• Policy-based automated adaptations
1024x128x128, 3 levels, 2K PE’sTime: ~ 15% Memory: ~25%
Richtmyer Meshkov (3D)
IPARS Multi-block Oil Reservoir Simulation
Data-Management for Adaptive Applications
• Application requirements– Adaptive Finite Difference
• Large hierarchical objects; dynamic size, orientation and interactions– Adaptive Finite Element
• Dynamic number of objects of dynamic size– Adaptive Fast Multipole
• Dynamic number of small objects with dynamic interactions• Traditional data-management for computation – multi-dimensional arrays
– data set, index space, injective function• Data-management abstraction for distributed adaptive applications
– An extended definition of an array where • Each element of the array can itself be an array • Each element of the array can be an object of arbitrary and variable size• The array can grow and shrink dynamically• The array is distributed
– Hierarchical Distributed Dynamic Array• Performance, Performance, Performance• Locality, Locality, Locality
“A Common Data Management Infrastructure for Parallel Adaptive Algorithms for PDE Solutions,” M. Parashar, J. C. Browne, C. Edwards, and K. Klimkowsky, Proceeding of Supercomputing,‘ San Jose, CA, November 1997
MACE: Supporting Dynamic Coupling/Interactions
• High Performance Geometry-based Shared Spaces– Models/numerics as well as the interactions are typically based on the geometry of the
discretized domain– Use SFC’s to create a distributed directory of shared geometric regions– Processors can create shared regions and can read and write object related to a
shared region – e.g. mortar grid– Complements MPI, OpenMP, PVM, etc.
Multi-numerics Multi-physics Multi-scale
“A Dynamic Geometry-Based Shared Space Interaction Framework for Parallel ScientificApplications,” L. Zhang* and M. Parashar, Proceedings of the 11th International Conference on High Performance Computing (HiPC 2004), Bangalore, India, December 2004.
A Selection of SAMR Applications Enabled
Multi-block grid structure and oil concentrations contours (IPARS, M. Peszynska, UT Austin)
Blast wave in the presence of a uniform magnetic field) – 3 levels of refinement. (Zeus +
GrACE + Cactus, P. Li, NCSA, UCSD)
Mixture of H2 and Air in stoichiometricproportions with a non-uniform temperature field(GrACE + CCA, Jaideep Ray, SNL, Livermore)
Richtmyer-Meshkov - detonation in a deforming tube - 3 levels. Z=0 plane visualized on the right
(VTF + GrACE, R. Samtaney, CIT)
SAMR: Spatial and Temporal Heterogeneity and Dynamics
regrid step 114regrid step 5 regrid step 96
regrid step 201
RM3D (200 regrid steps, size=256*64*64)
0
1020
3040
50
6070
80
0 20 40 60 80 100 120 140 160 180
Regrid Steps
Tota
l Loa
d (1
00k)
regrid step 176
Spatial and Temporal Heterogeneity and Load Dynamics of a 3D RicSpatial and Temporal Heterogeneity and Load Dynamics of a 3D Richtmyerhtmyer--Meshkov Simulation using SAMRMeshkov Simulation using SAMR
Analysis of Computation and Communication Patterns of Distributed SAMR Applications
P10
01
12
22
2 1 01
12
22
2 10
computation
communicationtime
00
11
22
22 1 0
11
22
22 1
0computation
communicationtime
P2
. . .
. . .
* The number in the time slot box denotes the refinement level of the load under processing* In this case, the number of refinement levels is 3 and the refinement factor is 2.* The communication time consists of three types, intra-level, iter-level and synchronization cost
2 intra-level
2 intra-level
2 sync 2 and 1 inter-level
computation
communicationtime
computation
communicationtime
. . .P1
P2
2
. . . 22 and 1 inter-level
Enlarged with more details
1 intra-level
1 intra-level
2 sync. . .1
. . .1
Timing Diagram for Distributed SAMRTiming Diagram for Distributed SAMR
Runtime Management for SAMR Applications
• Partitioning/Load-balancing strategy– maximize parallelism, minimize inter/intra level comm., maintain inter/intra
level locality, support efficient repartitioning, …– Partitioning/load-balancing strategy depends on the structure of the grid
hierarchy and the current application/system state [IEEE TPDS 2002]• Granularity
– patch size, AMR efficiency, comm./comp. ratio, overhead, node-performance, load-balance, …
• Number of processors/Load per processor– Dynamic allocations/configuration/management
• 1000+ processor from the beginning or “on-demand”
• Hierarchical “emergent” distributions using dynamic processor groups• Communication optimizations/latency tolerance/multithreading• Availability, capabilities, and state of system resources
Partitioning Approaches
Ack. X. Li, OSU
SAMR – Partitioning Systems
System Execution Mode Granularity Partitioner
Organization Decomposition Institute
CHARM Comp-intensive Coarse-grained Static single-partitioner Domain-based UIUC
Chombo Comp-intensive Fine-grained, coarse-grained Static single-partitioner Domain-based LBNL
HRMS/ GrACE Comp-intensive
Fine-grained, coarse-grained
Adaptive hierarchical multi-partitioner, Hybrid strategies
Domain-based, hybrid Rutgers
Nature+ Fable Comp-intensive Coarse-grained Single meta-partitioner
Domain-based, hybrid Sandia
ParaMesh Comp-intensive Fine-grained, coarse-grained Static single-partitioner Domain-based NASA
ParMetis Comp-intensive, comm-intensive Fine-grained Static single-partitioner Graph-based Minnesota
PART Comp-intensive Coarse-grained Static single-partitioner Domain-based Northwestern
SAMRAI Comp-intensive, comm-intensive Fine-grained,
coarse-grained Static single-partitioner Patch-based LLNL
Ack. X. Li, OSU
Proactive & Reactive Runtime Management
• Reactively and proactively mange and optimize application execution using current system and application state, predictive models for system behavior and application performance– Runtime sensing of current system and application state– Analyze, characterize, anticipate system and application behavior – Reactively and proactively adapt application execution
• Application-sensitive adaptation– Characterizes current application state– Determines resource allocation, partitioning/mapping of application
components, granularity, load-balancing and communication mechanisms• System-sensitive adaptation
– Driven by system state and system performance predictions– Determine application granularity, communication strategies based on
bandwidth, and nature of refinements based on availability and “health” of computing elements
• Performance prediction functions - (S. Hariri, Univ. of AZ)
“Investigating Autonomic Runtime Management Strategies for SAMR Applications”, S. Chandra, J. Yang, Y. Zhang, M. Parashar, and S. Hariri, International Journal of Parallel Processing, Editor: F. Darema, Kluwer Academic Publishers, 2005.
ARMaDA: Adaptive Application-Sensitive Management for SAMR Applications
• Identify and characterize cliques• Define management objective and strategy• Hierarchically partition, map and tune
Clique Region Characterization
Partitioningand
Scheduling
Application StateCharacterization
Nature of Adaptation
ApplicationDynamics
Com
puta
tion/
Com
mun
icat
ion
Dynamic Driver Application
MappingDistribution
Redistribution
Load-Balancing Algorithms- Greedy, Binpack, Level-basedCommunication Strategies- Staggered sends, Delayed waitsClustering Algorithms- Segmentation/Level basedSpace-Time Hybrid Schemes- Application-level Pipelining- Application-level Out-of-core
Partitioning Algorithms- ISP, LPA, HPA, G-MISP, ……
Optimization Repository
State Analysis• Data migration• Application locality• Communication costs• Load balancing• Adaptive partitioning• Adaptation overheads• Memory requirements• Granularity control
IdentifyClique
Regions
RuntimePrescriptions
CharacterizeClique
Regions
CurrentApplication
State
*A Clique Region is a relatively homogeneous region in the SAMR grid hierarchy
ARMaDA: Adaptive Application-Sensitive Management for SAMR Applications
• Runtime application monitoring and characterization– computation/communication
requirements, application dynamics, nature of adaptation, etc.
• Deduction– map partitioners to application state
• Adaptation (Meta-partitioner)– dynamically select and configure and
invoke “best” partitioner at runtime
Characterizing Application State at Runtime
• Application state characterized using operations on the geometry of the grid hierarchy– Computation/Communication requirements
• computationally-intensive or communication-dominated
– Application Dynamics• speed of changes in application refinement patterns
– Nature of Adaptation• scattered or localized refinements, affecting overheads
• Fast and efficient characterization algorithms minimize overheads
“Towards Autonomic Application-Sensitive Partitioning for SAMR Applications”, S. Chandra and M. Parashar, Journal of Parallel and Distributed Computing, Academic Press, Vol. 65, Issue 4, pp. 519 – 531, April 2005.
Reactive System Sensitive Partitioning
• Cost model used to calculate relative capacities of nodes in terms of CPU, memory, and bandwidth availability
• Relative capacity for node k:
– where wp, wm, and wb are the weights associated with relative CPU, Memory, and Bandwidth availability respectively
• Evaluation– Linux based 32 node Beowulf cluster and
synthetic load generators– RM3D kernel, 128*32*32 base grid, 3
refinement levels, 4 steps regrid– 18% improvement in execution time over
non-system sensitive scheme
kbkmkpk BwMwPwC ++=1=++ bmp www
Capacity Calculator
Heterogeneous System Sensitive Partitioner
CPU
Memory
Bandwidth
Capacity Available
Application
Weights
Partitions
Resource
Monitoring
Tool
“Adaptive System-Sensitive Partitioning of AMR Applications on Heterogeneous Clusters”, S. Sinha and M. Parashar, Cluster Computing: The Journal of Networks, Software Tools, and Applications, KluwerAcademic Publishers, Vol. 5, Issue 4, pp. 343 - 352, 2002
Adaptive Hierarchical Multiple-Partitioner Strategy (AHMP)
• Rationale– Extends Hierarchical Partitioning
Algorithm (HPA)• reduce global communication
overheads • enable incremental repartitioning• expose more concurrent
communication and computation – Addresses spatial heterogeneity
• Approach: Divide-and-Conquer– Identify clique regions and
characterize states through clustering– Select appropriate partitioner for each
clique region to match characteristics of partitioners and cliques
– Adapt to the states of computational domains and systems
– Repartition and reschedule within local resource group
Start
Clustering
gridhierarchy
cliquehierarchy
Select a Partitioner
LBC
SBCRecursivelyfor each clique
End
Characterize clique
Partition clique
PartitionerRepository
SelectionPolicies
Repartitioning
“Using Clustering to Address the Heterogeneity and Dynamism in Parallel SAMR Applications”, X. Li and M. Parashar, Proceedings of the 12th International Conference on High Performance Computing, Goa, India, December 2005.
AHMP: Operation
RG2
RG1
T iexe: estimated execution time for processor i in RG
RG3
RG4
The load imbalance factor (LIF) is defined by,
LIF(RG-k) = max(T iexe) – min(T iexe)avg of T iexe in this RG
Repartitionreschedule AHMP
GPAAHMP
LPA
Partitionschedule
AHMPALP
AHMPALOC
RG: Resource Group
AHMP: Experimental Evaluation
Experiment Setup:IBM SP4 cluster (DataStar at San Diego Supercomputer Center,
total 1632 processors)SP4 (p655) node: 8 processors(1.5 GHz), memory 16 GB, 6.0 GFlops
Performance gain
AHMP 30% - 42%LPA 20% - 29%
GPA: GreedyLPA: Level-based
Execution Time for RM3D Application (1000 time steps, size=256x64x64)
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
64 128 256 512 1024 1280
Number of Processors
Exec
utio
n Ti
me
(sec
)
GPALPASBC+AHMP
Overall PerformanceRM3D:
refinement levels = 3refinement factor = 2
SBC Clustering Time
On average, clustering cost is less than 0.01 second, while the execution time between regridding steps is typically > 10 seconds.
RM3D: Turbulence at CaltechRM2D: Turbulence at CaltechBL3D: Oil-water flow, UT AustinTP2D: Heat transportBL2D: Oil-water flow, UT AustinENO2D: VTF at Caltech
Clustering Time for SBC
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
rm3d rm2d bl3d tp2d bl2d eno2d
SAMR Applications
Tim
e (m
icro
seco
nd)
Dispatch: Heterogeneous Workload Simulations
• Applications with computational heterogeneity– reactive flows such as simulation of hydrocarbon flames– pointwise processes operate at different timescales than diffusive
and convective processes – hence approximately decoupled– operator-split integration methods in PDEs– highly uneven distribution of workload as a function of space– traditional partitioning/load-balancing approaches not suitable– preserving spatial coupling reduces communication costs
• Dispatch strategy– dynamic structured partitioning for parallel applications with
computational heterogeneity– integrated with GrACE computational framework– combines inverse space-filling curve based partitioning with in-
situ weighted global load balancing
“Dynamic Structured Partitioning for Parallel Scientific Applications with Pointwise Varying Workloads”, S. Chandra, M. Parashar and J. Ray, Proceedings of 20th IEEE/ACM International Parallel and Distributed Processing Symposium, IEEE Computer Society Press, April 2006 .
Dispatch: Illustrative Reactive-Diffusion Kernel
• R-D application– model approximates the ignition of
CH4-Air mixture in a non-uniform temperature field with 3 hot-spots
– high dynamism, varying workloads, space-time heterogeneity
– reactive processes near flame front have high computation requirement
– solves equation of the form
Dispatch: Experimental Evaluation
• Evaluation setup– 256*256 base grid resolution, 200
iterations for R-D kernel– 8-128 processors on SDSC IBM SP4
“DataStar”– single uniform level– compare performance of Dispatch
and Homogeneous• Evaluation results
– Dispatch improves execution time by 11.23 – 46.34%
– Dispatch considers weights of pointwise processes; achieves smaller deviation for compute times and reduced sync times
Conclusion
• Adaptive and interactive simulations can enable accurate solutions of physically realistic models of complex phenomena– Large-scale efficient parallel/distributed implementations present significant
challenges• Conceptual and implementation solutions for enabling adaptive and
interactive simulations– Computational engines
• HDDA/DAGH/GrACE/MACE– Adaptive runtime management/optimization
• PRAGMA/ARMaDA– Interactive/collaborative monitoring/steering - computation collaboratories
• Discover/DIOS
• More Information, publications, software– www.caip.rutgers.edu/~parashar/– [email protected]
http://www.caip.rutgers.edu/~parashar/
The Team
• TASSL Rutgers University• Viraj Bhat• Andrez Quiroz Hernandez• Nanyan Jiang• Zhen Li (Jenny)• Vincent Matossian• Sumir Chandra• Mingliang Wang• Li Zhang
• Key CS Collaborators– HPDC, University of Arizona
• Salim Hariri– Biomedical Informatics, The Ohio
State University• Tahsin Kurc, Joel Saltz
– CS, University of Maryland• Alan Sussman
• Key Applications Collaborators– CSM, University of Texas at Austin
• Hector Klie, Mary Wheeler– IG, University of Texas at Austin
• Mrinal Sen, Paul Stoffa– PPPL
• R. Samtaney– CRL, Sandia National Laboratory,
Livermore• Jaideep Ray, Johan Steensland
– University of Arizona• T. –C. Jim Yeh
– Rutgers University• S. Garofilini, A. Cutinho, N. Zabusky
Distributed Adaptive Simulations using Structured Adaptive Mesh-Refinement (SAMR)OverviewStructured Adaptive Mesh Refinement (SAMR)Related Work: SAMR InfrastructuresGrACE: Adaptive Computational Engine for SAMRData-Management for Adaptive ApplicationsMACE: Supporting Dynamic Coupling/InteractionsA Selection of SAMR Applications EnabledSAMR: Spatial and Temporal Heterogeneity and DynamicsAnalysis of Computation and Communication Patterns of Distributed SAMR ApplicationsRuntime Management for SAMR ApplicationsPartitioning ApproachesSAMR – Partitioning SystemsProactive & Reactive Runtime ManagementARMaDA: Adaptive Application-Sensitive Management for SAMR ApplicationsARMaDA: Adaptive Application-Sensitive Management for SAMR ApplicationsCharacterizing Application State at RuntimeReactive System Sensitive PartitioningAdaptive Hierarchical Multiple-Partitioner Strategy (AHMP)AHMP: OperationAHMP: Experimental EvaluationSBC Clustering TimeDispatch: Heterogeneous Workload SimulationsDispatch: Illustrative Reactive-Diffusion KernelDispatch: Experimental EvaluationConclusionThe Team