11 FASTMath Team Lori Diachin, Institute Director FASTMath: Summary of Portable Performance Strategies FASTMath SciDAC Institute LLNL-PRES-501654

11

FASTMath Team

Lori Diachin, Institute Director

FASTMath: Summary of Portable Performance Strategies

FASTMath SciDAC Institute

LLNL-PRES-501654

22

End User of FASTMath Software: Same piece of code running on different architectures with ‘good’ performance

Developer of FASTMath Software: A relatively small amount of effort is needed make to make a change to get good performance within advertised (algorithmic or performance) tolerances across both current and future architectures

Particular targeting portability for two classes of architectures • Hybrid multicore (CPU/GPU) systems• Manycore systems

FASTMath defines portable performance from two perspectives

33

We surveyed 12 key FASTMath libraries to determine• Challenges/strategies to support performance portability within

the library• Challenges/strategies associated with portable performance

when using multiple packages together• Key areas of future investigation

Libraries:

FASTMath libraries are already working toward performance portability

PETSc Hypre mueLu SUNDIALS Eigensolvers SuperLU

Chombo BoxLib PUMI/PARMA/PCU MOAB Zoltan/Graph Algs VisIt (Guest library)

44

Data movement• NUMA – multilevel memory management techniques (tiling and smart

task/data placement)• Data motion distance• Cache coherence

Thread management• Placement• Pools interacting• Oversubscription• Thread collectives/synchronization techniques

Summary of on node performance challenges within a single library

55

Execution models• MPI + X + Y• Lightweight MPI – lightweight

communication libraries• EAVL runtime

Algorithmic changes• Changes in algorithms to reduce

communication/increase arithmetic intensity

• Fusion – pipelining – both with and without communication

• Compute on the fly using fast small memory to reduce storage costs? Compression

Early thoughts on addressing on node performance challenges within a single library

Multiple kernel support• Hand codes for code generated• Code generation

Data and execution abstractions (e.g., templated C++ approaches (Kokkos,RAJA) or other libraries (TiDA))

Cross-Compiler-based approaches (e.g. ROSE, CHILL)

Just in time compile Interfaces/Tools

• API for communicating data layout and location (pinning)

• Interfaces for thread pool/pinning information

• Zoltan load balance/partitioning/coloring

Embedded performance models Standards ‘influence’ – OpenMP, C++, etc

66

Thread management• Oversubscription of threads• Programming model consistency

Data management• Data affinity/use of data as laid out by others

– this depends on the type of data• The need to track original layout so that you

can use that information as needed in ‘re-layout’

• Understanding the costs of moving data vs using layouts as given

• Transition data among software components • How do we do more work on data while it’s in

Cache (L1 or L2) – this can cross library boundaries (e.g. mesh/solver interactions)

Addressing algorithmic improvements that result from closer coupling of mesh/solver• Using mesh information to develop improved

matrix free solution algorithms but this introduces software issues

Summary of challenges when using multiple libraries/software packages

Performance diagnostics• User decisions can significantly impact data

remap costs – we can make this information transparent to that the user knows the implications of the choices they are making

• Communicating performance to the user – what should they expect

Resource management• Making libraries small enough to not swamp

memory – heavily templated libraries are a particular concern

• If there’s a lot of per process data that can’t be shared – if you have 50 processes – how much is private and must be replicated and how much of the data is shared to minimize memory costs

• Libraries sharing/coordinating use of fast memory and threads

77

Data management across libraries Do we need a ‘data mediator/coupler’– will

not ‘manage’ all the data (we have concern about a data manager). Not sure if this is needed overall or if it’s needed pairwise - accepting data in an ‘intermediate form’ that can be quickly optimized for each components use

• Fusion between libraries- different scales of granularity (e.g., smaller chunks of data to allow data reuse while in Cache)

• Optimize data layouts across multiple components a priori? Not even doing the simple things well right now

• Producer and consumer of the data preserve some locality across the interface

Thread management across libraries• Thread communicator API - Allows multiple

programming models to communicate info about threads and thread pools.

• Need to associate thread information with data layout

Early thoughts on strategies for addressing the challenges associated with multiple libraries/packages

Testing solutions• Looking at sub-components of the problem

to simplify the process to some extent – e.g. look only at solvers rather than all PDE solver

• Development of mini-apps that combine software tools to explore these ideas and solution strategies

Resource management• Phased communication and resource

splitting/allocation for different parts of the solution process

• need good shared library practices

88

Library Summaries

FASTMath SciDAC Institute

99

Current Strategies:• Thread communicator concept to allow passing of thread and data information

among packages• Algorithmic changes to fuse/pipeline operations• MPI+OpenMP, OpenVienna for GPU support

Future Strategies:• ? Didn’t capture this

PETSc (POC: Barry Smith)

1010

Current Strategies:• Reducing communication• Changing math algorithms• Focus on portability (minimize external dependencies)• Using OpenMP, will investigate OpenACC

Future Strategies:• Compiler based-approaches• Reorder operations to exploit better convergence properties of GS

within a node• Code generation approaches for reordering operations to explore

communication/computation tradeoffs

Hypre (POC: Rob Falgout)

1111

Current Strategies:• Task placement to get locality and reduce communication costs• Partitioning to reduce number neighbors

Future Strategies:• whether/how Kokkos plays a role• partitioning to divide work between CPU and GPU• partitioning for fixed sized memories

Note:• Zoltan could provide a suite a tools that many others in FASTMath can

use – perhaps Kokkos can benefit from using Zoltan. Might be able to provide a bridge between the concept of tiling tools and higher level libraries – mapping application topology onto machine topology using task placement strategies

Zoltan2 (POC: Karen Devine)

1212

Current Strategies:• Reducing communication• Changing math libraries

Future Strategies:• Higher level kernels built on Kokkos• fused kernels (prolongator smoothing)• task based parallelism for more expensive aggregation schemes• data layout for different devices• Could use tools like HWloc (gives the layout of the node and allows

you to pin the threads) for large-scale systems, portable across the platforms

mueLu (POC: Jonathan Hu)

1313

Current Strategies:• MPI+openMP• coarse grained (released) and fine grained parallelism, experiments with

smaller kernel codes with fine locking where it does perform better for geometric multigrid – a co-scheduled moving wave front – needs to be done by hand or with Chill compiler to be successful

• Using different wavefront algorithms for different architectures• Several DAG based parallel strategies (HPX, Charm++, etc) limited success so

far• Tried tiling but not had success yet with our version (now trying TiDA).

Future Strategies:• Need a compiler to help with fusion (tiling may mitigate this by providing data

locality)• DSL for AMR (coded this up in C++ and fortran and it does reduce the amount

of code by quite bit) – separates the dimensionality and the way the data is laid out from the algorithm – can be put to the tiler and the communication code generation

Chombo (POC: Brian Van Straalen)

1414

Current Strategies:• ? Didn’t capture this

Future Strategies:• DAG scheduling tool rather than manual scheduling (static is fine, dynamic

would mess up the data structures – could be a coloring problem) – could use Zoltan.

• Interested in new programming model or library to use heterogeneous cores. Don’t need help on the CPU side – need help with CPU/GPU transfer and making the API more uniform across architectures, currently have CudaMemCopy would like GPUMemCopy.

• Would like better performance models – need to be broader than on node, also need across node

SuperLU (POC: Sherry Li)

1515

Current Strategies:• Lightweight API for MPI and threads – we have an inexpensive abstraction for

this. • API for the data structures to allow switching among different data structures for

different architectures – link against different ones that are hand coded Future Strategies:

• Optimal use cases all involve using these tools with solvers, etc – will need to be consistent across packages

• Could use portable HWloc to pin threads/memory.

PUMI/PARMA/PCU (POC: Mark Shephard)

1616

Current Strategies:• using tiling/TiDA• load balancing – accurate assessment of the workload, cost of moving the

data, the communication patterns Future Strategies:

• Zoltan might be able to help with some of this• Establish detailed profiling to understand current bottlenecks/costs. • Wish for a fast malloc for threads – would be private to a thread, haven’t tried

tcmalloc yet (does it work for fortran?)

BoxLib (POC: Ann Almgren)

1717

Current Strategies:• Using hybrid MPI/OpenMP• New algorithms to increase concurrency

Future Strategies:• partners interested in GPUs, calculations are very memory intensive which is

problematic – may need mixed CPU/GPU, don’t want to do this themselves and explicitly copy things back and forth – worried about maintenance/lightweight.

• Want to explore matrix free to reduce memory use – compute on the fly algorithms

PARPACK/Eigensolvers (POC: Chao Yang)

1818

Current Strategies:• ? Didn’t capture this

Future Strategies:• Thread communicator in Chombo, using threaded PETSc in Chombo. • Restructure the algorithm to better utilize the buffering strategy in SR, never

store the entire problem – compute fine grid data on the fly

Segmental Refinement Multigrid (POC: Mark Adams)

1919

Current Strategies:• Using Low level APIs – don’t know how the data structures are laid out or how it

will be used• OpenARC• Writing a mini-app

Future Strategies:• Need to expose a higher-level API for data structures

MOAB (POC: Vijay Mahadevan)

2020

Current Strategies:• Using a bit of threading for vector work we supply to users, • Using CVODE in a threadsafe manner. • SUNDIALS doesn’t have much data at all – uses callbacks, could rely a lot on

user/other code for threading – works great for cases where you don’t need a lot of fusing for performance – can you make the call backs different to get fused things back rather than blas1 things

Future Strategies:• Reducing communication through vector kernel fusion• GPU interacting with vector supplied by user – need to play well with what user

supplies and libraries underneath use. • Tools: code generation tools for kernel code – architecture aware/specific for

vector. What are the fused kernels that are useful for linear algebra? Are there too many? Are there a reasonable number that can be still used for callbacks?

• Fusion which includes communication

SUNDIALS (POC: Carol Woodward)

2121

Current Strategies:• prototyping GPU multilevel EAVL (C++ lightweight tool from ORNL)– like

Kokkos but operates at two levels of abstractions (structured and unstructured mesh)

• Implement a few key vtk objects in EAVL, then will run on CPU, GPU, etc.. transparently, so all the expression template on GPUs

Future Strategies:• ? Didn’t capture this

VisIt (Guest Package) (POC: Mark Miller)

2222

Capture the notes here in text document – assignments for folks to flesh out

Next meeting:• Piggy back on IDEAS meeting in January 27-29 in the bay area• Piggy back on CSE15; ask Chris Johnson for space at Utah

Thread communicator – Barry, Jed, others?• Develop and present a new C API – Barry/Jed/Others?• Create a working group to provide feedback, early use cases, etc

Next Steps

Documents

11 FASTMath Team Lori Diachin, Institute Director FASTMath: Summary of Portable Performance Strategies FASTMath SciDAC Institute LLNL-PRES-501654