Working Group on Methodology for Optimizing Multilevel Parallelism Fialho, Gimenez, Tallent, Welton, Morris, Malony, Montoya and Browne

Working Group on Methodology for Optimizing

Multilevel Parallelism

Fialho, Gimenez, Tallent, Welton, Morris, Malony, Montoya and Browne

Working Assumptions:“Optimal” Parallelism = Optimum Productivity

• Formulate performance optimization problem as find “optimal” parallelism

• Best possible balance of the several modes of parallelism:• Intra-core• Intra-chip• Intra-node• Inter-node

• Multiple interacting factors each with many options• Intra-chip memory access• Intra-node memory access• Concurrency (threading, vectorization, acceleration)• Internode communication• Load Balance

• Optimization with consideration of interactions

Current Status of Tools

• Separate tools for optimizing each factor• Separate tools for optimizing each mode of parallelism• Several different tools for each factor or mode of

parallelism are available• Frameworks for integration of tools and/or creating

“workflows” are available• How do we determine appropriate and consistent

workflows or framework instances from the tools?

Apply a Conceptual Process

1. Specify what is to be optimized2. Specify the metrics needed to diagnosis the bottleneck and

recommend the optimization3. Define the algorithms for diagnosing bottlenecks and

recommending optimizations in terms of the metrics4. Determine the information needed to evaluate those metrics5. Specify how to obtain the information.

Generate a methodology (workflow) from the conceptual process

Two Cases

• Optimize” parallelism of application for given execution environment and input data set with only “local” restructuring• Only “local” source code changes• No algorithm changes

• Re-structure/re-engineer application to attain “optimal” parallelism on (possible) execution environments• Componentize code• Choose different algorithms• Evaluate different component parts and optimize across “components”

• Workflows are different for each, but certainly overlap

Optimization Information Requirements

• Need to incorporate multiple types of information• “Optimize” with only “local” modification• Source code• Execution environment• Runtime behavior

• Optimize with restructuring• Domain• Algorithm• Source code/Execution environment/runtime behavior

Conceptual WorkflowLocal (Inside out) Optimization Workflow

Assumptions: application structure, execution environment and intial conditions/inputs are fixed1. Insure load balance and choose optimal affinity mappings, etc.2. Maximize Intra-node efficiency

1. Intra-core – Maximize vectorization and core-local memory access2. Intra-chip – optimize chip-local memory access3. Intra-node – minimize NUMA accesses4. Intra-node – Choose optimal number of tasks/threads

3. Minimize internode communication cost4. If nodes are at “roofline” for computation or memory bandwidth, then optimize

internode communication5. If nodes are not bottlenecked on either computation or memory bandwidth then

reallocate data to minimize the number of nodes used6. Go to step 2 and repeat

Questions for Further Discussion

• What is the model for restructuring applications to attain “optimal” parallelism?• Can we construct “roofline” analytical models for factors such as

vectorization, threading and communication?• How can we combine software restructuring tools with performance

optimization tools to get “optimal” restructuring workflow?• Roles for offline and online optimization?

Documents

Working Group on Methodology for Optimizing Multilevel Parallelism Fialho, Gimenez, Tallent, Welton, Morris, Malony, Montoya and Browne