83
Application of Mixed- Mode Programming in a Real-World Scientific Code Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki

Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki

Embed Size (px)

Citation preview

Application of Mixed-Mode Programming in a Real-World Scientific Code

Case Study:

PRACE Autumn School in HPC Programming Techniques25-28 November 2014 Athens, Greece

Nikos Tryfonidis, Aristotle University of Thessaloniki

What is this all about?

You are given a scientific code, parallelized with MPI.

Question: Any possible performance benefits by a mixed-mode implementation?

We will go through the steps of preparing, implementing and evaluating the addition of threads to the code.

The Code

MG : General Purpose Computational Fluid Dynamics Code (~20000 lines). Written in C and parallelized with MPI, using communication library written by author.

Developed by Mantis Numerics and provided by Prof. Sam Falle (director of company and author of the code).

MG has been used professionally for research in Astrophysics and simulations of liquid CO2 in pipelines, non-ideal detonations, groundwater flow etc.

Outline

1. Preparation: Code description, initial benchmarks.

2. Implementation: Introduction of threads into the code. Application of some interesting OpenMP concepts:

Parallelizing linked list traversals OpenMP Tasks Avoiding race conditions

3. Results - Conclusion

PreparationCode Description

Make It Hybrid: How do you start?

Step 1: Inspection of the code, discussion with the author.

Make It Hybrid: How do you start?

Step 1: Inspection of the code, discussion with the author.

Step 2: Run some initial benchmarks to get an idea of the program’s (pure MPI) runtime and scaling.

Make It Hybrid: How do you start?

Step 1: Inspection of the code, discussion with the author.

Step 2: Run some initial benchmarks to get an idea of the program’s (pure MPI) runtime and scaling.

Step 3: Use profiling to gain some insight into the code’s hotspots/bottlenecks.

What does the code do?

Computational domain: Consists of cells (yellow boxes) and joins (arrows).

The code performs computational work by looping through all cells and joins.

…1st Cell 2nd Cell …… Last Cell(1D example)

1st Join

2nd

Join

Last

Join

…and in parallel?

Cells are distributed to all MPI Processes, using a 1D decomposition (each Process gets a contiguous group of cells and joins).

Halo Communication

Proc. 1

Proc. 2

Structure of (the relevant part of) the code

Computational hotspot of the code: “step” function (~500 lines).

“step” determines the stable time step and then advances the solution over that time step.

Mainly consists of halo communication and multiple loops over cells and joins (separate).

Basic Structure of “step” Function

Halo Communication (Calls to MPI)

Loops through Cells and Joins (Computational Work)

1st Order Step

Basic Structure of “step” Function

Halo Communication (Calls to MPI)

Loops through Cells and Joins (Computational Work)

Halo Communication (Calls to MPI)

Loops through Cells , Halo Cells and Joins:Multiple Loops (Heavier Computational

Work)

1st Order Step

2nd Order Step

PreparationInitial Benchmarks and Profiling

Initial Benchmarks

Initial benchmarks were run, using test case suggested by the code author.

A 3D computational domain was used. Various domain sizes were tested (100³, 200³ and 300³ cells), for 10 computational steps.

Representative performance results will be shown here.

Initial Benchmarks: Execution Time

Figure 1: Execution time (in seconds) versus number of MPI Processes (size: 300³)

Initial Benchmarks: Speedup

Figure 2: Speedup versus number of MPI Processes (all sizes).

Initial Profiling

Profiling of the code was done using CrayPAT.

Four profiling runs were performed, with different numbers of processors (2, 4, 128, 256) and a grid size of 200³ cells.

Most relevant result of the profiling runs for the purpose of this presentation: percentage of time spent in MPI functions.

Initial Profiling: % MPI Time

Figure 3: Percentage of time spent in MPI communication, for 2, 4, 128 and 256 processors (200³ cells)

Initial Benchmarks and Profiling Results

The performance of the code is seriously affected by increasing the number of processors.

Performance actually becomes worse after a certain point.

Profiling shows that MPI communication dominates the runtime for high processor counts.

Mixed-Mode: Why It May Work Here

Smaller number of MPI Processes means: Fewer calls to MPI. Cheaper MPI collective communications.

MG uses a lot of these (written in communication library).

Mixed-Mode: Why It May Work Here

Smaller number of MPI Processes means: Fewer calls to MPI. Cheaper MPI collective communications. MG

uses a lot of these (written in communication library).

Fewer halo cells (less data communicated, less memory required).

Note: Simple 1D decomposition of domain requires more halo cells per MPI process than 2D or 3D domain decompositions. Mixed-Mode, requiring fewer halo cells, helps here.

Mixed-Mode: Why It May Not Work So Well Here.

Addition of OpenMP code: possible additional synchronization (barriers, critical regions etc) needed for threads – bad for performance!

Only one thread (master) used for communication, means we will not be using the system’s maximum bandwidth potential.

ImplementationThe Actual Work

So What’s The Plan?

And What’s Wrong With It?

All loops in “step” function are linked list traversals!

pointer = first cell

while (pointer != NULL) { - Do Work on current pointer / cell –

pointer = next cell }

Linked List example (pseudocode) :

And What’s Wrong With Linked List Traversals?

Linked list traversals use a while loop.

Iterations continue until the final element of the linked list is reached.

And What’s Wrong With Linked List Traversals?

Linked list traversals use a while loop.

Iterations continue until the final element of the linked list is reached.

In other words: Next element that the loop will work on

is not known until the end of current iteration

No well-defined loop boundaries!

And What’s Wrong With Linked List Traversals?

Linked list traversals use a while loop.

Iterations continue until the final element of the linked list is reached.

In other words: Next element that the loop will work on

is not known until the end of current iteration

No well-defined loop boundaries!

We can’t use simple OpenMP

“parallel for” dire

ctives to

parallelize these loops!

Plan B?

ImplementationManual Parallelization of Linked List Traversals

And How Do We Parallelize This?

Straightforward way to parallelize a linked list traversal: transform the while loop into a for loop.

This can be parallelized with a for loop!

Manual Parallelization of Linked List Traversals

1. Count number of cells (1 loop needed)

2. Allocate array of pointers of appropriate size

3. Point to every cell (1 loop needed)

4. Rewrite the original while loop as a for loop

Manual Parallelization of Linked List Traversals: Pseudocode

BEFORE:

pointer = first cell

while(pointer!= NULL) { - Do Work on current pointer / cell –

pointer = next cell}

AFTER:

pointer = first cellwhile (pointer != NULL) {

counter+=1 pointer = next cell}

Allocate pointer array (size of counter)

for (i=0; i<counter; i++) { pointer_array[i] = pointer pointer = next cell}

for (i=0; i<counter; i++) {pointer = pointer_array[i]- Do Work -

}

Manual Parallelization of Linked List Traversals: Extra Code

BEFORE:

pointer = first cell

while(pointer!= NULL) { - Do Work on current pointer / cell –

pointer = next cell}

AFTER:

pointer = first cellwhile (pointer != NULL) {

counter+=1 pointer = next cell}

Allocate pointer array (size of counter)

for (i=0; i<counter; i++) { pointer_array[i] = pointer pointer = next cell}

for (i=0; i<counter; i++) {pointer = pointer_array[i]- Do Work -

}

Manual Parallelization of Linked List Traversals: Adding OpenMP

After verifying that the code still produces correct results, we are ready to introduce OpenMP to the “for” loops we wrote.

Manual Parallelization of Linked List Traversals: Adding OpenMP

After verifying that the code still produces correct results, we are ready to introduce OpenMP to the “for” loops we wrote.

In similar fashion to plain OpenMP, we must pay attention to: The data scope of the variables. Data dependencies that may lead to race

conditions.

Manual Parallelization of Linked List Traversals: Adding OpenMP

1 #pragma omp parallel shared (cptr_ptr, ...)2 private (t, cptr, ...)

3 firstprivate (cptr_counter, ...)

4 default (none)5 {6 #pragma omp for schedule(type, chunk)7 for (t=0; t<cptr_counter; t++) {89 cptr = cptr_ptr[t];1011 / Do Work /12 / ( . . . ) /13 }14 }

Manual Parallelization Performance Tests

After introducing OpenMP to the code and verifying correctness, performance tests took place, in order to evaluate performance as a plain OpenMP code.

Tests were run for different problem sizes, using different numbers of threads (1,2,4,8).

Manual Parallelization Performance Results: Execution Time

Figure 4: Execution time versus number of threads, for second – order step loops (size: 200³ cells)

Manual Parallelization Performance Results: Speedup

Figure 5: Speedup versus number of threads, for second – order step loops (size: 200³ cells)

Manual Parallelization Performance Results: Thoughts

Almost ideal speedup for up to 4 threads. With 8 threads, the two heaviest loops continue to show decent speedup.

Similar results for smaller problem size (100³ cells), only less speedup.

Manual Parallelization Performance Results: Thoughts

Almost ideal speedup for up to 4 threads. With 8 threads, the two heaviest loops continue to show decent speedup.

Similar results for smaller problem size (100³ cells), only less speedup.

In mixed mode, cells will be distributed to processes: interesting to see if we will still have speedup there.

ImplementationParallelization of Linked List Traversals Using OpenMP Tasks

Alternative Parallelization Method for Linked Lists: OpenMP Tasks

OpenMP Tasks: a feature introduced with OpenMP 3.0.

The Task construct basically wraps up a block of code and its corresponding data, and schedules it for execution by a thread.

OpenMP Tasks allow the parallelization of a more wide variety of loops, making OpenMP more flexible.

What Can Tasks Do For Us Here?

The Task construct is the right tool for parallelizing a “while” loop with OpenMP.

Each iteration of the “while” loop can be a Task.

Using Tasks is an elegant method for our case, leading to cleaner code with minimal additions.

pointer = first cell

while(pointer!= NULL) { - Do Work on current pointer / cell –

pointer = next cell}

OpenMP Tasks For Linked List Traversals - Pseudocode

BEFORE:

AFTER:

#pragma omp parallel { #pragma omp single {

pointer = first cell

while (pointer != NULL) {

#pragma omp task {

-Do Work on current pointer / cell– } pointer = next cell

} }

OpenMP Tasks For Linked List Traversals

Using OpenMP Tasks, we were able to parallelize the linked list traversal by just adding OpenMP directives!

Fewer additions to the code, elegant method.

Usual OpenMP work still applies: data scope and dependencies need to be resolved.

How Did Tasks Perform, Though?Execution Time (…)

Figure 6: Execution time versus number of threads, for second – order step loops, using Tasks (size: 200³ cells)

How Did Tasks Perform, Though?Speedup (…)

Figure 7: Speedup versus number of threads, for second – order step loops, using Tasks (size: 200³ cells)

Why So Bad?

1. J.M. Bull, F. Reid, N. McDonnell - A Microbenchmark Suite for OpenMP Tasks. 8th International Workshop on OpenMP, IWOMP 2012, Rome, Italy, June 11-13, 2012. Proceedings

Figure 8: OpenMP Task creation and dispatch overhead versus number of Threads¹.

Why So Bad?

For the current code, performance tests show that creating the Tasks and dispatching them requires roughly the same time needed to complete them, for one thread.

With more threads, it gets much worse (remember the logarithmic axis in previous graph).

OpenMP Tasks: Conclusion

The problem: very big number of Tasks, not heavy enough each to justify huge overheads.

Despite being elegant and clear, OpenMP Tasks are clearly not the way to go.

Could try different strategies (e.g. grouping Tasks together), but that would cancel the benefits of Tasks (elegance and clarity).

And The Winner Is…

Manual Parallelization of linked list traversals will be used for our mixed-mode MPI+OpenMP implementation with this particular code.

It may be ugly and inelegant, but it can get things done.

In defense of Tasks: If the code had been written with the intent of using OpenMP Tasks, things could have been different.

ImplementationAvoiding Race Conditions Without Losing The Race

Avoiding Race Conditions

Additional synchronization required by OpenMP can prove to be very harmful for the performance of the mixed-mode code.

While race conditions need to be avoided at all costs, this must be done in the least expensive way possible.

Race Condition Example: Find Maximum

At a certain point, the code needs to find the maximum value of an array.

While trivial in serial, with OpenMP this is a race condition waiting to happen.

for (i=0; i<n; i++){

if (a[i] > max){ max = a[i];}

}

Part of loop to be parallelized with OpenMP:

Race Condition Example: Find Maximum

At a certain point, the code needs to find the maximum value of an array.

While trivial in serial, with OpenMP this is a race condition waiting to happen.

Two ways to tackle this: 1. Critical Regions2. Manually (Temporary

Shared Arrays)

for (i=0; i<n; i++){

if (a[i] > max){ max = a[i];}

}

What happens if (when) 2 or more threads try to write to “max” at the same time?

Part of loop to be parallelized with OpenMP:

Race Condition Example: Using Critical Regions

With a Critical Region we can easily avoid the race conditions.

However, Critical Regions are very bad for performance.

Question: Include loop in critical region or not?

for (i=0;i<n;i++) {

#pragma omp critical if (a[i] > max) { max = a[i];}

}

Now only one thread at a time can be inside critical block.

1 5 1 2 2

4 3 1 9

2 7 6 3

Avoiding Critical RegionsWith Temporary Shared Arrays

Data (Shared Array, 4 Threads):

4 8 5

597

5 1 2 2

4 3 1 9

2 7 6 3

1

Avoiding Critical RegionsWith Temporary Shared Arrays

Data (Shared Array, 4 Threads):

4 8 5

Temp. Shared Array:

8

Thread 0 Thread 1 Thread 2Thread 3

Each thread writes its own maximum to corresponding element

597

5 1 2 2

4 3 1 9

2 7 6 3

1

Avoiding Critical RegionsWith Temporary Shared Arrays

Data (Shared Array, 4 Threads):

4 8 5

Temp. Shared Array:

8

Thread 0 Thread 1 Thread 2Thread 3

Each thread writes its own maximum to corresponding element

Single Thread:

9 A single thread picks out the total maximum

Critical Region vs Temporary Shared Array

Benchmarks were carried out, measuring execution time for the “find maximum” loop only.

Three cases tested: Critical Region, single “find max”

instruction inside Critical Regions, whole “find max” loop

inside Temporary Arrays

Critical Region vs Temporary Shared Array: Time

Figure 9: Execution time versus number of threads (size: 200³ cells).

Critical Region vs Temporary Shared Array: Speedup

Figure 10: Speedup versus number of threads (size: 200³ cells).

Critical Region vs Temporary Shared Array: Results

The temporary array method is clearly the winner.

However: Additional code needed for this method. Smaller problem sizes give less

performance gains for more threads (nothing we can do about that, though).

ResultsMixed-Mode Performance

Mixed-Mode Performance Tests

The code was tested in mixed-mode with 2, 4 and 8 threads per MPI Process.

Same variation in problem size as before (100³, 200³, 300³ cells).

Representative results will be shown here.

Mixed-Mode: Execution Time(size: 200³ cells)

Figure 11: Time versus number of threads, 2 threads per MPI Proc.

Mixed-Mode: Execution Time(size: 200³ cells)

Figure 12: Time versus number of threads, 4 threads per MPI Proc.

Mixed-Mode: Execution Time(size: 200³ cells)

Figure 13: Time versus number of threads, 8 threads per MPI Proc.

Mixed-Mode: Speedup(size: 100³ cells)

Figure 14: Speedup versus number of threads, all combinations

Mixed-Mode: Speedup(size: 200³ cells)

Figure 15: Speedup versus number of threads, all combinations

Mixed-Mode: Speedup(size: 300³ cells)

Figure 16: Speedup versus number of threads, all combinations

Mixed-Mode Performance: Results

Mixed-Mode outperforms the original MPI-only implementation for the higher processor numbers tested.

Mixed-Mode Performance: Results

Mixed-Mode outperforms the original MPI-only implementation for the higher processor numbers tested.

MPI-only performs better (or almost the same) as mixed mode for the lower processor numbers tested.

Mixed-Mode Performance: Results

Mixed-Mode outperforms the original MPI-only implementation for the higher processor numbers tested.

MPI-only performs better (or almost the same) as mixed mode for the lower processor numbers tested.

Mixed-Mode with 4 threads/MPI Process is the best choice for problem sizes tested.

Mixed-Mode versus pure MPI: Memory Usage

Figure 17: Memory usage versus number of PEs , 8 threads per MPI Process (200³ cells)

ConclusionWas Mixed-Mode Any Good Here?

Conclusion: Mixed-Mode Versus Pure MPI

For problem sizes and processor numbers tested: Mixed-Mode performed better or equally compared to pure MPI.

Higher processor numbers: Mixed-Mode manages to achieve speedup where pure MPI slows down.

Mixed-Mode required significantly less memory.

So, To Answer Our Very First Question…

Any possible performance benefits from a Mixed-Mode implementation?

So, To Answer Our Very First Question…

Any possible performance benefits from a Mixed-Mode implementation for this code?

Answer: Yes, for larger numbers of processors (> 256), a mixed-mode implementation of this code: Provides Speedup instead of Slow-Down. Uses less memory

Thank You!