Optimizing thread performance for a genomics variant caller

Optimizing Thread Performance for a

Genomics Variant Caller

This talk

• Introduce two tools that can help improve the performance of

multithreaded code

• Apply the tools to a real world Genomics code

caption

Tool 1: Allinea Performance Reports – benchmarking and

characterization

Tool 2: Allinea Forge - Debugging and Profiling

• Debug and profile from one interface, configuration

• Secure native remote and local access

• Rapidly switch between the tasks

• Edit, build, commit, debug, profile, optimize..

Small data files

<5% slowdown

No instrumentation

No recompilation

Our profiler finds the performance bottlenecks

Our debugger helps bugs and performance

• Observe why workload is imbalanced

• Observe why particular code paths are followed

• .. And fix any bugs that optimization creates!

Above all…

• The tools are aimed at any performance problem that matters

– Focus on time: the ultimate judge of performance

• Do not prejudge the problem

– Don’t assume it’s MPI messages, threads or I/O before profiling!

• If there’s a problem..

– Allinea Performance Reports shows it, and advises you on solutions

– Allinea Forge’s profiler shows it, next to your code

6 steps to improve performance

Get a realistic test case

• Performance on real data matters

• Keep the test case for reference and re-use

Profile your code

• Add “-g” flag to your compilation

• Run with a profiler

Look for the significant

• Which part/phase of the code dominates time?

• Is there any unexpected significant time use?

What is the nature of the problem?

• Compute? I/O? MPI? Thread synchronization?

• Display the metrics that show the problem best

Apply brain to solve

• MPI – can you balance the work better?

• Compute – is memory time dominant – can you improve layout?

Think of the future

• Try larger process or thread counts to watch for scalability problems

• Keep the profile (.map file) for future comparison

Example: Improving Thread Usage in Genomics

• DISCOVAR

– Variant caller and small genome assembler

– Sub-mammalian sized genomes

– Newer DISCOVAR de novo for larger genomes

• C++ and OpenMP

• Developed by Broad Institute at MIT

A first look – on real hardware

• It’s not I/O intensive

• Good quantity of

OpenMP time

• No vectorization

OpenMP in detail

• Physical cores are

200% loaded:

hyperthreading is on

• 17% of parallel region

time is synchronization

• .. That’s quite high

Investigating the OpenMP synchronization

• Horizontal time axis: colour coded– Dark green – single core

– Light green – OpenMP work

– Light blue – pthreadsynchronization

– Gray – idle

• Vertical axis– #cores doing something

• Something’s very wrong towards the end – with all the gray

Zoom in on the region

• Stacks, code, regions,

time are all focused on

zoom area

• Key observation:

– OpenMP region with

“omp critical” is where

the time is being wasted

Fixing

• #pragma omp critical– Execute exactly one

thread at a time to ensure safety

• Is costing too much – Passing “token” from

thread to thread to do small pieces of work.

• Run whole section on one thread instead– Has same semantics

Impact of change

• Runtime down by 7%

As a performance report

• Improvements in

– Runtime

– Synchronization

overhead

Let’s try something bigger – into Amazon cloud!

• C4.8xlarge– 36 hyperthreaded cores

– 60GB RAM

– Xeon E5-2666 v3 Haswell

– 25MB Cache

– 2.6GHZ

vs

• Our physical server– 24 hyperthreaded cores

– 24 GB RAM

– Xeon E5-2407 v2

– 10MB Cache

– 2.4GHz

$ ./runme.sh

discovar version: Discovar r52488

loadaverage: 0.05 0.98 1.36 1/790 16317

2015-07-27 07:57 PERF: REAL 835.857 USER 36.188 SYSTEM 5.441 PERC 4.71

835 seconds to run on EC2

… vs …

~448 seconds on our physical server

Why?

Profile with Allinea Forge to find where the problem is

• Focus on initial 300

seconds: something

must be wrong here

• Serious lack of good

“green” compute

In detail…

• 36 threads, waiting… but who is using madvise?!

Why is glibc so bad?

• madvise system call in _int_free()– At least two context

switches each call ..

– This glibc version has issues…?

• What other options are there?

Maybe Google TCMalloc?

• Optimized for multi-threaded applications

• No-win– Same run time

– Issue is use of sys_futexnot madvise

• .. Not optimized for thismultithreaded application!

Jemalloc?

• As recommended by

the Broad Institute

• … same runtime

Jemalloc – same problem

• Source proves the issue

again…

Can Intel libraries help?

• We try the Intel TBB multithreaded allocator

• 14 minutes down to 10 minutes!

• .. But still this code has scope for more…

Real optimization of OpenMP regions

• NB – still profiling for

first 300 seconds only

• Significant inactivity in

final 60 seconds

• OpenMP region

– #pragma omp parallel for

• Is it working?

– No – the threads are idle

• Let’s remove

After the first fix…

• Now able to run to

completion

– 358 seconds

• Still inactivity at end of

run

Zoomed to the inactivity…

• Another OpenMP region

• Quick edit: comment out

the OpenMP, again!

… and the impact

• Down to 304 seconds

Finally… something to sort out

• Recursive, in-place

multithreaded sorter

• Is not scaling well in

thread counts

• Options?

– Re-engineer

– Replace

– Tune

Let’s tune

• Try limiting the thread pool to 8 workers

– Better than 36 clashing threads?

Result…

• Runtime 4.7 minutes

• 3x improvement on

original

• #1 position on the

Broad Benchmark list

for a sub-$2 / hour

system!

Lessons learned

• Real codes exhibit many different performance patterns– Profiling real data sets at real scales is vital to target the effort

– Small test cases do not expose all the problems

– Small thread counts can be too small to find real problems

• Changing code can be simple– Use threads wisely – it will not always be faster

– Changing libraries – someone else might have fixed your problem

• Re-engineering is sometimes necessary– Take advantage of vector units

– Take advantage of threads

Increase the performance of your software

Analyze and tune with Allinea Performance Reports

Develop, profile and debug applications with Allinea Forge

With professional support when you need it most

Read more!

http://www.allinea.com/tune

http://www.allinea.com/tune

http://www.allinea.com/develop




Software

Optimizing thread performance for a genomics variant caller