Upload
allineasoftware
View
196
Download
9
Embed Size (px)
Citation preview
Optimizing Thread Performance for a
Genomics Variant Caller
This talk
• Introduce two tools that can help improve the performance of
multithreaded code
• Apply the tools to a real world Genomics code
caption
Tool 1: Allinea Performance Reports – benchmarking and
characterization
Tool 2: Allinea Forge - Debugging and Profiling
• Debug and profile from one interface, configuration
• Secure native remote and local access
• Rapidly switch between the tasks
• Edit, build, commit, debug, profile, optimize..
Small data files
<5% slowdown
No instrumentation
No recompilation
Our profiler finds the performance bottlenecks
Our debugger helps bugs and performance
• Observe why workload is imbalanced
• Observe why particular code paths are followed
• .. And fix any bugs that optimization creates!
Above all…
• The tools are aimed at any performance problem that matters
– Focus on time: the ultimate judge of performance
• Do not prejudge the problem
– Don’t assume it’s MPI messages, threads or I/O before profiling!
• If there’s a problem..
– Allinea Performance Reports shows it, and advises you on solutions
– Allinea Forge’s profiler shows it, next to your code
6 steps to improve performance
Get a realistic test case
• Performance on real data matters
• Keep the test case for reference and re-use
Profile your code
• Add “-g” flag to your compilation
• Run with a profiler
Look for the significant
• Which part/phase of the code dominates time?
• Is there any unexpected significant time use?
What is the nature of the problem?
• Compute? I/O? MPI? Thread synchronization?
• Display the metrics that show the problem best
Apply brain to solve
• MPI – can you balance the work better?
• Compute – is memory time dominant – can you improve layout?
Think of the future
• Try larger process or thread counts to watch for scalability problems
• Keep the profile (.map file) for future comparison
Example: Improving Thread Usage in Genomics
• DISCOVAR
– Variant caller and small genome assembler
– Sub-mammalian sized genomes
– Newer DISCOVAR de novo for larger genomes
• C++ and OpenMP
• Developed by Broad Institute at MIT
A first look – on real hardware
• It’s not I/O intensive
• Good quantity of
OpenMP time
• No vectorization
OpenMP in detail
• Physical cores are
200% loaded:
hyperthreading is on
• 17% of parallel region
time is synchronization
• .. That’s quite high
Investigating the OpenMP synchronization
• Horizontal time axis: colour coded– Dark green – single core
– Light green – OpenMP work
– Light blue – pthreadsynchronization
– Gray – idle
• Vertical axis– #cores doing something
• Something’s very wrong towards the end – with all the gray
Zoom in on the region
• Stacks, code, regions,
time are all focused on
zoom area
• Key observation:
– OpenMP region with
“omp critical” is where
the time is being wasted
Fixing
• #pragma omp critical– Execute exactly one
thread at a time to ensure safety
• Is costing too much – Passing “token” from
thread to thread to do small pieces of work.
• Run whole section on one thread instead– Has same semantics
Impact of change
• Runtime down by 7%
As a performance report
• Improvements in
– Runtime
– Synchronization
overhead
Let’s try something bigger – into Amazon cloud!
• C4.8xlarge– 36 hyperthreaded cores
– 60GB RAM
– Xeon E5-2666 v3 Haswell
– 25MB Cache
– 2.6GHZ
vs
• Our physical server– 24 hyperthreaded cores
– 24 GB RAM
– Xeon E5-2407 v2
– 10MB Cache
– 2.4GHz
$ ./runme.sh
discovar version: Discovar r52488
loadaverage: 0.05 0.98 1.36 1/790 16317
2015-07-27 07:57 PERF: REAL 835.857 USER 36.188 SYSTEM 5.441 PERC 4.71
835 seconds to run on EC2
… vs …
~448 seconds on our physical server
Why?
Profile with Allinea Forge to find where the problem is
• Focus on initial 300
seconds: something
must be wrong here
• Serious lack of good
“green” compute
In detail…
• 36 threads, waiting… but who is using madvise?!
Why is glibc so bad?
• madvise system call in _int_free()– At least two context
switches each call ..
– This glibc version has issues…?
• What other options are there?
Maybe Google TCMalloc?
• Optimized for multi-threaded applications
• No-win– Same run time
– Issue is use of sys_futexnot madvise
• .. Not optimized for thismultithreaded application!
Jemalloc?
• As recommended by
the Broad Institute
• … same runtime
Jemalloc – same problem
• Source proves the issue
again…
Can Intel libraries help?
• We try the Intel TBB multithreaded allocator
• 14 minutes down to 10 minutes!
• .. But still this code has scope for more…
Real optimization of OpenMP regions
• NB – still profiling for
first 300 seconds only
• Significant inactivity in
final 60 seconds
• OpenMP region
– #pragma omp parallel for
• Is it working?
– No – the threads are idle
• Let’s remove
After the first fix…
• Now able to run to
completion
– 358 seconds
• Still inactivity at end of
run
Zoomed to the inactivity…
• Another OpenMP region
• Quick edit: comment out
the OpenMP, again!
… and the impact
• Down to 304 seconds
Finally… something to sort out
• Recursive, in-place
multithreaded sorter
• Is not scaling well in
thread counts
• Options?
– Re-engineer
– Replace
– Tune
Let’s tune
• Try limiting the thread pool to 8 workers
– Better than 36 clashing threads?
Result…
• Runtime 4.7 minutes
• 3x improvement on
original
• #1 position on the
Broad Benchmark list
for a sub-$2 / hour
system!
Lessons learned
• Real codes exhibit many different performance patterns– Profiling real data sets at real scales is vital to target the effort
– Small test cases do not expose all the problems
– Small thread counts can be too small to find real problems
• Changing code can be simple– Use threads wisely – it will not always be faster
– Changing libraries – someone else might have fixed your problem
• Re-engineering is sometimes necessary– Take advantage of vector units
– Take advantage of threads
Increase the performance of your software
Analyze and tune with Allinea Performance Reports
Develop, profile and debug applications with Allinea Forge
With professional support when you need it most
Read more!