Multithreaded Programming Quickstart

1

Multithreaded Programming Quickstart

A Dr. Dobb’s Journal Vendor Perspectives NetSeminar

Sponsored by Intel

Tuesday, May 9, 20069AM PT / 12PM ET

Multithreaded Programming Quickstart

Software & Solutions GroupCharles Congdon, Senior Software

EngineerMay 9, 2006

Agenda

Motivation for Threading

Concepts in Parallelism

Implementing Parallelism

Existing Code

Common Threading Challenges

Summary

Intel® Integrated Performance Primitives, Intel® Math Kernel Library, Intel VTune™ Performance Analyzer, Intel® Threading Tools, Intel® Thread Profiler, Intel® Thread Checker, Intel® C++ Compiler, Intel® Fortran Compiler, Intel, and the Intel Logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.

Hardware Architecture

The trend toward multi-core mobile, desktop, and server processors is expected to continue into the foreseeable future, and software must be threaded to take full advantage of it.

Why Thread Your Application?

Increased responsiveness and worker productivity

• Increased application responsiveness when different tasks run inparallel

Improved performance in parallel environments

• When running computations on multiple processors

More computation per cubic foot of data center

• Web-based apps are often multi-threaded by nature

Performance + responsiveness makes it easier to add new features

Taking full advantage of Multi-Core hardware requires multi-threaded software

Agenda




Existing Code


Summary

Hardware and Software Threading

Hyper-Threading (HT) Technology

• Hardware technology to increase processor performance by improving CPU utilization

Dual and Multi-Core

• Hardware technology to increase processor performance by placing multiple CPU cores in a single processor package

Multi-threading

• Software technology to improve software functionality & increase software performance by utilizing multiple (logical) CPUs

• This is what we have traditionally seen in multitasking operating systems that run multiple applications and processes at once

What is Parallel Computing?

More than one thread of control

More than one processor

• Multiple Threads executing concurrently

• Coordinated work division

• Single problem

Shared Memory Parallelism

• Most common implementation

• Scheduling handled by the OS

• Sharing a single address space

• Requires a system w/ shared memory and multiple CPUs

Types of Parallel Computing

Instruction-Level Parallelism (ILP)

Data-Level Parallelism (MMX™ Technology; SSE, SSE2, and SSE3 instructions)

Thread-Level Parallelism (TLP)

Process-Level Parallelism (“batch queue”)

Multi-computer distributed computing

• Clusters

• Grids

• SETI@Home*

* Other brands and names may be claimed as the property of others.

Partitioning Methods

Functional Decomposition

• Task ParallelismEach Thread performs a unique job

Grid reprinted with permission of Dr. Phu V. Luong, Coastal and Hydraulics Laboratory, ERDC

Domain Decomposition

• Data Parallelism:Same operation applied to all data

Ocean Model

Surface Model

Hydro Model

Atmosphere Model

GOAL: Identify independent computations / primitive tasks

Most Code Contains Parallelism

Task parallelism:

Independent subprograms

Data parallelism:

Independent loop iterations

for (y=0; y<nLines; y++)

genLine(model,im[y]);

call fluxx(fv,fx)

call fluxy(fv,fy)

call fluxz(fv,fz)

Lock/Synchronization Object

Working definition:

• A programmatic construct that coordinates multithreaded access to shared global data

Or in less flashy terms:

• Something that allows the programmer to keep two threads from updating the same variable at once.

Granularity

Granularity of parallel work

• Finding the right sized “chunks” of parallel work can be challenging– Too large can lead to load imbalance– Too small can lead to synchronization overhead

• Adjust dynamically based on data and system to help keep the balance right and reduce synchronization

Granularity of synchronization/locking

• Synchronization should happen in as small a region as possible– Too large and the execution becomes serial as other threads wait for the

lock

• Synchronization should happen as infrequently as possible– Too often and synchronization overhead can dominate

Parallel Overhead

Synchronization Overhead

• Arises when multiple threads try to acquire the same lock at once– Minimize data-sharing across threads– When it is necessary, keep it as short as possible and outside of tight loops

Thread-Creation overhead

• Thread creation is very expensive and should be done infrequently– Use re-usable threads and thread pools

False-sharing overhead

• Cache pinging when different threads access adjacent data– Have threads work on different sections of problem

Intel® Thread Profiler and Intel® VTune can help you detect these issues.

PP00 P1 …….PP44P3PP22Thread 1Thread 2 PN-1………PN/2+2PN/2+1

PP11PP00 PPN/2N/2………………

Load Balancing

Give each thread equal-sized chunks of work

• For task parallelism, equal-sized tasks

• For data parallelism, equal splitting of the data

For task parallelism in particular:

• Can be data-dependent – may need to adjust dynamically

• One thread might get several tasks vs. one task

• Use Intel® VTune™ Performance Analyzer to help assess load

For Both:

• May need to use smaller chunks of work to load-balance better– Which can increase synchronization overhead…

Agenda




Existing Code


Summary

Start early with multicore for best results

Understand threading concepts, parallel software architectures and patterns

Learn about threading technologies like OpenMP*, Win32 threads, PThreads, etc.

Mentor your dev team with any SMP developers you have

Understand fundamental scale/coordination limiters in your code

Understand coordination overhead

Search for algorithms which are more parallel-friendly

Determine optimal thread count, set it dynamically

Avoid spin loops: sleep or use threading sync mechanisms

Don’t overlook growing datasets!Key Resources:

• Intel® Software Network• Intel® Software College• Professional programming books,

including Intel® Press

Learn Design


Repeatable Benchmarks Required

Measure CPU hotspots, I/O hotspots, and the degree of parallelism in your application before/during/after threading

• Windows Performance Monitor*

• Linux vmstat*, sar*, mpstat*, iostat*

• Intel® VTune™ Performance Analyzer

• Intel® Thread Profiler

Use your knowledge of the algorithm to identify opportunities for parallelism• Decompose processing into compute threads• Consider Partitioning Methods

Verify with tools

• Tools will identify dependencies you overlooked

• Tools will help identify regions for greatest ROI

• Tools will improve your productivity

Developer knowledge of the algorithms is important!


Candidate Areas for Threading

Loops in hotspot code• Each iteration needs to be independent• Iterations with dependencies may be candidates for pipelining

Hotspot function that contains unrelated tasks with no data dependencies

• Each of these tasks could be placed on a separate thread

Sub-tree of application call-graph profiling

• Use Intel® VTune™ call-graph functionality to understand execution flow

Frequently-executed repetitive tasks

• Each iteration of task must use different data

• Use performance analysis to determine if these happen often enough to justify effort

Options for Adding Parallelism

Explicitly Thread your program using Win32*/POSIX* threading APIs

Use a Compiler to automatically parallelize code

Use a Programming Language API (C#*, Java*, etc.)

Programming Language Extension (OpenMP*)• Use OpenMP* directives to tell the compiler how to decompose parts

of a serial program for parallel execution

Use an internally-threaded runtime library for common tasks• Intel® Integrated Performance Primitives (Intel® IPP) and Intel®

Math Kernel Library (Intel® MKL)• Parallel memory managers like MicroQuill SmartHeap* and Hoard*

These options are not mutually exclusive: mix and match as needed


Threading with OpenMP*

• About OpenMP*– OpenMP is a directive-based set of language extensions to C, C++,

and Fortran• Requires OpenMP*-enabled compiler

– Easily parallelizes independent countable loops (Fortran DO or restricted C for)

– Coarser-grained parallelism possible via worksharing directives – Advanced features include API functions to get thread information and

locks, and some subtle directives and clauses

• You can use OpenMP* and Intel® Threading Tools to very quickly prototype threaded algorithms.

Regardless how you ultimately implement your threaded application, OpenMP* provides a quick way to get started


Parallel region

• A parallel region is the basic concept of OpenMP*

• After a PARALLEL directive, every thread is executing the same region (master thread plus slave threads)

• At the end of the parallel region, slave threads (conceptually) disappear, leaving only the master thread

• Nested parallelism complicates matters

• Makes it possible to add parallelism incrementally

Single Thread of execution

#pragma omp parallel{

Parallel: Multiple threads of execution

} // End Parallel: back to single thread

Parallel Regions

Master Thread


Example – OpenMP* Threads

BB

AA CC

…// Divide work of outer loop between all processors on the system

#pragma omp parallel for private(x,y)

for (x=0; x<width; x++)

for (y=0; y<height; y++)

C(x,y) = F(A(x,y),B(x,y));

do_something_else(); …

Number of threads used by OpenMP* is determined at initialization time (number of processors).

If disabled, code looks and runs like single threaded code


Development Cycle

Analysis–Verify timings, verify dependencies

–Intel® VTune™ Performance Analyzer

Design (Introduce Threads)–Use a threaded library

–e.g. Intel® Performance libraries: IPP and MKL

–OpenMP* (Intel® Compiler)–Explicit threading (Win32*, Pthreads*)

Analyze for correctness–Intel® Thread Checker–Intel® Debugger

Tune performance–Intel® Thread Profiler–Intel® VTune™ Performance Analyzer


Agenda




Existing Code

Common Threading Errors

Summary

Starting Small with Existing Code

If full-blown threading looks too hard…

• Look at the code where you spend the most time - hotspots

• Identify code regions that would benefit from parallelism

• Try to use the Intel® Compiler to parallelize tight inner loops– /Qparallel* and /Qx* options– OpenMP* directives and /Qopenmp* option

• Use OpenMP*, Intel® Thread Checker, and Intel® Thread Profiler to prototype possible threading implementations– Once you have a good algorithm, you can rewrite in a native threading API

like Win32* or pThreads* as desired.


Starting Small with Existing Code (continued)

If full-blown threading looks too hard (continued)

• Replace calls to large common functions with calls to internallyparallel libraries such as:– Intel® Integrated Performance Primitives– Intel® Math Kernel Library

• Make your libraries thread-safe in anticipation of their being called by threaded code– Code a simple multi-threaded test harness with OpenMP*

• Consider “functional” (task) parallelism– Try to separate computation from unrelated tasks (such as the GUI,

printing, etc.)


Agenda




Existing Code


Summary

Challenges to Implementing Parallelism

Correctness• Shared resource identification• Threading the right code correctly• Difficult to debug

– Data Races– Deadlocks

Performance• Program decomposition (Functional task/Data)• Overhead (Thread management and synchronization)• Resource utilization (load balancing)

– Adequate Memory and I/O bandwidth– Task Priorities

Problem: Program Gives Incorrect Results

Some possible explanations:

• Race condition or storage conflicts– More than one thread accesses memory without synchronization– Locking used, but is too local to be effective

• Other components (such as 3rd-party APIs) may not be “thread safe” in certain use cases

Debug via:

• Intel® Thread Checker

• A tool like Rational Purify* or Compuware BoundsChecker*


Data Race Example

Suppose you have global variables a=1, b=2

Thread1x = a + b

Thread2b = 42

The end result can be different if

• Thread1 runs before Thread2:

• Thread2 runs before Thread1: x = 43 x = 3

Execution order is not guaranteed unless updates to “b” and its use are protected by a

synchronization mechanism.

Problem: Threaded Program Hangs

Possible explanations:• Deadlocks

– Actual (A waits on B, which is waiting on A)– Potential (will happen under the “right” conditions)

• Thread Stalls and Waits– Dangling locks (thread exits holding a lock requested by another

thread)• Thread exit does not automatically release held locks• Must release lock from the same thread where you obtained (entered) it

Debug using:• Intel® Thread Checker• A standard debugger• Print statements (be sure to identify the thread)

Deadlock Example

Thread 2 will hang here

If Thread 1 reached here first

Deadlock Deadlock -- Both threads are now frozenBoth threads are now frozen

ThreadFunc2(){

lock(B);globalY++;lock(A);

globalX++;unlock(A);

unlock(B);}

ThreadFunc1(){

lock(A);globalX++;lock(B);

globalY++;unlock(B);

unlock(A);}

To fix: both functions must acquire and release locks in the same order

Problem: Poor Threaded PerformanceProgram runs more slowly on a multi-processor machine after it was threaded

Common Issues:• Locking Granularity

– Too small – parallel overhead dominates.– Too large – not enough parallel work, or last grain runs alone.

• Synchronization – Excessive use of shared data– Contention for the same synchronization object

• Load balance– Improper distribution of parallel work

Diagnose the problem with:• System performance monitors• Intel® Thread Profiler • Intel® VTune™ Performance Analyzer

Problem: Poor Scaling

The application doesn’t run much faster on a multi-processor system after it is threaded

Common Issues• Large periods of serial execution may dominate• Your threads may be contending, imbalanced, or starved• The hottest code may not have been threaded• May have exceeded memory bandwidth of machine

To diagnose, use:• System performance monitors• Intel® Thread Profiler• Intel® VTune™ Performance Analyzer• Schedule periodic scaling studies on 1, 2, 4, … N processor systems.

(N = twice most common customer configuration)

Poor Parallel Performance ExampleSlowdowns:• Contention over “Lock L” stalls thread T3 between E4 and E5.• Runtime of thread T3 lengthened by “Event E.”

To fix:• Try not to hold “Lock L” as long.• Question whether the resources/code protected by “Lock L” needs to

be locked• Remove the processing of “Event E” from the critical path of

execution.

T1

T2

T3

E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12E0

wait for T2 & T3

wait for lock

L

release lock L

wait for external event

Lock L

Join T3

Event E

Problem: Still Poor Scaling

Scaling still poor even after algorithmic issues are solved –consider architecture issues

One possible cause – False Sharing• Arises when two processors are working on adjacent data elements

that fall in the same cache line• Data must bounce back and forth between processors as each tries

to read and write to the data it is working on

To diagnose:

• Use Intel® Vtune™ Performance Analyzer and performance monitoring events to locate most of your false-sharing problems– Look for “Memory Order Machine Clears”

• Solve the problem by – Changing data placement – e.g. adding padding data– Altering patterns by which threads access data – don’t work on same

cache line – see backup slides

PP00 P1 …….PP44P3PP22Thread 1Thread 2

Agenda




Existing Code

Common Threading Errors

Summary

Summary

Multi-core and multi-processor systems are the new standard

Concurrent compute threads are required to take full advantage of multi-processor/multi-core systems

Properly threading your code is challenging• But you don’t have to do it all at once• Focus on areas that will have the most impact to overall application

performance• Monitor scaling on systems with more processors than the current

common customer configuration

A thorough understanding of actual application behavior under typical loads is necessary before you consider threading

Intel continues to invest in software to ease the transition to threaded code

41

Q&AQ&A

Please Submit Your Questions NowPlease Submit Your Questions Now

42

ResourcesResources

Intel PerformanceIntel Performancehttp://www.ddj.com/intel/26.htmhttp://www.ddj.com/intel/26.htm

Intel® Software NetworkIntel® Software Networkhttp://www.ddj.com/intel/27.htmhttp://www.ddj.com/intel/27.htm

Intel Developer Center Intel Developer Center –– ThreadingThreadinghttp://www.ddj.com/intel/28.htmhttp://www.ddj.com/intel/28.htm

Intel MultiIntel Multi--Core ProcessingCore Processinghttp://www.ddj.com/intel/29.htmhttp://www.ddj.com/intel/29.htm

Intel® Developer Solutions CatalogIntel® Developer Solutions Cataloghttp://www.ddj.com/intel/30.htmhttp://www.ddj.com/intel/30.htm

Backup

But Don’t Take MY Word for It

“We are optimizing RenderMan’s core to be very scalable for future multi-core architectures. Intel’s Threading Tools have accelerated our development cycle dramatically. Intel’s Thread Checker for example, helped identify potential threading issues very quickly, in days compared to weeks if done otherwise. Thread Profiler, on the other hand, has helped us understand threading performance problems so we could fix them to improve scalability. The Intel Threading Tools are now an integral part of our development process.”Dana BataliDirector of RenderMan DevelopmentPixar


“We found Intel ThreadChecker to be an indispensable aid for analyzing threaded code. We were impressed at how well it handled an application as large and complex as Maya. Based on this experience I plan to use this tool on future threading projects. Intel ThreadProfiler was very useful for analyzing bottlenecks in our threaded code. ThreadProfiler . . . showed us the reasons for the slowdown, so we were able to restructure the code for better threaded performance.”Martin WattSoftware ArchitectAlias


"We used Intel Threading Tools including Intel Thread Profiler to realize improved threaded application performance of Omni Page 15 running on Intel multi-core platforms. We look forward to using Intel Thread Profiler with its critical path analysis and selective magnification of important time regions on future thread optimization projects."Gyorgy VarszegiScansoft

Thread Development: Lessons Learned

Find thread balance• Pick the “right” level of function/data granularity to balance threads,

test it• Decide if you can/should adjust thread balance at runtime

Remove threading bugs• Don’t mix Intel and Microsoft compilers when using OpenMP (with a

few exceptions)• Race conditions & deadlocks

– Pre-threaded code (libraries) can reduce– Intel® Thread Checker flags these bugs

Remove thread coordination issues• Test algorithm scale• Minimize real and false data sharing, share safely, and check if you

got it right• Intel® Thread Profiler helps find these

Amdahl’s Law

• Describes the upper bound of parallel speedup (scaling)

• Helps think about the effects of overhead

• Examples:– If only half of a process is able to take advantage of parallelism, the

maximum possible scalability is two – assuming an infinite number of processors and perfect efficiency.

– If only two processors are available, the maximum possible speedup is 1.33, assuming perfect efficiency.

Load Imbalance

Unequal work loads lead to idle threads and wasted time

Time

BusyIdle

Granularity

Coarse grain

Fine grainParallelizable portionSerial

Parallelizable portion

Serial

Scaling: ~3X

Scaling: ~1.10X

Scaling: ~2.5X

Scaling: ~1.05X

Intel® Software Development Products

Intel® VTune™ Performance Analyzer

• Identify “hot spots” of code that may benefit from threading

• Shows callgraph to help identify threading candidates

Intel® Threading Tools

• Locate thread performance bottlenecks

• Estimate achievable/available performance

• Quickly validate designs and create prototypes

• Locate positions of data race conditions (read/write, write/read, write/write)– Isolate deadlock– Identify inappropriate API arguments

Intel® Thread Checker Interface

Intel® Thread Profiler Interface

Shows over-time application behavior, impact of each synchronization object, concurrency, etc.

Group

Zoom

Threading with OpenMP*

For performance-hungry applications – parallel threads use all cores

Use OpenMP* compiler for automated thread creation

MRTEMRTEThreads Threads

C/C++C/C++ThreadsThreads

C/C++C/C++OpenMPOpenMP

PortablePortable MostlyMostlyScalableScalable Not alwaysNot alwaysThread for Latency HidingThread for Latency HidingPerformance OrientedPerformance Oriented Work for itWork for itIncremental ParallelismIncremental ParallelismHigh LevelHigh LevelSerially Code IntactSerially Code IntactVerifiable CorrectnessVerifiable Correctness

Driving Parallel Computing: Clustering, Grids, Open MP...

• Openib.org Industry Alliance –Intel, Dell, Mellanox, Voltaire, Topspin, Oracle, Infinicon…

– Common set of InfiniBand drivers in the Linux 2.6 kernel (kernel.org)

• Parallelization Tools– Intel® MPI libraries for Ethernet, Infiniband..– Cluster OpenMP*: Demonstrated scaling to 100 nodes – Interconnect software: OpenIB and DET

• Advanced Computing Solutions– Network Storage Solutions for speeds over 10Gb/s

interconnects with Remote Direct Memory Access (RDMA)

• End-User Engagements on Advanced Computing requirements– Cluster, Grid Access – Engineering Engagements

Intel is investing in tools, labs, architecture, standards

Example: Non-Threaded Application

public class Report {

public void Compute(int x){

x = x * x; ... // long task}

}

public static void main(){

Report r = new Report();r.Compute(5);

}

Example: Worker Thread



this.x = x;mi = new MethodInvoker(AsyncCompute);mi.BeginInvoke(null, null);

}void AsyncCompute(){

x = x * x; ... // long task}MethodInvoker mi;int x;

}

public static void main(){

Report r = new Report();r.Compute(5);

}

Example: Thread Delegatepublic class Report {

public delegate void OnDone(int x);

public void Compute(int x, OnDone ondone){

this.x = x;this.ondone = ondone;mi = new MethodInvoker(AsyncCompute);mi.BeginInvoke(null, null);

}void AsyncCompute(){

x = x * x; ... // long taskondone(x);

}

MethodInvoker mi;int x;OnDone ondone;

}

Example: Delegate Invocation in GUIvoid Output(int x){

if (InvokeRequired)BeginInvoke(new Report.OnDone(OutputImpl),

new object[]{x});else

OutputImpl(x);}

void OutputImpl(int x){

Console.Write(x); }

Control.InvokeRequired PropertyGets a value indicating whether the caller must call an invoke method when making method calls to the control because the caller is on a different thread than the one the control was created on.

Control.BeginInvoke MethodExecutes the specified delegate asynchronously with the specified arguments, on the thread that the control's underlying handle was created on.

Example: Abort Task on Worker Thread



this.x = x;bAbort = false;mi = new MethodInvoker(AsyncCompute);ar = mi.BeginInvoke(null, null);

}public void CancelComputing() {

if (mi != null){

bAbort = true;mi.EndInvoke(ar);mi = null;

}}void AsyncCompute(){

if (bAbort) return;x = x * x; ... // long task

}MethodInvoker mi;IAsyncResult ar;int x;bool bAbort;

}

Example: Locking a Unique Resource


public void Compute(int[] xs){

foreach (int x in xs)ThreadPool.QueueUserWorkItem(

new WaitCallback(ComputeOne), x);}void ComputeOne(object param){

int x = (int)param;lock (sqlConn){

func(x); // allow only one thread run here}

}void func(int x) {

... // long task using unique resource// e.g. SQL connection sqlConn

}}

False Sharing

False sharing can occur when 2 threads access distinct or independent data that fall into the same cache line

When two processor cores share a single memory system, and have caches…• A given piece of data may be in memory, and in both processor core

caches• All copies of one cache line must have the same data• Reading stale data yields incorrect results so…• Processors cores make sure no reads occur to a cache line after

someone else writes the line (MESI snooping protocol) � bottleneck• Frequently this kind of line sharing is unintentional and can be

avoided

False sharing hurts performance on multi-processor and multi-core systems as well as systems with Hyper-Threading Technology

Two threads divide the work by every other 3-component vertex

Two threads update the contiguous vertices, v[k] and v[k+1], which fall on the same cache line (common case)

False Sharing: Example

PP00 P1 …….P9PP88P7PP66P5PP44P3PP22Thread 1Thread 2

Cache line 64 bytes

12 bytes

……. Xk Yk Zk Xk+1

12 bytes

Pk Pk+1

Yk+1 Zk+1 …….

PN-1………PN/2+3PN/2+2PN/2+1

Thread 1

Thread 2

Let each thread handle half of the vertices by dividing the data into equal (or near-equal) halves

False Sharing: Example Fix

PP11PP00 PPN/2N/2………………PP22

Some Books and Training Resources

Training• Intel® Software College• Programs from other Vendors

Unix Reference documentation• “Unix Systems Programming: Communication, Concurrency and Threads,

Second Edition” by Kay Robbins, Steve Robbins• “Threads Primer: A Guide to Multithreaded Programming” by Bil Lewis, Lewis,

Daniel J. Berg• “Programming With Threads” by Steve Kleiman, Devang Shah, Bart Smaalders• “Advanced Programming in the UNIX(R) Environment” by W. Richard Stevens

Windows Reference documentation• “MultiThreading Applications in Win32: The Complete Guide to Threads” by Jim

Beveridge and Robert Wiener• “Advanced Windows” by Jeffery Richter• “Debugging Applications for Microsoft .Net and Microsoft Windows” by John

Robbins

Some Online Resources

Intel Developer Services Threading Center

Intel® Thread Tools web site

Threading Methodology: Principles and Practices

Developing Multithreaded Applications: A Platform Consistent Approach

Multiple Approaches to Multithreaded Applications

Advanced Multi-Threaded Programming

Multithreading for Experts - inside a parallel application

Techniques to Improve Performance of Multithreaded Applications

Improve Performance with Thread Aware Memory Allocators

Common Concurrent Programming Errors (Linux Magazine*, March 2002)

Multithreaded Programming with OpenMP*

Advanced OpenMP* programming

Prototyping with OpenMP*

*Other names and brands may be claimed as the property of others.

Documents

Multithreaded Programming Quickstart