Upload
others
View
15
Download
0
Embed Size (px)
Citation preview
1
Multithreaded Programming Quickstart
A Dr. Dobb’s Journal Vendor Perspectives NetSeminar
Sponsored by Intel
Tuesday, May 9, 20069AM PT / 12PM ET
Multithreaded Programming Quickstart
Software & Solutions GroupCharles Congdon, Senior Software
EngineerMay 9, 2006
Agenda
Motivation for Threading
Concepts in Parallelism
Implementing Parallelism
Existing Code
Common Threading Challenges
Summary
Intel® Integrated Performance Primitives, Intel® Math Kernel Library, Intel VTune™ Performance Analyzer, Intel® Threading Tools, Intel® Thread Profiler, Intel® Thread Checker, Intel® C++ Compiler, Intel® Fortran Compiler, Intel, and the Intel Logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.
Hardware Architecture
The trend toward multi-core mobile, desktop, and server processors is expected to continue into the foreseeable future, and software must be threaded to take full advantage of it.
Why Thread Your Application?
Increased responsiveness and worker productivity
• Increased application responsiveness when different tasks run inparallel
Improved performance in parallel environments
• When running computations on multiple processors
More computation per cubic foot of data center
• Web-based apps are often multi-threaded by nature
Performance + responsiveness makes it easier to add new features
Taking full advantage of Multi-Core hardware requires multi-threaded software
Agenda
Motivation for Threading
Concepts in Parallelism
Implementing Parallelism
Existing Code
Common Threading Challenges
Summary
Hardware and Software Threading
Hyper-Threading (HT) Technology
• Hardware technology to increase processor performance by improving CPU utilization
Dual and Multi-Core
• Hardware technology to increase processor performance by placing multiple CPU cores in a single processor package
Multi-threading
• Software technology to improve software functionality & increase software performance by utilizing multiple (logical) CPUs
• This is what we have traditionally seen in multitasking operating systems that run multiple applications and processes at once
What is Parallel Computing?
More than one thread of control
More than one processor
• Multiple Threads executing concurrently
• Coordinated work division
• Single problem
Shared Memory Parallelism
• Most common implementation
• Scheduling handled by the OS
• Sharing a single address space
• Requires a system w/ shared memory and multiple CPUs
Types of Parallel Computing
Instruction-Level Parallelism (ILP)
Data-Level Parallelism (MMX™ Technology; SSE, SSE2, and SSE3 instructions)
Thread-Level Parallelism (TLP)
Process-Level Parallelism (“batch queue”)
Multi-computer distributed computing
• Clusters
• Grids
• SETI@Home*
* Other brands and names may be claimed as the property of others.
Partitioning Methods
Functional Decomposition
• Task ParallelismEach Thread performs a unique job
Grid reprinted with permission of Dr. Phu V. Luong, Coastal and Hydraulics Laboratory, ERDC
Domain Decomposition
• Data Parallelism:Same operation applied to all data
Ocean Model
Surface Model
Hydro Model
Atmosphere Model
GOAL: Identify independent computations / primitive tasks
Most Code Contains Parallelism
Task parallelism:
Independent subprograms
Data parallelism:
Independent loop iterations
for (y=0; y<nLines; y++)
genLine(model,im[y]);
call fluxx(fv,fx)
call fluxy(fv,fy)
call fluxz(fv,fz)
Lock/Synchronization Object
Working definition:
• A programmatic construct that coordinates multithreaded access to shared global data
Or in less flashy terms:
• Something that allows the programmer to keep two threads from updating the same variable at once.
Granularity
Granularity of parallel work
• Finding the right sized “chunks” of parallel work can be challenging– Too large can lead to load imbalance– Too small can lead to synchronization overhead
• Adjust dynamically based on data and system to help keep the balance right and reduce synchronization
Granularity of synchronization/locking
• Synchronization should happen in as small a region as possible– Too large and the execution becomes serial as other threads wait for the
lock
• Synchronization should happen as infrequently as possible– Too often and synchronization overhead can dominate
Parallel Overhead
Synchronization Overhead
• Arises when multiple threads try to acquire the same lock at once– Minimize data-sharing across threads– When it is necessary, keep it as short as possible and outside of tight loops
Thread-Creation overhead
• Thread creation is very expensive and should be done infrequently– Use re-usable threads and thread pools
False-sharing overhead
• Cache pinging when different threads access adjacent data– Have threads work on different sections of problem
Intel® Thread Profiler and Intel® VTune can help you detect these issues.
PP00 P1 …….PP44P3PP22Thread 1Thread 2 PN-1………PN/2+2PN/2+1
PP11PP00 PPN/2N/2………………
Load Balancing
Give each thread equal-sized chunks of work
• For task parallelism, equal-sized tasks
• For data parallelism, equal splitting of the data
For task parallelism in particular:
• Can be data-dependent – may need to adjust dynamically
• One thread might get several tasks vs. one task
• Use Intel® VTune™ Performance Analyzer to help assess load
For Both:
• May need to use smaller chunks of work to load-balance better– Which can increase synchronization overhead…
Agenda
Motivation for Threading
Concepts in Parallelism
Implementing Parallelism
Existing Code
Common Threading Challenges
Summary
Start early with multicore for best results
Understand threading concepts, parallel software architectures and patterns
Learn about threading technologies like OpenMP*, Win32 threads, PThreads, etc.
Mentor your dev team with any SMP developers you have
Understand fundamental scale/coordination limiters in your code
Understand coordination overhead
Search for algorithms which are more parallel-friendly
Determine optimal thread count, set it dynamically
Avoid spin loops: sleep or use threading sync mechanisms
Don’t overlook growing datasets!Key Resources:
• Intel® Software Network• Intel® Software College• Professional programming books,
including Intel® Press
Learn Design
* Other brands and names may be claimed as the property of others.
Repeatable Benchmarks Required
Measure CPU hotspots, I/O hotspots, and the degree of parallelism in your application before/during/after threading
• Windows Performance Monitor*
• Linux vmstat*, sar*, mpstat*, iostat*
• Intel® VTune™ Performance Analyzer
• Intel® Thread Profiler
Use your knowledge of the algorithm to identify opportunities for parallelism• Decompose processing into compute threads• Consider Partitioning Methods
Verify with tools
• Tools will identify dependencies you overlooked
• Tools will help identify regions for greatest ROI
• Tools will improve your productivity
Developer knowledge of the algorithms is important!
* Other brands and names may be claimed as the property of others.
Candidate Areas for Threading
Loops in hotspot code• Each iteration needs to be independent• Iterations with dependencies may be candidates for pipelining
Hotspot function that contains unrelated tasks with no data dependencies
• Each of these tasks could be placed on a separate thread
Sub-tree of application call-graph profiling
• Use Intel® VTune™ call-graph functionality to understand execution flow
Frequently-executed repetitive tasks
• Each iteration of task must use different data
• Use performance analysis to determine if these happen often enough to justify effort
Options for Adding Parallelism
Explicitly Thread your program using Win32*/POSIX* threading APIs
Use a Compiler to automatically parallelize code
Use a Programming Language API (C#*, Java*, etc.)
Programming Language Extension (OpenMP*)• Use OpenMP* directives to tell the compiler how to decompose parts
of a serial program for parallel execution
Use an internally-threaded runtime library for common tasks• Intel® Integrated Performance Primitives (Intel® IPP) and Intel®
Math Kernel Library (Intel® MKL)• Parallel memory managers like MicroQuill SmartHeap* and Hoard*
These options are not mutually exclusive: mix and match as needed
* Other brands and names may be claimed as the property of others.
Threading with OpenMP*
• About OpenMP*– OpenMP is a directive-based set of language extensions to C, C++,
and Fortran• Requires OpenMP*-enabled compiler
– Easily parallelizes independent countable loops (Fortran DO or restricted C for)
– Coarser-grained parallelism possible via worksharing directives – Advanced features include API functions to get thread information and
locks, and some subtle directives and clauses
• You can use OpenMP* and Intel® Threading Tools to very quickly prototype threaded algorithms.
Regardless how you ultimately implement your threaded application, OpenMP* provides a quick way to get started
* Other brands and names may be claimed as the property of others.
Parallel region
• A parallel region is the basic concept of OpenMP*
• After a PARALLEL directive, every thread is executing the same region (master thread plus slave threads)
• At the end of the parallel region, slave threads (conceptually) disappear, leaving only the master thread
• Nested parallelism complicates matters
• Makes it possible to add parallelism incrementally
Single Thread of execution
#pragma omp parallel{
Parallel: Multiple threads of execution
} // End Parallel: back to single thread
Parallel Regions
Master Thread
* Other brands and names may be claimed as the property of others.
Example – OpenMP* Threads
BB
AA CC
…// Divide work of outer loop between all processors on the system
#pragma omp parallel for private(x,y)
for (x=0; x<width; x++)
for (y=0; y<height; y++)
C(x,y) = F(A(x,y),B(x,y));
do_something_else(); …
Number of threads used by OpenMP* is determined at initialization time (number of processors).
If disabled, code looks and runs like single threaded code
* Other brands and names may be claimed as the property of others.
Development Cycle
Analysis–Verify timings, verify dependencies
–Intel® VTune™ Performance Analyzer
Design (Introduce Threads)–Use a threaded library
–e.g. Intel® Performance libraries: IPP and MKL
–OpenMP* (Intel® Compiler)–Explicit threading (Win32*, Pthreads*)
Analyze for correctness–Intel® Thread Checker–Intel® Debugger
Tune performance–Intel® Thread Profiler–Intel® VTune™ Performance Analyzer
* Other brands and names may be claimed as the property of others.
Agenda
Motivation for Threading
Concepts in Parallelism
Implementing Parallelism
Existing Code
Common Threading Errors
Summary
Starting Small with Existing Code
If full-blown threading looks too hard…
• Look at the code where you spend the most time - hotspots
• Identify code regions that would benefit from parallelism
• Try to use the Intel® Compiler to parallelize tight inner loops– /Qparallel* and /Qx* options– OpenMP* directives and /Qopenmp* option
• Use OpenMP*, Intel® Thread Checker, and Intel® Thread Profiler to prototype possible threading implementations– Once you have a good algorithm, you can rewrite in a native threading API
like Win32* or pThreads* as desired.
* Other brands and names may be claimed as the property of others.
Starting Small with Existing Code (continued)
If full-blown threading looks too hard (continued)
• Replace calls to large common functions with calls to internallyparallel libraries such as:– Intel® Integrated Performance Primitives– Intel® Math Kernel Library
• Make your libraries thread-safe in anticipation of their being called by threaded code– Code a simple multi-threaded test harness with OpenMP*
• Consider “functional” (task) parallelism– Try to separate computation from unrelated tasks (such as the GUI,
printing, etc.)
* Other brands and names may be claimed as the property of others.
Agenda
Motivation for Threading
Concepts in Parallelism
Implementing Parallelism
Existing Code
Common Threading Challenges
Summary
Challenges to Implementing Parallelism
Correctness• Shared resource identification• Threading the right code correctly• Difficult to debug
– Data Races– Deadlocks
Performance• Program decomposition (Functional task/Data)• Overhead (Thread management and synchronization)• Resource utilization (load balancing)
– Adequate Memory and I/O bandwidth– Task Priorities
Problem: Program Gives Incorrect Results
Some possible explanations:
• Race condition or storage conflicts– More than one thread accesses memory without synchronization– Locking used, but is too local to be effective
• Other components (such as 3rd-party APIs) may not be “thread safe” in certain use cases
Debug via:
• Intel® Thread Checker
• A tool like Rational Purify* or Compuware BoundsChecker*
* Other brands and names may be claimed as the property of others.
Data Race Example
Suppose you have global variables a=1, b=2
Thread1x = a + b
Thread2b = 42
The end result can be different if
• Thread1 runs before Thread2:
• Thread2 runs before Thread1: x = 43 x = 3
Execution order is not guaranteed unless updates to “b” and its use are protected by a
synchronization mechanism.
Problem: Threaded Program Hangs
Possible explanations:• Deadlocks
– Actual (A waits on B, which is waiting on A)– Potential (will happen under the “right” conditions)
• Thread Stalls and Waits– Dangling locks (thread exits holding a lock requested by another
thread)• Thread exit does not automatically release held locks• Must release lock from the same thread where you obtained (entered) it
Debug using:• Intel® Thread Checker• A standard debugger• Print statements (be sure to identify the thread)
Deadlock Example
Thread 2 will hang here
If Thread 1 reached here first
Deadlock Deadlock -- Both threads are now frozenBoth threads are now frozen
ThreadFunc2(){
lock(B);globalY++;lock(A);
globalX++;unlock(A);
unlock(B);}
ThreadFunc1(){
lock(A);globalX++;lock(B);
globalY++;unlock(B);
unlock(A);}
To fix: both functions must acquire and release locks in the same order
Problem: Poor Threaded PerformanceProgram runs more slowly on a multi-processor machine after it was threaded
Common Issues:• Locking Granularity
– Too small – parallel overhead dominates.– Too large – not enough parallel work, or last grain runs alone.
• Synchronization – Excessive use of shared data– Contention for the same synchronization object
• Load balance– Improper distribution of parallel work
Diagnose the problem with:• System performance monitors• Intel® Thread Profiler • Intel® VTune™ Performance Analyzer
Problem: Poor Scaling
The application doesn’t run much faster on a multi-processor system after it is threaded
Common Issues• Large periods of serial execution may dominate• Your threads may be contending, imbalanced, or starved• The hottest code may not have been threaded• May have exceeded memory bandwidth of machine
To diagnose, use:• System performance monitors• Intel® Thread Profiler• Intel® VTune™ Performance Analyzer• Schedule periodic scaling studies on 1, 2, 4, … N processor systems.
(N = twice most common customer configuration)
Poor Parallel Performance ExampleSlowdowns:• Contention over “Lock L” stalls thread T3 between E4 and E5.• Runtime of thread T3 lengthened by “Event E.”
To fix:• Try not to hold “Lock L” as long.• Question whether the resources/code protected by “Lock L” needs to
be locked• Remove the processing of “Event E” from the critical path of
execution.
T1
T2
T3
E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12E0
wait for T2 & T3
wait for lock
L
release lock L
wait for external event
Lock L
Join T3
Event E
Problem: Still Poor Scaling
Scaling still poor even after algorithmic issues are solved –consider architecture issues
One possible cause – False Sharing• Arises when two processors are working on adjacent data elements
that fall in the same cache line• Data must bounce back and forth between processors as each tries
to read and write to the data it is working on
To diagnose:
• Use Intel® Vtune™ Performance Analyzer and performance monitoring events to locate most of your false-sharing problems– Look for “Memory Order Machine Clears”
• Solve the problem by – Changing data placement – e.g. adding padding data– Altering patterns by which threads access data – don’t work on same
cache line – see backup slides
PP00 P1 …….PP44P3PP22Thread 1Thread 2
Agenda
Motivation for Threading
Concepts in Parallelism
Implementing Parallelism
Existing Code
Common Threading Errors
Summary
Summary
Multi-core and multi-processor systems are the new standard
Concurrent compute threads are required to take full advantage of multi-processor/multi-core systems
Properly threading your code is challenging• But you don’t have to do it all at once• Focus on areas that will have the most impact to overall application
performance• Monitor scaling on systems with more processors than the current
common customer configuration
A thorough understanding of actual application behavior under typical loads is necessary before you consider threading
Intel continues to invest in software to ease the transition to threaded code
41
Q&AQ&A
Please Submit Your Questions NowPlease Submit Your Questions Now
42
ResourcesResources
Intel PerformanceIntel Performancehttp://www.ddj.com/intel/26.htmhttp://www.ddj.com/intel/26.htm
Intel® Software NetworkIntel® Software Networkhttp://www.ddj.com/intel/27.htmhttp://www.ddj.com/intel/27.htm
Intel Developer Center Intel Developer Center –– ThreadingThreadinghttp://www.ddj.com/intel/28.htmhttp://www.ddj.com/intel/28.htm
Intel MultiIntel Multi--Core ProcessingCore Processinghttp://www.ddj.com/intel/29.htmhttp://www.ddj.com/intel/29.htm
Intel® Developer Solutions CatalogIntel® Developer Solutions Cataloghttp://www.ddj.com/intel/30.htmhttp://www.ddj.com/intel/30.htm
Backup
But Don’t Take MY Word for It
“We are optimizing RenderMan’s core to be very scalable for future multi-core architectures. Intel’s Threading Tools have accelerated our development cycle dramatically. Intel’s Thread Checker for example, helped identify potential threading issues very quickly, in days compared to weeks if done otherwise. Thread Profiler, on the other hand, has helped us understand threading performance problems so we could fix them to improve scalability. The Intel Threading Tools are now an integral part of our development process.”Dana BataliDirector of RenderMan DevelopmentPixar
But Don’t Take MY Word for It
“We found Intel ThreadChecker to be an indispensable aid for analyzing threaded code. We were impressed at how well it handled an application as large and complex as Maya. Based on this experience I plan to use this tool on future threading projects. Intel ThreadProfiler was very useful for analyzing bottlenecks in our threaded code. ThreadProfiler . . . showed us the reasons for the slowdown, so we were able to restructure the code for better threaded performance.”Martin WattSoftware ArchitectAlias
But Don’t Take MY Word for It
"We used Intel Threading Tools including Intel Thread Profiler to realize improved threaded application performance of Omni Page 15 running on Intel multi-core platforms. We look forward to using Intel Thread Profiler with its critical path analysis and selective magnification of important time regions on future thread optimization projects."Gyorgy VarszegiScansoft
Thread Development: Lessons Learned
Find thread balance• Pick the “right” level of function/data granularity to balance threads,
test it• Decide if you can/should adjust thread balance at runtime
Remove threading bugs• Don’t mix Intel and Microsoft compilers when using OpenMP (with a
few exceptions)• Race conditions & deadlocks
– Pre-threaded code (libraries) can reduce– Intel® Thread Checker flags these bugs
Remove thread coordination issues• Test algorithm scale• Minimize real and false data sharing, share safely, and check if you
got it right• Intel® Thread Profiler helps find these
Amdahl’s Law
• Describes the upper bound of parallel speedup (scaling)
• Helps think about the effects of overhead
• Examples:– If only half of a process is able to take advantage of parallelism, the
maximum possible scalability is two – assuming an infinite number of processors and perfect efficiency.
– If only two processors are available, the maximum possible speedup is 1.33, assuming perfect efficiency.
Load Imbalance
Unequal work loads lead to idle threads and wasted time
Time
BusyIdle
Granularity
Coarse grain
Fine grainParallelizable portionSerial
Parallelizable portion
Serial
Scaling: ~3X
Scaling: ~1.10X
Scaling: ~2.5X
Scaling: ~1.05X
Intel® Software Development Products
Intel® VTune™ Performance Analyzer
• Identify “hot spots” of code that may benefit from threading
• Shows callgraph to help identify threading candidates
Intel® Threading Tools
• Locate thread performance bottlenecks
• Estimate achievable/available performance
• Quickly validate designs and create prototypes
• Locate positions of data race conditions (read/write, write/read, write/write)– Isolate deadlock– Identify inappropriate API arguments
Intel® Thread Checker Interface
Intel® Thread Profiler Interface
Shows over-time application behavior, impact of each synchronization object, concurrency, etc.
Group
Zoom
Threading with OpenMP*
For performance-hungry applications – parallel threads use all cores
Use OpenMP* compiler for automated thread creation
MRTEMRTEThreads Threads
C/C++C/C++ThreadsThreads
C/C++C/C++OpenMPOpenMP
PortablePortable MostlyMostlyScalableScalable Not alwaysNot alwaysThread for Latency HidingThread for Latency HidingPerformance OrientedPerformance Oriented Work for itWork for itIncremental ParallelismIncremental ParallelismHigh LevelHigh LevelSerially Code IntactSerially Code IntactVerifiable CorrectnessVerifiable Correctness
Driving Parallel Computing: Clustering, Grids, Open MP...
• Openib.org Industry Alliance –Intel, Dell, Mellanox, Voltaire, Topspin, Oracle, Infinicon…
– Common set of InfiniBand drivers in the Linux 2.6 kernel (kernel.org)
• Parallelization Tools– Intel® MPI libraries for Ethernet, Infiniband..– Cluster OpenMP*: Demonstrated scaling to 100 nodes – Interconnect software: OpenIB and DET
• Advanced Computing Solutions– Network Storage Solutions for speeds over 10Gb/s
interconnects with Remote Direct Memory Access (RDMA)
• End-User Engagements on Advanced Computing requirements– Cluster, Grid Access – Engineering Engagements
Intel is investing in tools, labs, architecture, standards
Example: Non-Threaded Application
public class Report {
public void Compute(int x){
x = x * x; ... // long task}
}
public static void main(){
Report r = new Report();r.Compute(5);
}
Example: Worker Thread
public class Report {
public void Compute(int x){
this.x = x;mi = new MethodInvoker(AsyncCompute);mi.BeginInvoke(null, null);
}void AsyncCompute(){
x = x * x; ... // long task}MethodInvoker mi;int x;
}
public static void main(){
Report r = new Report();r.Compute(5);
}
Example: Thread Delegatepublic class Report {
public delegate void OnDone(int x);
public void Compute(int x, OnDone ondone){
this.x = x;this.ondone = ondone;mi = new MethodInvoker(AsyncCompute);mi.BeginInvoke(null, null);
}void AsyncCompute(){
x = x * x; ... // long taskondone(x);
}
MethodInvoker mi;int x;OnDone ondone;
}
Example: Delegate Invocation in GUIvoid Output(int x){
if (InvokeRequired)BeginInvoke(new Report.OnDone(OutputImpl),
new object[]{x});else
OutputImpl(x);}
void OutputImpl(int x){
Console.Write(x); }
Control.InvokeRequired PropertyGets a value indicating whether the caller must call an invoke method when making method calls to the control because the caller is on a different thread than the one the control was created on.
Control.BeginInvoke MethodExecutes the specified delegate asynchronously with the specified arguments, on the thread that the control's underlying handle was created on.
Example: Abort Task on Worker Thread
public class Report {
public void Compute(int x){
this.x = x;bAbort = false;mi = new MethodInvoker(AsyncCompute);ar = mi.BeginInvoke(null, null);
}public void CancelComputing() {
if (mi != null){
bAbort = true;mi.EndInvoke(ar);mi = null;
}}void AsyncCompute(){
if (bAbort) return;x = x * x; ... // long task
}MethodInvoker mi;IAsyncResult ar;int x;bool bAbort;
}
Example: Locking a Unique Resource
public class Report {
public void Compute(int[] xs){
foreach (int x in xs)ThreadPool.QueueUserWorkItem(
new WaitCallback(ComputeOne), x);}void ComputeOne(object param){
int x = (int)param;lock (sqlConn){
func(x); // allow only one thread run here}
}void func(int x) {
... // long task using unique resource// e.g. SQL connection sqlConn
}}
False Sharing
False sharing can occur when 2 threads access distinct or independent data that fall into the same cache line
When two processor cores share a single memory system, and have caches…• A given piece of data may be in memory, and in both processor core
caches• All copies of one cache line must have the same data• Reading stale data yields incorrect results so…• Processors cores make sure no reads occur to a cache line after
someone else writes the line (MESI snooping protocol) � bottleneck• Frequently this kind of line sharing is unintentional and can be
avoided
False sharing hurts performance on multi-processor and multi-core systems as well as systems with Hyper-Threading Technology
Two threads divide the work by every other 3-component vertex
Two threads update the contiguous vertices, v[k] and v[k+1], which fall on the same cache line (common case)
False Sharing: Example
PP00 P1 …….P9PP88P7PP66P5PP44P3PP22Thread 1Thread 2
Cache line 64 bytes
12 bytes
……. Xk Yk Zk Xk+1
12 bytes
Pk Pk+1
Yk+1 Zk+1 …….
PN-1………PN/2+3PN/2+2PN/2+1
Thread 1
Thread 2
Let each thread handle half of the vertices by dividing the data into equal (or near-equal) halves
False Sharing: Example Fix
PP11PP00 PPN/2N/2………………PP22
Some Books and Training Resources
Training• Intel® Software College• Programs from other Vendors
Unix Reference documentation• “Unix Systems Programming: Communication, Concurrency and Threads,
Second Edition” by Kay Robbins, Steve Robbins• “Threads Primer: A Guide to Multithreaded Programming” by Bil Lewis, Lewis,
Daniel J. Berg• “Programming With Threads” by Steve Kleiman, Devang Shah, Bart Smaalders• “Advanced Programming in the UNIX(R) Environment” by W. Richard Stevens
Windows Reference documentation• “MultiThreading Applications in Win32: The Complete Guide to Threads” by Jim
Beveridge and Robert Wiener• “Advanced Windows” by Jeffery Richter• “Debugging Applications for Microsoft .Net and Microsoft Windows” by John
Robbins
Some Online Resources
Intel Developer Services Threading Center
Intel® Thread Tools web site
Threading Methodology: Principles and Practices
Developing Multithreaded Applications: A Platform Consistent Approach
Multiple Approaches to Multithreaded Applications
Advanced Multi-Threaded Programming
Multithreading for Experts - inside a parallel application
Techniques to Improve Performance of Multithreaded Applications
Improve Performance with Thread Aware Memory Allocators
Common Concurrent Programming Errors (Linux Magazine*, March 2002)
Multithreaded Programming with OpenMP*
Advanced OpenMP* programming
Prototyping with OpenMP*
*Other names and brands may be claimed as the property of others.