1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2

Programming with Shared Memory

CONTENT

Introduction Cilk TBB OpenMP

• Using heavy weight processes• Using threads. Example Pthreads• Using a completely new programming language for parallel programming - not popular. Example Ada• Using library routines with an existing sequential programming language• Modifying the syntax of an existing sequential programming language to create a parallel programming language • Using an existing sequential programming language supplemented with compiler directives for specifying parallelism. Example OpenMP

Alternatives for Programming SharedMemory

FAMILY TREE

Chare Kernelsmall tasks

Cilk space efficient schedulercache-oblivious algorithms

OpenMP*fork/join tasks

JSR-166(FJTask)containers

OpenMP taskqueuewhile & recursion

Intel® TBB

STLgeneric programming

STAPLrecursive ranges

Threaded-C continuation taskstask stealing

ECMA .NET*parallel iteration classes

Libraries

Languages

Pragmas

Operating systems often based upon notion of a process.

Processor time shares between processes, switching from one process to another. Might occur at regular intervals or when an active process becomes delayed.

Offers opportunity to deschedule processes blocked from proceeding for some reason, e.g. waiting for an I/O operation to complete.

Concept could be used for parallel programming. Not much used because of overhead but fork/join concepts used elsewhere.

Using Heavyweight Processes

FORK-JOIN construct

UNIX System Calls

SPMD model with different code for master process and forked slave process.

Differences between a process and threads

SAS(SHARED ADDRESS SPACE) PROGRAMMING MODEL

Thread(Process)

System

read(X) write(X)

Shared variable

线程安全（ Thread safe ）是指某线程可由多个线程同时调用，并且能够产生正确的结果 .

标准的 I/O 线程安全：输出消息时不会产生字符交错情况 .

返回时间的系统调用可能不是线程安全的 .

访问共享数据的例程需要特别设计以确保是线程安全的 .

Thread-Safe Routines

考虑如下两进程：每一进程往共享变量 x 加 1. 首先读 x ，然后计算加 1 ，最后结果写回去。

Accessing Shared Data

Conflict in accessing shared variable

临界区包含代码以及所涉及的资源。建立临界区是确保在任何时刻只有一个进程访问特定资源。

这种机制也称之为互斥（ mutual exclusion ）

Critical Section

最简单的互斥机制是锁。

一种锁是种位变量： 1 指示一个进程进入了临界区； 0 指示没有进程在临界区 .

类似于门锁 : 进程来到临界区的“门口”，如果发现门是开着，它进去并锁上门。如果它完成操作，打开门离开临界区

Control of critical sections through busy waiting

Locks are implemented in Pthreads with mutually exclusive lock variables, or “mutex” variables:

.pthread_mutex_lock(&mutex1);

critical sectionpthread_mutex_unlock(&mutex1);

If a thread reaches a mutex lock and finds it locked, it will wait for the lock to open. If more than one thread is waiting for the lock to open when it opens, the system will select one thread to be allowed to proceed. Only the thread that locks a mutex can unlock it.

Pthread Lock Routines

当进程 P1 锁定资源 R1 后，然后申请被 P2 锁定的资源 R2 ，同时 P2 申请资源 R1

Deadlock

死锁也可以发生在如下的循环锁中

Deadlock (deadly embrace)

Offers one routine that can test whether a lock is actually closed without blocking the thread:

pthread_mutex_trylock()

Will lock an unlocked mutex and return 0 or will return with EBUSY if the mutex is already locked – might find a use in overcoming deadlock.

Pthreads

A positive integer (including zero) operated upon by two operations:

P operation on semaphore s

Waits until s is greater than zero and then decrements s by one and allows the process to continue.

V operation on semaphore s

Increments s by one and releases one of the waiting processes (if any).

Semaphores

P and V operations are performed indivisibly.

Mechanism for activating waiting processes is also implicit in P and V operations. Though exact algorithm not specified, algorithm expected to be fair. Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore.

Mutual exclusion of critical sections can be achieved with onesemaphore having the value 0 or 1 (a binary semaphore), whichacts as a lock variable, but the P and V operations include a process scheduling mechanism:Process 1 Process 2 Process 3Noncritical section Noncritical section Noncritical section

. . .P(s) P(s) P(s) Critical section Critical section Critical sectionV(s) V(s) V(s)

. . .Noncritical section Noncritical section Noncritical section

Can take on positive values other than zero and one.Provide, for example, a means of recording the number of “resource units” available or used and can be used to solve producer/ consumer problems.

General semaphore(or counting semaphore)

Suite of procedures that provides only way to access shared resource. Only one process can use a monitor procedure at any instant.Could be implemented using a semaphore or lock to protect entry, i.e.,

monitor_proc1(){lock(x);

.monitor body

.unlock(x);return;}

Monitor

Often, a critical section is to be executed if a specific globalcondition exists; for example, if a certain value of a variable has been reached.

With locks, the global variable would need to be examined at frequent intervals (“polled”) within a critical section.

Very time-consuming and unproductive exercise.

Can be overcome by introducing so-called condition variables.

Condition Variables

Language Constructs for ParallelismShared Data

Shared memory variables might be declared as shared with, say,

shared int x;

par Construct

For specifying concurrent statements:

par {S1;S2;..Sn;

forall Construct

To start multiple similar processes together:

forall (i = 0; i < n; i++) {S1;S2;..Sm;

which generates n processes each consisting of the statements forming the body of the for loop, S1, S2, …, Sm. Each process uses a different value of i.

Example

forall (i = 0; i < 5; i++)a[i] = 0;

clears a[0], a[1], a[2], a[3], and a[4] to zero concurrently.

DESIGN FOR MULTITHREADING Good design is critical Bad multithreading can be worse than no

multithreading Deadlocks, synchronization bugs, poor performance,

BAD MULTITHREADING

Thread 1

Thread 2

Thread 3

Thread 4

Thread 5

Rendering ThreadRendering ThreadRendering Thread

Game Thread

GOOD MULTITHREADING

Main Thread

Physics

Rendering Thread

Animation/Skinning

Particle Systems

Networking

File I/O

Game Thread

Present

Rendering

Physics

InputFrame 2Frame 3Frame 4

ANOTHER PARADIGM: CASCADES

Thread 1

Thread 2

Thread 3

Thread 4

Thread 5

Frame 1

Advantages: Synchronization points are few and well-defined

Disadvantages: Increases latency (for constant frame rate) Needs simple (one-way) data flow

MULTITHREADED PROGRAMMING INCILK

• Introduction

• Inlets

• Abort

CONTENT

ttile, T

CILK IN ONE SLIDE

扩展 C 语言支持并行，同时不改变原来的串行语义 . 面向 fork-join 方式任务产生的编程模型非常适合递归算法 (e.g.

branch-and-bound) 有着坚实的理论基础 … 能够证明性能

cilk Marks a function as a “cilk” function that can be spawned

spawn Spawns a cilk function … only 2 to 5 times the cost of a regular function call

sync Wait until immediate children spawned functions return

• 高级关键字inlet Define a function to handle return values from a cilk task

cilk_fence A portable memory fence.

abort Terminate all currently existing spawned tasks

RECURSION IS AT THE HEART OF CILK Cilk 中派生新任务非常方便 . 不是采用循环，而是递归地产生很多任务 . 创建任务嵌套队列，调度器采用 workstealing 确

保所有的核都忙

With Cilk, the programmer worries about expressing concurrency, not the details of

how it is implemented

INTRODUCTION ...

Cilk program 是一组 procedures

procedure 是一系列 threads

Cilk threads are:

represented by nodes in the dag

FIBONACCI – AN EXAMPLE

int fib (int n) {if (n<2) return (n); else { int x,y; x = fib(n-1); y = fib(n-2); return (x+y); }}

cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); }}

Cilk codeCilk code

Cilk provides no new data types.

BASIC CILK KEYWORDS

声明一 Cilk 函数或过程。该过程可以并行 spawn.

派生子线程。子线程可以与父进程并行执行

Control cannot pass this point until all spawned children have returned.

DYNAMIC MULTITHREADING

The computation dag unfolds dynamically.

Example: fib(4)

MULTITHREADED COMPUTATION

• 有向无环图 G = (V, E) 代表了并行指令流 .• 每一顶点 v代表一 (Cilk) thread: 最大指令序列，不包

含并行控制指令 (spawn, sync, return).• 每一边 e可以是一 spawn 边 , return 边 , 或 continue

spawn edgereturn edge

continue edge

initial thread

final thread

CACTUS STACK

Views of stack

CBA D E

Cilk supports C’s rule for pointers: A pointer to stack space can be passed from parent to child, but not from child to parent. (Cilk also supports malloc.)

Cilk’s cactus stack supports several views in parallel.

OPERATING ON RETURNED VALUES

• Cilk achieves this functionality using an internal function, called an inlet, which is executed as a secondary thread on the parent frame when the child returns.

• The inlet keyword defines a void internal function to be an inlet.

x += spawn foo(a,b,c); x += spawn foo(a,b,c);

SEMANTICS OF INLETScilk int fib (int n){ int x = 0;

if (n<2) return n;else { summer(spawn fib (n-1)); summer(spawn fib (n-2)); sync; return (x);}}

cilk int fib (int n){ int x = 0;

if (n<2) return n;else { summer(spawn fib (n-1)); summer(spawn fib (n-2)); sync; return (x);}}

inlet void summer (int result){ x += result; return;}

1. The Cilk procedure fib(i) is spawned.

2. Control passes to the next statement.

3. When fib(i) returns, summer() is invoked

SEMANTICS OF INLETS

• In the current implementation of Cilk, the inlet definition may not contain a spawn, and only the first argument of the inlet may be spawned at the call site.

IMPLICIT INLETS

cilk int wfib(int n) { if (n == 0) { return 0; } else { int i, x = 1; for (i=0; i<=n-2; i++) { x += spawn wfib(i); } sync; return x; }}

对于赋值运算 , Cilk 编译器自动产生一个不明确的 inlet

COMMON PATTERN FOR CILK 考虑包含循环的程序

• 将其转换为递归结构… 将范围分割为两半直到每一块足够小

void vadd (real *A, real *B, int n){ int i; for(i=0; i<n; i++) A[i] += B[i];}

void vadd (real *A, real *B, int n){ if (n<MIN) { int i; for(i=0; i<n; i++) A[i] += B[i]; } else { vadd(A, B, n/2); vadd(A+n/2, B+n/2, n-n/2); } }• 加入 Cilk 关键词

spawnspawn

COMPUTING A PRODUCT

p = Aii = 0

int product(int *A, int n) { int i, p=1; for (i=0; i<n; i++) { p *= A[i];

} return p;}

优化 : 如果部分结果为 0 ，终止计算

COMPUTING A PRODUCT

p = Aii = 0

int product(int *A, int n) { int i, p=1; for (i=0; i<n; i++) { p *= A[i];

} return p;}

if (p == 0) break;

优化 : 如果部分结果为 0 ，终止计算

COMPUTING A PRODUCT IN PARALLEL

p = Aii = 0

cilk int prod(int *A, int n) { int p = 1; if (n == 1) { return A[0]; } else { p *= spawn product(A, n/2); p *= spawn product(A+n/2, n-n/2); sync; return p; }}

怎样终止 ? 52

CILK’S ABORT FEATUREcilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; return; }

if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; }}

cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; return; }

1. Recode the implicit inlet to make it explicit.53

CILK’S ABORT FEATUREcilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x;

return; }

cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x;

return; }

2. Check for 0 within the inlet.54

CILK’S ABORT FEATURE

cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; if (p == 0) {

abort; /* Aborts existing children, */ } /* but not future ones. */ return; }

2. Check for 0 within the inlet.55

CILK’S ABORT FEATURE

if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) );

mult( spawn product(A+n/2, n-n/2) ); sync; return p; }}

if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) );

mult( spawn product(A+n/2, n-n/2) ); sync; return p; }}

CILK’S ABORT FEATUREcilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; return; }

if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); if (p == 0) { /* Don’t spawn if we’ve */ return 0; /* already aborted! */ } mult( spawn product(A+n/2, n-n/2) ); sync; return p; }}

MUTUAL EXCLUSION

Cilk’s solution to mutual exclusion is no better than anybody else’s.

Cilk provides a library of locks declared with Cilk_lockvar. • To avoid deadlock with the Cilk scheduler, a

lock should only be held within a Cilk thread.• I.e., spawn and sync should not be executed

while a lock is held.

LOCKING

Cilk_lockvar data type

#include <cilk-lib.h>::Cilk_lockvar mylock;:{Cilk_lock_init(mylock);:Cilk_lock(mylock); /* begin critical section */

:Cilk_unlock(mylock); /* end critical section */

KEY IDEASCilk is simple: cilk, spawn, sync, SYNCHED, inlet, abort

JCilk is simplerWork & span

Work & spanWork & spanWork & span Work & span Work & span Work & span Work & span Work & span Work & span

Work & span

INTEL’S THREADING BUILDING BLOCKS

THREADING BUILDING BLOCKS LIBRARY CHARACTERISTICS C++ Library. Targets threading for performance (designed to

parallelize computationally intensive work). Is compatible with other threading packages. Emphasizes scalable data parallel

programming. Specifies templates and tasks instead of

threads - the library schedules tasks onto threads and manages load balancing.

COMPONENTS OF TBB (VERSION 2.1)

Synchronization primitivesatomic operationsvarious flavors of mutexes (improved)

Parallel algorithmsparallel_for (improved)parallel_reduce (improved)parallel_do (new)pipeline (improved)parallel_sortparallel_scan

Concurrent containersconcurrent_hash_mapconcurrent_queueconcurrent_vector(all improved)

Task schedulerWith new functionality

Memory allocatorstbb_allocator (new), cache_aligned_allocator, scalable_allocator

Utilitiestick_counttbb_thread (new)

C++ REVIEW: FUNCTION TEMPLATE Type-parameterized function.

Strongly typed. Obeys scope rules. Actual arguments evaluated exactly once. Not redundantly instantiated.

template<typename T>void swap( T& x, T& y ) { T z = x; x = y; y = z;}

void reverse( float* first, float* last ) { while( first<last-1 ) swap( *first++, *--last );}

Compiler instantiates template swap with T=float.

[first,last) define half-open interval

GENERICITY OF SWAP

T(const T&) Copy constructor

void T::operator=(const T&); Assignment

~T() Destructor

template<typename T>void swap( T& x, T& y ) { T z = x; x = y; y = z; }

// Construct z// Assignment// Assignment// Destroy z

Requirements for T

C++ REVIEW: TEMPLATE CLASS Type-parameterized class

template<typename T, typename U>class pair {public: T first; U second; pair( const T& x, const U& y ) : first(x), second(y) {}};

pair<string,int> x;x.first = “abc”;x.second = 42;

Compiler instantiates template pair with T=string and U=int.

TBB LIBRARY ALGORITHM – PARALLEL_FOR parallel_for is a template function provided by library. template <typename Range, typename Body>

void parallel_for(const Range& range, Functor& func, partitioner );

Requirements for Range R:

Library provides blocked_range, blocked_range2d, blocked_range3d Programmer can define new kinds of ranges

R(const R&) Copy a range

R::~R() Destroy a range

bool R::empty() const Is range empty?

bool R::is_divisible() const Can range be split?

R::R (R& r, split) Split r into two subranges

REQUIREMENTS FOR FUNCTOR

template <typename Range, typename Functor>void parallel_for(const Range& range, Functor& func, partitioner );

Requirements for Functor func:

F::F( const F& ) Copy constructor

F::~F() Destructor

void F::operator() (Range& subrange) Apply F to subrange

EXAMPLE – PARALLEL_FOR

• Example: concurrently apply a function to each element in an array.

• Serial version:

void SerialApplyFoo( float a[], size_t n ) { for( size_t i=0; i<n; ++i ) Foo(a[i]); }

• Iteration space is 0…(n-1).

TBB LIBRARY ALGORITHM – PARALLEL_FOR, CONTINUED Parallel version requires two steps:

#include "tbb/blocked_range.h"

class ApplyFoo {

float *const my_a;

public: ApplyFoo( float a[] ) : my_a(a) {}

void operator()( const blocked_range<size_t>& r ) const { float *a = my_a; for( size_t i=r.begin(); i!=r.end(); ++i ) Foo(a[i]); } };

TBB LIBRARY ALGORITHM – PARALLEL_FOR, CONTINUED

parallel_for breaks iteration space into chunks each of which are run on separate threads

blocked_range<T>(begin,end,grainsize) - recursively divisible struct.

grainsize, specifies the number of iterations for a “reasonable size” chunk to deal out to a processor. If the iteration space has more than grainsize iterations, parallel_for splits it into separate subranges that are scheduled separately.

operator () processes a chunk 71

#include "tbb/parallel_for.h"

void ParallelApplyFoo( float a[], size_t n ) { parallel_for(blocked_range<size_t>(0,n,IdealGrainSize), ApplyFoo(a) ); }

TBB LIBRARY - ALGORITHMS parallel_reduce – math ops with elements of an

array in parallel parallel_do – loops indeterminate length iteration

spaces parallel_* - several others…

WORK STEALING

Thread

mailbox

Thread

mailbox

Thread

mailbox

Thread

mailbox

Cache Affinity2. Steal task advertised in mailbox

Load balance

3. Steal oldest task from random victim

Locality1. Take youngest task from my deque

Override0. Do explicitly specified task

HOW THIS WORKS

Split range...

.. recursively...

...until grainsize.

WORK DEPTH FIRST; STEAL BREADTH FIRST

victim thread

Best choice for theft!• big piece of work• data far from victim’s hot data.

Second best choice.

PARALLEL SORT EXAMPLE (WITH WORK STEALING)QUICKSORT – STEP 1

THREAD 1

32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63

Thread 1 starts with the initial data

tbb::parallel_sort (color, color+64);

11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 15 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35

52 47 41 43 53 60 61 38 56 48 59 54 50 49 51 45 62 39 42 40 58 55 57 44 46 63

THREAD 1

32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63

THREAD 2THREAD 3 THREAD 4

Thread 1 partitions/splits its data

11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 15 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35

52 47 41 43 53 60 61 38 56 48 59 54 50 49 51 45 62 39 42 40 58 55 57 44 46 63

THREAD 1 THREAD 2

32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63

Thread 2 gets work by stealing from Thread 1

THREAD 3 THREAD 4

7 37 49

11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 15 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35

52 47 41 43 53 60 61 38 56 48 59 54 50 49 51 45 62 39 42 40 58 55 57 44 46 63

THREAD 1

1 0 2 6 4 5 3

712 29 27 19 20 30 33 31 25 21 11 15 17 26 18 16 10 9 23 13 14 8 24 36 32 28 22 34 35

45 47 41 43 46 44 40 38 42 48 39

4950 52 51 54 62 59 56 61 58 55 57 60 53 63

THREAD 2

32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63

7 37 49

11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 15 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35

52 47 41 43 53 60 61 38 56 48 59 54 50 49 51 45 62 39 42 40 58 55 57 44 46 63

THREAD 1

1 0 2 6 4 5 3

712 29 27 19 20 30 33 31 25 21 11 15 17 26 18 16 10 9 23 13 14 8 24 36 32 28 22 34 35

45 47 41 43 46 44 40 38 42 48 39

4950 52 51 54 62 59 56 61 58 55 57 60 53 63

32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63

11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 15 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35

52 47 41 43 53 60 61 38 56 48 59 54 50 49 51 45 62 39 42 40 58 55 57 44 46 63

1 0 2 6 4 5 3

712 29 27 19 20 30 33 31 25 21 11 15 17 26 18 16 10 9 23 13 14 8 24 36 32 28 22 34 35

45 47 41 43 46 44 40 38 42 48 39

4950 52 51 54 62 59 56 61 58 55 57 60 53 63

11 8 14 13 9 10 16 12 17 15

1821 25 26 31 33 30 20 23 19 27 29 24 36 32 28 22 34 35

THREAD 1 THREAD 2THREAD 3 THREAD 4

32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63

Thread 1 sorts the rest of its data

0 1 2 3 4 5 6 7 18 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

Thread 2 sorts the rest its data

Thread 3 partitions/splitsits data

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 15 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35

52 47 41 43 53 60 61 38 56 48 59 54 50 49 51 45 62 39 42 40 58 55 57 44 46 63

THREAD 1

1 0 2 6 4 5 3

712 29 27 19 20 30 33 31 25 21 11 15 17 26 18 16 10 9 23 13 14 8 24 36 32 28 22 34 35

45 47 41 43 46 44 40 38 42 48 39

4950 52 51 54 62 59 56 61 58 55 57 60 53 63

11 8 14 13 9 10 16 12 17 15

1821 25 26 31 33 30 20 23 19 27 29 24 36 32 28 22 34 35

32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63

Thread 1 gets more work by stealing from Thread 3

Thread 3 sorts therest of its data

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 27 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 15 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35

52 47 41 43 53 60 61 38 56 48 59 54 50 49 51 45 62 39 42 40 58 55 57 44 46 63

THREAD 1

1 0 2 6 4 5 3

712 29 27 19 20 30 33 31 25 21 11 15 17 26 18 16 10 9 23 13 14 8 24 36 32 28 22 34 35

45 47 41 43 46 44 40 38 42 48 39

4950 52 51 54 62 59 56 61 58 55 57 60 53 63

11 8 14 13 9 10 16 12 17 15

1821 25 26 31 33 30 20 23 19 27 29 24 36 32 28 22 34 35

19 25 26 22 24 21 20 23

2730 29 33 36 32 28 31 34 35

32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 27 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 15 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35

52 47 41 43 53 60 61 38 56 48 59 54 50 49 51 45 62 39 42 40 58 55 57 44 46 63

THREAD 1

1 0 2 6 4 5 3

712 29 27 19 20 30 33 31 25 21 11 15 17 26 18 16 10 9 23 13 14 8 24 36 32 28 22 34 35

45 47 41 43 46 44 40 38 42 48 39

4950 52 51 54 62 59 56 61 58 55 57 60 53 63

11 8 14 13 9 10 16 12 17 15

1821 25 26 31 33 30 20 23 19 27 29 24 36 32 28 22 34 35

19 25 26 22 24 21 20 23

2730 29 33 36 32 28 31 34 35

32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63

Thread 2 gets more work by stealing from Thread 1

11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 15 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35

52 47 41 43 53 60 61 38 56 48 59 54 50 49 51 45 62 39 42 40 58 55 57 44 46 63

THREAD 1

1 0 2 6 4 5 3

712 29 27 19 20 30 33 31 25 21 11 15 17 26 18 16 10 9 23 13 14 8 24 36 32 28 22 34 35

45 47 41 43 46 44 40 38 42 48 39

4950 52 51 54 62 59 56 61 58 55 57 60 53 63

11 8 14 13 9 10 16 12 17 15

1821 25 26 31 33 30 20 23 19 27 29 24 36 32 28 22 34 35

19 25 26 22 24 21 20 23

2730 29 33 36 32 28 31 34 35

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63

DONE 85

TBB LIBRARY - PIPELINE

TBB implements the pipeline pattern. Data flows through a series of pipeline stages,

and each stage processes the data in some way.

Parallel stage scales because it can process items in parallel or out of order.

Serial stage processes items one at a time in order. Another serial stage.

Items wait for turn in serial stage

Controls excessive parallelism by limiting total number of items flowing through pipeline.

Uses sequence numbers recover order for serial stage.

Tag incoming items with sequence numbers

Throughput limited by throughput of slowest serial stage.

PARALLEL PIPELINE

810 9 3 1

EXAMPLE

Sample problem: read a text file (sequential), capitalize the first letter of each word (parallel), and write the modified text to a new file (sequential).

TBB LIBRARY – PIPELINE, CONTINUED

// Create the pipeline tbb::pipeline pipeline;

// Create file-reading writing stage and add it to the pipeline MyInputFilter input_filter( input_file ); pipeline.add_filter( input_filter );

// Create capitalization stage and add it to the pipeline MyTransformFilter transform_filter; pipeline.add_filter( transform_filter ); // Create file-writing stage and add it to the pipeline MyOutputFilter output_filter( output_file ); pipeline.add_filter( output_filter );

// Run the pipeline pipeline.run( MyInputFilter::n_buffer );

// Must remove filters from pipeline before they are implicitly destroyed. pipeline.clear();

TBB – PIPELINE, CONTINUED

// Filter that writes each buffer to a file. class MyOutputFilter: public tbb::filter { FILE* my_output_file; public: MyOutputFilter( FILE* output_file ); /*override*/void* operator()( void* item ); }; MyOutputFilter::MyOutputFilter( FILE* output_file ) : tbb::filter(serial), my_output_file(output_file) { }

void* MyOutputFilter::operator()( void* item ) { MyBuffer& b = *static_cast<MyBuffer*>(item); fwrite( b.begin(), 1, b.size(), my_output_file ); return NULL; }

TBB – PIPELINE, CONTINUED

// Changes the first letter of each word from lower case to upper case. class MyTransformFilter: public tbb::filter { public: MyTransformFilter(); /*override*/void* operator()( void* item ); }; MyTransformFilter::MyTransformFilter() : tbb::filter(parallel) {}

/*override*/void* MyTransformFilter::operator()( void* item ) { // a for loop and ‘toupper()’ go here… }

TBB - TIMING tick_count class

using namespace tbb;

void Foo() { tick_count t0 = tick_count::now(); ...action being timed... tick_count t1 = tick_count::now(); printf("time for action = %g seconds\n", (t1-t0).seconds() );}

TBB: TASK SCHEDULERIntel Propaganda: The task scheduler is the engine that powers

the loop templates. When practical, use the loop templates instead

of the task scheduler because the templates hide the complexity of the scheduler.

However, if you have an algorithm that does not naturally map onto one of the high-level templates, use the task scheduler. All of the scheduler functionality that is used by the high-level templates is available for you to use directly, so you can build new high-level templates that are just as powerful as the existing ones. 93

TBB – TASK SCHEDULER Maps tasks to threads. Handles load balancing and scheduling. Hides threading details – just think in terms of tasks. Any task using the task scheduler must have an

initialized tbb::task_scheduler_init object.

#include "tbb/task_scheduler_init.h" using namespace tbb; int main() { task_scheduler_init init; ... return 0; }

EXAMPLE: NAIVE FIBONACCI CALCULATION

Recursion typically used to calculate Fibonacci number F(n)=F(n-1)+F(n-2)

long SerialFib( long n ) { if( n<2 ) return n; else return SerialFib(n-1) + SerialFib(n-2);}

EXAMPLE: NAIVE FIBONACCI CALCULATION

Can envision Fibonacci computation as a task graph

SerialFib(4)

SerialFib(3) SerialFib(2)

SerialFib(1)

SerialFib(2)

SerialFib(3)

SerialFib(1)

SerialFib(0)SerialFib(1)

SerialFib(0)

FIBONACCI - TASK SPAWNING SOLUTION

Use TBB tasks to thread creation and execution of task graph

Create new root task Allocate task object Construct task

Spawn (execute) task, wait for completion

long ParallelFib( long n ) { long sum; FibTask& a = *new(Task::allocate_root()) FibTask(n,&sum); Task::spawn_root_and_wait(a); return sum;}

class FibTask: public task {public: const long n; long* const sum; FibTask( long n_, long* sum_ ) : n(n_), sum(sum_) {} task* execute() { // Overrides virtual function task::execute if( n<CutOff ) { *sum = SerialFib(n); } else { long x, y; FibTask& a = *new( allocate_child() ) FibTask(n-1,&x); FibTask& b = *new( allocate_child() ) FibTask(n-2,&y); set_ref_count(3); // 3 = 2 children + 1 for wait spawn( b ); spawn_and_wait_for_all( a ); *sum = x+y; } return NULL; }};

FIBONACCI - TASK SPAWNING SOLUTION

Derived from TBB task class

Create new child tasks to compute (n-1)th and (n-2)th Fibonacci numbers

Reference count is used to know when spawned tasks have completed

Set before spawning any children

Spawn task; return immediately

Can be scheduled at any time

Spawn task; block until all children have completed execution

The execute method does the computation of a task

CONCURRENT CONTAINERS

TBB Library provides highly concurrent containers STL containers are not concurrency-friendly:

attempt to modify them concurrently can corrupt container

Standard practice is to wrap a lock around STL containers Turns container into serial bottleneck

Library provides fine-grained locking or lockless implementations Worse single-thread performance, but better

scalability. Can be used with the library, OpenMP, or native

threads.

TBB - CONTAINERS

concurrent_hash_map<Key,T,HashCompare> concurrent_queue<T,Allocator> concurrent_vector<T,Allocator> Supports concurrent access, concurrent

operations and parallel iteration. The TBB library retains control over memory

allocation.

CONCURRENCY-FRIENDLY INTERFACES

Some STL interfaces are inherently not concurrency-friendly

For example, suppose two threads each execute:

Solution: concurrent_queue has pop_if_present

extern std::queue q;

if(!q.empty()) {

item=q.front();

q.pop();

At this instant, another thread might pop last element.

CONCURRENT QUEUE CONTAINER

concurrent_queue<T> Preserves local FIFO order

If thread pushes and another thread pops two values, they come out in the same order that they went in

Method push(const T&) places copy of item on back of queue

Two kinds of pops Blocking – pop(T&) non-blocking – pop_if_present(T&)

Method size() returns signed integer If size() returns –n, it means n pops await corresponding

pushes Method empty() returns size() == 0

Difference between pushes and pops May return true if queue is empty, but there are pending

CONCURRENT QUEUE CONTAINER EXAMPLE

Simple example to enqueue and print integers

Constructor for queue

Push items onto queue

While more things on queue Pop item off Print item

#include “tbb/concurrent_queue.h”#include <stdio.h>using namespace tbb;

int main (){ concurrent_queue<int> queue; int j;

for (int i = 0; i < 10; i++) queue.push(i);

while (!queue.empty()) { queue.pop(&j); printf(“from queue: %d\n”, j); } return 0;}

TBB - ALLOCATION

tbb_allocator<T>: allocates and frees memory via the TBB malloc library if available, otherwise it reverts to using malloc and free.

scalable_allocator<T>: allocates and frees memory in a way that scales with the number of processors.

others…

SCALABLE MEMORY ALLOCATORS Serial memory allocation can easily become

a bottleneck in multithreaded applications Threads require mutual exclusion into shared

heap TBB offers two choices for scalable memory

allocation Similar to the STL template class std::allocator

scalable_allocator Offers scalability, but not protection from false sharing Memory is returned to each thread from a separate

pool cache_aligned_allocator

Offers both scalability and false sharing protection 112

METHODS FOR SCALABLE_ALLOCATOR #include “tbb/scalable_allocator.h” template<typename T> class scalable_allocator;

Scalable versions of malloc, free, realloc, calloc void *scalable_malloc( size_t size ); void scalable_free( void *ptr ); void *scalable_realloc( void *ptr, size_t size ); void *scalable_calloc( size_t nobj, size_t size );

STL allocator functionality T* A::allocate( size_type n, void* hint=0 )

Allocate space for n values

void A::deallocate( T* p, size_t n ) Deallocate n values from p

void A::construct( T* p, const T& value ) void A::destroy( T* p )

SCALABLE ALLOCATORS EXAMPLE

#include “tbb/scalable_allocator.h”typedef char _Elem;

typedef std::basic_string<_Elem, std::char_traits<_Elem>, tbb::scalable_allocator<_Elem>> MyString; . . .{. . . int *p; MyString str1 = "qwertyuiopasdfghjkl";

MyString str2 = "asdfghjklasdfghjkl"; p = tbb::scalable_allocator<int>().allocate(24);. . .}

Use TBB scalable allocator for STL basic_string class

Use TBB scalable allocator to allocate 24 integers

TBB: SYNCHRONIZATION PRIMITIVES Parallel tasks must sometimes touch shared data

When data updates might overlap, use mutual exclusion to avoid race

High-level generic abstraction for HW atomic operations Atomically protect update of single variable

Critical regions of code are protected by scoped locks The range of the lock is determined by its lifetime (scope) Leaving lock scope calls the destructor, making it exception

safe Minimizing lock lifetime avoids possible contention Several mutex behaviors are available

Spin vs. queued “are we there yet” vs. “wake me when we get there”

Writer vs. reader/writer (supports multiple readers/single writer) Scoped wrapper of native mutual exclusion function 115

ATOMIC EXECUTION atomic<T>

T should be integral type or pointer type Full type-safe support for 8, 16, 32, and 64-bit integersOperations

atomic <int> i;. . .int z = i.fetch_and_add(2);

‘= x’ and ‘x = ’ read/write value of x

x.fetch_and_store (y) z = x, y = x, return z

x.fetch_and_add (y) z = x, x += y, return z

x.compare_and_swap (y,p) z = x, if (x==p) x=y; return z

SUMMARY

Intel® Threading Building Blocks is a parallel programming model for C++ applications

Used for computationally intense code A focus on data parallel programming

Intel® Threading Building Blocks provides Generic parallel algorithms Highly concurrent containers Low-level synchronization primitives A task scheduler that can be used directly

SHARED MEMORY PROGRAMMING WITH OPENMP

CS267 Lecture 6 119

INTRODUCTION TO OPENMP What is OpenMP?

Open specification for Multi-Processing “Standard” API for defining multi-threaded shared-

memory programs openmp.org – Talks, examples, forums, etc.

High-level API Preprocessor (compiler) directives ( ~ 80% ) Library Calls ( ~ 19% ) Environment Variables ( ~ 1% )

OPENMP* OVERVIEW:

omp_set_lock(lck)

#pragma omp parallel for private(A, B)

#pragma omp critical

C$OMP parallel do shared(a, b, c)

C$OMP PARALLEL REDUCTION (+: A, B)

call OMP_INIT_LOCK (ilok)

call omp_test_lock(jlok)

setenv OMP_SCHEDULE “dynamic”

CALL OMP_SET_NUM_THREADS(10)

C$OMP DO lastprivate(XX)

C$OMP ORDERED

C$OMP SINGLE PRIVATE(X)

C$OMP SECTIONS

C$OMP MASTER

C$OMP ATOMIC

C$OMP FLUSH

C$OMP PARALLEL DO ORDERED PRIVATE (A, B, C)

C$OMP THREADPRIVATE(/ABC/)

C$OMP PARALLEL COPYIN(/blk/)

Nthrds = OMP_GET_NUM_PROCS()

!$OMP BARRIER

OpenMP: An API for Writing Multithreaded Applications

A set of compiler directives and library routines for parallel application programmers

Makes writing multi-threaded applications in Fortran, C and C++ as easy as we can make it.

Standardizes last 20 years of SMP practice

* The name “OpenMP” is the property of the OpenMP Architecture Review Board.

THE ESSENCE OF OPENMP Create threads that execute in a shared address

space: The only way to create threads is with the “parallel construct” Once created, all threads execute the code inside the construct.

Split up the work between threads by one of two means: SPMD (Single program Multiple Data) … all threads execute the

same code and you use the thread ID to assign work to a thread.

Workshare constructs split up loops and tasks between threads.

Manage data environment to avoid data access conflicts Synchronization so correct results are produced regardless of

how threads are scheduled. Carefully manage which data can be private (local to each

thread) and shared. 121

CS267 Lecture 6 122

A PROGRAMMER’S VIEW OF OPENMP OpenMP is a portable, threaded, shared-memory

programming specification with “light” syntax Exact behavior depends on OpenMP implementation! Requires compiler support (C or Fortran)

OpenMP will: Allow a programmer to separate a program into serial

regions and parallel regions, rather than T concurrently-executing threads.

Hide stack management Provide synchronization constructs

OpenMP will not: Parallelize automatically Guarantee speedup Provide freedom from data races

CS267 Lecture 6 123

MOTIVATION – OPENMP

int main() {

// Do this part in parallel printf( "Hello, World!\n" );

return 0; }

CS267 Lecture 6 124

MOTIVATION – OPENMP

int main() {

omp_set_num_threads(16);

// Do this part in parallel #pragma omp parallel { printf( "Hello, World!\n" ); }

return 0; }

ttile, T

OPENMP EXECUTION MODEL: Fork-Join Parallelism:

Master thread spawns a team of threads as needed.

Parallelism added incrementally until performance are met: i.e. the sequential program evolves into a parallel program. Parallel Regions

Master Thread in red

A Nested Paralle

l region

Sequential Parts

267 Lecture 6

PROGRAMMING MODEL – CONCURRENT LOOPS

OpenMP easily parallelizes loops Requires: No data

dependencies (reads/write or write/write pairs) between iterations!

Preprocessor calculates loop bounds for each thread directly from serial source ?

for( i=0; i < 25; i++ ) {

printf(“Foo”);

#pragma omp parallel for

CS267 Lecture 6 127

PROGRAMMING MODEL – LOOP SCHEDULING

schedule clause determines how loop iterations are divided among the thread team static([chunk]) divides iterations statically between

threads Each thread receives [chunk] iterations, rounding as

necessary to account for all iterations Default [chunk] is ceil( # iterations / # threads )

dynamic([chunk]) allocates [chunk] iterations per thread, allocating an additional [chunk] iterations when a thread finishes Forms a logical work queue, consisting of all loop iterations Default [chunk] is 1

guided([chunk]) allocates dynamically, but [chunk] is exponentially reduced with each allocation

267 Lecture 6

PROGRAMMING MODEL – DATA SHARING

Parallel programs often employ two types of data Shared data, visible to all

threads, similarly named Private data, visible to a

single thread (often stack-allocated)

• OpenMP:• shared variables are shared• private variables are private

• PThreads:• Global-scoped variables are

shared• Stack-allocated variables are

private

// shared, globals

int bigdata[1024];

void* foo(void* bar) {

// private, stack

int tid;

/* Calculation goes

here */

int bigdata[1024];

void* foo(void* bar) {

int tid;

#pragma omp parallel \

shared ( bigdata ) \

private ( tid )

/* Calc. here */

CS267 Lecture 6 129

PROGRAMMING MODEL - SYNCHRONIZATION

OpenMP Synchronization OpenMP Critical Sections

Named or unnamed No explicit locks / mutexes

Barrier directives

Explicit Lock functions When all else fails – may

require flush directive

Single-thread regions within parallel regions master, single directives

#pragma omp critical{ /* Critical code here */}

#pragma omp barrier

omp_set_lock( lock l );/* Code goes here */omp_unset_lock( lock l );

#pragma omp single{ /* Only executed once */}

ttile, T

EXAMPLE PROBLEM: NUMERICAL INTEGRATION

(1+x2) dx = 0

F(xi)x i = 0

Mathematically, we know that:

We can approximate the integral as a sum of rectangles:

Where each rectangle has width x and height F(xi) at the middle of interval i.

PI PROGRAM: AN EXAMPLE

static long num_steps = 100000;double step;void main (){ int i; double x, pi, sum = 0.0;

step = 1.0/(double) num_steps; x = 0.5 * step;

for (i=0;i<= num_steps; i++){ sum += 4.0/(1.0+x*x);

x+=step; } pi = step * sum;

PI PROGRAM: IDENTIFY CONCURRENCY

static long num_steps = 100000;double step;void main (){ int i; double x, pi, sum = 0.0;

step = 1.0/(double) num_steps; x = 0.5 * step;

for (i=0;i<= num_steps; i++){ sum += 4.0/(1.0+x*x);

x+=step; } pi = step * sum;

Loop iterations

can in principle be

executed concurrentl

PI PROGRAM: EXPOSE CONCURRENCY, PART 1

static long num_steps = 100000;double step;void main (){ double pi, sum = 0.0;

step = 1.0/(double) num_steps;

int i; double x; for (i=0;i<= num_steps; i++){

x = (i+0.5)*step; sum += 4.0/(1.0+x*x);

} pi = step * sum;

Isolate data that must be shared from data local to a

Redefine x to remove loop

carried dependence

This is called a reduction … results from each

iteration accumulated into a single global. 133

PI PROGRAM: EXPOSE CONCURRENCY, PART 2DEAL WITH THE REDUCTION

static long num_steps = 100000;#define NUM 4 //expected max thread count

double step;void main (){ double pi, sum[NUM] = {0.0};

step = 1.0/(double) num_steps;

int i, ID=0; double x; for (i=0;i<= num_steps; i++){

x = (i+0.5)*step; sum[ID] += 4.0/(1.0+x*x);

} for(int i=0, pi=0.0;i<NUM;i++)

pi += step * sum[i];}

Common Trick:

promote scalar “sum” to an array indexed by

the number of threads to

create thread local copies of shared

PI PROGRAM: EXPRESS CONCURRENCY USING OPENMP#include <omp.h>static long num_steps = 100000;#define NUM 4double step;void main (){ double pi, sum[NUM] = {0.0};

step = 1.0/(double) num_steps;#pragma omp parallel num_threads(NUM){ int i, ID; double x; ID = omp_get_thread_num();

for (i=ID;i<= num_steps; i+=NUM){ x = (i+0.5)*step; sum[ID] += 4.0/(1.0+x*x);

for(int i=0, pi=0.0;i<NUM;i++) pi += step * sum[i];}

Create NUM threads

Each thread executes code in the parallel

Simple mod to loop to deal

out iterations to threads

variables defined inside a

thread are private to

that thread

automatic variables defined outside a parallel region are shared between threads

PI PROGRAM: FIXING THE NUM THREADS BUG#include <omp.h>static long num_steps = 100000;#define NUM 4 double step;void main (){ double pi, sum[NUM] = {0.0};

step = 1.0/(double) num_steps;#pragma omp parallel num_threads(NUM){ int nthreads = omp_get_num_threads(); int i, ID; double x; ID = omp_get_thread_num();

for (i=ID;i<= num_steps; i+=nthreads){ x = (i+0.5)*step; sum[ID] += 4.0/(1.0+x*x);

Hence, you need to add

a bit of code to get the actual number of

threads

NUM is a requested number of

threads, but an OS can choose to give you fewer.

INCREMENTAL PARALLELISM

Software development with incremental Parallelism:Behavior preserving transformations to

expose concurrency.Express concurrency incrementally by

adding OpenMP directives… in a large program I can do this loop by loop to evolve my original program into a parallel OpenMP program.

Build and time program, optimize as needed with behavior preserving transformations until you reach the desired performance. 137

PI PROGRAM: EXECUTE CONCURRENCY#include <omp.h>static long num_steps = 100000;#define NUM 4 double step;void main (){ double pi, sum[NUM] = {0.0};

step = 1.0/(double) num_steps;#pragma omp parallel num_threads(NUM){ int nthreads = omp_get_num_threads(); int i, ID; double x; ID = omp_get_thread_num();

for (i=ID;i<= num_steps; i+=nthreads){ x = (i+0.5)*step; sum[ID] += 4.0/(1.0+x*x);

The performance can suffer on some systems due to false sharing of

sum[ID] … i.e. independent

elements of the sum array share a cache line and

hence every update requires

a cache line transfer between

threads.

Build this program and execute on

parallel hardware.

PI PROGRAM: SAFE UPDATE OF SHARED DATA#include <omp.h>static long num_steps = 100000;#define NUM 4double step;int main (){ double pi, sum=0.0;

step = 1.0/(double) num_steps;#pragma omp parallel num_threads(NUM){ int i, ID; double x, psum= 0.0; int nthreads = omp_get_num_threads(); ID = omp_get_thread_num(); for (i=ID;i<= num_steps; i+=nthreads){

x = (i+0.5)*step; psum += 4.0/(1.0+x*x);

} #pragma omp critical sum += psum;} pi = step * sum; }

Replace array for sum with a local/private

version of sum (psum) … no more false

sharing

Use a critical section so only one thread at

a time can update sum, i.e. you can

safely combine psum values 139

ttile, T

PI PROGRAM: MAKING LOOP-SPLITTING AND REDUCTIONS EVEN EASIER

#include <omp.h>static long num_steps = 100000; double step;void main (){ int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps;#pragma omp parallel for private(i, x) reduction(+:sum) for (i=0;i<= num_steps; i++){

x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x);

} pi = step * sum;}

Reduction used to manage

dependencies

Private clause creates data

local to a thread

SYNCHRONIZATION: BARRIER Barrier: Each thread waits until all threads

arrive.#pragma omp parallel shared (A, B, C) private(id){

id=omp_get_thread_num();A[id] = big_calc1(id);

#pragma omp barrier #pragma omp for

for(i=0;i<N;i++){C[i]=big_calc3(i,A);}#pragma omp for nowait

for(i=0;i<N;i++){ B[i]=big_calc2(C, i); }A[id] = big_calc4(id);

}implicit barrier at the end of a parallel region

implicit barrier at the end of a for worksharing construct

no implicit barrier due to nowait

PUTTING THE MASTER THREAD TO WORK The master construct denotes a structured block

that is only executed by the master thread. The other threads just skip it (no synchronization is implied).

#pragma omp parallel {

do_many_things();#pragma omp master

{ exchange_boundaries(); }#pragma omp barrier

do_many_other_things();}

RUNTIME LIBRARY ROUTINES AND ICVS To use a known, fixed number of threads in a program,

(1) tell the system that you don’t want dynamic adjustment of the number of threads, (2) set the number of threads, then (3) save the number you got.

#include <omp.h>void main(){ int num_threads; omp_set_dynamic( 0 ); omp_set_num_threads( omp_num_procs() );#pragma omp parallel { int id=omp_get_thread_num();#pragma omp single num_threads = omp_get_num_threads(); do_lots_of_stuff(id); }}

Protect this op since Memory stores are not atomic

Request as many threads as you have processors.

Disable dynamic adjustment of the number of threads.

Internal Control Variables (ICVs) … define state of runtime system to a thread. Consistent pattern: set with “omp_set” or an environment variable, read with “omp_get”

ttile, T

OPTIMIZING LOOP PARALLEL PROGRAMS

#include <omp.h>#pragma omp parallel{// define neighborhood as the num_neighbors particles// within “cutoff” of each particle “i”. #pragma omp for for( int i = 0; i < n; i++ ) { Fx[i]=0.0; Fy[i]=0.0; for (int j = 0; j < num_neigh[i]; j++) neigh_ind = neigh[i][j]; Fx[i] += forceX(i, neigh_ind); FY[i] += forceY(i, neigh_ind); } }}

Particles may be unevenly distributed … i.e. different particles have different numbers of neighbors.

Evenly spreading out loop iterations may fail to balance the load among threads

We need a way to tell the compiler how to best distribute the load.

Short range force computation for a particle system using the cut-off method

THE SCHEDULE CLAUSE The schedule clause affects how loop iterations are mapped onto

threads schedule(static [,chunk])

Deal-out blocks of iterations of size “chunk” to each thread.

schedule(dynamic[,chunk]) Each thread grabs “chunk” iterations off a queue until all

iterations have been handled. schedule(guided[,chunk])

Threads dynamically grab blocks of iterations. The size of the block starts large and shrinks down to size “chunk” as the calculation proceeds.

schedule(runtime) Schedule and chunk size taken from the OMP_SCHEDULE

environment variable (or the runtime library … for OpenMP 3.0).

ttile, T

OPTIMIZING LOOP PARALLEL PROGRAMS

#include <omp.h>#pragma omp parallel{// define neighborhood as the num_neigh particles// within “cutoff” of each particle “i”. #pragma omp for schedule(dynamic, 10) for( int i = 0; i < n; i++ ) { Fx[i]=0.0; Fy[i]=0.0; for (int j = 0; j < num_neigh[i]; j++) neigh_ind = neigh[i][j]; Fx[i] += forceX(i, neigh_ind); FY[i] += forceY(i, neigh_ind); } }}

Divide range of n into chunks of size 10.

Each thread computes a chunk then goes back to get its next chunk of 10 iterations.

Dynamically balances the load between threads.

Short range force computation for a particle system using the cut-off method

Schedule Clause When To Use

STATIC Pre-determined and predictable by the programmer

DYNAMIC Unpredictable, highly variable work per iteration

GUIDED Special case of dynamic to reduce scheduling overhead

loop work-sharing constructs:The schedule clause

Least work at runtime : scheduling done at compile-time

Most work at runtime : complex scheduling logic used at run-time

SECTIONS WORK-SHARING CONSTRUCT The Sections work-sharing construct gives a

different structured block to each thread.

#pragma omp parallel{

#pragma omp sections { #pragma omp section X_calculation(); #pragma omp section

y_calculation(); #pragma omp section

z_calculation(); }

By default, there is a barrier at the end of the “omp sections”. Use the “nowait” clause to turn off the barrier.

SINGLE WORK-SHARING CONSTRUCT

The single construct denotes a block of code that is executed by only one thread.

A barrier is implied at the end of the single block.

#pragma omp parallel {

do_many_things();#pragma omp single

{ exchange_boundaries(); }do_many_other_things();

SUMMARY OF OPENMP’S KEY CONSTRUCTS

The only way to create threads is with the parallel construct:#pragma omp parallel

All thread execute the instructions in a parallel construct. Split work between threads by:

SPMD: use thread ID to control execution Worksharing constructs to split loops (simple loops only)

#pragma omp for Combined parallel/workshare as a shorthand

#pragma omp parallel for High level synchronization is safest

#pragma critical#pragma barrier

1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2

Documents

Intel® Cilk™ Plus Application Binary Interface …® Cilk Plus Application Binary Interface Specification Document Number: 324512-001US Page 3 On Windows, the Cilk Plus runtime

Tbb описание

Economic contEnts ontent - SIGAR

CILK: An Efficient Multithreaded Runtime System

TechTargetmedia.techtarget.com/.../BusApps/Oracle_12c_guide_.pdf · 2017. 5. 31. · Oracle PRO* ontent . TechTarget Oracle PRO* ontent . TechTarget Oracle PRO* ontent . TechTarget

Cilk - An Efficient Multithreaded Runtime System

TBR & TBB catalog

Cilk 5.3.2 Reference Manual

Optimizing LU Factorization in Cilk ++

TBB and TDD Parts Manual - Parts Towndownload.partstown.com/.../-/en_US/manuals/TBB-TDD_pm.pdf · Door Parts TDD & TBB Evaporator Housing Gasket Lock Gasket Gasket Handle TDD-1:Top

© 2009 Charles E. Leiserson and Pablo Halpern1 Introduction to Cilk++ Programming PADTAD July 20, 2009 Cilk, Cilk++, Cilkview, and Cilkscreen, are trademarks

EBA REPORT - TBB

Cilk 5.3.1 Reference Manual

Parallel Processing Final Presentation CILK

Homeward Bound TBB

Introduction to Cilk Programming

Parallel Hybrid Computing · GPU GPU GPU GPU OpenMP HMPP MPI CUDA. Programming Multicores/ ... CILK, TBB, automatic parallelization, vectorization… • Distributed memory architectures

Cilk 5.4.6 Reference Manual - Massachusetts Institute of …people.csail.mit.edu/jim/temp/manual.pdf · 2014. 5. 3. · Cilk runtime system, the Cilk compiler, a collection of example

Cilk Plus Parallel Reduction

Programming in Cilksupertech.lcs.mit.edu/cilk/lecture-3.pdfProgramming matrix multiplication in Cilk — Dr. Bradley C. Kuszmaul LECTURE 3 Advanced Cilk programming: inlets, abort,