Chapter 19 OpenMP Speaker: Lung-Sheng Chien Reference: [1] OpenMP C and C++ Application Program Interface v2.0 [2] OpenMP C and C++ Application Program

Chapter 19 OpenMP

Speaker: Lung-Sheng Chien

Reference: [1] OpenMP C and C++ Application Program Interface v2.0

[2] OpenMP C and C++ Application Program Interface v3.0

[3] OpenMP forum, http://www.openmp.org/forum/

[4] OpenMP tutorial: https://computing.llnl.gov/tutorials/openMP/

[5] Getting Started with OpenMP: http://rac.uits.iu.edu/hpc/openmp_tutorial/C/

http://www.openmp.org/forum/

OutLine

• OpenMP introduction- shared memory architecture- multi-thread

• Example 1: hello world• Example 2: vector addition• enable openmp in vc2005• Example 3: vector addition + Qtime• Example 4: matrix multiplication• Example 5: matrix multiplication (block version)

What is OpenMP

• The OpenMP (Open Multi-Processing) is an application programming interface (API) that supports multi-platform shared memory multiprocessing programming in C/C++ and Fortran on many architectures, including Unix and Microsoft Windows platforms. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior.

• OpenMP is a portable, scalable model that gives programmers a simple and flexible interface for developing parallel applications for platforms ranging from the desktop to the supercomputer.

• An application built with the hybrid model of parallel programming can run on a computer cluster using both OpenMP and Message Passing Interface (MPI). OpenMP: shared memory MPI: distributed memory

http://en.wikipedia.org/wiki/OpenMP

http://en.wikipedia.org/wiki/Application_programming_interface

http://en.wikipedia.org/wiki/Shared_memory

http://en.wikipedia.org/wiki/Multiprocessing

http://en.wikipedia.org/wiki/C_(programming_language)

http://en.wikipedia.org/wiki/C%2B%2B

http://en.wikipedia.org/wiki/Fortran

http://en.wikipedia.org/wiki/Unix

http://en.wikipedia.org/wiki/Microsoft_Windows

http://en.wikipedia.org/wiki/Compiler_directive

http://en.wikipedia.org/wiki/Environment_variable

http://en.wikipedia.org/wiki/Programmer

http://en.wikipedia.org/wiki/Supercomputer

http://en.wikipedia.org/wiki/Parallel_programming

http://en.wikipedia.org/wiki/Computer_cluster

http://en.wikipedia.org/wiki/Message_Passing_Interface

History of OpenMP

• The OpenMP Architecture Review Board (ARB) published its first API specifications, OpenMP for Fortran 1.0, in October 1997. October the following year they released the C/C++ standard.

• 2000 saw version 2.0 of the Fortran specifications with version 2.0 of the C/C++ specifications being released in 2002.

• Version 2.5 is a combined C/C++/Fortran specification that was released in 2005.

• Version 3.0, released in May, 2008, is the current version of the API specifications. Included in the new features in 3.0 is the concept of tasks and the task construct. These new features are summarized in Appendix F of the OpenMP 3.0 specifications.

http://www.openmp.org/mp-documents/spec30.pdf

http://www.openmp.org/mp-documents/spec30.pdf

Goals of OpenMP

• Standardization: Provide a standard among a variety of shared memory architectures/platforms.

• Lean and Mean: establish a simple and limited set of directives for programming shared memory machines. Significant parallelism can be implemented by using just 3 or 4 directives.

• Ease of Use: -Provide capability to incrementally parallelize a serial program, unlike message-passing libraries which typically require an all or nothing approach -Provide the capability to implement both coarse-grain and fine-grain parallelism

• Portability: -Supports Fortran (77, 90, and 95), C, and C++ -Public forum for API and membership

Website: http://openmp.org/wp/

OpenMP forum: http://www.openmp.org/forum/

Please register in this forum and browse articles in “General” item

• OpenMP is an implementation of multithreading, a method of parallelization whereby the master "thread" (a series of instructions executed consecutively) "forks" a specified number of slave "threads" and a task is divided among them. The threads then run concurrently, with the runtime environment allocating threads to different processors.

• The runtime environment allocates threads to processors depending on usage, machine load and other factors. The number of threads can be assigned by the runtime environment based on environment variables or in code using functions. The OpenMP functions are included in a header file labelled "omp.h" in C/C++

Multithread (多執行緒 )

http://en.wikipedia.org/wiki/Thread_(computer_science)

http://en.wikipedia.org/wiki/Runtime_environment

Core elements

A compiler directive in C/C++ is called a pragma (pragmatic information). It is a preprocessor directive, thus it is declared with a hash (#). Compiler directives specific to OpenMP in C/C++ are written in codes as follows:

OpenMP programming model [1]

• Shared Memory, Thread Based Parallelism: OpenMP is based upon the existence of multiple threads in the shared memory programming paradigm. A shared memory process consists of multiple threads.

• Explicit Parallelism: OpenMP is an explicit (not automatic) programming model, offering the programmer full control over parallelization.

• Fork - Join Model: - OpenMP uses the fork-join model of parallel execution - All OpenMP programs begin as a single process: the master thread. The master thread executes sequentially until the first parallel region construct is encountered - FORK: the master thread then creates a team of parallel threads - The statements in the program that are enclosed by the parallel region construct are then executed in parallel among the various team threads - JOIN: When the team threads complete the statements in the parallel region construct, they synchronize and terminate, leaving only the master thread

OpenMP programming model [2]

• Compiler Directive Based:OpenMP parallelism is specified through the use of compiler directives.

• Nested Parallelism Support: - The API provides for the placement of parallel constructs inside of other parallel constructs - Implementations may or may not support this feature.

• Dynamic Threads: -The API provides for dynamically altering the number of threads which may used to execute different parallel regions - Implementations may or may not support this feature.

• I/O: -OpenMP specifies nothing about parallel I/O. This is particularly important if multiple threads attempt to write/read from the same file.-If every thread conducts I/O to a different file, the issues are not as significant. -It is entirely up to the programmer to insure that I/O is conducted correctly within the context of a multi-threaded program.

• FLUSH Often?: -OpenMP provides a "relaxed-consistency" and "temporary" view of thread memory (in their words). In other words, threads can "cache" their data and are not required to maintain exact consistency with real memory all of the time. -When it is critical that all threads view a shared variable identically, the programmer is responsible for insuring that the variable is FLUSHed by all threads as needed.

OutLine

• OpenMP introduction

• Example 1: hello world- parallel construct

• Example 2: vector addition• enable openmp in vc2005• Example 3: vector addition + Qtime• Example 4: matrix multiplication• Example 5: matrix multiplication (block version)

Example 1: hello world [1]hello.c

Makefile

The #pragma directives offer a way for each compiler to offer machine- and operating system-specific features while retaining overall compatibility with the C and C++ languages. Pragmas are machine- or operating system-specific by definition, and are usually different for every compiler.

If the compiler finds a pragma it does not recognize, it issues a warning, but compilation continues.

MSDN library 2005

man icpc

header file “omp.h” is necessary for OpenMP programming

Example 1: hello world [2]

hello.c

Machine quartet2 has 4 cores


Machine octet1 has 8 cores (two quad-core) octet1

Question 1: How to impose number of threads in code?

environment variable OMP_NUM_THREADS

hello.c


Question 2: How can we run the same code in sequential mode?

hello.c Makefile

sequential version

octet1

quartet2

only one core executes


Question 3: How can we issue number of threads explicitly in code?hello.c

synchronization

wait until all 5 threads execute “printf” statement.

use 5 threads (explicit) to execute concurrently

every thread has its own copy


quartet2 octet1

th_id

core 0

th_id

core 1

th_id

core 2

th_id

core 3

th_id

core 4

Directive Format

The syntax of an OpenMP directive is formally specified by the grammar

Each directive starts with #pragma omp, to reduce the potential for conflict with other (non-OpenMP or vendor extensions to OpenMP) pragma directives with the same names. White space can be used before and after the #, and sometimes white space must be used to separate the words in a directive. Preprocessing tokens following the #pragma omp are subject to macro replacement.

Conditional compilation

PARALLEL construct

Work-sharing construct

for Directive

sections Directive

workshare Directive

single Directive

Parallel construct

• The number of physical processors hosting the threads is implementation-defined. Once created, the number of threads in the team remains constant for the duration of that parallel region.

• When a thread reaches a PARALLEL directive, it creates a team of threads and becomes the master of the team. The master is a member of that team and has thread number 0 within that team.

• Starting from the beginning of this parallel region, the code is duplicated and all threads will execute that code.

• There is an implied barrier at the end of a parallel region. Only the master thread of the team continues execution at the end of a parallel region.

How many threads

• The number of threads in a parallel region is determined by the following factors, in order of precedence:- evaluation of the IF clause - setting of the NUM_THREADS clause - use of the omp_set_num_threads() library function - setting of the OMP_NUM_THREADS environment variable - implementation default - usually the number of CPUs on a node, though it could be dynamic.

• Threads are numbered from 0 (master thread) to N-1.

• Master thread is numbered as 0.

Question 4: How to write parallel code such that it is independent of number of cores of host machine?

Question 5: What happens if number of threads is larger than number of cores of host machine?

Private clause

The PRIVATE clause declares variables in its list to be private to each thread.

“private variable” means each thread has its own copy and cannot interchange information.

• PRIVATE variables behave as follows:- a new object of the same type is declared once for each thread in the team - all references to the original object are replaced with references to the new object - variables declared PRIVATE are uninitialized for each thread

Exercise 1: modify code of hello.c to show “every thread has its own private variable th_id”, that is, shows th_id has 5 copies.

Exercise 2: modify code of hello.c, remove clause “private (th_id)” in #pragma directive, what happens? Can you explain?

OutLine

• OpenMP introduction• Example 1: hello world

• Example 2: vector addition- work-sharing construct: for Directive

• enable openmp in vc2005• Example 3: vector addition + Qtime• Example 4: matrix multiplication• Example 5: matrix multiplication (block version)

Work-sharing construct

• A work-sharing construct divides the execution of the enclosed code region among the members of the team that encounter it

• A work-sharing construct must be enclosed dynamically within a parallel region in order for the directive to execute in parallel

• Work-sharing constructs do not launch new threads

• There is no implied barrier upon entry to a work-sharing construct, however there is an implied barrier at the end of a work sharing construct

for: shares iterations of a loop across the team.A type of data parallelism

sections: breaks work into separate, discrete sections. Each section is executed by a thread. A type of functional parallelism

single: serializes a section of code.

Example 2: vector addition [1]

vecadd.c walltime.c

vecadd.c

Tool for measuring time

only valid in Linux system

parameter

Example 2: vector addition [2]

vecadd.c

Makefile

“O0” means no optimization

shared clause and default clause

The SHARED clause declares variables in its list to be shared among all threads in the team

• A shared variable exists in only one memory location and all threads can read or write to that address (every thread can “see” the shared variable)

• It is the programmer's responsibility to ensure that multiple threads properly access SHARED variables (such as via CRITICAL sections)

Question 6: Why index i must be private variable and a,b,c,N can be shared variable? What happens if we change i to shared variable? What happens if we change a,b,c,N to private variable?

The DEFAULT clause allows the user to specify a default PRIVATE, SHARED, or NONE scope for all variables in the lexical extent of any parallel region.

Work-Sharing construct: for Directive

• SCHEDULE: Describes how iterations of the loop are divided among the threads in the team- static: loop iterations are divided into pieces of size chunk and then statically assigned to threads. If chunk is not specified, the iterations are evenly (if possible) divided contiguously among the threads - dynamic: loop iterations are divided into pieces of size chunk, and dynamically scheduled among the threads; when a thread finishes one chunk, it is dynamically assigned another. The default chunk size is 1.

• nowait: If specified, then threads do not synchronize at the end of the parallel loop.

Example of static schedule

Assume we have 16 array elements, say a[16], b[16] and c[16] and use 4 threads

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15aThread ID

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15a

0 1 2 3

1 no chunk is specified, compiler would divide 16 elements into 4 threads

Thread ID

2 chunk = 2

0 1 2 3 0 1 2 3

Results of example 2

Number of thread

quartet2 Octet1

1 1.6571 (s) 1.5451 (s)

2 0.9064 (s) 0.9007 (s)

4 0.5433 (s) 0.5165 (s)

8 0.6908 (s) 0.4830 (s)

16 0.7694 (s) 0.5957 (s)

32 0.9263 (s) 0.7098 (s)

64 0.9625 (s) 0.7836 (s)

82 10N

compiler: Intel C compiler icpc 10.0

Compiler option: -O0

single 1.54513.199

8 0.483

T

T core

Octet1:

single 1.65713.05

4 0.5433

T

T core

quartet2:

Question 7: the limitation of performance improvement is 3, why? Can you use different configuration of schedule clause to improve this number?

OutLine

• OpenMP introduction• Example 1: hello world• Example 2: vector addition

• enable openmp in vc2005- vc2005 supports OpenMP 2.0- vc 6.0 does not support OpenMP

• Example 3: vector addition + Qtime• Example 4: matrix multiplication• Example 5: matrix multiplication (block version)

Example 1 (hello world) in vc2005 [1]

Step 1: create a empty consol application



Step 2: copy hello.c to this project and add hello.c to project manager


Step 3: change platform to x64


choose option “x64”

update platform as “x64”


Step 4: enable “openmp” support

vc 2005 support OpenMP 2.0


Step 5: compile and execute

Example 2 (vector addition) in vc2005 [1]

walltime.c only works in Linux machine since no “sys/time.h” in windows

In time.h of ANCI C, no function “gettimeofday”, hence we give up walltime.c

Example 2 (vector addition) in vc2005 [2]

time_t time( time_t *tp)

returns the current calendar time or -1 if the time is not available. If tp is not NULL, the return value is also assigned to *tp.

double difftime( time_t time_2, time_t time_1)

returns time_2 – time_1 expressed in seconds

vecadd.cpp

OutLine

• OpenMP introduction• Example 1: hello world• Example 2: vector addition• enable openmp in vc2005

• Example 3: vector addition + Qtime

• Example 4: matrix multiplication• Example 5: matrix multiplication (block version)

Example 3: vector addition (Qtime) [1]

• A QTime object contains a clock time, i.e. the number of hours, minutes, seconds, and milliseconds since midnight

• QTime uses the 24-hour clock format; it has no concept of AM/PM. It operates in local time; it knows nothing about time zones or daylight savings time.

• QTime can be used to measure a span of elapsed time using the start(), restart(), and elapsed() functions

vecadd.cpp

constructs the time 0 hours, minutes, seconds and milliseconds, i.e. 00:00:00.000 (midnight).

This is a valid time.


vecadd.cpp


generate project file vecadd_qt.pro

generate Makefile

Makefile


Step 1: setup an empty project

Embed Qt 3.2.1 non-comercial version into vc 2005


Step 2: copy vecadd.cpp into this project

Step 3: add item “vecadd.cpp” in project manager


Step 4: project properties C/C++ General Additional include Directories

.;$(QTDIR)\include;C:\Qt\3.2.1NonCommercial\mkspecs\win32-msvc


Step 5: project properties C/C++ Preprocessor Preprocessor Definitions

WIN32;_DEBUG;_CONSOLE;_MBCS;UNICODE;QT_DLL;QT_THREAD_SUPPORT

Step 6: project properties C/C++ Language OpenMP Support


Step 7: project properties Linker General Additional Library Directories

$(QTDIR)\lib;C:\Program Files (x86)\Microsoft Visual Studio 8\VC\lib


Step 8: project properties Linker Input Additional Dependence

"qt-mtnc321.lib" "qtmain.lib" "kernel32.lib"

Step 9: compile and execute

Restriction: QT3 in windows only support 32-bit application, we must choose platform as “Win32”, we will solve this problem after installing QT4

OutLine

• OpenMP introduction• Example 1: hello world• Example 2: vector addition• enable openmp in vc2005• Example 3: vector addition + Qtime

• Example 4: matrix multiplication

• Example 5: matrix multiplication (block version)

Example 4: matrix multiplication [1]matrixMul.h

matrixMul.cpp

1

wA

ij ik kjk

c a b

ika A i wA k

kjb A k wB j

ijc A i wC j

row-major index

sequential version

Example 4: matrix multiplication [2]

matrixMul.cpp

parallel version

Question 8: we have three for-loop, one is for “i”, one is for “j” and last one is for “k”, which one is parallelized by OpenMP directive?

Question 9: explain why variable i, j, k, sum, a, b are declared as private? Can we move some of them to shared clause?

Example 4: matrix multiplication [3]main.cpp

use QT timer

Example 4: matrix multiplication [4]main.cpp

use qmake to generate Makefile


Let BLOCK_SIZE = 16 and 2_size A size B size C N BLOCK SIZE

total memory usage floatsize A size B size C

N Total size Thread 1 Thread 2 Thread 4 Thread 8

16 0.75 MB 53 ms 31 ms 21 ms 24ms

32 3 MB 434 ms 237 ms 121 ms 90 ms

64 12 MB 17,448 ms 8,964 ms 6,057 ms 2,997 ms

128 48 MB 421,854 ms 312,983 ms 184,695 ms 92,862 ms

256 192 MB 4,203,536 ms 2,040,448 ms 1,158,156 ms 784,623 ms

Platform: oectet1, with compiler icpc 10.0, -O2

Large performance gap amogn N = 32, N = 64 and N = 128, so this algorithm is NOT good. Besides improvement of multi-thread is not significant.


running

Use command “top” to see resource usage

CPU usage is 800 %, 8 cores are busy

Exercise 3: verify subroutine matrixMul_parallel

matrixMul.cpp

matrixMul.cpp

Combine Parallel Work-sharing constructs

Exercise 4: verify following subroutine matrix_parallel, which parallelizes loop-j , not loop-i.

1. Performance between loop-i and loop-j

2. why do we declare index i as shared variable? What happens if we declare index i as private variable?

matrixMul.cpp

OutLine

• OpenMP introduction• Example 1: hello world• Example 2: vector addition• enable openmp in vc2005• Example 3: vector addition + Qtime• Example 4: matrix multiplication

• Example 5: matrix multiplication (block version)

Example 5: matrix multiplication (block version) [1]

(0,0) (1,0)

(0,1) (1,1)

(0,2) (1,2)

(0,0) (1,0) (2,0)

(0,1) (1,1) (2,1)

(0,2) (1,2)

(0,0) (1,0) (2,0)

(0,1) (1,1) (2,1)

(2,2)

6 4A R

4 6B R 6 6C R

x

y

0 1

54

2 3

76

8 9

1312

10 11

1514

16 17

2120

18 19

2322

6 4A R

hA

wA

(0,0) (1,0)

(0,1) (1,1)

(0,2) (1,2)

6 4A R

bx

by

Thread (0,0) Thread (1,0)

Thread (0,1) Thread (1,1)

tx

ty

0

00

01

1

2

1

1

blocksize by

blocksize bxtx

ty

, , ,bx by tx ty , blocksize bx tx blocksize by ty row-major

global index


matrixMul_block.cpp

Shared memory in GPU

(0,0) (1,0)

(0,1) (1,1)

(0,2) (1,2)

(0,0) (1,0) (2,0)

(0,1) (1,1) (2,1)

6 4A R

4 6B R

_ 3hA grid _ 2wA grid _ 3wB grid


matrixMul_block.cpp

(0,0) (1,0)

(0,1) (1,1)

(0,2) (1,2)

(0,0) (1,0) (2,0)

(0,1) (1,1) (2,1)

6 4A R

4 6B R

aBegin physical index of first entry in block A (0,1)

bBegin physical index of first entry in block B (1,0)

copy global data to small block, why?

Example 5: matrix multiplication (block version) [4]matrixMul_block.cpp

(0,0) (1,0)

(0,1) (1,1)

(0,2) (1,2)

(0,0) (1,0) (2,0)

(0,1) (1,1) (2,1)

(0,2) (1,2)

(0,0) (1,0) (2,0)

(0,1) (1,1) (2,1)

(2,2)

6 4A R

4 6B R 6 6C R

1

, , ,wA

k

C i j A i k B k j

for all , 1,1i j block

or equivalently A (0,1) B (1,0) (1,1)A B (1,1) (1,1)C

Compute submatrix of C sequentially


Parallel versionGPU code


Let BLOCK_SIZE = 16 and 2_size A size B size C N BLOCK SIZE

total memory usage floatsize A size B size C


16 0.75 MB 40 ms 34 ms 34 ms 44 ms

32 3 MB 301 ms 309 ms 240 ms 219 ms

64 12 MB 2,702 ms 2,310 ms 1,830 ms 1,712 ms

128 48 MB 24,548 ms 19,019 ms 15,296 ms 13,920 ms

256 192 MB 198,362 ms 151,760 ms 129,754 ms 110,540 ms

Platform: oectet1, with compiler icpc 10.0, -O2


16 0.75 MB 53 ms 31 ms 21 ms 24 ms

32 3 MB 434 ms 237 ms 121 ms 90 ms

64 12 MB 17,448 ms 8,964 ms 6,057 ms 2,997 ms

128 48 MB 421,854 ms 312,983 ms 184,695 ms 92,862 ms

256 192 MB 4,203,536 ms 2,040,448 ms 1,158,156 ms 784,623 ms

Non-block version

Question 10: non-block version is much slower than block version, why?


Block version, BLOCK_SIZE = 512


2 12 MB 3,584 ms 1,843 ms 961 ms 453 ms

4 48 MB 27,582 ms 14,092 ms 7,040 ms 3,533 ms

8 192 MB 222,501 ms 110,975 ms 55,894 ms 28,232 ms


64 12 MB 2,702 ms 2,310 ms 1,830 ms 1,712 ms

128 48 MB 24,548 ms 19,019 ms 15,296 ms 13,920 ms

256 192 MB 198,362 ms 151,760 ms 129,754 ms 110,540 ms

Block version, BLOCK_SIZE = 16

Question 11: larger BLOCK_SIZE implies better performance when using multi-thread, why?

Question 12: small BLOCK_SIZE is better in single thread, why?

Question 13: matrix-matrix multiplication is of complexity O(N^3), which algorithm is “good” to achieve this property?


Cache has 4 MB, we can have large BLOCK_SIZE

cache line is 64 byte (16 float)

BLOCK_SIZE = 512 2 2512 1024 1size Bs size As float Byte MB

BLOCK_SIZE = 16 216 1size Bs size As float kB

In CPU

In GPU

Exercise 5: verify subroutine matrixMul_block_seq with non-block version, you can use high precision package.

Non-block version

Exercise 6: if we use “double”, how to choose value of BLOCK_SIZE, show your experimental result.

Exercise 7: Can you modify subroutine matrixMul_block_parallel to improve its performance?

Exercise 8: compare parallel computation between CPU and GPU in your host machine

Documents

Chapter 19 OpenMP Speaker: Lung-Sheng Chien Reference: [1] OpenMP C and C++ Application Program Interface v2.0 [2] OpenMP C and C++ Application Program