View
225
Download
3
Embed Size (px)
Citation preview
Introduction to OpenMP
For a more detailed tutorial see:http://www.openmp.org
Look at the presentations
also see:https://computing.llnl.gov/tutorials/openMP/
Concepts
• An Application Program Interface (API) that may be used to explicitly direct multi-threaded, shared memory parallelism
• Directive based programming– declare properties of language structures
• sections of code, loops
– scope variables
• A few service routines– get information
• Compiler options• Environment variables
Open MP Programming Model
• Directive– #pragma omp directive [clause list]– C$OMP construct [clause [clause] … ]
• Program executes serially until it encounters a parallel directive– #pragma omp parallel [clause list]– /* structured block of code */
• Clause list is used to specify conditions– Conditional parallelism - if (cond)– Degree of concurrency - num_threads(int)– Data Handling -
• private(vlist), firstprivate(vlist), shared(vlist)
OpenMP Programming Model
• fork-join parallelism• Master thread spawns a team of threads
as needed.
Typical OpenMP Use
• Generally used to parallelize loops– Find most time consuming loops– Split iterations up between threads
void main(){ double Res[1000]; for(int i=0;i<1000;i++) { do_huge_comp(Res[i]); }}
void main(){ double Res[1000];
#pragma omp parallel for for(int i=0;i<1000;i++) { do_huge_comp(Res[i]); }}
Thread Interaction
• OpenMP operates using shared memory– Threads communicate via shared variables
• Unintended sharing can lead to race conditions– output changes due to thread scheduling
• Control race conditions using synchronization– synchronization is expensive– change the way data is stored to minimize the
need for synchronization
Syntax format
• Compiler directives– C/C++
• #pragma omp construct [clause [clause] …]
– Fortran• C$OMP construct [clause [clause] … ]• !$OMP construct [clause [clause] … ]• *$OMP construct [clause [clause] … ]
• Since we use directives, no changes need to be made to a program for a compiler that doesn’t support OpenMP
Using OpenMP• Some compilers can automatically place directives with
option– -qsmp=auto (IBM xlc)– some loops may speed up, some may slow down
• Compiler option required when you use directives– icc -openmp (Linux – Intel compiler)– gcc -fopenmp (gcc versions >= 4.2)
• Scoping variables is the hard part!– shared variables, thread private variables
OpenMP Example#include <stdio.h>#include <omp.h>
main (){
int nthreads, tid;
/* Fork a team of threads giving them their own copies of variables */#pragma omp parallel private(tid) {
/* Obtain and print thread id */ tid = omp_get_thread_num(); printf("Hello World from thread = %d\n", tid);
/* Only master thread does this */ if (tid == 0) { nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); }
} /* All threads join master thread and terminate */
}
[snell@atropos openmp]$ gcc-4.2 -fopenmp hello.c -o hello[snell@atropos openmp]$ ./hello Hello World from thread = 0Hello World from thread = 1Number of threads = 2[snell@atropos openmp]$
[snell@atropos openmp]$ gcc-4.2 -fopenmp hello.c -o hello[snell@atropos openmp]$ ./hello Hello World from thread = 0Hello World from thread = 1Number of threads = 2[snell@atropos openmp]$
OpenMP Directives
• 5 categories– Parallel Regions– Worksharing– Data Environment– Synchronization– Runtime functions / environment
variables
• Basically the same between C/C++ and Fortran
Parallel Regions
• Create threads with omp parallel
• Threads share A (default behavior)• Master thread creates the threads• Threads all start at same time then synchronize at
a barrier at the end to continue with code.
double A[1000]omp_set_num_threads(4);#pragma omp parallel{
int ID = omp_get_thread_num();dosomething(ID, A);
}
Parallel Regions
#pragma omp parallel [clause ...] newline structured_block
if (scalar_expression) private (list) shared (list) default (shared | none) firstprivate (list) reduction (operator: list) copyin (list) num_threads (integer-expression)
Clauses
Parallel Regions
• The number of threads in a parallel region is determined by the following factors, in order of precedence:– 1. Evaluation of the IF clause – 2. Setting of the NUM_THREADS clause – 3. Use of the omp_set_num_threads() library function – 4. Setting of the OMP_NUM_THREADS environment variable – 5. Implementation default - usually the number of CPUs on
a node, though it could be dynamic.
• Threads are numbered from 0 (master thread) to N-1
Sections construct
• The sections construct gives a different structured block to each thread
• By default there is a barrier at the end. Use the nowait clause to turn off.
#pragma omp parallel#pragma omp sections{
X_calculation();#pragma omp section
y_calculation();#pragma omp section
z_calculation();}
Work-sharing constructs
• the for construct splits up loop iterations
• By default, there is a barrier at the end of the “omp for”. Use the “nowait” clause to turn off the barrier.
#pragma omp parallel#pragma omp for
for (I=0;I<N;I++){
NEAT_STUFF(I);}
Short-hand notation
• Can combine parallel and work sharing constructs
• There is also a “parallel sections” construct
#pragma omp parallel forfor (I=0;I<N;I++){
NEAT_STUFF(I);}
#pragma omp parallel forfor (I=0;I<N;I++){
NEAT_STUFF(I);}
A Rule
• In order to be made parallel, a loop must have canonical “shape”
for (index=start; index end; )
<<=>=>
index++;++index;index--;--index;index += inc;index -= inc;index = index + inc;index = inc + index;index = index – inc;
An example
#pragma omp parallel for private(j)for (i = 0; i < BLOCK_SIZE(id,p,n); i++)
for (j = 0; j < n; j++)a[i][j] = MIN(a[i][j], a[i][k] + tmp[j])
By definition, private variable values are undefined at loop entry and exit
To change this behavior, you can use the firstprivate(var) and lastprivate(var)
clauses
x[0] = complex_function();#pragma omp parallel for private(j) firstprivate(x)for (i = 0; i < n; i++)
for (j = 0; j < m; j++)x[j] = g(i, x[j-1]);
answer[i] = x[j] – x[i];
Scheduling Iterations• The schedule clause effects how loop
iterations are mapped onto threads• schedule(static [,chunk])
– Deal-out blocks of iterations of size “chunk” to each thread.
• schedule(dynamic[,chunk])– Each thread grabs “chunk” iterations off a queue until all
iterations have been handled.• schedule(guided[,chunk])
– Threads dynamically grab blocks of iterations. The size of he block starts large and shrinks down to size “chunk” as the calculation proceeds.
• schedule(runtime)– Schedule and chunk size taken from the OMP_SCHEDULE
environment variable.
An example
#pragma omp parallel for private(j) schedule(static, 2)for (i = 0; i < n; i++)
for (j = 0; j < m; j++)x[j][j] = g(i, x[j-1]);
You can play with the chunk size to meet load balancing issues, etc.
Synchronization Directives
• BARRIER– inside PARALLEL, all threads synchronize
• CRITICAL (lock) / END CRITICAL (lock)– section that can be executed by one
thread only– lock is optional name to distinguish
several critical constructs from each other
An example
double area, pi, x;int i, n;
area = 0.0;
#pragma omp parallel for private(x)for (i = 0; i < n; i++){
x = (i + 0.5)/n;
#pragma omp criticalarea += 4.0/(1.0 + x*x);
}pi = area / n;
Reductions
• Sometimes you want each thread to calculate part of a value then collapse all that into a single value
• Done with reduction clausearea = 0.0;#pragma omp parallel for private(x) reduction (+:area)for (i = 0; i < n; i++){
x = (i + 0.5)/n;area += 4.0/(1.0 + x*x);
}pi = area / n;
Another Example/* A Monte Carlo algorithm for calculating pi */int count; /* points inside the unit quarter circle */unsigned short xi[3]; /* random number seed */int i; /* loop index */int samples; /* Number of points to generate */double x,y; /* Coordinates of points */double pi; /* Estimate of pi */
xi[0] = 1; /* These statements set up the random seed */xi[1] = 1;xi[2] = 0;count = 0;for (i = 0; i < samples; i++){
x = erand48(xi);y = erand48(xi);if (x*x + y*y <= 1.0) count++;
}pi = 4.0 * count / samples;printf(“Estimate of pi: %7.5f\n”, pi);
OpenMP Issues
Each thread needs differentrandom number seeds
count is sharedwe need the aggregate
OpenMP Version/* A Monte Carlo algorithm for calculating pi */int count; /* points inside the unit quarter circle */unsigned short xi[3]; /* random number seed */int i; /* loop index */int samples; /* Number of points to generate */double x,y; /* Coordinates of points */double pi; /* Estimate of pi */
omp_set_num_threads(omp_get_num_procs());xi[0] = 1; xi[1] = 1; xi[2] = omp_get_thread_num();count = 0;
#pragma omp parallel for firstprivate(xi) private(x,y) reduction(+:count)for (i = 0; i < samples; i++){
x = erand48(xi);y = erand48(xi);if (x*x + y*y <= 1.0) count++;
}pi = 4.0 * count / samples;printf(“Estimate of pi: %7.5f\n”, pi);
What is wrong with this?What is wrong with this?
#include <stdio.h>#include <stdlib.h>#include <omp.h>
main(int argc, char *argv[]){ /* A Monte Carlo algorithm for calculating pi */ int count; /* points inside the unit quarter circle */ int i; /* loop index */ int samples; /* Number of points to generate */ double x,y; /* Coordinates of points */ double pi; /* Estimate of pi */
samples = atoi(argv[1]);
#pragma omp parallel{ unsigned short xi[3]; /* random number seed */ xi[0] = 1; /* These statements set up the random seed */ xi[1] = 1; xi[2] = omp_get_thread_num(); count = 0; printf("I am thread %d\n", xi[2]);#pragma omp for firstprivate(xi) private(x,y) reduction(+:count) for (i = 0; i < samples; i++) { x = erand48(xi); y = erand48(xi); if (x*x + y*y <= 1.0) count++; }} pi = 4.0 * (double)count / (double)samples; printf("Count = %d, Samples = %d, Estimate of pi: %7.5f\n", count, samples, pi);}
Corrected VersionCorrected Version
[snell@atropos openmp]$ time ./montecarlopi.single 50000000I am thread 0Count = 39268604, Samples = 50000000, Estimate of pi: 3.14149
real 0m4.628suser 0m4.568ssys 0m0.030s[snell@atropos openmp]$ time ./montecarlopi 50000000I am thread 1I am thread 0Count = 39272420, Samples = 50000000, Estimate of pi: 3.14179
real 0m2.480suser 0m4.620ssys 0m0.030s[snell@atropos openmp]$
[snell@atropos openmp]$ time ./montecarlopi.single 50000000I am thread 0Count = 39268604, Samples = 50000000, Estimate of pi: 3.14149
real 0m4.628suser 0m4.568ssys 0m0.030s[snell@atropos openmp]$ time ./montecarlopi 50000000I am thread 1I am thread 0Count = 39272420, Samples = 50000000, Estimate of pi: 3.14179
real 0m2.480suser 0m4.620ssys 0m0.030s[snell@atropos openmp]$
An alternate version…
#pragma omp parallel private(xi, t, x, y, local_count){
xi[0] = 1; xi[1] = 1; xi[2] = tid = omp_get_thread_num();t = omp_get_num_threads();local_count = 0;
for (i = tid; i < samples; i += t){
x = erand48(xi);y = erand48(xi);if (x*x + y*y <= 1.0) local_count++;
}#pragma omp critical
count += local_count;} pi = 4.0 * count / samples; printf(“Estimate of pi: %7.5f\n”, pi);}
[snell@atropos openmp]$ time ./montecarlopi2 50000000I am thread 0I am thread 11: local_count is 164610031: local_count is 16392925Count = 32853927, Samples = 50000000, Estimate of pi: 2.62831
real 0m5.053suser 0m9.697ssys 0m0.053s[snell@atropos openmp]$
[snell@atropos openmp]$ time ./montecarlopi2 50000000I am thread 0I am thread 11: local_count is 164610031: local_count is 16392925Count = 32853927, Samples = 50000000, Estimate of pi: 2.62831
real 0m5.053suser 0m9.697ssys 0m0.053s[snell@atropos openmp]$
Problems!Problems!
Corrected Version /* A Monte Carlo algorithm for calculating pi */ int count; /* points inside the unit quarter circle */ int local_count; /* points inside the unit quarter circle */ int t, tid; int i; /* loop index */ int samples; /* Number of points to generate */ double x,y; /* Coordinates of points */ double pi; /* Estimate of pi */
samples = atoi(argv[1]);
#pragma omp parallel private(i,t,x,y,local_count) reduction(+:count){ unsigned short xi[3]; /* random number seed */ xi[0] = 1; /* These statements set up the random seed */ xi[1] = 1; xi[2] = tid = omp_get_thread_num(); t = omp_get_num_threads(); count = 0; //printf("I am thread %d\n", xi[2]); for (i = tid; i < samples; i+=t) { x = erand48(xi); y = erand48(xi); if (x*x + y*y <= 1.0) count++; } //printf("%d: count is %d\n", tid, count);} pi = 4.0 * (double)count / (double)samples; printf("Count = %d, Samples = %d, Estimate of pi: %7.5f\n", count, samples, pi);
Corrections:•i should be private!•random number seed array(xi) should also be private
Use reduction
Corrections:•i should be private!•random number seed array(xi) should also be private
Use reduction
[snell@atropos openmp]$ time ./montecarlopi3 50000000Count = 39269476, Samples = 50000000, Estimate of pi: 3.14156
real 0m2.321suser 0m4.559ssys 0m0.014s
[snell@atropos openmp]$ time ./montecarlopi3 50000000Count = 39269476, Samples = 50000000, Estimate of pi: 3.14156
real 0m2.321suser 0m4.559ssys 0m0.014s
Serial Directives
• MASTER / END MASTER– executed by master thread only
• DO SERIAL / END DO SERIAL– loop immediately following should not
be parallelized– useful with -qsmp=omp:auto
• SINGLE– only one thread executes the block
Example Serial Execution/* A Monte Carlo algorithm for calculating pi */…omp_set_num_threads(omp_get_num_procs());xi[0] = 1; xi[1] = 1; xi[2] = omp_get_thread_num();count = 0;
#pragma omp parallel for firstprivate(xi) private(x,y) reduction(+:count)for (i = 0; i < samples; i++){
x = erand48(xi);y = erand48(xi);if (x*x + y*y <= 1.0) count++;
#pragma omp single{printf(“Loop Iteration: %d\n”, i);}
}pi = 4.0 * count / samples;printf(“Estimate of pi: %7.5f\n”, pi);
Conditional Execution
• Overhead of fork/join is high• If a loop is small, you don’t want to
parallellize• But, you may not know how big until
runtime• Conditional clause for parallel execution
– if ( expression )area = 0.0;#pragma omp parallel for private(x) reduction (+:area) if (n > 5000)for (i = 0; i < n; i++){
x = (i + 0.5)/n;area += 4.0/(1.0 + x*x);
}pi = area / n;
Scope Rules
• Shared memory programming model– most variables are shared by default
• Global variables are shared• But not everything is shared
– loop index variables– stack variables in called functions from parallel
region
• variable set and then used in for-loop is PRIVATE
• array whose subscript is constant w.r.t. PARALLEL for-loop and is set and then used within the for-loop is PRIVATE
Scope Clauses
• for/DO directive has extra clauses, the most important– PRIVATE (variable list)– REDUCTION (op: variable list)
• op is sum, min, max• variable is scalar, XLF allows array
Scope Clauses (2)
• PARALLEL and PARALELL for/DO and PARALLEL SECTIONS have also– DEFAULT (variable list)
• scope determined by rules
– SHARED (variable list)– IF (scalar logical expression)
• directives are like programming language extension, not compiler option
integer i,j,n
real*8 a(n,n), b(n)
read (1) b
!$OMP PARALLEL DO
!$OMP PRIVATE (i,j) SHARED (a,b,n)
do j=1,n
do i=1,n
a(i,j) = sqrt(1.d0 + b(j)*i)
end do
end do
!$OMP END PARALLEL DO
Matrix Multiply
!$OMP PARALLEL DO PRIVATE(i,j,k)
do j=1,n
do i=1,n
do k=1,n
c(i,j) = c(i,j) + a(i,k) * b(k,j)
end do
end do
end do
Analysis
• Outer loop is parallel: columns of c• Not optimal for cache use• Can put more directives for each loop• Then granularity might be too fine
OMP Functions
• int omp_get_num_procs()• int omp_get_num_threads()• int omp_get_thread_num()• void omp_set_num_threads(int)
OpenMP Environment Variables
• OpenMP parallelism may be controlled via environment variables– OMP_NUM_THREADS
• Sets number of threads in parallel sections
– OMP_DYNAMIC• When = TRUE, allows number of threads to be set at
runtime
– OMP_NESTED• When = TRUE, enables nested parallelism
– OMP_SCHEDULE• Controls the scheduling assignment • Example - export OMP_SCHEDULE=“static,4”
Fortran Parallel Directives
• PARALLEL / END PARALLEL• PARALLEL SECTIONS / SECTION /
SECTION / END PARALLEL SECTIONS• DO / END DO
– work sharing directive for DO loop immediately following
• PARALLEL DO / END PARALLEL DO– combined section and work sharing
Pthread Translation