Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
PARALLEL PROGRAMMING ON MULTICORESThreading Subroutines
Nikos P. PitsianisXiaobai Sun
DEPARTMENT OF COMPUTER SCIENCEDUKE UNIVERSITY
09/29/2010
1 / 38
mailto:[email protected]
MY EXPERIENCE
BACKGROUNDPh.D. in Mathematics; basic knowledge of C
WHAT I HAVE GAINED
I Learned Fortran, Pthread, and OpenMPI Parallel Fast Multipole Method (FMM)
1. 1st parallel version on multicore desktop & laptop2. Initial code has 12k lines; current code has 18k lines3. In both Pthread and OpenMP4. Problem size: 20M (laptop); 100M (workstation)5. Speedup: 8X on an oct-core workstation
2 / 38
USING EXISTING SUBROUTINES
You have a working sequential code, but the parallel version sees
1. Segmentation fault
2. Bus error
3. Heap collapse
4. Inconsistent results
What happened?
3 / 38
USE OF EXISTING SUBROUTINES
Potential Problems, not THREAD-SAFE
Concurrent updates to global variables
Concurrent use of local scratch space
Stack overflow
4 / 38
USE OF EXISTING SUBROUTINES
Concurrent updates to global variablesdouble balance;void deposit (double s) {
balance += s;}
void withdraw (double s) {balance -= s;
}
int main () {balance = 0.0deposit (s1);deposit (s2);withdraw (s3);. . .printf("%f\n",balance);return 0;
}
double balance;pthread_mutex_t balance_mutex;void deposit (double s) {
pthread_mutex_lock (&balance_mutex);balance += s;pthread_mutex_unlock (&balance_mutex);
}
void withdraw (double s) {pthread_mutex_lock (&balance_mutex);balance -= s;pthread_mutex_unlock (&balance_mutex);
}
int main () {balance = 0.0;. . .return 0;
}
Use mutex variable
5 / 38
USE OF EXISTING SUBROUTINES
Concurrent updates to global variablesdouble balance;void deposit (double s) {
balance += s;}
void withdraw (double s) {balance -= s;
}
int main () {balance = 0.0deposit (s1);deposit (s2);withdraw (s3);. . .printf("%f\n",balance);return 0;
}
double balance;pthread_mutex_t balance_mutex;void deposit (double s) {
pthread_mutex_lock (&balance_mutex);balance += s;pthread_mutex_unlock (&balance_mutex);
}
void withdraw (double s) {pthread_mutex_lock (&balance_mutex);balance -= s;pthread_mutex_unlock (&balance_mutex);
}
int main () {balance = 0.0;. . .return 0;
}
Use mutex variable
6 / 38
USE OF EXISTING SUBROUTINES
Concurrent updates to global variablesdouble balance;void deposit (double s) {
balance += s;}
void withdraw (double s) {balance -= s;
}
int main () {balance = 0.0deposit (s1);deposit (s2);withdraw (s3);. . .printf("%f\n",balance);return 0;
}
double balance;pthread_mutex_t balance_mutex;void deposit (double s) {
pthread_mutex_lock (&balance_mutex);balance += s;pthread_mutex_unlock (&balance_mutex);
}
void withdraw (double s) {pthread_mutex_lock (&balance_mutex);balance -= s;pthread_mutex_unlock (&balance_mutex);
}
int main () {balance = 0.0;. . .return 0;
}
Use mutex variable
7 / 38
USE OF EXISTING SUBROUTINES
Concurrent use of local scratch spaceint main () {
int s1, s2, s, work[100];s1 = task1(work);s2 = task2(work);s = s1+s2;printf("%d\n", s);return 0;
}
int main () {int s1, s2, s, work1[100], work2[100];s1 = task1(work1);s2 = task2(work2);s = s1+s2;printf("%d\n", s);return 0;
}
Allocate for each thread; trim it down when necessary
8 / 38
USE OF EXISTING SUBROUTINES
Concurrent use of local scratch spaceint main () {
int s1, s2, s, work[100];s1 = task1(work);s2 = task2(work);s = s1+s2;printf("%d\n", s);return 0;
}
int main () {int s1, s2, s, work1[100], work2[100];s1 = task1(work1);s2 = task2(work2);s = s1+s2;printf("%d\n", s);return 0;
}
Allocate for each thread; trim it down when necessary
9 / 38
USE OF EXISTING SUBROUTINES
Concurrent use of local scratch spaceint main () {
int s1, s2, s, work[100];s1 = task1(work);s2 = task2(work);s = s1+s2;printf("%d\n", s);return 0;
}
int main () {int s1, s2, s, work1[100], work2[100];s1 = task1(work1);s2 = task2(work2);s = s1+s2;printf("%d\n", s);return 0;
}
Allocate for each thread; trim it down when necessary
10 / 38
USE OF EXISTING SUBROUTINES
Stack overflow
Caused by recursive call of subroutine
Unfold recursion
void foo (int i, int *s) {int ia[100];if ( i > 0)
foo(i-1, s);*s += compute(ia, i);
}
int main () {int s, i = 2;foo(i, &s);return 0;
}
int i, sMain
int ia[100]
foo(2)
int ia[100]
foo(1)
int ia[100]
foo(0)
Stack
int main () {int i, s, ia[100];for ( i = 0; i < 2; i++ )
s += compute(ia,i);return 0;
}int i, s
ia[100]
11 / 38
USE OF EXISTING SUBROUTINES
Stack overflow
Caused by recursive call of subroutine
Unfold recursion
void foo (int i, int *s) {int ia[100];if ( i > 0)
foo(i-1, s);*s += compute(ia, i);
}
int main () {int s, i = 2;foo(i, &s);return 0;
}int i, s
Main
int ia[100]
foo(2)
int ia[100]
foo(1)
int ia[100]
foo(0)
Stack
int main () {int i, s, ia[100];for ( i = 0; i < 2; i++ )
s += compute(ia,i);return 0;
}int i, s
ia[100]
12 / 38
USE OF EXISTING SUBROUTINES
Stack overflow
Caused by recursive call of subroutine
Unfold recursion
void foo (int i, int *s) {int ia[100];if ( i > 0)
foo(i-1, s);*s += compute(ia, i);
}
int main () {int s, i = 2;foo(i, &s);return 0;
}int i, s
Main
int ia[100]
foo(2)
int ia[100]
foo(1)
int ia[100]
foo(0)
Stack
int main () {int i, s, ia[100];for ( i = 0; i < 2; i++ )
s += compute(ia,i);return 0;
}int i, s
ia[100]
13 / 38
USE OF EXISTING SUBROUTINES
Stack overflow
Caused by recursive call of subroutine
Unfold recursion
void foo (int i, int *s) {int ia[100];if ( i > 0)
foo(i-1, s);*s += compute(ia, i);
}
int main () {int s, i = 2;foo(i, &s);return 0;
}int i, s
Main
int ia[100]
foo(2)
int ia[100]
foo(1)
int ia[100]
foo(0)
Stack
int main () {int i, s, ia[100];for ( i = 0; i < 2; i++ )
s += compute(ia,i);return 0;
}int i, s
ia[100]
14 / 38
USE OF EXISTING SUBROUTINES
Stack overflow
Caused by recursive call of subroutine
Unfold recursion
void foo (int i, int *s) {int ia[100];if ( i > 0)
foo(i-1, s);*s += compute(ia, i);
}
int main () {int s, i = 2;foo(i, &s);return 0;
}int i, s
Main
int ia[100]
foo(2)
int ia[100]
foo(1)
int ia[100]
foo(0)
Stack
int main () {int i, s, ia[100];for ( i = 0; i < 2; i++ )
s += compute(ia,i);return 0;
}int i, s
ia[100]
15 / 38
USE OF EXISTING SUBROUTINES
Stack overflow
Caused by recursive call of subroutine
Unfold recursion
void foo (int i, int *s) {int ia[100];if ( i > 0)
foo(i-1, s);*s += compute(ia, i);
}
int main () {int s, i = 2;foo(i, &s);return 0;
}
int i, sMain
int ia[100]
foo(2)
int ia[100]
foo(1)
int ia[100]
foo(0)
Stack
int main () {int i, s, ia[100];for ( i = 0; i < 2; i++ )
s += compute(ia,i);return 0;
}int i, s
ia[100]
16 / 38
USE OF EXISTING SUBROUTINES
Stack overflow
Caused by memory allocation within subroutine
Allocate within main(), pass pointer to thread
int ps[3];
void *foo (void *threadid) {long tid = (long)threadid;int ia[100];ps[tid] = compute(ia,i);pthread_exit(NULL);
}
int main () {long i;int s;pthread_t threads[3];for ( i = 0; i < 3; i++ )
pthread_create(&threads[i], NULL, foo, (void *i));for ( i = 0; i < 3; i++ )
pthread_join (threads[t], NULL);s = ps[0]+ps[1]+ps[2];return 0;
}
17 / 38
USE OF EXISTING SUBROUTINES
Stack overflow
Caused by memory allocation within subroutine
Dynamic allocation with malloc()
Use pthread library pthread_attr_setstacksize()
Allocate within main(), pass pointer to thread
18 / 38
USE OF EXISTING SUBROUTINES
Stack overflow
Caused by memory allocation within subroutine
Allocate within main(), pass pointer to thread
int ps[3];int *gia;
void *foo (void *threadid) {long tid = (long)threadid;int *ia = &gia[tid*100];ps[tid] = compute(ia,i);pthread_exit(NULL);
}
int main () {long i;int s;pthread_t threads[3];gia = (int *)malloc(sizeof(int)*3*100);for ( i = 0; i < 3; i++ )
pthread_create(&threads[i], NULL, foo, (void *i));for ( i = 0; i < 3; i++ )
pthread_join (threads[t], NULL);s = ps[0]+ps[1]+ps[2];free(gia);return 0;
}
19 / 38
ALGORITHMIC REDUCTION OF CRITICAL SECTIONS
READING & WRITING ARE NOT SYMMETRIC
1. Concurrent reading is fast & safe2. Concurrent writing must be AVOIDED
20 / 38
ALGORITHMIC REDUCTION OF CRITICAL SECTIONSExample: TML of FMM
B1
B2?
B3
Need O(N) Locks!!!
B4
Need O(3d − 2d) LocksNeed ZERO Locks
•
•
• • ••
••
•
•• •
••
•
•
•
••
•• •
•
•
•
•••
•
••
•
•
•
••
•
• ••
•• •
•
•• •
•
•
•••
•
••
•
•
•
•
•
•
•
••
21 / 38
ALGORITHMIC REDUCTION OF CRITICAL SECTIONSExample: TML of FMM
B1
B2?
B3
Need O(N) Locks!!!
B4
Need O(3d − 2d) LocksNeed ZERO Locks
22 / 38
ALGORITHMIC REDUCTION OF CRITICAL SECTIONSExample: TML of FMM
B1
B2?
B3
Need O(N) Locks!!!
B4
Need O(3d − 2d) LocksNeed ZERO Locks
23 / 38
ALGORITHMIC REDUCTION OF CRITICAL SECTIONSExample: TML of FMM
B1
B2
?
B3
Need O(N) Locks!!!
B4
Need O(3d − 2d) LocksNeed ZERO Locks
24 / 38
ALGORITHMIC REDUCTION OF CRITICAL SECTIONSExample: TML of FMM
B1
B2?
B3
Need O(N) Locks!!!
B4
Need O(3d − 2d) LocksNeed ZERO Locks
25 / 38
ALGORITHMIC REDUCTION OF CRITICAL SECTIONSExample: TML of FMM
B1
B2
?
B3
Need O(N) Locks!!!
B4
Need O(3d − 2d) LocksNeed ZERO Locks
26 / 38
ALGORITHMIC REDUCTION OF CRITICAL SECTIONSExample: TML of FMM
B1
B2
?
B3
Need O(N) Locks!!!
B4
Need O(3d − 2d) Locks
Need ZERO Locks
27 / 38
ALGORITHMIC REDUCTION OF CRITICAL SECTIONSExample: TML of FMM
B1
B2
?
B3
Need O(N) Locks!!!
B4
Need O(3d − 2d) Locks
Need ZERO Locks
28 / 38
DEVELOPING TIPS
Save a copy of your working code
Do sequential version first
Work on one subroutine at a timePick the most time-critical subroutine first
29 / 38
DEVELOPING TIPS
Study software structures
lmaxlmax − 1 lmaxlmax − 2 lmax − 1lmax − 3 lmax − 2 lmax − 2
.........
.........
.........2 3 31 2 20
12.........
lmax − 3lmax − 2lmax − 1
lmax
All l’s All l’s
All Particles
OPERATIONS
Upward P
ass:TSM,TMM
List2: TME
, TEE,TEL
List3: TMT
or TST
List5 &
Evaluate
Local: TLL
& TLT
List4: TSL
or TSTList
1: TST
SumLoca
l & Direct
TIME STEP
30 / 38
CODING TIPS
Do NOT use unless necessary
No goto statement in Fortran
Loops in Fortran
No break statement
UTILIZE COMPILER INSTEAD OF BLOCKING IT
31 / 38
CODING TIPS
Do NOT use unless necessary
No goto statement in Fortran
Loops in Fortran
DO .............
ENDDO
DO 10 .............
10 CONTINUE
No break statement
UTILIZE COMPILER INSTEAD OF BLOCKING IT
32 / 38
CODING TIPS
Do NOT use unless necessary
No goto statement in Fortran
Loops in Fortran
No break statement
while (cond) {.......cond = foo();
}
while (1) {......break;
}
UTILIZE COMPILER INSTEAD OF BLOCKING IT
33 / 38
TESTING & TUNING TIPS
Makefile
Split Fortran subroutines into individual file
Debugging & profiling tools
Shell script for batch test
34 / 38
TESTING & TUNING TIPS
Makefile
Split Fortran subroutines into individual file√
Fsplit
http://developers.sun.com/sunstudio/documentation/ss12/mr/man1/fsplit.1.html√
Use individual compilation flag for each routine
Debugging & profiling tools
Shell script for batch test
35 / 38
http://developers.sun.com/sunstudio/documentation/ss12/mr/man1/fsplit.1.html
TESTING & TUNING TIPS
Makefile
Split Fortran subroutines into individual file
Debugging & profiling tools√
gprofhttp://www.cs.utah.edu/dept/old/texinfo/as/gprof_toc.html√
DDDhttp://www.gnu.org/software/ddd√
valgrindhttp://valgrind.org/
Shell script for batch test
36 / 38
http://www.cs.utah.edu/dept/old/texinfo/as/gprof_toc.htmlhttp://www.gnu.org/software/dddhttp://valgrind.org/
TESTING & TUNING TIPS
Makefile
Split Fortran subroutines into individual file
Debugging & profiling tools
Shell script for batch test#! /bin/bashfor flag in -O -O2 -O4do
make cleanmake CFLAGS=$flag fFFLAGS=$flag sFFLAGS=-Ofor (( t=2; t
SUMMARY
I Threading subroutinesI Algorithmic reduction of critical sectionsI Programming development tips
Thank you for your attentionPlease send your comments to [email protected]
38 / 38
mailto:[email protected]