Upload
godfrey-craig
View
218
Download
2
Tags:
Embed Size (px)
Citation preview
Symbolic Program Consistency Checking of OpenMP Parallel
Programs with Relaxed Memory Models
Fang Yu
National Cheng Chi University
Shun-Ching Yang
Guan-Cheng Chen
Che-Chang Chan
National Taiwan University
Based on an LCTES 2012 paper.
Farn Wang
National Taiwan University
& Academia Sinica
Outline
• Introduction– Motivation– Parallel program correctness– Related work
• 2-step program consistency checking– Step 1: Static race constraint solution– Step 2: Guided simulation
• Extended finite-state machine (EFSM), relaxed memory models
• Implementation• Experiments• Conclusion
2
Motivation (1/4)
• Parallel Programming – Multi-cores, – General purpose computation on GPU (GPGPU)– Distributed computing, cloud computing
• Challenges: – Parallel loops, chunk sizes, # threads, schedules– Arrays, pointer aliases, – Relaxed memory models
3
Motivation (2/4)
A Running example of C & OpenMP
4
for(k=0;k<size-1;k++){ #pragma omp parallel for default (none) shared(M,L,size,k) private(i,j) schedule(static,1) num_thread(4) for(i=k+1,i<size;i++){ L[i][k] = M[i][k]/M[k][k] for(j=k+1;j<size;j++){ M[i][j] = M[i][j] – L[i-1][k]*M[k][j]} } }
Motivation (3/4)
Thread1: k+1, … , k+1+c-1, Thread2: k+1+c , … , k+1+2c-1Thread3: k+1+2c , … , k+1+3c-1Thread4: k+1+3c, … , k+1+4c-1
Thread1: k+1+4c, … , k+1+5c-1 ……. 5
for(k=0;k<size-1;k++){ #pragma omp parallel for default (none) shared(M,L,size,k) private(i,j) schedule(static,c) num_thread(4) for(i=k+1,i<size;i++){ L[i][k] = M[i][k]/M[k][k] for(j=k+1;j<size;j++){ M[i][j] = M[i][j] – L[i-1][k]*M[k][j]} } }
Motivation (4/4)
Many programming supports• forks & joins• P-threads• Open Multi-Processing (OpenMP)• Thread Building Blocks• Microsoft …
6
Parallel Program Correctness (1/4)
7
Program level, what users care about• Determinism: – For all input, all executions yield the same output.
• Consistency: – All executions yield the same output as the sequential
execution.
• Race-freedom: – Parallel executions do not yield different results.
All seemingly equivalent at program level. • unless sequential execution is not a parallel execution.
Parallel Program Correctness (2/4)
• Checking the correctness property of each parallel region (PR)
• Correctness at PRs correctness of the program
parallel for
parallel while
parallel for
parallel for
8
Parallel Program Correctness (3/4)
9
In practice • It may be unclear what the program result is. • Instead, properties for correctness at PR level
are usually checked. – determinism – consistency – race-freedom
• At RW schedule levels, values do not count.
– linearizability (transaction levels)
Parallel Program Correctness (4/4)
Linearizability (Transaction level) race-freedom (PR RW level)
determinism (PR level) = consistency (PR level)
race-freedom (program level) = determinism (program level) = consistency (program level)
program correctness 10
Related Work (1/4)• Thread analyzer of Sun Studio [Lin 2008]– Static race detection, no arrays
• Intel Thread Checker [Petersen & Shah 2003]– Dynamic approach
• Instrumentation approach on client-server for race detection [Kang et al. 2009]– Run-time monitoring in OpenMP programs
• OmpVerify [Basupalli et al. 2011]– Polyhedral analysis for Affine Control Loops
11
Related Work in PLDI 2012 (2/4)no simulation as the 2nd step
• Detect races via liquid effects [Kawaguchi, Rondon, Bakst, Jhala]– type inferencing for precise race detection. – no arrays.
• Speculative Linearizability [Guerraoui,Kuncak,Losa]
• Reasoning about Relaxed Programs [Carbin, Kim, Misailovic, Rinard]
• Parallelizing Top-Down Interprocedural Analysis [Albarghouthi, Kumar, Nori, Rajamani]
12
Related Work in PLDI 2012 (3/4) no simulation as the 2nd step
• Sound and Precise Analysis of Parallel Programs through Schedule-Specialization [Wu, Tang, Hu, et al]
• Race Detection for Web Applications [Petrov, Vechev, Sridharan, Dolby]
• Concurrent Data Representation Synthesis [Hawkins, Aiken, Fisher2, et al]
• Dynamic Synthesis for Relaxed Memory Models [Liu, Nedev, Prisadnikov, et al]
Related Work in PLDI 2012 (4/4)no simulation as the 2nd step
Tools:• Parcae [Raman, Zaks, Lee 3, et al] • Chimera [Lee, Chen, Flinn, Narayanasamy]• Janus [Tripp1, Manevich, Field, Sagiv]• Reagents [Turon]
Methodology (1/2)
Assumptions:• Arrays do not overlap.• No pointers other than arrays. • Fixed #threads, chunk size, scheduling policy.– We analyze consistency of program implementation.
• Focusing on OpenMP.– The techniques should be applicable to other
frameworks.
• Output result prescribed by users.15
Why OpenMP ?
• Complicate enough• Practical enough– Parallelizes programs automatically; – Is an industry standard of application
programming interface (API); – Is supported by Sun Studio, Intel Parallel Studio,
Visual C++, GNU Compiler Collection (GCC).
16
Methodology (2/2)
Potential race analysis at PR level
Guided simulation for program consistency violations
17
Program Consistency checking
end
2-step program consistency checking.
Potential race report
Step 1: Potential Races at PR level
18
Necessary constraints as Presburger formulas• A race constraint between each pair of
memory references to the same location by different threads.
• Solution of the pairwise constraints via Presburger formula solving.
Race-freedom
Step 1: Potential Race AnalysisC program with OpenMP
Pairwise Constraints Generator
Pairwise Race Constraints
Consraint Solver
Sat?No Yes
Potential races(Truth
Assignment)
19
Potential Race Constraint
A Potential Race Constraint = Thread Path Condition Λ Race Condition• Thread Path Condition– Necessary for a thread to access a memory location in a
statement– Obtained by symbolic postcondition analysis
• Race Condition– The necessary condition of an access by two threads in a
parallel region
20
Running example
Thread1: k+1, … , k+1+c-1, Thread2: k+1+c , … , k+1+2c-1Thread3: k+1+2c , … , k+1+3c-1Thread4: k+1+3c, … , k+1+4c-1
Thread1: k+1+4c, … , k+1+5c-1 ……. 21
for(k=0;k<size-1;k++){ #pragma omp parallel for default (none) shared(M,L,size,k) private(i,j) schedule(static,c) num_thread(4) for(i=k+1,i<size;i++){ L[i][k] = M[i][k]/M[k][k] for(j=k+1;j<size;j++){ M[i][j] = M[i][j] – L[i-1][k]*M[k][j]} } }
Thread Path Condition of L[i][k]
22
for(k=0;k<size-1;k++){ #pragma omp parallel for default (none) shared(M,L,size,k) private(i,j) schedule(static,c) num_thread(4) for(i=k+1,i<size;i++){ L[i][k] = M[i][k]/M[k][k] for(j=k+1;j<size;j++){ M[i][j] = M[i][j] – L[i-1][k]*M[k][j]} } }
Thread 1: it1-(k+1)%4=0 Λ k+1≤ i t1< size
Thread Path Conditions of L[i-1][k]
23
Thread 2: it2-(k+1)-1 % 4 = 0 Λ k+1 ≤ it2 < sizeΛ k+1 ≤ jt2 < size
for(k=0;k<size-1;k++){ #pragma omp parallel for default (none) shared(M,L,size,k) private(i,j) schedule(static,c) num_thread(4) for(i=k+1,i<size;i++){ L[i][k] = M[i][k]/M[k][k] for(j=k+1;j<size;j++){ M[i][j] = M[i][j] – L[i-1][k] *M[k][j]} } }
Race Condition of L[i][k] & L[i-1][k]
24
for(k=0;k<size-1;k++){ #pragma omp parallel for default (none) shared(M,L,size,k) private(i,j) schedule(static,c) num_thread(4) for(i=k+1,i<size;i++){ L[i][k] = M[i][k]/M[k][k] for(j=k+1;j<size;j++){ M[i][j] = M[i][j] – L[i-1][k] *M[k][j]} } }
it1-(k+1) % 4 = 0 Λ k+1 ≤ it1 < sizeΛ it2-(k+1)-1 % 4=0 Λ k+1 ≤ it2 < sizeΛ k+1 ≤ jt2 < size
Λ k = k Λ it1 = it2 -1
Potential Race Constraint Solving
Potential races (Omega lib.): . . i_1 = k+1+4alpha. . i_2 = k+2+4alpha. . i_2 = i_1+1. . i_1 < size. . i_2 < size. . k+1 <= i_1. . k+1 <= i_2. . k+1 <= j_2. . j_2 < sizei_1 – 0 [0,), not_tighti_2 – 0 [0,), not_tight 25
it1-(k+1) % 4 = 0 Λ k+1 ≤ it1 < sizeΛ it2-(k+1)-1 % 4=0 Λ k+1 ≤ it2 < sizeΛ k+1 ≤ jt2 < sizeΛ k = k Λ it1 = it2 -1
All Presburger
Step 2: Guided symbolic simulation
26
• Program models: – Extended finite-state machine (EFSM)– Relaxed memory model
• Simulator of EFSM– Stepwise, backtrack, fixed point
• Witness of program consistency violations– comparison with the sequential execution result.
Potential races
(from step 1)
Guided SimulationC program with OpenMP
Model Generator
Model (EFSM)
Simulation
Consistency ?
Yes
No
fixed point ?
Yes
No
27
Consistency violations Consistency
(w. benign races)
C Program Model Construction (1/2)
start
S
stop
(true)x=(t-1) *c +i; y=0;
(x-y+m*c j y=c-1)x=x-y+m*c; y=0;
(x-y+m*c>j y=c-1)
(x<jy<c-1)x++; y++;
(x> j)
y is an auxiliary local variable for chunk.t is the serial number of the thread.
28
Example: #pragma omp for schedule(Static, c) num_threads(m) for(x=i;x<=j;x++) S
To model races in a C statement: y = f(x1, x2, …, xn)
assume reads x1, x2,…, xn in order. – other orders can also be modeled.
Translate to the following n+1 EFSM transitions: a1=x1; a2=x2; …; an=xn; y=f(a1,…,an);
a1, a2, …, an are auxiliary variables in EFSM.
29
C Program Model Construction (2/2)
Relaxed Memory Models
• Out-of-order execution of accesses to the memory for hardware efficiency.– local caches, multiprocessors– for customized synchronizations, controlled races
• May lead to unexpected result.A classical example:
initially x=0 y = 0thread 1: x=1; thread 2: y = 1;
z = y; w = x; assert z=1w=1
30
Relaxed Memory Models
A classical example: initially x=0 y = 0
31
cache 1
cache 2
thread 1: x=1; z=y;
thread 2: y=1; w=x;
memoryloadstore
storeload
assert z=1w=1
x.c1=1y.c1=1load(w.c2,x)load(z.c1,y)store(x.c1)x=x.c1store(y.c2)y=y.c2
Relaxed Memory Models
Total store order (TSO) • From SPARC• Adapted to Intel 80x86 series• Description: – Local reads can use pending writes in the local
store.• Problem: Peer reads are not aware of the local pending
writes.
– Local stores must be FIFO.
32
Modeling TSO w. m threads (1/4)
• An array x[0..m] for each shared variable x – x[0] is the memory copy. – x[i] is the cache copy of x of thread i [1,m]– x now becomes an address variable instead of
the value variable for x.
33
Modeling TSO w. m threads (2/4)• An arrays ls[0..n] of objects for load-store (LS) buffer
of size n+1.– ls_st[k]: status of load-store buffer cell k
• 0: not used, 1:load, 2: store
– ls_th[k]: thread that use load-store buffer cell k.– ls_dst[k], ls_src[k]: destination and source addresses– ls_value: value to storePurely for convenience.Can be changed to m load-store buffers for each thread. Need know mappings from threads to cores
34
PW ? steps EFSM transitions
Pending Write (PW)
1 Thread J: !load@Q ls_src@(Q) = &x; ls_dst = &a;
LS Q: must be the largest PW LS object. ?load@J ls_th=J; ls_status = 1;
2 Thread J: ?load_finish
LS Q: !load_finish@(ls_th)l s_dst[0]=ls_value; ls_th=0; ls_status=0; compact LS array;
No pending Write
1 Thread J: !load@Q ls_src@(Q) = &x; ls_dst = &a;
LS Q: must be the smallest idle LS obj. ?load@J ls_th=J; ls_status = 1;
2 Thread J: ?load_finish
LS Q: !load_finish@(ls_th) ls_dst[0]=ls_src[0];ls_th=0; ls_status=0; compact LS array;
Modeling TSO w. m threads (3/4)Load a x by thread j, ‘a’ is private.
35
Modeling TSO w. m threads (4/4)Store a x by thread j, ‘a’ is private.
36
steps EFSM transitions1 Thread J: !store@Q ls_dst@(Q) = &x; ls_value = a;
LS Q: must be the smallest idle LS obj. ?store@J ls_th=J; ls_status = 2;
2 LS Q: ls_dst[0] = ls_value; ls_th=0; ls_status=0; compact LS array;
Guided Simulation
• For each pairwise race condition truth assignment, perform a simulation session.
• Use a stack to explore the simulation paths. • Explore all paths compatible with the truth
assignment. • Check consistency at the end of each path. • Mark benign races.
37
Implementation
Pathg – path generator • Pontential race condition solving– Presburger Omega library
• Model construction: – REDLIB for EFSM with synchronizations, arrays,
variable declarations, address arithmetics
• Guided EFSM simulation– REDLIB semi-symbolic simulator– step, backtrack, check fixpoint/consistency
38
ImplementationGuided Symbolic Simulation
Sequential execution(Golden model) Guided Multi-Threaded Simulation
ParallelTask 1
ParallelTask 1
ParallelTask 2
ParallelTask 2
ParallelTask 3
MasterThread
MasterThread
Read:L[2][1]Read:L[2][1]Write:L[2][1]Read:L[2][1]Write:L[2][1]
.
.
.
.
Read:L[2][1]Read:L[2][1]Write:L[2][1]Read:L[2][1]Write:L[2][1]
.
.
.
.
Memory Accessing Sequence
MemoryAccessing Sequence
Read:L[2][1]Write:L[2][1]Read:L[2][1]Read:L[2][1]Write:L[2][1]
.
.
.
.
Read:L[2][1]Write:L[2][1]Read:L[2][1]Read:L[2][1]Write:L[2][1]
.
.
.
.
MasterThread
ParallelTask 1
ParallelTask 1
ParallelTask 2
ParallelTask 2
ParallelTask 3
MasterThread
output output
39
ImplementationPotential Race Report
===tg:i_4,i_1=====tw:i_4Race::L[5][1]===tg:i_3,i_4=====tw:i_3Race::L[4][1]===tg:i_2,i_3=====tw:i_2Race:: L[3][1]===tg:i_1,i_2=====tw:i_1Race:: L[2][1]
tg indicates threads involved in the race.
tw indicates threads WRITE the Memory address.
Race is where the race condition is.
We enumerate variables to limit the solution
40
Experiments
• Environment– Ubuntu 9.10 64bit– i5-760 2.8GHz and 2GB RAM
• Benchmarks– OpenMP Source Code Repository (OmpSCR)– NAS Parallel Benchmarks (NPB)
41
Constraint Solving of OmpSCR Bug v1: Races manually introduced (between any two threads dealing with
the consecutive iterations) Bug v2: Rare races introduced (only between two specific threads on a
particular share memory) Fixed: A barrier statement manually inserted (remove the race in Bug v2)
BenchmarkOriginal Bug v1 Bug v2 Fixed
#Const. #Sat Time #Const. #Sat Time #Const. #Sat Time #Const. #Sat Time
c_lu.c 71 0 0.18s 629 29 1.810s 935 30 4.110s 935 0 5.15s
c_ja01.c 95 0 0.39s 95 8 0.42s 155 1 0.75s 95 0 0.77s
c_ja02.c 95 0 0.03s 95 8 0.35s 155 1 0.67s 95 0 1.03s
c_loopA.c 17 0 0.04s 47 4 0.07s 95 1 0.32s 17 0 0.84s
c_loopB.c 17 0 0.03s 29 4 0.08s 95 1 0.15s 17 0 1.13s
c_md.c 65 0 0.25s 77 4 0.30s 131 1 0.53s 65 0 1.25s
42
Symbolic Simulation of OmpSCR• Blindly simulation needs to explore (much) more traces to hit a
consistency violation! • Standard OpenMP tools fail to report races of these benchmarks.
Benchmarks
Guided simulation Random simulation Sun Studio Intel Thread Checker
#Traces Time #Trace Time race Race/total
c_lu_bug1 1 23.35s 25.3 52.11s N 4/10
c_lu_bug2 1 23.22s 178.9 110.58s N 1/10
c_ja01_bug1 1 6.65s 10.6 26.60s N 4/10
c_ja01_bug2 1 13.91s 42.1 58.16s N 3/10
c_ja_02_bug1 1 14.86s 25 28.83s N 2/10
c_ja_02_bug2 1 15.19s 41.3 52.25s N 2/10
c_loopA_bug1 1 10.76s 11.7 36.82s N 3/10
c_loopA_bug2 1 56.86s 27.6 98.40s N 2/10
c_loopB_bug1 1 14.54s 9.4 29.58s N 2/10
c_loopB_bug2 1 41.50s 38.6 66.48s N 2/10
c_md_bug1 1 12.19s 10.4 26.21s N 3/10
c_md_bug2 1 19.38s 44.3 83.52s N 2/10 43
NAS Parallel Benchmarks• Middle-size benchmarks (1200+~3500+ loc) • Efficient race constraint solving – e.g., 150000+ race constraints solved in 38 minutes by omega library
• Rare satisfiable constraints – 8/85067 constraints of nas_lu.c
Benchmark #loc #Access #Const. #Sat Time
nas_lu.c 3481 13736 85067 8 27m30.37s
bt.c 3616 15916 157047 0 37m33.32s
mg.c 1250 4636 2269 0 0m17.19s
sp.c 2983 13604 45209 0 4m0.32s44
nas_lu.c
• Slice the program to the segment of the paralleled region with satisfiable race conditions
• Construct the symbolic model of the sliced segment:– 35 Modes (EFSM)– Reaching the fixed point without consistency violation after 205
steps and 16.93secs
• Benign races– All of them are used as mutual exclusion semaphores– nas_lu.c is consistent
45
Conclusion
• Static analysis of program consistency– for real C/C++ program with OpenMP directives
• Highly automated solution– Constraint solving– Symbolic simulation
• High precision: relaxed memory models• High efficiency • Extension to TBB, other memory models ? • Partial order reduction ?
46