Symbolic Program Consistency Checking of OpenMP Parallel Programs with Relaxed Memory Models Fang Yu National Cheng Chi University Shun-Ching Yang Guan-Cheng

Symbolic Program Consistency Checking of OpenMP Parallel

Programs with Relaxed Memory Models

Fang Yu

National Cheng Chi University

Shun-Ching Yang

Guan-Cheng Chen

Che-Chang Chan

National Taiwan University

Based on an LCTES 2012 paper.

Farn Wang

National Taiwan University

& Academia Sinica

Outline

• Introduction– Motivation– Parallel program correctness– Related work

• 2-step program consistency checking– Step 1: Static race constraint solution– Step 2: Guided simulation

• Extended finite-state machine (EFSM), relaxed memory models

• Implementation• Experiments• Conclusion

2

Motivation (1/4)

• Parallel Programming – Multi-cores, – General purpose computation on GPU (GPGPU)– Distributed computing, cloud computing

• Challenges: – Parallel loops, chunk sizes, # threads, schedules– Arrays, pointer aliases, – Relaxed memory models

3

Motivation (2/4)

A Running example of C & OpenMP

4

for(k=0;k<size-1;k++){ #pragma omp parallel for default (none) shared(M,L,size,k) private(i,j) schedule(static,1) num_thread(4) for(i=k+1,i<size;i++){ L[i][k] = M[i][k]/M[k][k] for(j=k+1;j<size;j++){ M[i][j] = M[i][j] – L[i-1][k]*M[k][j]} } }

Motivation (3/4)

Thread1: k+1, … , k+1+c-1, Thread2: k+1+c , … , k+1+2c-1Thread3: k+1+2c , … , k+1+3c-1Thread4: k+1+3c, … , k+1+4c-1

Thread1: k+1+4c, … , k+1+5c-1 ……. 5

for(k=0;k<size-1;k++){ #pragma omp parallel for default (none) shared(M,L,size,k) private(i,j) schedule(static,c) num_thread(4) for(i=k+1,i<size;i++){ L[i][k] = M[i][k]/M[k][k] for(j=k+1;j<size;j++){ M[i][j] = M[i][j] – L[i-1][k]*M[k][j]} } }

Motivation (4/4)

Many programming supports• forks & joins• P-threads• Open Multi-Processing (OpenMP)• Thread Building Blocks• Microsoft …

6

Parallel Program Correctness (1/4)

7

Program level, what users care about• Determinism: – For all input, all executions yield the same output.

• Consistency: – All executions yield the same output as the sequential

execution.

• Race-freedom: – Parallel executions do not yield different results.

All seemingly equivalent at program level. • unless sequential execution is not a parallel execution.


• Checking the correctness property of each parallel region (PR)

• Correctness at PRs correctness of the program

parallel for

parallel while

parallel for

parallel for

8


9

In practice • It may be unclear what the program result is. • Instead, properties for correctness at PR level

are usually checked. – determinism – consistency – race-freedom

• At RW schedule levels, values do not count.

– linearizability (transaction levels)


Linearizability (Transaction level) race-freedom (PR RW level)

determinism (PR level) = consistency (PR level)

race-freedom (program level) = determinism (program level) = consistency (program level)

program correctness 10

Related Work (1/4)• Thread analyzer of Sun Studio [Lin 2008]– Static race detection, no arrays

• Intel Thread Checker [Petersen & Shah 2003]– Dynamic approach

• Instrumentation approach on client-server for race detection [Kang et al. 2009]– Run-time monitoring in OpenMP programs

• OmpVerify [Basupalli et al. 2011]– Polyhedral analysis for Affine Control Loops

11

Related Work in PLDI 2012 (2/4)no simulation as the 2nd step

• Detect races via liquid effects [Kawaguchi, Rondon, Bakst, Jhala]– type inferencing for precise race detection. – no arrays.

• Speculative Linearizability [Guerraoui,Kuncak,Losa]

• Reasoning about Relaxed Programs [Carbin, Kim, Misailovic, Rinard]

• Parallelizing Top-Down Interprocedural Analysis [Albarghouthi, Kumar, Nori, Rajamani]

12

Related Work in PLDI 2012 (3/4) no simulation as the 2nd step

• Sound and Precise Analysis of Parallel Programs through Schedule-Specialization [Wu, Tang, Hu, et al]

• Race Detection for Web Applications [Petrov, Vechev, Sridharan, Dolby]

• Concurrent Data Representation Synthesis [Hawkins, Aiken, Fisher2, et al]

• Dynamic Synthesis for Relaxed Memory Models [Liu, Nedev, Prisadnikov, et al]

Related Work in PLDI 2012 (4/4)no simulation as the 2nd step

Tools:• Parcae [Raman, Zaks, Lee 3, et al] • Chimera [Lee, Chen, Flinn, Narayanasamy]• Janus [Tripp1, Manevich, Field, Sagiv]• Reagents [Turon]

Methodology (1/2)

Assumptions:• Arrays do not overlap.• No pointers other than arrays. • Fixed #threads, chunk size, scheduling policy.– We analyze consistency of program implementation.

• Focusing on OpenMP.– The techniques should be applicable to other

frameworks.

• Output result prescribed by users.15

Why OpenMP ?

• Complicate enough• Practical enough– Parallelizes programs automatically; – Is an industry standard of application

programming interface (API); – Is supported by Sun Studio, Intel Parallel Studio,

Visual C++, GNU Compiler Collection (GCC).

16

Methodology (2/2)

Potential race analysis at PR level

Guided simulation for program consistency violations

17

Program Consistency checking

end

2-step program consistency checking.

Potential race report

Step 1: Potential Races at PR level

18

Necessary constraints as Presburger formulas• A race constraint between each pair of

memory references to the same location by different threads.

• Solution of the pairwise constraints via Presburger formula solving.

Race-freedom

Step 1: Potential Race AnalysisC program with OpenMP

Pairwise Constraints Generator

Pairwise Race Constraints

Consraint Solver

Sat?No Yes

Potential races(Truth

Assignment)

19

Potential Race Constraint

A Potential Race Constraint = Thread Path Condition Λ Race Condition• Thread Path Condition– Necessary for a thread to access a memory location in a

statement– Obtained by symbolic postcondition analysis

• Race Condition– The necessary condition of an access by two threads in a

parallel region

20

Running example

Thread1: k+1, … , k+1+c-1, Thread2: k+1+c , … , k+1+2c-1Thread3: k+1+2c , … , k+1+3c-1Thread4: k+1+3c, … , k+1+4c-1

Thread1: k+1+4c, … , k+1+5c-1 ……. 21


Thread Path Condition of L[i][k]

22


Thread 1: it1-(k+1)%4=0 Λ k+1≤ i t1< size

Thread Path Conditions of L[i-1][k]

23

Thread 2: it2-(k+1)-1 % 4 = 0 Λ k+1 ≤ it2 < sizeΛ k+1 ≤ jt2 < size

for(k=0;k<size-1;k++){ #pragma omp parallel for default (none) shared(M,L,size,k) private(i,j) schedule(static,c) num_thread(4) for(i=k+1,i<size;i++){ L[i][k] = M[i][k]/M[k][k] for(j=k+1;j<size;j++){ M[i][j] = M[i][j] – L[i-1][k] *M[k][j]} } }

Race Condition of L[i][k] & L[i-1][k]

24

for(k=0;k<size-1;k++){ #pragma omp parallel for default (none) shared(M,L,size,k) private(i,j) schedule(static,c) num_thread(4) for(i=k+1,i<size;i++){ L[i][k] = M[i][k]/M[k][k] for(j=k+1;j<size;j++){ M[i][j] = M[i][j] – L[i-1][k] *M[k][j]} } }

it1-(k+1) % 4 = 0 Λ k+1 ≤ it1 < sizeΛ it2-(k+1)-1 % 4=0 Λ k+1 ≤ it2 < sizeΛ k+1 ≤ jt2 < size

Λ k = k Λ it1 = it2 -1

Potential Race Constraint Solving

Potential races (Omega lib.): . . i_1 = k+1+4alpha. . i_2 = k+2+4alpha. . i_2 = i_1+1. . i_1 < size. . i_2 < size. . k+1 <= i_1. . k+1 <= i_2. . k+1 <= j_2. . j_2 < sizei_1 – 0 [0,), not_tighti_2 – 0 [0,), not_tight 25

it1-(k+1) % 4 = 0 Λ k+1 ≤ it1 < sizeΛ it2-(k+1)-1 % 4=0 Λ k+1 ≤ it2 < sizeΛ k+1 ≤ jt2 < sizeΛ k = k Λ it1 = it2 -1

All Presburger

Step 2: Guided symbolic simulation

26

• Program models: – Extended finite-state machine (EFSM)– Relaxed memory model

• Simulator of EFSM– Stepwise, backtrack, fixed point

• Witness of program consistency violations– comparison with the sequential execution result.

Potential races

(from step 1)

Guided SimulationC program with OpenMP

Model Generator

Model (EFSM)

Simulation

Consistency ?

Yes

No

fixed point ?

Yes

No

27

Consistency violations Consistency

(w. benign races)

C Program Model Construction (1/2)

start

S

stop

(true)x=(t-1) *c +i; y=0;

(x-y+m*c j y=c-1)x=x-y+m*c; y=0;

(x-y+m*c>j y=c-1)

(x<jy<c-1)x++; y++;

(x> j)

y is an auxiliary local variable for chunk.t is the serial number of the thread.

28

Example: #pragma omp for schedule(Static, c) num_threads(m) for(x=i;x<=j;x++) S

To model races in a C statement: y = f(x1, x2, …, xn)

assume reads x1, x2,…, xn in order. – other orders can also be modeled.

Translate to the following n+1 EFSM transitions: a1=x1; a2=x2; …; an=xn; y=f(a1,…,an);

a1, a2, …, an are auxiliary variables in EFSM.

29

C Program Model Construction (2/2)

Relaxed Memory Models

• Out-of-order execution of accesses to the memory for hardware efficiency.– local caches, multiprocessors– for customized synchronizations, controlled races

• May lead to unexpected result.A classical example:

initially x=0 y = 0thread 1: x=1; thread 2: y = 1;

z = y; w = x; assert z=1w=1

30


A classical example: initially x=0 y = 0

31

cache 1

cache 2

thread 1: x=1; z=y;

thread 2: y=1; w=x;

memoryloadstore

storeload

assert z=1w=1

x.c1=1y.c1=1load(w.c2,x)load(z.c1,y)store(x.c1)x=x.c1store(y.c2)y=y.c2


Total store order (TSO) • From SPARC• Adapted to Intel 80x86 series• Description: – Local reads can use pending writes in the local

store.• Problem: Peer reads are not aware of the local pending

writes.

– Local stores must be FIFO.

32

Modeling TSO w. m threads (1/4)

• An array x[0..m] for each shared variable x – x[0] is the memory copy. – x[i] is the cache copy of x of thread i [1,m]– x now becomes an address variable instead of

the value variable for x.

33

Modeling TSO w. m threads (2/4)• An arrays ls[0..n] of objects for load-store (LS) buffer

of size n+1.– ls_st[k]: status of load-store buffer cell k

• 0: not used, 1:load, 2: store

– ls_th[k]: thread that use load-store buffer cell k.– ls_dst[k], ls_src[k]: destination and source addresses– ls_value: value to storePurely for convenience.Can be changed to m load-store buffers for each thread. Need know mappings from threads to cores

34

PW ? steps EFSM transitions

Pending Write (PW)

1 Thread J: !load@Q ls_src@(Q) = &x; ls_dst = &a;

LS Q: must be the largest PW LS object. ?load@J ls_th=J; ls_status = 1;

2 Thread J: ?load_finish

LS Q: !load_finish@(ls_th)l s_dst[0]=ls_value; ls_th=0; ls_status=0; compact LS array;

No pending Write

1 Thread J: !load@Q ls_src@(Q) = &x; ls_dst = &a;

LS Q: must be the smallest idle LS obj. ?load@J ls_th=J; ls_status = 1;

2 Thread J: ?load_finish

LS Q: !load_finish@(ls_th) ls_dst[0]=ls_src[0];ls_th=0; ls_status=0; compact LS array;

Modeling TSO w. m threads (3/4)Load a x by thread j, ‘a’ is private.

35

Modeling TSO w. m threads (4/4)Store a x by thread j, ‘a’ is private.

36

steps EFSM transitions1 Thread J: !store@Q ls_dst@(Q) = &x; ls_value = a;

LS Q: must be the smallest idle LS obj. ?store@J ls_th=J; ls_status = 2;

2 LS Q: ls_dst[0] = ls_value; ls_th=0; ls_status=0; compact LS array;

Guided Simulation

• For each pairwise race condition truth assignment, perform a simulation session.

• Use a stack to explore the simulation paths. • Explore all paths compatible with the truth

assignment. • Check consistency at the end of each path. • Mark benign races.

37

Implementation

Pathg – path generator • Pontential race condition solving– Presburger Omega library

• Model construction: – REDLIB for EFSM with synchronizations, arrays,

variable declarations, address arithmetics

• Guided EFSM simulation– REDLIB semi-symbolic simulator– step, backtrack, check fixpoint/consistency

38

ImplementationGuided Symbolic Simulation

Sequential execution(Golden model) Guided Multi-Threaded Simulation

ParallelTask 1

ParallelTask 1

ParallelTask 2

ParallelTask 2

ParallelTask 3

MasterThread

MasterThread

Read:L[2][1]Read:L[2][1]Write:L[2][1]Read:L[2][1]Write:L[2][1]

.

.

.

.

Read:L[2][1]Read:L[2][1]Write:L[2][1]Read:L[2][1]Write:L[2][1]

.

.

.

.

Memory Accessing Sequence

MemoryAccessing Sequence

Read:L[2][1]Write:L[2][1]Read:L[2][1]Read:L[2][1]Write:L[2][1]

.

.

.

.

Read:L[2][1]Write:L[2][1]Read:L[2][1]Read:L[2][1]Write:L[2][1]

.

.

.

.

MasterThread

ParallelTask 1

ParallelTask 1

ParallelTask 2

ParallelTask 2

ParallelTask 3

MasterThread

output output

39

ImplementationPotential Race Report

===tg:i_4,i_1=====tw:i_4Race::L[5][1]===tg:i_3,i_4=====tw:i_3Race::L[4][1]===tg:i_2,i_3=====tw:i_2Race:: L[3][1]===tg:i_1,i_2=====tw:i_1Race:: L[2][1]

tg indicates threads involved in the race.

tw indicates threads WRITE the Memory address.

Race is where the race condition is.

We enumerate variables to limit the solution

40

Experiments

• Environment– Ubuntu 9.10 64bit– i5-760 2.8GHz and 2GB RAM

• Benchmarks– OpenMP Source Code Repository (OmpSCR)– NAS Parallel Benchmarks (NPB)

41

Constraint Solving of OmpSCR Bug v1: Races manually introduced (between any two threads dealing with

the consecutive iterations) Bug v2: Rare races introduced (only between two specific threads on a

particular share memory) Fixed: A barrier statement manually inserted (remove the race in Bug v2)

BenchmarkOriginal Bug v1 Bug v2 Fixed

#Const. #Sat Time #Const. #Sat Time #Const. #Sat Time #Const. #Sat Time

c_lu.c 71 0 0.18s 629 29 1.810s 935 30 4.110s 935 0 5.15s

c_ja01.c 95 0 0.39s 95 8 0.42s 155 1 0.75s 95 0 0.77s

c_ja02.c 95 0 0.03s 95 8 0.35s 155 1 0.67s 95 0 1.03s

c_loopA.c 17 0 0.04s 47 4 0.07s 95 1 0.32s 17 0 0.84s

c_loopB.c 17 0 0.03s 29 4 0.08s 95 1 0.15s 17 0 1.13s

c_md.c 65 0 0.25s 77 4 0.30s 131 1 0.53s 65 0 1.25s

42

Symbolic Simulation of OmpSCR• Blindly simulation needs to explore (much) more traces to hit a

consistency violation! • Standard OpenMP tools fail to report races of these benchmarks.

Benchmarks

Guided simulation Random simulation Sun Studio Intel Thread Checker

#Traces Time #Trace Time race Race/total

c_lu_bug1 1 23.35s 25.3 52.11s N 4/10

c_lu_bug2 1 23.22s 178.9 110.58s N 1/10

c_ja01_bug1 1 6.65s 10.6 26.60s N 4/10

c_ja01_bug2 1 13.91s 42.1 58.16s N 3/10

c_ja_02_bug1 1 14.86s 25 28.83s N 2/10

c_ja_02_bug2 1 15.19s 41.3 52.25s N 2/10

c_loopA_bug1 1 10.76s 11.7 36.82s N 3/10

c_loopA_bug2 1 56.86s 27.6 98.40s N 2/10

c_loopB_bug1 1 14.54s 9.4 29.58s N 2/10

c_loopB_bug2 1 41.50s 38.6 66.48s N 2/10

c_md_bug1 1 12.19s 10.4 26.21s N 3/10

c_md_bug2 1 19.38s 44.3 83.52s N 2/10 43

NAS Parallel Benchmarks• Middle-size benchmarks (1200+~3500+ loc) • Efficient race constraint solving – e.g., 150000+ race constraints solved in 38 minutes by omega library

• Rare satisfiable constraints – 8/85067 constraints of nas_lu.c

Benchmark #loc #Access #Const. #Sat Time

nas_lu.c 3481 13736 85067 8 27m30.37s

bt.c 3616 15916 157047 0 37m33.32s

mg.c 1250 4636 2269 0 0m17.19s

sp.c 2983 13604 45209 0 4m0.32s44

nas_lu.c

• Slice the program to the segment of the paralleled region with satisfiable race conditions

• Construct the symbolic model of the sliced segment:– 35 Modes (EFSM)– Reaching the fixed point without consistency violation after 205

steps and 16.93secs

• Benign races– All of them are used as mutual exclusion semaphores– nas_lu.c is consistent

45

Conclusion

• Static analysis of program consistency– for real C/C++ program with OpenMP directives

• Highly automated solution– Constraint solving– Symbolic simulation

• High precision: relaxed memory models• High efficiency • Extension to TBB, other memory models ? • Partial order reduction ?

46

Conclusion

Symbolic approach for static consistency checking– Detect and identify races by solving race constraints

(Presburger formulas)– Construct symbolic models and perform guided simulation

with races– Support relaxed memory models– Find consistency violations effectively (when existing)

47

Documents

Symbolic Program Consistency Checking of OpenMP Parallel Programs with Relaxed Memory Models Fang Yu National Cheng Chi University Shun-Ching Yang Guan-Cheng