60
EE CS Electrical Engineering and Computer Sciences BERKELEY PAR LAB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EE CS Electrical Engineering and Computer Sciences BERKELEY PAR LAB Efficient Data Race Detection for Distributed Memory Parallel Programs CS267 Spring 2012 (3/8) Originally presented at Supercomputing 2011 (Seattle, WA) Chang-Seo Park and Koushik Sen University of California Berkeley Paul Hargove and Costin Iancu Lawrence Berkeley Laboratory

Chang-Seo Park and Koushik Sen University of California Berkeley

  • Upload
    zita

  • View
    43

  • Download
    0

Embed Size (px)

DESCRIPTION

Efficient Data Race Detection for Distributed Memory Parallel Programs CS267 Spring 2012 (3/8) O riginally presented at Supercomputing 2011 (Seattle, WA). Chang-Seo Park and Koushik Sen University of California Berkeley Paul Hargove and Costin Iancu Lawrence Berkeley Laboratory. - PowerPoint PPT Presentation

Citation preview

Page 1: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Efficient Data Race Detection for Distributed Memory Parallel Programs

CS267 Spring 2012 (3/8) Originally presented at Supercomputing 2011 (Seattle, WA)

Chang-Seo Park and Koushik Sen University of California Berkeley

Paul Hargove and Costin IancuLawrence Berkeley Laboratory

Page 2: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

2

Current State of Parallel Programming

Parallelism everywhere! Top supercomputer has 500K+ cores Quad-core standard on desktop / laptops Dual-core smartphones

Parallelism and concurrency make programming harder Scheduling non-determinism may cause subtle bugs

But, limited usage of testing and correctness tools We like hero programmers Hero programmers can find bugs (in sequential code) Tools are hard to find and use

Page 3: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

3

Outline

Introduction Example Bug and Motivation Efficient Data Race Detection with Active Testing

Prediction phase Confirmation phase

HOWTO: Primer on using UPC-Thrille Conclusion Q&A and Project Ideas

Page 4: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

4

Example Parallel Program

Simple matrix vector multiply

A b c = ×

Page 5: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

5

Example Parallel Program in UPC

1: void matvec(shared [N] double A[N][N],

shared double B[N],shared double C[N]) {

2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)

C A B

11

?? = 1 1

1 1

Page 6: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

6

Example Parallel Program in UPC

C A B

11

22 = 1 1

1 1

1: void matvec(shared [N] double A[N][N],

shared double B[N],shared double C[N]) {

2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)

Page 7: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

7

UPC Example: Problem?

C A B

11

22 = 1 1

1 1

1: void matvec(shared [N] double A[N][N],

shared double B[N],shared double C[N]) {

2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)

No apparent bug in this program.

Page 8: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

8

1: void matvec(shared [N] double A[N][N],

shared double B[N],shared double C[N]) {

2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)

UPC Example: Data Race

No apparent bug in this program.But, if we call

matvec(A,B,B)?

Data Race!

B A B

11

11 = 1 1

1 1

Page 9: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

9

UPC Example: Data Race

B A B

21

21 = 1 1

1 1

1: void matvec(shared [N] double A[N][N],

shared double B[N],shared double C[N]) {

2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)

Data Race! No apparent bug in this program.But, if we call

matvec(A,B,B)?

Page 10: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

10

UPC Example: Data Race

B A B

23

23 = 1 1

1 1

1: void matvec(shared [N] double A[N][N],

shared double B[N],shared double C[N]) {

2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)

Data Race! No apparent bug in this program.But, if we call

matvec(A,B,B)?

Page 11: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

11

UPC Example: Trace

Example Trace: 4: sum[0] = 0;4: sum[1] = 0;6: sum[0]+= A[0]

[0]*B[0];6: sum[1]+= A[1]

[0]*B[0]; 6: sum[0]+= A[0]

[1]*B[1];6: sum[1]+= A[1]

[1]*B[1]; 9: B[0] = sum[0];9: B[1] = sum[1];

1: void matvec(shared [N] double A[N][N],

shared double B[N],shared double C[N]) {

2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)

Data Race?

Page 12: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

12

Example Trace: 4: sum[0] = 0;4: sum[1] = 0;6: sum[0]+= A[0]

[0]*B[0];6: sum[0]+= A[0]

[1]*B[1];6: sum[1]+= A[1]

[0]*B[0]; 9: B[0] = sum[0];6: sum[1]+= A[1]

[1]*B[1]; 9: B[1] = sum[1];

UPC Example: Trace

1: void matvec(shared [N] double A[N][N],

shared double B[N],shared double C[N]) {

2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)

Would be nice to have a trace exhibiting the data race

Data Race!

Page 13: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

13

UPC Example: Trace

1: void matvec(shared [N] double A[N][N],

shared double B[N],shared double C[N]) {

2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)

Would be nice to have a trace exhibiting the assertion failure

Example Trace: 4: sum[0] = 0;4: sum[1] = 0;6: sum[0]+= A[0]

[0]*B[0];6: sum[0]+= A[0]

[1]*B[1];9: B[0] = sum[0];6: sum[1]+= A[1]

[0]*B[0]; 6: sum[1]+= A[1]

[1]*B[1]; 9: B[1] = sum[1];

Data Race!

C != A*B

Page 14: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

14

Desiderata

Would be nice to have a trace Showing a data race (or some other concurrency bug) Showing an assertion violation due to a data race (or some other

visible artifact)

Page 15: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

15

Active Testing

Would be nice to have a trace Showing a data race (or some other concurrency bug) Showing an assertion violation due to a data race (or some other

visible artifact) Leverage program analysis to make testing quickly find

real concurrency bugs Phase 1: Use imprecise static or dynamic program analysis to

find bug patterns where a potential concurrency bug can happen (Race Detector)

Phase 2: Directed testing to confirm potential bugs as real(Race Tester)

Page 16: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

16

Active Testing: Phase 1

1: void matvec(shared [N] double A[N][N],

shared double B[N],shared double C[N]) {

2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)

Page 17: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

17

Active Testing: Phase 1

1: void matvec(shared [N] double A[N][N],

shared double B[N],shared double C[N]) {

2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)1. Insert instrumentation at

compile time

Page 18: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

18

Active Testing: Phase 1

2. Run instrumented program normally and obtain trace

Generated Trace: 4: sum[0] = 0;4: sum[1] = 0;6: sum[0]+= A[0]

[0]*B[0];6: sum[1]+= A[1]

[0]*B[0]; 6: sum[0]+= A[0]

[1]*B[1];6: sum[1]+= A[1]

[1]*B[1]; 9: B[0] = sum[0];9: B[1] = sum[1];

1: void matvec(shared [N] double A[N][N],

shared double B[N],shared double C[N]) {

2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)

Page 19: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

19

Generated Trace: 4: sum[0] = 0;4: sum[1] = 0;6: sum[0]+= A[0]

[0]*B[0];6: sum[1]+= A[1]

[0]*B[0]; 6: sum[0]+= A[0]

[1]*B[1];6: sum[1]+= A[1]

[1]*B[1]; 9: B[0] = sum[0];9: B[1] = sum[1];

Active Testing: Phase 1

3. Algorithm detects data races

1: void matvec(shared [N] double A[N][N],

shared double B[N],shared double C[N]) {

2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)

Page 20: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

20

Active Testing: Phase 1

3. Potential race between statements 6 and 9

Generated Trace: 4: sum[0] = 0;4: sum[1] = 0;6: sum[0]+= A[0]

[0]*B[0];6: sum[1]+= A[1]

[0]*B[0]; 6: sum[0]+= A[0]

[1]*B[1];6: sum[1]+= A[1]

[1]*B[1]; 9: B[0] = sum[0];9: B[1] = sum[1];

1: void matvec(shared [N] double A[N][N],

shared double B[N],shared double C[N]) {

2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)

Page 21: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

21

Active Testing: Phase 2

Goal 1. Confirm racesGoal 2. Create assertion failure

1: void matvec(shared [N] double A[N][N],

shared double B[N],shared double C[N]) {

2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)

Page 22: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

22

Generate this execution:

4: sum[0] = 0;4: sum[1] = 0;6: sum[0]+= A[0][0]*B[0];6: sum[0]+= A[0][1]*B[1];9: B[0] = sum[0];6: sum[1]+= A[1][0]*B[0]; 6: sum[1]+= A[1][1]*B[1]; 9: B[1] = sum[1];

Active Testing: Phase 2

Data Race!

Goal 1. Confirm racesGoal 2. Create assertion failure

1: void matvec(shared [N] double A[N][N],

shared double B[N],shared double C[N]) {

2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)

Page 23: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

23

Active Testing: Phase 2

Control scheduler knowing that (6,9) could race

Trace: 4: sum[1] = 0; 1: void matvec(shared [N] double A[N]

[N],shared double B[N],shared double C[N]) {

2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)

Page 24: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

24

Active Testing: Phase 2

Control scheduler knowing that (6,9) could race

Trace: 4: sum[1] = 0;4: sum[0] = 0;

1: void matvec(shared [N] double A[N][N],

shared double B[N],shared double C[N]) {

2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)

Page 25: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

25

Active Testing: Phase 2

Control scheduler knowing that (6,9) could race

Trace: 4: sum[1] = 0;4: sum[0] = 0;6: sum[1]+= A[1][0]*B[0];

1: void matvec(shared [N] double A[N][N],

shared double B[N],shared double C[N]) {

2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)

Page 26: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

26

Active Testing: Phase 2

Control scheduler knowing that (6,9) could race

Trace: 4: sum[1] = 0;4: sum[0] = 0;

Postponed: { 6: sum[1]+= A[1][0]*B[0]; }

Do not postponeif there is a deadlock

1: void matvec(shared [N] double A[N][N],

shared double B[N],shared double C[N]) {

2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)

Page 27: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

27

Active Testing: Phase 2

Control scheduler knowing that (6,9) could race Postponed: { 6: sum[1]+= A[1][0]*B[0]; }

Trace: 4: sum[1] = 0;4: sum[0] = 0;6: sum[0]+= A[0][0]*B[0];

Do not postponeif there is a deadlock

1: void matvec(shared [N] double A[N][N],

shared double B[N],shared double C[N]) {

2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)

Page 28: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

28

Active Testing: Phase 2

Control scheduler knowing that (6,9) could race

Trace: 4: sum[1] = 0;4: sum[0] = 0;6: sum[0]+= A[0][0]*B[0];6: sum[0]+= A[0][1]*B[1];

Do not postponeif there is a deadlock

Postponed: { 6: sum[1]+= A[1][0]*B[0]; }

1: void matvec(shared [N] double A[N][N],

shared double B[N],shared double C[N]) {

2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)

Page 29: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

29

Active Testing: Phase 2

Control scheduler knowing that (6,9) could race

Trace: 4: sum[1] = 0;4: sum[0] = 0;6: sum[0]+= A[0][0]*B[0];6: sum[0]+= A[0][1]*B[1];9: B[0] = sum[0];

Postponed: { 6: sum[1]+= A[1][0]*B[0]; }

1: void matvec(shared [N] double A[N][N],

shared double B[N],shared double C[N]) {

2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)

Page 30: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

30

Active Testing: Phase 2

Control scheduler knowing that (6,9) could race

Trace: 4: sum[1] = 0;4: sum[0] = 0;6: sum[0]+= A[0][0]*B[0];6: sum[0]+= A[0][1]*B[1];9: B[0] = sum[0];

Postponed: { 6: sum[1]+= A[1][0]*B[0]; }

1: void matvec(shared [N] double A[N][N],

shared double B[N],shared double C[N]) {

2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)

Race? yes

Page 31: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

31

Active Testing: Phase 2

Control scheduler knowing that (6,9) could race

Trace: 4: sum[1] = 0;4: sum[0] = 0;6: sum[0]+= A[0][0]*B[0];6: sum[0]+= A[0][1]*B[1];

Postponed: {}

1: void matvec(shared [N] double A[N][N],

shared double B[N],shared double C[N]) {

2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)

9:B[0]=sum[0]; 6:sum[1]+=A[1][0]*B[0];

Page 32: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

32

Active Testing: Phase 2

Achieved Goal 1:Confirmed race.

Trace: 4: sum[1] = 0;4: sum[0] = 0;6: sum[0]+= A[0][0]*B[0];6: sum[0]+= A[0][1]*B[1];9: B[0] = sum[0];6: sum[1]+= A[1][0]*B[0];

1: void matvec(shared [N] double A[N][N],

shared double B[N],shared double C[N]) {

2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)

Racing Statements Temporally Adjacent

Page 33: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

33

Active Testing: Phase 2

1: void matvec(shared [N] double A[N][N],

shared double B[N],shared double C[N]) {

2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)

Trace: 4: sum[1] = 0;4: sum[0] = 0;6: sum[0]+= A[0][0]*B[0];6: sum[0]+= A[0][1]*B[1];9: B[0] = sum[0];6: sum[1]+= A[1][0]*B[0]; 6: sum[1]+= A[1][1]*B[1]; 9: B[1] = sum[1];

Achieved Goal 2:Assertion failure.

C != A*B

Page 34: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

34

UPC-Thrille

Thread Interposition Library and Lightweight Extensions Framework for active testing UPC programs

Instrument UPC source code at compile time Using macro expansions, add hooks for analyses

Phase 1: Race detector Observe execution and predict which accesses may potentially

have a data race Phase 2: Race tester

Re-execute program while controlling the scheduler to create actual data race scenarios predicted in phase 1

Page 35: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

35

UPC-Thrille

Extension to the Berkeley UPC compiler and runtime Unfortunately, disabled by default on NERSC clusters Fortunately, compilers with Thrille enabled are available

• /global/homes/p/parkcs/hopper/bin/thrille_upcc• /global/homes/p/parkcs/franklin/bin/thrille_upcc

You can also build Berkeley upcc with Thrille enabled from source by following the steps at http://upc.lbl.gov/thrille

Add switch “-thrille=[mode]” to (thrille enabled) upcc where [mode] is empty (default, no instrumentation) racer (phase 1 of race detection, predicts racy statement pairs) tester (phase 2, tries to create race on a given statement pair) Can also add “default_options = -thrille=[mode]” to ~/.upccrc

Page 36: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

36

-thrille=racer

UPC-Thrille

No changes needed to source file(s) Separate binary for each analysis phase

Including one “empty” uninstrumened version Run each phase with corresponding binary

hello.upc

a.outupcc

upcrunb.outupcc

-thrille=tester

c.outupcc

Page 37: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

37

UPC-Thrille: racer

-thrille=racer Finds potential data race pairs Records them in upct.race.N files (N=1,2,3,…)

Only works with static # of threads now (needs –T n) This limitation will be lifted soon

Example

$ upcc -T4 -thrille=racer matvec.upc -o matvec-racer$ upcrun matvec-racer (in an interactive batch job)…[2] Potential race #1 found:[2] Read from [0x3ff7004,0x3ff7008) by thread 2 at phase 4 (matvec.upc:17)[3] Write to [0x3ff7004,0x3ff7008) by thread 3 at phase 4 (matvec.upc:26)…

Page 38: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

38

UPC-Thrille: tester

-thrille=tester Confirms data races predicted in phase 1 Reads in upct.race.N files (N=1,2,3,…) and tests individually

A script upctrun is provided to automatically test all races and skip equivalent ones One could also test a specific race with env. UPCT_RACE_ID=n

Example$ upcc -T4 -thrille=tester matvec.upc -o matvec-tester$ upctrun matvec-tester…('matvec.upc:17', 'matvec.upc:26') : (8, 1, True)…

# of equivalent races

# pairs tested

True if race confirmed

Page 39: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

39

Limitations

Limitations of prediction phase Dynamic analysis can only analyze collected data Cannot predict races on parts of code that was not executed Cannot predict races on binary-only libraries whose source were

not instrumented Limitations of confirmation phase

Non-confirmation does not guarantee race freedom “Benign” data races

Page 40: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

40

Conclusion

Active testing for finding bugs in parallel programs Combines dynamic analysis with testing Observe executions for potential concurrency bugs Re-execute to confirm bugs

UPC-Thrille is a efficient, scalable, and extensible analysis framework for UPC Currently provides race detection analysis Other analyses in progress (class projects?)

http://upc.lbl.gov/[email protected]

Page 41: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

41

Page 42: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

42

Optimization 1: Distributed Checking

Minimize interaction between threads Store shared memory accesses locally At barrier boundary, send access information to respective

owner of memory Conflict checking distributed among threads

T1 T2

Shared access after wait

notify

wait

notify

wait

access

notify

wait

notify

wait

T1 T2

Shared access after notify

notifywait

notifywait

access

notifywait

notifywait

notify

wait

notify

wait

T1 T2

Shared access between barriers

barrier

barrier

barrier

barrier

access

barrier

barrier

barrier

barrier

Page 43: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

43

Optimization 2: Filter Redundancy

Information collected up to synchronization point may be redundant Reading and writing to same memory address Accessing same memory in different sizes or different locksets

Page 44: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

44

Optimization 2: Filter Redundancy

Information collected up to synchronization point may be redundant Reading and writing to same memory address Accessing same memory in different sizes or different locksets

(Extended) Weaker-than relation Only keep the least protected accesses Prune provably redundant accesses [Choi et al ’02] Also reduces superfluous race reports

e1 e⊑ 2 (access e1 is weaker-than e2) iff larger memory range (e1.m e⊇ 2.m) accessed by more threads (e1.t = * e∨ 1.t = e2.t) smaller lockset (e1.L e⊆ 2.L) weaker access type (e1.a = Write e∨ 1.a = e2.a)

Page 45: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

45

Optimization 3: Sampling

Scientific applications have tight loops Same computation and communication pattern each time Inefficient to check for races at every loop iteration

Reduce overhead by sampling Probabilistically sample each instrumentation point Reduce probability at each unsuccessful check Set probability to 0 when race found (disable check)

Page 46: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

46

How Well Does it Scale?

Maximum 8% slowdown at 8K cores Franklin Cray XT4 Supercomputer at NERSC Quad-core 2.3GHz CPU and 8GB RAM per node Portals interconnect

Optimizations for scalability Efficient Data Structures Minimize Communication Sampling with Exponential Backoff

T1 T2notify

wait

notify

wait

access

notifywait

notify

wait

T1 T2notify

wait

notifywait

access

notifywait

notifywait

notify

wait

notify

wait

T1 T2barrierbarrier

barrierbarrier

access

barrierbarrier

barrierbarrier

Page 47: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

47

Active Testing Cartoon: Phase I

Potential Collision

Page 48: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

48

Active Testing Cartoon: Phase II

Page 49: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

49

New Landscape for HPC

Shared memory for scalability and utilization Hybrid programming models: MPI + OpenMP PGAS: UPC, CAF, X10, etc.

Asynchronous access to shared data likely to cause bugs Unified Parallel C (UPC)

Parallel extensions to ISO C99 standard for shared and distributed memory hardware

Single Program Multiple Data (SPMD) +Partitioned Global Address Space (PGAS)

Shared memory concurrency Transparent access using pointers to shared data (array) Bulk transfers with memcpy, memput, memget Fine-grained (lock) and bulk (barrier) synchronization

Page 50: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

50

Phase 1: Checking for Conflicts

To predict possible races, Need to check all shared accesses for conflicts Collect information through instrumentation

Page 51: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

51

Phase 1: Checking for Conflicts

To predict possible races, Need to check all shared accesses for conflicts Collect information through instrumentation

Two accesses e1 = (m1, t1, L1, a1, p1, s1) and e2 = (m2, t2, L2, a2, p2, s2) are in conflict when memory range overlaps (m1∩m2 ≠ )∅ accesses from different threads (t1 ≠ t2) no common locks held (L1∩L2 = )∅ at least one write (a1 = Write a∨ 2 =

Write) may happen in parallel w.r.t. barriers (p1 || p2)⟹(s1, s2) is a potential data race pair

Page 52: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

52

Differences and Challenges for UPC

Previous work on Java and pthreads programs Synchronization with locks and condition variables Single node

UPC has different programming model (SPMD) Large scale Bulk communication (memory regions) Non-blocking communication Collective operations with data movement Different memory consistency model

Page 53: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

53

Previous work on Java and pthreads programs Synchronization with locks and condition variables Single node

UPC has different programming model (SPMD) Large scale Bulk communication (memory regions) Non-blocking communication Collective operations with data movement Different memory consistency model

Optimizations for scalability Distribute analysis and coalesce queries Efficient data structures for memory interval reasoning Reduce communication through filtering and sampling

Differences and Challenges for UPC

T1 T2

Shared access after wait

notify

wait

notify

wait

access

notify

wait

notify

wait

T1 T2

Shared access after notify

notifywait

notifywait

access

notifywait

notifywait

notify

wait

notify

wait

T1 T2

Shared access between barriers

barrier

barrier

barrier

barrier

access

barrier

barrier

barrier

barrier

Page 54: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

54

Previous work on Java and pthreads programs Synchronization with locks and condition variables Single node

UPC has different programming model (SPMD) Large scale Bulk communication (memory regions) Non-blocking communication Collective operations with data movement Different memory consistency model

Optimizations for scalability Distribute analysis and coalesce queries Efficient data structures for memory interval reasoning Reduce communication through filtering and sampling

Differences and Challenges for UPC

Page 55: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

55

Previous work on Java and pthreads programs Synchronization with locks and condition variables Single node

UPC has different programming model (SPMD) Large scale Bulk communication (memory regions) Non-blocking communication Collective operations with data movement Different memory consistency model

Optimizations for scalability Distribute analysis and coalesce queries Efficient data structures for memory interval reasoning Reduce communication through filtering and sampling

Differences and Challenges for UPC

(Extended) Weaker-than relation Only keep the least protected accesses Prune provably redundant accesses [Choi et al ’02] Also reduces superfluous race reports

e1 e⊑ 2 (access e1 is weaker-than e2) iff larger memory range (e1.m e⊇ 2.m) accessed by more threads (e1.t = * e∨ 1.t = e2.t) smaller lockset (e1.L e⊆ 2.L) weaker access type (e1.a = Write e∨ 1.a = e2.a)

Page 56: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

56

Results on Single Node

* 4 threads on quad-core 2.66GHz CPU / 8GB RAM

Benchmark LoC RuntimeThrilleRacer ThrilleTester

Overhead Pot. race Overhead Conf. raceguppie 227 2.094s 12% 2 1.7% 2

knapsack 191 2.099s 14.9% 2 1.8% 2

lapalce 123 2.101s 16.3% 0 - -

mcop 358 2.183s 0.7% 0 - -

psearch 777 2.982s 1.8% 3 3.8% 2FT 2.3 2306 8.711s 6.1% 2 4.8% 2

CG 2.4 1939 3.812s 0.5% 0 - -

EP 2.4 763 10.02s 0.9% 0 - -

FT 2.4 2374 7.036s 0.1% 1 4.2% 1

IS 2.4 1449 3.073s 1.1% 0 - -

MG 2.4 2314 4.895s 3.1% 2 1.2% 2

BT 3.3 9626 48.78s 0.5% 8 0.8% 0LU 3.3 6311 37.05s 0.5% 0 - -

SP 3.3 5691 59.56s 0.2% 8 3.0% 0

Low overhead: < 20%

Unconfirmed bugs due to custom synchronization

Page 57: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

57

Scalability Results on Franklin*

16 64 256

576 16 64 25

651

2 16 64 256

512 36 14

425

610

241

10

100

1000

NormalRacerTester

Spee

dup

BT LU MG SP

Class C Class D

* Cray XT4 Supercomputer at NERSCQuad-core 2.3GHz CPU / 8GB RAM per node / Portals interconnect

Maximum 8% slowdown at 8K cores

Page 58: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

58

Bugs Found

In NPB 2.3 FT, Wrong lock allocation function causes real races in validation code Spurious validation failure errors

shared dcomplex *dbg_sum;static upc_lock_t *sum_write;

sum_write = upc_global_lock_alloc(); // wrong function

upc_lock (sum_write);{

dbg_sum->real = dbg_sum->real + chk.real;dbg_sum->imag = dbg_sum->imag + chk.imag;

}upc_unlock (sum_write);

Page 59: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

59

Bugs Found

In SPLASH2 lu, Multiple initialization of vector without locks Different results on different executions Performance bug

void InitA(){ … for (j=0; j<n; j++) { for (i=0; i<n; i++) { rhs[i] += a[i+j*n]; // executed by all threads } }}

Page 60: Chang-Seo  Park  and  Koushik Sen University of California  Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

60

Conclusion

Need correctness tool support for HPC Scarcity of effective correctness tools

Our proposal: Active testing Combine dynamic analysis with testing Low overhead (<10%) Scalable (>8K cores) General algorithm: applicable to other programming models

• MPI, CUDA, OpenMP

http://upc.lbl.gov/thrillePGAS @ Booth 124