Sound and Precise Analysis of Parallel Programs through Schedule Specialization Jingyue Wu, Yang...

Preview:

Citation preview

Sound and Precise Analysis ofParallel Programs through

Schedule Specialization

Jingyue Wu, Yang Tang, Gang Hu, Heming Cui, Junfeng YangColumbia University

1

2

Motivation

soundness (# of analyzed schedules / # of total schedules)

precision Total Schedules

AnalyzedSchedules

StaticAnalysis

DynamicAnalysis

AnalyzedSchedules

?

• Analyzing parallel programs is difficult.

3

• Precision: Analyze the program over a small set of schedules. • Soundness: Enforce these schedules at runtime.

Schedule Specialization

soundness (# of analyzed schedules / # of total schedules)

precision Total Schedules

StaticAnalysis

DynamicAnalysis

AnalyzedSchedules

EnforcedSchedules

ScheduleSpecialization

4

Enforcing Schedules Using Peregrine

• Deterministic multithreading– e.g. DMP (ASPLOS ’09), Kendo (ASPLOS ’09), CoreDet

(ASPLOS ’10), Tern (OSDI ’10), Peregrine (SOSP ’11), DTHREADS (SOSP ’11)

– Performance overhead• e.g. Kendo: 16%, Tern & Peregrine: 39.1%

• Peregrine– Record schedules, and reuse them on a wide range of

inputs.– Represent schedules explicitly.

5

• Precision: Analyze the program over a small set of schedules. • Soundness: Enforce these schedules at runtime.

Schedule Specialization

soundness (# of analyzed schedules / # of total schedules)

precision

StaticAnalysis

DynamicAnalysis

AnalyzedSchedules

EnforcedSchedulesSchedule

Specialization

6

Framework

• Extract control flow and data flow enforced by a set of schedules

Schedule

ScheduleSpecializationProgram

C/C++ programwith Pthread

Total order ofsynchronizations

SpecializedProgram

Extra def-usechains

7

Outline

• Example• Control-Flow Specialization• Data-Flow Specialization• Results• Conclusion

Running Example

int results[p_max];int global_id = 0;

int main(int argc, char *argv[]) { int i; int p = atoi(argv[1]); for (i = 0; i < p; ++i) pthread_create(&child[i], 0, worker, 0); for (i = 0; i < p; ++i) pthread_join(child[i], 0); return 0;}

void *worker(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0;}

8

Thread 0 Thread 1 Thread 2

create

create

join

join

lock

unlocklock

unlock

Race-free?

Control-Flow Specializationint main(int argc, char *argv[]) { int i; int p = atoi(argv[1]); for (i = 0; i < p; ++i) pthread_create(&child[i], 0, worker, 0); for (i = 0; i < p; ++i) pthread_join(child[i], 0); return 0;}

9

create

create

join

join

atoi

++i

create

return

i = 0

i < p

++i

join

i < p

i = 0

Control-Flow Specializationint main(int argc, char *argv[]) { int i; int p = atoi(argv[1]); for (i = 0; i < p; ++i) pthread_create(&child[i], 0, worker, 0); for (i = 0; i < p; ++i) pthread_join(child[i], 0); return 0;}

10

create

create

join

join

atoi

++i

create

return

i = 0

i < p

++i

join

i < p

i = 0

atoi

create

i = 0

i < p

Control-Flow Specializationint main(int argc, char *argv[]) { int i; int p = atoi(argv[1]); for (i = 0; i < p; ++i) pthread_create(&child[i], 0, worker, 0); for (i = 0; i < p; ++i) pthread_join(child[i], 0); return 0;}

11

create

create

join

join

atoi

++i

create

return

i = 0

i < p

++i

join

i < p

i = 0

create

atoi

i = 0

i < p

create

++i

create

i < p

Control-Flow Specializationint main(int argc, char *argv[]) { int i; int p = atoi(argv[1]); for (i = 0; i < p; ++i) pthread_create(&child[i], 0, worker, 0); for (i = 0; i < p; ++i) pthread_join(child[i], 0); return 0;}

12

create

create

join

join

atoi

++i

create

return

i = 0

i < p

++i

join

i < p

i = 0

atoi

create

i = 0

i < p

++i

create

i < p

++i

i < p

join

i < p

i = 0

++i

join

i < p

++i

i < p

return

13

Control-Flow Specialized Program

int main(int argc, char *argv[]) { int i; int p = atoi(argv[1]); i = 0; // i < p == true pthread_create(&child[i], 0, worker.clone1, 0); ++i; // i < p == true pthread_create(&child[i], 0, worker.clone2, 0); ++i; // i < p == false i = 0; // i < p == true pthread_join(child[i], 0); ++i; // i < p == true pthread_join(child[i], 0); ++i; // i < p == false return 0;}

atoi

create

i = 0

i < p

++i

create

i < p

++i

i < p

join

i < p

i = 0

++i

join

i < p

++i

i < p

return

14

More Challenges onControl-Flow Specialization

• Ambiguity

call

Caller Callee

call

S1

• A schedule has too many synchronizations

ret

S2

Data-Flow Specialization

int global_id = 0;

void *worker.clone1(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0;}

void *worker.clone2(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0;}

15

Thread 0 Thread 1 Thread 2

create

create

join

join

lock

unlock

lock

unlock

global_id = 0

my_id = global_idglobal_id++

my_id = global_idglobal_id++

Data-Flow Specialization

int global_id = 0;

void *worker.clone1(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0;}

void *worker.clone2(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0;}

16

Thread 0 Thread 1 Thread 2

create

create

join

join

lock

unlock

lock

unlock

global_id = 0

my_id = global_idglobal_id++

my_id = global_idglobal_id++

Data-Flow Specialization

int global_id = 0;

void *worker.clone1(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0;}

void *worker.clone2(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0;}

17

Thread 0 Thread 1 Thread 2

create

create

join

join

lock

unlock

lock

unlock

global_id = 0

my_id = 0global_id = 1

my_id = global_idglobal_id++

Data-Flow Specialization

int global_id = 0;

void *worker.clone1(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0;}

void *worker.clone2(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0;}

18

Thread 0 Thread 1 Thread 2

create

create

join

join

lock

unlock

lock

unlock

global_id = 0

my_id = 0global_id = 1

my_id = 1global_id = 2

Data-Flow Specialization

int global_id = 0;

void *worker.clone1(void *arg) { pthread_mutex_lock(&global_id_lock); global_id = 1; pthread_mutex_unlock(&global_id_lock); results[0] = compute(0); return 0;}

void *worker.clone2(void *arg) { pthread_mutex_lock(&global_id_lock); global_id = 2; pthread_mutex_unlock(&global_id_lock); results[1] = compute(1); return 0;}

19

Thread 0 Thread 1 Thread 2

create

create

join

join

lock

unlock

lock

unlock

global_id = 0

my_id = 0global_id = 1

my_id = 1global_id = 2

20

More Challenges onData-Flow Specialization

• Must/May alias analysis– global_id

• Reasoning about integers– results[0] = compute(0)– results[1] = compute(1)

• Many def-use chains

21

Evaluation

• Applications– Static race detector– Alias analyzer– Path slicer

• Programs– PBZip2 1.1.5– aget 0.4.1– 8 programs in SPLASH2– 7 programs in PARSEC

22

Program Original Specialized

aget 72 0

PBZip2 125 0

fft 96 0

blackscholes 3 0

swaptions 165 0

streamcluster 4 0

canneal 21 0

bodytrack 4 0

ferret 6 0

raytrace 215 0

cholesky 31 7

radix 53 14

water-spatial 2447 1799

lu-contig 18 18

barnes 370 369

water-nsquared 354 333

ocean 331 292

StaticRaceDetector

# of FalsePositives

23

Program Original Specialized

aget 72 0

PBZip2 125 0

fft 96 0

blackscholes 3 0

swaptions 165 0

streamcluster 4 0

canneal 21 0

bodytrack 4 0

ferret 6 0

raytrace 215 0

cholesky 31 7

radix 53 14

water-spatial 2447 1799

lu-contig 18 18

barnes 370 369

water-nsquared 354 333

ocean 331 292

StaticRaceDetector

# of FalsePositives

24

Program Original Specialized

aget 72 0

PBZip2 125 0

fft 96 0

blackscholes 3 0

swaptions 165 0

streamcluster 4 0

canneal 21 0

bodytrack 4 0

ferret 6 0

raytrace 215 0

cholesky 31 7

radix 53 14

water-spatial 2447 1799

lu-contig 18 18

barnes 370 369

water-nsquared 354 333

ocean 331 292

StaticRaceDetector

# of FalsePositives

25

Program Original Specialized

aget 72 0

PBZip2 125 0

fft 96 0

blackscholes 3 0

swaptions 165 0

streamcluster 4 0

canneal 21 0

bodytrack 4 0

ferret 6 0

raytrace 215 0

cholesky 31 7

radix 53 14

water-spatial 2447 1799

lu-contig 18 18

barnes 370 369

water-nsquared 354 333

ocean 331 292

StaticRaceDetector

# of FalsePositives

26

Static Race Detector: Harmful Races Detected

• 4 in aget• 2 in radix• 1 in fft

27

Precision of Schedule-AwareAlias Analysis

28

Precision of Schedule-AwareAlias Analysis

29

Precision of Schedule-AwareAlias Analysis

30

Conclusion and Future Work

• Designed and implemented schedule specialization framework– Analyzes the program over a small set of schedules– Enforces these schedules at runtime

• Built and evaluated three applications– Easy to use– Precise

• Future work– More applications– Similar specialization ideas on sequential programs

31

Related Work

• Program analysis for parallel programs– Chord (PLDI ’06), RADAR (PLDI ’08), FastTrack (PLDI ’09)

• Slicing– Horgon (PLDI ’90), Bouncer (SOSP ’07), Jhala (PLDI ’05), Weiser

(PhD thesis), Zhang (PLDI ’04)• Deterministic multithreading

– DMP (ASPLOS ’09), Kendo (ASPLOS ’09), CoreDet (ASPLOS ’10), Tern (OSDI ’10), Peregrine (SOSP ’11), DTHREADS (SOSP ’11)

• Program specialization– Consel (POPL ’93), Gluck (ISPL ’95), Jørgensen (POPL ’92),

Nirkhe (POPL ’92), Reps (PDSPE ’96)

32

Backup Slides

33

Specialization Time

34

Handling Races

• We do not assume data-race freedom. • We could if our only goal is optimization.

35

Input Coverage

• Use runtime verification for the inputs not covered

• A small set of schedules can cover a wide range of inputs

36

Recommended