DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1...

Preview:

Citation preview

1

DCatch: Automatically Detecting Distributed Concurrency Bugs

in Cloud Systems

Haopeng Liu, Guangpu Li, Jeffrey Lukman, Jiaxin Li,

Shan Lu, Haryadi Gunawi, and Chen Tian*

*

2

Cloud systems

3

Cloud systems

4

Distributed concurrency bugs (DCbugs)

5

Distributed concurrency bugs (DCbugs)

• Unexpected timing among distributed operations

6

Distributed concurrency bugs (DCbugs)

• Unexpected timing among distributed operations

• Example

BA C

MapReduce-3274

7

Distributed concurrency bugs (DCbugs)

• Unexpected timing among distributed operations

• Example

BA C BA C

MapReduce-3274

hang

8

DCbugs need to be tackled

• Common in distributed systems [1, 2, 3]

– 26% failures caused by non-deterministic [1]

– 6% software bugs in clouds system [2]

[1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14[2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14[3] Leesatapornwongsa. TaxDC. In ASPLOS’16

9

DCbugs need to be tackled

• Common in distributed systems [1, 2, 3]

• Difficult to avoid, expose and diagnose

[1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14[2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14[3] Leesatapornwongsa. TaxDC. In ASPLOS’16

10

DCbugs need to be tackled

• Common in distributed systems [1, 2, 3]

• Difficult to avoid, expose and diagnose

“That is one monster of a race!”

Hadoop Map/Reduce / MAPREDUCE-3274

[1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14[2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14[3] Leesatapornwongsa. TaxDC. In ASPLOS’16

11

DCbugs need to be tackled

• Common in distributed systems [1, 2, 3]

• Difficult to avoid, expose and diagnose

“That is one monster of a race!”

Hadoop Map/Reduce / MAPREDUCE-3274

“There isn’t a week going by without new bugs about races.”

HBase / HBASE-4397

[1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14[2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14[3] Leesatapornwongsa. TaxDC. In ASPLOS’16

12

DCbugs need to be tackled

• Common in distributed systems [1, 2, 3]

• Difficult to avoid, expose and diagnose

“That is one monster of a race!”

Hadoop Map/Reduce / MAPREDUCE-3274

“There isn’t a week going by without new bugs about races.”

HBase / HBASE-4397

“We have already fix many cases, however it seems exist many other [racing] cases.”

HBase / HBASE-6147

[1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14[2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14[3] Leesatapornwongsa. TaxDC. In ASPLOS’16

13

DCbugs need to be tackled

• Common in distributed systems [1, 2, 3]

• Difficult to avoid, expose and diagnose

“That is one monster of a race!”

Hadoop Map/Reduce / MAPREDUCE-3274

“There isn’t a week going by without new bugs about races.”

HBase / HBASE-4397

“We have already found and fix many cases, however it seems exist many other cases.”

HBase / HBASE-6147

“This has become quite messy, sigh.”

Hadoop Map/Reduce / MAPREDUCE-4819

[1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14[2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14[3] Leesatapornwongsa. TaxDC. In ASPLOS’16

14

DCbugs need to be tackled

• Common in distributed systems [1, 2, 3]

• Difficult to avoid, expose and diagnose

“That is one monster of a race!”

Hadoop Map/Reduce / MAPREDUCE-3274

“There isn’t a week going by without new bugs about races.”

HBase / HBASE-4397

“We have already found and fix many cases, however it seems exist many other cases.”

HBase / HBASE-6147

“This has become quite messy, sigh.”

Hadoop Map/Reduce / MAPREDUCE-4819

“Great catch, Sid! Apologies for missing the race condition.”

Hadoop Map/Reduce / MAPREDUCE-4099

[1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14[2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14[3] Leesatapornwongsa. TaxDC. In ASPLOS’16

15

[1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14[2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14[3] Leesatapornwongsa. TaxDC. In ASPLOS’16

DCbugs need to be tackled

• Common in distributed systems [1, 2, 3]

• Difficult to avoid, expose and diagnose

“That is one monster of a race!”

Hadoop Map/Reduce / MAPREDUCE-3274

“There isn’t a week going by without new bugs about races.”

HBase / HBASE-4397

“We have already found and fix many cases, however it seems exist many other cases.”

HBase / HBASE-6147

“This has become quite messy, sigh.”

Hadoop Map/Reduce / MAPREDUCE-4819

“Great catch, Sid! Apologies for missing the race condition.”

Hadoop Map/Reduce / MAPREDUCE-4099

“We [prefer] debug crashes instead of hanging jobs.”

Hadoop Map/Reduce / MAPREDUCE-3634

16

[1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14[2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14[3] Leesatapornwongsa. TaxDC. In ASPLOS’16

DCbugs need to be tackled

• Common in distributed systems [1, 2, 3]

• Difficult to avoid, expose and diagnose

“That is one monster of a race!”

Hadoop Map/Reduce / MAPREDUCE-3274

“There isn’t a week going by without new bugs about races.”

HBase / HBASE-4397

“We have already found and fix many cases, however it seems exist many other cases.”

HBase / HBASE-6147

“This has become quite messy, sigh.”

Hadoop Map/Reduce / MAPREDUCE-4819

“Great catch, Sid! Apologies for missing the race condition.”

Hadoop Map/Reduce / MAPREDUCE-4099

“We [prefer] debug crashes instead of hanging jobs.”

Hadoop Map/Reduce / MAPREDUCE-3634

Can we detect DCbugs before they manifest?

17

Previous work

• Model checking

– Work on abstracted models

– Face state-space explosion issue

18

Our idea

• Follow the philosophy of traditional concurrency bug detection

19

Our idea

• Follow the philosophy of traditional concurrency bug detection

Machine 2Machine 1 Machine 3 Machine 4

20

Our idea

• Follow the philosophy of traditional concurrency bug detection

Machine 2Machine 1 Machine 3 Machine 4

21

Our idea

• Follow the philosophy of traditional concurrency bug detection

Machine 2Machine 1 Machine 3 Machine 4

22

Example

BA C

23

Example

BA C

B

//UnReg thread

void unReg(jID){

jMap.remove(jID);

....

}

//RPC thread

Task getTask(jID){

...

return jMap.get(jID);

}

24

Local concurrency bug detection

25

Local concurrency bug detection

26

Local concurrency bug detection

Is the problem solved?

27

Local concurrency bug detection

Trace

T1 T2 T3

28

Local concurrency bug detection

Trace

T1 T2 T3

C1

C1: How to handle the hugeamount of mem accesses?

Challenges

29

Local concurrency bug detection

.

.

.

Trace HB

T1 T2 T3

C1

C1: How to handle the hugeamount of mem accesses?

Challenges

30

Local concurrency bug detection

.

.

.

Trace HB

T1 T2 T3

C1

C1: How to handle the hugeamount of mem accesses?

Challenges

rw

31

Local concurrency bug detection

.

.

.

Trace HB

T1 T2 T3

C1

C2

C1: How to handle the hugeamount of mem accesses?

C2: What’s the happens-beforemodel?

Challenges

rw

32

Local concurrency bug detection

.

.

.

Trace TriageHB

T1 T2 T3 .

.

.

C1

C2

C1: How to handle the hugeamount of mem accesses?

C2: What’s the happens-beforemodel?

Challenges

assert(r)

rw

rw

33

Local concurrency bug detection

.

.

.

Trace TriageHB

T1 T2 T3 .

.

.

C1

C2

C3

C1: How to handle the hugeamount of mem accesses?

C2: What’s the happens-beforemodel?

C3: How to estimate thedistributed impact of a race?

Challenges

assert(r)

rw

rw

34

Local concurrency bug detection

.

.

.

T1 T2 T3 .

.

.

.

.

.

Trace Triage TriggerHB

C1

C2

C3

C1: How to handle the hugeamount of mem accesses?

C2: What’s the happens-beforemodel?

C3: How to estimate thedistributed impact of a race?

Challenges

+sleeprw

rw

assert(r)

35

Local concurrency bug detection

.

.

.

Trace Triage TriggerHB

T1 T2 T3 .

.

.

.

.

.

C1

C2

C3

C4

C1: How to handle the hugeamount of mem accesses?

C2: What’s the happens-beforemodel?

C3: How to estimate thedistributed impact of a race?

C4: How to trigger withdistributed time manipulation?

Challenges

+sleeprw

assert(r)

rw

36

Contribution

• A comprehensive HB Model for distributed systems

37

Contribution

Trace Triage TriggerHB

C1

C2

C3

C4

C1: How to handle the hugeamount of mem accesses?

C2: What’s the happens-before model?

C3: How to estimate thedistributed impact of a race?

C4: How to trigger withdistributed time manipulation?

ChallengesSolved by

DCatch

• A comprehensive HB Model for distributed systems

• DCatch tool detects DCbugs from correct runs

38

Contribution

• A comprehensive HB Model for distributed systems

• DCatch tool detects DCbugs from correct runs

• Evaluate on 4 systems

• Report 32 DCbugs, with 20 of them being truly harmful

39

Outline

• Motivation

• DCatch Happens-before Model

• DCatch tool

• Evaluation

• Conclusion

40

DCatch Happens-before Model

HMaster

Thd

w

41

DCatch Happens-before Model

Thd

HMaster

Thd

w

42

DCatch Happens-before Model

ThdThd

HMaster HRegionServer

Thd

w

43

DCatch Happens-before Model

Thd Event threadThd

HMaster

e

HRegionServer

Thd

w

44

DCatch Happens-before Model

Thd Event threadThd

HMaster

e

HRegionServer

Thd

ZK Coordinator

w

45

DCatch Happens-before Model

ThdThd Event threadThd

HMaster

e

HRegionServer

Thd

ZK Coordinator

w

46

DCatch Happens-before Model

ThdThd Event threadThd

HMaster

e

HRegionServer

Thd

ZK Coordinator

w

r

47

DCatch Happens-before Model

ThdThd Event threadThd

HMaster

e

HRegionServer

Thd

ZK Coordinator

w

r

Where is HB model for distributed systems?

48

DCatch Happens-before Model

ThdThd Eve thdThd

HMaster HRegionServer

Thd

ZK Coordinator

w

r

49

DCatch Happens-before Model

Dist.Loc.

ThdThd Eve thdThd

HMaster HRegionServer

Thd

ZK Coordinator

w

r

Dist.

50

DCatch Happens-before Model

Dist.Loc.

Async.

Sync.ThdThd Eve thdThd

HMaster HRegionServer

Thd

ZK Coordinator

w

r

Dist.Async.

51

DCatch Happens-before Model

Stand.

Dist.Loc.Cust.

Async.

Sync.ThdThd Eve thdThd

HMaster HRegionServer

Thd

ZK Coordinator

w

r

Dist.Async.

Cust.

52

Distributed rules

DistributedLocal

StandardCustom

Async.

Sync.

53

Distributed rule #1

• Logical time clock (Leslie Lamport, 1978)

Send

Recv

Machine1 Machine2

Standard

Asy

nch

.

Customize

Syn

ch.

Socket

54

Distributed rule #1

• Logical time clock (Leslie Lamport, 1978)

Send

Recv

Machine1 Machine2

Standard

Asy

nch

.

Customize

Syn

ch.

Socket

55

Distributed rule #2

RPC-call

RPC-begin

Machine1 Machine2

RPC-rtRPC-end

Standard

Asy

nch

.

Customize

Syn

ch.

RPC

Socket

56

Distributed rule #2

RPC-call

RPC-begin

Machine1 Machine2

RPC-rtRPC-end

Standard

Asy

nch

.

Customize

Syn

ch.

RPC

Socket

waiting

57

Distributed rule #2

RPC-call

RPC-begin

Machine1 Machine2

RPC-rtRPC-end

Standard

Asy

nch

.

Customize

Syn

ch.

RPC

Socket

waiting

58

Distributed rule #3

//Thread1

flag = True;

//Thread

while(!flag){

}

...

In multi-threaded systems:

Standard

Asy

nch

.

Customize

Syn

ch.

Socket

RPC

Dist. while-loop

59

Distributed rule #3

//Thread1

flag = True;

//Thread

while(!flag){

}

...

In multi-threaded systems:

Standard

Asy

nch

.

Customize

Syn

ch.

Socket

RPC

Dist. while-loop

60

Distributed rule #3

//Thread1

flag = True;

//Thread

while(!flag){

}

...

In multi-threaded systems:

Standard

Asy

nch

.

Customize

Syn

ch.

Socket

RPC

Dist. while-loop

In distributed systems:

61

Distributed rule #3

//Thread1

flag = True;

//Thread

while(!getFlag()){

}

...

//Thread2

bool getFlag(){

return flag;

}

Machine BMachine A

In distributed systems:

Standard

Asy

nch

.

Customize

Syn

ch.

Socket

RPC

Dist. while-loop

62

Distributed rule #3

//Thread1

flag = True;

//Thread

while(!getFlag()){

}

...

//Thread2

bool getFlag(){

return flag;

}

Machine BMachine A

In distributed systems:

Standard

Asy

nch

.

Customize

Syn

ch.

Socket

RPC

Dist. while-loop

63

Distributed rule #4

ThdThd Eve thdThdThd

ZK Coordinator

w

r

Standard

Asy

nch

.

Customize

Syn

ch.

Socket

RPC

ZooKeeper Service

64

Distributed rule #4

Standard

Asy

nch

.

Customize

Syn

ch.

ZooKeeper Service

Socket

RPCThdThd Eve thdThdThd

ZK Coordinator

w

r

65

Distributed rules

StandardCustomized

Asy

nch

ron

ou

sSy

nch

ron

ou

sRPCDist. While-loop

SocketZookeeper Service

66

Local rules

DistributedLocal

67

Local rules

StandardCustomized

Asy

nch

ron

ou

sSy

nch

ron

ou

s

Event-relatedn/a

68

Local rules

StandardCustomized

Asy

nch

ron

ou

sSy

nch

ron

ou

s

Event-related

Thread fork/joinWhile-loop

n/a

69

Local rules

StandardCustomized

Asy

nch

ron

ou

sSy

nch

ron

ou

sThread fork/joinWhile-loop

Event-relatedn/a

70

Outline

• Motivation

• DCatch Happens-before Model

• DCatch tool

• Evaluation

• Conclusion

71

Triage TriggerTrace HB

72

C1: How to handle the hugeamount of mem accesses?

C2: What’s the happens-beforemodel?

C3: How to estimate thedistributed impact of a race?

C4: How to trigger withdistributed time manipulation?

Challenges

Triage TriggerTrace HB

Selective tracing: only mem accesses inEvent/message handlers and their callee.

73

C1: How to handle the hugeamount of mem accesses?

C2: What’s the happens-beforemodel?

C3: How to estimate thedistributed impact of a race?

C4: How to trigger withdistributed time manipulation?

Challenges

Triage TriggerTrace HB

Local Distributed

[1] Raychev. Effective Race Detection for Event-Driven Programs. In OOPSLA’13

74

C1: How to handle the hugeamount of mem accesses?

C2: What’s the happens-beforemodel?

C3: How to estimate thedistributed impact of a race?

C4: How to trigger withdistributed time manipulation?

Challenges

Triage TriggerTrace HB

Machine B

//RPC thread

Task getTask(jID){

...

return jMap.get(jID);

}

//UnReg thread

void unReg(jID){

jMap.remove(jID);

....

}

75

C1: How to handle the hugeamount of mem accesses?

C2: What’s the happens-beforemodel?

C3: How to estimate thedistributed impact of a race?

C4: How to trigger withdistributed time manipulation?

Challenges

Triage TriggerTrace HB

Machine BMachine A

//RPC thread

Task getTask(jID){

...

return jMap.get(jID);

}

//UnReg thread

void unReg(jID){

jMap.remove(jID);

....

}

while(!getTask(jID)){

}

76

C1: How to handle the hugeamount of mem accesses?

C2: What’s the happens-beforemodel?

C3: How to estimate thedistributed impact of a race?

C4: How to trigger withdistributed time manipulation?

Challenges

Triage TriggerTrace HB

Machine BMachine A

//RPC thread

Task getTask(jID){

...

return jMap.get(jID);

}

//UnReg thread

void unReg(jID){

jMap.remove(jID);

....

}

while(!getTask(jID)){

}

77

C1: How to handle the hugeamount of mem accesses?

C2: What’s the happens-beforemodel?

C3: How to estimate thedistributed impact of a race?

C4: How to trigger withdistributed time manipulation?

Challenges

Triage TriggerTrace HB

Machine BMachine A

//RPC thread

Task getTask(jID){

...

return jMap.get(jID);

}

//UnReg thread

void unReg(jID){

jMap.remove(jID);

....

}

while(!getTask(jID)){

}

hang

78

C1: How to handle the hugeamount of mem accesses?

C2: What’s the happens-beforemodel?

C3: How to estimate thedistributed impact of a race?

C4: How to trigger withdistributed time manipulation?

Challenges

Triage TriggerTrace HB

Thd ThdEvent thread

e2

e1

w

r

79

C1: How to handle the hugeamount of mem accesses?

C2: What’s the happens-beforemodel?

C3: How to estimate thedistributed impact of a race?

C4: How to trigger withdistributed time manipulation?

Challenges

Triage TriggerTrace HB

Thd ThdEvent thread

e2

e1

w

r +sleep

80

C1: How to handle the hugeamount of mem accesses?

C2: What’s the happens-beforemodel?

C3: How to estimate thedistributed impact of a race?

C4: How to trigger withdistributed time manipulation?

Challenges

Triage TriggerTrace HB

Thd ThdEvent thread

e2

e1

w

r

+sleep

81

C1: How to handle the hugeamount of mem accesses?

C2: What’s the happens-beforemodel?

C3: How to estimate thedistributed impact of a race?

C4: How to trigger withdistributed time manipulation?

Challenges

Triage TriggerTrace HB

Thd

Machine A

Thd

Machine C

Thd ThdEvent thread

e2

e1

w

r

Machine B

82

C1: How to handle the hugeamount of mem accesses?

C2: What’s the happens-beforemodel?

C3: How to estimate thedistributed impact of a race?

C4: How to trigger withdistributed time manipulation?

Challenges

Triage TriggerTrace HB

Thd

Machine A

Thd

Machine C

Thd ThdEvent thread

e2

e1

w

r

Machine B

+sleep

+sleep

83

C1: How to handle the hugeamount of mem accesses?

C2: What’s the happens-beforemodel?

C3: How to estimate thedistributed impact of a race?

C4: How to trigger withdistributed time manipulation?

Challenges

Triage TriggerTrace HB

Thd

Machine A

Thd

Machine C

Thd ThdEvent thread

e2

e1

w

r

Machine B

+sleep

+sleep+sleep

+sleep

84

Outline

• Motivation

• DCatch Happens-before Model

• DCatch tool

• Evaluation

• Conclusion

85

• Benchmarks:

– 7 real-world DCbugs from TaxDC[1]

– 4 distributed systems

Methodology

[1] Leesatapornwongsa. TaxDC. In ASPLOS’16

86

Overall resultsBugID Detected? #. Bugs #. Benign #. false-pos

CA-1011 ✔ 3 0 0

HB-4539 ✔ 3 0 1

HB-4729 ✔ 4 1 0

MR-3274

✔ 2 0 4

MR-4637

✔ 1 2 4

ZK-1144 ✔ 5 1 1

ZK-1270 ✔ 6 2 0

Total 20 5 7

87

Overall resultsBugID Detected? #. Bugs #. Benign #. false-pos

CA-1011 ✔ 3 0 0

HB-4539 ✔ 3 0 1

HB-4729 ✔ 4 1 0

MR-3274

✔ 2 0 4

MR-4637

✔ 1 2 4

ZK-1144 ✔ 5 1 1

ZK-1270 ✔ 6 2 0

Total 20 5 7

88

Overall resultsBugID Detected? #. Bugs #. Benign #. false-pos

CA-1011 ✔ 3 0 0

HB-4539 ✔ 3 0 1

HB-4729 ✔ 4 1 0

MR-3274

✔ 2 0 4

MR-4637

✔ 1 2 4

ZK-1144 ✔ 5 1 1

ZK-1270 ✔ 6 2 0

Total 20 5 7

= 12 + 8

89

Overall resultsBugID Detected? #. Bugs #. Benign #. false-pos

CA-1011 ✔ 3 0 0

HB-4539 ✔ 3 0 1

HB-4729 ✔ 4 1 0

MR-3274

✔ 2 0 4

MR-4637

✔ 1 2 4

ZK-1144 ✔ 5 1 1

ZK-1270 ✔ 6 2 0

Total 20 5 7

90

Other results in our paper

• Performance overhead

• Trace compositions

• HB model impact

– False-positive

– False-negatives

• …

91

Outline

• Motivation

• DCatch Happens-before Model

• DCatch tool

• Evaluation

• Conclusion

92

Conclusion

• A HB Model for distributed systems

• DCatch detects DCbugs from correct runs with low false positive rates.

Trace Triage TriggerHB

C1

C2

C3

C4

Local Distributed

93

Thank you!Q&A

Recommended