DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1...

DCatch: Automatically Detecting Distributed Concurrency Bugs

in Cloud Systems

Haopeng Liu, Guangpu Li, Jeffrey Lukman, Jiaxin Li,

Shan Lu, Haryadi Gunawi, and Chen Tian*

Cloud systems

Distributed concurrency bugs (DCbugs)

• Unexpected timing among distributed operations

• Example

MapReduce-3274

• Example

BA C BA C

MapReduce-3274

DCbugs need to be tackled

• Common in distributed systems [1, 2, 3]

– 26% failures caused by non-deterministic [1]

– 6% software bugs in clouds system [2]

[1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14[2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14[3] Leesatapornwongsa. TaxDC. In ASPLOS’16

• Difficult to avoid, expose and diagnose

“That is one monster of a race!”

Hadoop Map/Reduce / MAPREDUCE-3274

“There isn’t a week going by without new bugs about races.”

HBase / HBASE-4397

“We have already fix many cases, however it seems exist many other [racing] cases.”

HBase / HBASE-6147

HBase / HBASE-4397

“We have already found and fix many cases, however it seems exist many other cases.”

HBase / HBASE-6147

“This has become quite messy, sigh.”

HBase / HBASE-4397

HBase / HBASE-6147

“Great catch, Sid! Apologies for missing the race condition.”

HBase / HBASE-4397

HBase / HBASE-6147

“We [prefer] debug crashes instead of hanging jobs.”

HBase / HBASE-4397

HBase / HBASE-6147

“We [prefer] debug crashes instead of hanging jobs.”

Can we detect DCbugs before they manifest?

Previous work

• Model checking

– Work on abstracted models

– Face state-space explosion issue

Our idea

• Follow the philosophy of traditional concurrency bug detection

Our idea

Machine 2Machine 1 Machine 3 Machine 4

Our idea

Example

//UnReg thread

void unReg(jID){

jMap.remove(jID);

//RPC thread

Task getTask(jID){

return jMap.get(jID);

Local concurrency bug detection

Is the problem solved?

T1 T2 T3

C1: How to handle the hugeamount of mem accesses?

Challenges

Trace HB

T1 T2 T3

Challenges

Trace HB

T1 T2 T3

Challenges

Trace HB

T1 T2 T3

C2: What’s the happens-beforemodel?

Challenges

Trace TriageHB

T1 T2 T3 .

Challenges

assert(r)

Trace TriageHB

T1 T2 T3 .

C3: How to estimate thedistributed impact of a race?

Challenges

assert(r)

T1 T2 T3 .

Trace Triage TriggerHB

Challenges

+sleeprw

assert(r)

T1 T2 T3 .

C4: How to trigger withdistributed time manipulation?

Challenges

+sleeprw

assert(r)

Contribution

• A comprehensive HB Model for distributed systems

Contribution

C2: What’s the happens-before model?

ChallengesSolved by

DCatch

• DCatch tool detects DCbugs from correct runs

Contribution

• DCatch tool detects DCbugs from correct runs

• Evaluate on 4 systems

• Report 32 DCbugs, with 20 of them being truly harmful

Outline

• Motivation

• DCatch Happens-before Model

• DCatch tool

• Evaluation

• Conclusion

DCatch Happens-before Model

HMaster

ThdThd

HMaster HRegionServer

Thd Event threadThd

HMaster

HRegionServer

Thd Event threadThd

HMaster

HRegionServer

ZK Coordinator

ThdThd Event threadThd

HMaster

HRegionServer

ZK Coordinator

HMaster

HRegionServer

ZK Coordinator

HMaster

HRegionServer

ZK Coordinator

Where is HB model for distributed systems?

ThdThd Eve thdThd

ZK Coordinator

Dist.Loc.

ThdThd Eve thdThd

ZK Coordinator

Dist.Loc.

Async.

Sync.ThdThd Eve thdThd

ZK Coordinator

Dist.Async.

Stand.

Dist.Loc.Cust.

Async.

Sync.ThdThd Eve thdThd

ZK Coordinator

Dist.Async.

Distributed rules

DistributedLocal

StandardCustom

Async.

Distributed rule #1

• Logical time clock (Leslie Lamport, 1978)

Machine1 Machine2

Standard

Customize

Socket

Distributed rule #1

• Logical time clock (Leslie Lamport, 1978)

Machine1 Machine2

Standard

Customize

Socket

Distributed rule #2

RPC-call

RPC-begin

Machine1 Machine2

RPC-rtRPC-end

Standard

Customize

Socket

Distributed rule #2

RPC-call

RPC-begin

Machine1 Machine2

RPC-rtRPC-end

Standard

Customize

Socket

waiting

Distributed rule #2

RPC-call

RPC-begin

Machine1 Machine2

RPC-rtRPC-end

Standard

Customize

Socket

waiting

Distributed rule #3

//Thread1

flag = True;

//Thread

while(!flag){

In multi-threaded systems:

Standard

Customize

Socket

Dist. while-loop

Distributed rule #3

//Thread1

flag = True;

//Thread

while(!flag){

Standard

Customize

Socket

Dist. while-loop

Distributed rule #3

//Thread1

flag = True;

//Thread

while(!flag){

Standard

Customize

Socket

Dist. while-loop

In distributed systems:

Distributed rule #3

//Thread1

flag = True;

//Thread

while(!getFlag()){

//Thread2

bool getFlag(){

return flag;

Machine BMachine A

Standard

Customize

Socket

Dist. while-loop

Distributed rule #3

//Thread1

flag = True;

//Thread

while(!getFlag()){

//Thread2

bool getFlag(){

return flag;

Machine BMachine A

Standard

Customize

Socket

Dist. while-loop

Distributed rule #4

ThdThd Eve thdThdThd

ZK Coordinator

Standard

Customize

Socket

ZooKeeper Service

Distributed rule #4

Standard

Customize

ZooKeeper Service

Socket

RPCThdThd Eve thdThdThd

ZK Coordinator

Distributed rules

StandardCustomized

sRPCDist. While-loop

SocketZookeeper Service

Local rules

DistributedLocal

Local rules

StandardCustomized

Event-relatedn/a

Local rules

StandardCustomized

Event-related

Thread fork/joinWhile-loop

Local rules

StandardCustomized

sThread fork/joinWhile-loop

Event-relatedn/a

Outline

• Motivation

• DCatch tool

• Evaluation

• Conclusion

Triage TriggerTrace HB

Challenges

Selective tracing: only mem accesses inEvent/message handlers and their callee.

Challenges

Local Distributed

[1] Raychev. Effective Race Detection for Event-Driven Programs. In OOPSLA’13

Challenges

Machine B

//RPC thread

Task getTask(jID){

//UnReg thread

void unReg(jID){

jMap.remove(jID);

Challenges

Machine BMachine A

//RPC thread

Task getTask(jID){

//UnReg thread

void unReg(jID){

jMap.remove(jID);

while(!getTask(jID)){

Challenges

Machine BMachine A

//RPC thread

Task getTask(jID){

//UnReg thread

void unReg(jID){

jMap.remove(jID);

Challenges

Machine BMachine A

//RPC thread

Task getTask(jID){

//UnReg thread

void unReg(jID){

jMap.remove(jID);

Challenges

Thd ThdEvent thread

Challenges

Thd ThdEvent thread

r +sleep

Challenges

Thd ThdEvent thread

+sleep

Challenges

Machine A

Machine C

Thd ThdEvent thread

Machine B

Challenges

Machine A

Machine C

Thd ThdEvent thread

Machine B

+sleep

Challenges

Machine A

Machine C

Thd ThdEvent thread

Machine B

+sleep

+sleep+sleep

+sleep

Outline

• Motivation

• DCatch tool

• Evaluation

• Conclusion

• Benchmarks:

– 7 real-world DCbugs from TaxDC[1]

– 4 distributed systems

Methodology

[1] Leesatapornwongsa. TaxDC. In ASPLOS’16

Overall resultsBugID Detected? #. Bugs #. Benign #. false-pos

CA-1011 ✔ 3 0 0

HB-4539 ✔ 3 0 1

HB-4729 ✔ 4 1 0

MR-3274

✔ 2 0 4

MR-4637

✔ 1 2 4

ZK-1144 ✔ 5 1 1

ZK-1270 ✔ 6 2 0

Total 20 5 7

CA-1011 ✔ 3 0 0

HB-4539 ✔ 3 0 1

HB-4729 ✔ 4 1 0

MR-3274

✔ 2 0 4

MR-4637

✔ 1 2 4

ZK-1144 ✔ 5 1 1

ZK-1270 ✔ 6 2 0

Total 20 5 7

CA-1011 ✔ 3 0 0

HB-4539 ✔ 3 0 1

HB-4729 ✔ 4 1 0

MR-3274

✔ 2 0 4

MR-4637

✔ 1 2 4

ZK-1144 ✔ 5 1 1

ZK-1270 ✔ 6 2 0

Total 20 5 7

= 12 + 8

CA-1011 ✔ 3 0 0

HB-4539 ✔ 3 0 1

HB-4729 ✔ 4 1 0

MR-3274

✔ 2 0 4

MR-4637

✔ 1 2 4

ZK-1144 ✔ 5 1 1

ZK-1270 ✔ 6 2 0

Total 20 5 7

Other results in our paper

• Performance overhead

• Trace compositions

• HB model impact

– False-positive

– False-negatives

• …

Outline

• Motivation

• DCatch tool

• Evaluation

• Conclusion

Conclusion

• A HB Model for distributed systems

• DCatch detects DCbugs from correct runs with low false positive rates.

Local Distributed

Thank you!Q&A

DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1...

Documents

The Decorator Construction Principle · Mobile robots navigate between machines Robot2 Machine1 Machine2 Machine3 Robot1 Machine4. 22 Examples of robots

Kinetic lightning machine2

PCatch: Automatically Detecting Performance …shanlu/paper/eurosys18.pdfPCatch: Automatically Detecting Performance Cascading Bugs in Cloud Systems EuroSys ’18, April 23–26, 2018,

Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions

Distributed Transactions 7. Transaction …Distributed Transactions 7. Transaction Management for distributed databases 28 Distributed Transactions 29 Distributed Commit 30 Distributed

Models TR8, TR10, TR12 - Pequea Machine2 INTRODUCTION Thank-You for choosing the Pequea Turbo Rake. Your rake is the result of years of research and development work. This Operator’s

DCatch: Automatically Detecting Distributed …shanlu/paper/asplos17-preprint.pdfDCatch: Automatically Detecting Distributed Concurrency Bugs in Cloud Systems ... University of Chicago

Interruptible Tasks: Treating Memory Pressure As ...shanlu/paper/sosp15-itask.pdfInterruptible Tasks: Treating Memory Pressure As Interrupts for Highly Scalable Data-Parallel Programs

View-Centric Performance Optimization for Database-Backed ...shanlu/paper/panorama.pdf · ones. This user study result, as well as the fact that these optimizations save computation

NDG Security: Distributed Governance, Distributed Access Control, Distributed Data

Outline Introduction Background Distributed DBMS Architecture Distributed Database Design Distributed Query Processing Distributed Transaction

The Byzantine Generals Problempeople.cs.uchicago.edu/~shanlu/teaching/33100_wi15/papers/byz.pdfNow consider another scenario, shown in Figure 2, in which the commander is a traitor

ConSeq: Detecting Concurrency Bugs through Sequential Errorspeople.cs.uchicago.edu/~shanlu/preprint/asplos191-zhang.pdf · 2016-03-29 · ConSeq: Detecting Concurrency Bugs through

DCatch: Automatically Detecting Distributed Concurrency ...people.cs.uchicago.edu/~shanlu/paper/asplos17-preprint.pdf · In big data and cloud computing era, distributed cloud soft-ware

Leaflet 3 General New - krishbiotech.comkrishbiotech.com/download/leaflet_3__general_new.pdf · Title: Leaflet_3_ General_New Author: Machine2 Created Date: 7/24/2015 2:30:45 PM

Repro Machine2-20181213173400 - Denefield School Accounts 2017 - 2018.pdf · 2019-01-07 · Title: Repro Machine2-20181213173400 Created Date: 20181213173400Z

FCatch: Automatically Detecting Time-of-fault Bugs in ...shanlu/paper/asplos18_fcatch.pdf · system for small time windows where a particular node’s crash is improperly handled

Skyway: Connecting Managed Heaps in Distributed Big Data ...shanlu/paper/asplos18_skyway.pdfdependent meta data stored in an object without changing the object format. This output

View-Centric Performance ... - University of Chicagopeople.cs.uchicago.edu/~shanlu/paper/panorama.pdf · View-Centric Performance Optimization for Database-Backed Web Applications

Veritas Cluster Server - Cisco · † The /etc/hosts file on Machine2 should appear as follows: 127.0.0.1 localhost 1.1.2.1 Machine2 loghost 10.1.1.2 Machine2 SM_REP2 REP_2_NIC_1