48

How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear
Page 2: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

How do we troubleshoot this?

How does Esmeralda know how to fix this?

2

Page 3: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

• Find bugs in networked applications • Large complex unknown applications !!!

• Large complex unknown networks !!!

• Understandable output / fix

Goal

3

Page 4: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

Motivation Apache Server

Chrome Client

4

Page 5: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

Motivation Apache Server

Chrome Client Different traffic (ICMP) Often different result

probing ping

6

Page 6: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

Motivation Apache Server

Chrome Clientpacket capture

Requires detailed protocol / app knowledge

9

Page 7: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

Motivation Apache Server

Chrome Client

ModelModel

Need a model per application

12

Model apps Magpie, Xtrace,

Pip...

Page 8: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

MotivationChrome Client

Network Config Analysis

Model & Config

Model & Config

Model & Config

Model & Config

14

Header Space Analysis, etc.

Apache Server

Page 9: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

Motivation Apache Server

Chrome Client

Network Config Analysis

Model & Config

Model & Config

Model & Config

Model & Config

Need detailed network knowledge HW + config

15

Page 10: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

Motivation Apache Server

Chrome Client ?

16

Page 11: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

NetCheck Apache Server

Chrome Client

programmer

programmer

17

Page 12: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

NetCheck Apache Server

Chrome Client

programmer

programmer

18

Page 13: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

NetCheck Apache Server

Chrome Client

Model Programmer’s Understanding

Deutsch’s Fallacies

programmer

programmer

19

Page 14: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

• Motivation • NetCheck Overview • Trace Ordering • Network Model • Fault Classification • Results / Conclusion

Outline

20

Page 15: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

NetCheck overview

ApplicationFail

Traces

NetCheck

Likely Faults

21

Page 16: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

NetCheck overview

Application

Traces

NetCheck

Likely Faults

Fail

ktrace strace

22

Page 17: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

NetCheck overview

Application

Traces

NetCheck

Likely Faults

Ordering Algorithm

Network Model

Diagnoses EngineInput

DiagnosisOutput

Host Traces

NetCheck

syscall simulationresult

simulation stateerrors

23

Page 18: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

NetCheck overview

Application

Traces

NetCheck

Likely Faults

Network Configuration Issues

Traffic Statistics

Problem Detected

24

Page 19: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

• Motivation • NetCheck Overview • Trace Ordering • Network Model • Fault Classification • Results / Conclusion

Outline

25

Traces (a) Trace Ordering

Page 20: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

Series of locally ordered system calls Don’t want to modify apps or use a global clock Gathered by strace, ktrace, systrace, truss, etc. Call arguments and “return values” !socket() = 3 bind(3, …) = 0 listen(3, 1) = 0 accept(3, …) = 4 recv(4, "HTTP", …) = 4 close(4) = 0

Traces

26

Call arguments

Return values

Return buffer

Page 21: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

!Node A Node B 1. socket() = 3 1. socket() = 3 2. bind(3, ...) = 0 2. connect(3,...) = 0 3. listen(3, 1) = 0 3. send(3, "Hello",.) = 5 4. accept(3, ...) = 4 4. close(3) = 0 5. recv(4,"Hello", ..) = 5 6. close(4) = 0

What we see is this:

- one trace per host - local order but no global order Q: how do we reconstruct what really happened?

27

Page 22: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

A1. socket() = 3 B1. socket() = 3 A2. bind(3, .. .) = 0 A3. listen(3, 1) = 0 B2. connect(3,...) = 0 A4. accept(3, ...) = 4 B3. send(3, "Hello", ...) = 5 A5. recv(4, "Hello", ...) = 5 B4. close(3) = 0 A6. close(4) = 0

What we want is this

The ground truth

A B

28

Page 23: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

A1. socket() = 3 B1. socket() = 3 A2. bind(3, .. .) = 0 A3. listen(3, 1) = 0 B2. connect(3,...) = 0 A4. accept(3, ...) = 4 B3. send(3, "Hello", ...) = 5 A5. recv(4, "Hello", ...) = 5 B4. close(3) = 0 A6. close(4) = 0

What we want is this

The ground truth !!!!!!!Goal: find an equivalent interleaving

A B

29

Page 24: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

!Node A Node B 1. socket() = 3 1. socket() = 3 2. bind(3, ...) = 0 2. connect(3,...) = 0 3. listen(3, 1) = 0 3. send(3, "Hello",.) = 5 4. accept(3, ...) = 4 4. close(3) = 0 5. recv(4,"Hello", ..) = 5 6. close(4) = 0

Observation 1: Order Equivalence

- one trace per host - local order but no global order Q: how do we reconstruct what really happened? The socket() calls are not visible to the other side Some orders are equivalent! 30

Page 25: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

!Node A Node B 1. socket() = 3 1. socket() = 3 2. bind(3, ...) = 0 2. connect(3,...) = 0 3. listen(3, 1) = 0 3. send(3, "Hello",.) = 5 4. accept(3, ...) = 4 4. close(3) = 0 5. recv(4,"Hello", ..) = 5 6. close(4) = 0

- one trace per host - local order but no global order Q: how do we reconstruct what really happened?

31

Observation 2: Return Values Guide Ordering

Page 26: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

Return values guide ordering

A2. bind(3, ...) = 0 A3. listen(3, 1) = 0 B2. connect(3, ...) = 0 !!A2. bind(3, ...) = 0 B2. connect(3, ...) = -1, ECONNREFUSED A3. listen(3, 1) = 0 !!A call’s return value may-depend-on a remote call’s action Result indicates order of calls 32

!!!!

!!!!

One valid ordering: all syscalls returned successfully.

A second valid ordering: connect failed with ECONNREFUSED.

Page 27: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

Deciding call order

full set of may-depend-on relations

socketbind getsockopt,setsockoptgetsockname

accept getpeername

poll, select

connect recv, recvfrom, recvmsg, read

send, sendto, sendmsg, write, writev, sendfileclose, shutdownlisten

33

Page 28: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

Ordering Algorithm

34

Input traces

Output Ordering

Algorithm processsocket socket

connect

send

recv

accept

listen

bind

A B

Page 29: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

Ordering Algorithm

35

Input traces

Output Ordering

Try socket on host A: accepted

Algorithm processsocket socket

connect

send

recv

accept

listen

bind

A B

socket

A

Page 30: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

connect

Ordering Algorithm

36

Input traces

Output Ordering

Try connect on host B:

Algorithm process

send

recv

accept

listen

A B

socket

Asocket

Bbind

A

connect rejected

Page 31: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

listen

Ordering Algorithm

37

Input traces

Output Ordering

Try listen on host A: accepted

Algorithm processconnect

send

recv

accept

A B

socket

Asocket

Bbind

Alisten

A

Page 32: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

recvrecv rejected

Ordering Algorithm

38

Input traces

Output Ordering

Try recv on host A:

Algorithm process

send

A B

socket

Asocket

Bbind

Alisten

Aconnect

Baccept

A

TCP BUFFER: “”

“Hola!”

Page 33: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

None

Ordering Algorithm

39

Input traces

Output Ordering

Try send on host B: accepted

Algorithm process

sendrecv

A B

socket

Asocket

Bbind

Alisten

Aconnect

Baccept

A

sendB

TCP BUFFER: “”

“Hola!”

Page 34: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

Ordering Algorithm

40

Input traces

Output Ordering

Try send on host B: accepted

Algorithm process

recv

A B

socket

Asocket

Bbind

Alisten

Aconnect

Baccept

A

sendB

TCP BUFFER: “Hello”

None

“Hola!”

Page 35: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

recvrecv

Fatal Error

Ordering Algorithm

41

Input traces

Output Ordering

Try recv on host A:

Algorithm processA B

socket

Asocket

Bbind

Alisten

Aconnect

Baccept

A

None

sendB

TCP BUFFER: “Hello”

“Hola!”

Page 36: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

• Motivation • NetCheck Overview • Trace Ordering • Network Model • Fault Classification • Results / Conclusion

Outline

42

Model

Accept

Reject

Fatal Error

Page 37: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

● Simulates invocation of a syscall ○ datagrams sent/lost ○ reordering / duplication is notable

○ track pending connections ○ buffer lengths and contents ○ send -> put data into buffer ○ recv -> pop data from buffer !

● Simulation outcome ○ Accept → can process (correct buffer) ○ Reject → wrong order (incomplete buffer) ○ Permanent reject → abnormal behavior (incorrect buffer)

Network Model

Model

Accept

Reject

Fatal Error

43

Page 38: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

● Simulates invocation of a syscall ● Capture programmer assumptions

● Assumes a simplified network view • Assume transitive connectivity • Little, random loss • No middle boxes

• Assume uniform platform • Flag OS differences

Network Model

44

Page 39: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

● Blackbox Tracing mechanism

How Model Return Values Impact Trace Ordering

Trace Ordering: linear running time (total trace length) * number of traces

45

Ordering Algorithm

Network Model

Diagnoses EngineInput

DiagnosisOutput

Host Traces

NetCheck

syscall simulationresult

simulation stateerrors

Page 40: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

• Motivation • NetCheck Overview • Trace Ordering • Network Model • Fault Classification • Results / Conclusion

Outline

46

(c) Fault Classifier

Output46

Page 41: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

● Goal: Decide what to output ● Problem: Show relevant information ● Fault classifier: global (rather than local) view

○ uncovers high-level patterns by extracting low-level features ○ Examples: middleboxes, non-transitive

connectivity, MTU, mobility, network disconnection

○ All look like loss, but have different patterns in the context of other flows

Fault Classifier

47

Page 42: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

● Options to show different levels of detail ● Network admins / developers

● detailed info ● End users

● Classification ● Recommendations

Fault Classifier

Network Configuration Issues

Traffic Statistics

Problem Detected

48

Page 43: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

• Motivation • NetCheck Overview • Trace Ordering • Network Model • Fault Classification • Results / Conclusion

Outline

49

Page 44: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

● Reproduce reported bugs from bug trackers (Python, Apache, Ruby, Firefox, etc.) ○ A total of 71 bugs ○ Grouped into 23 categories

■ Virtualization incurred/portability bugs ■ SO_REUSEADDR behaves differently across OSes ■ accept inherit O_NONBLOCK ■ …

○ Correct analysis of >95% bugs

Evaluation: Production Application Bugs

50

Page 45: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

● Twenty faults observed in practice on a live network ○ MTU bug

■ Intermediary device ○ Port forward

■ Traffic sent to non-relevant addresses ○ Provide supplemental info

■ packet loss ■ buffers being closed with data in

○ 90% of cases correctly detected

Evaluation: Observed Network Faults

51

Page 46: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

● Middle boxes ○ Multiple unaccepted connections ■ client behind NAT in FTP

• TCP/UDP ▪ non-transitive connectivity in VLC

• Complex failures oVirtualBox send data larger than buffer size oPidgin returned IP different from bind oSkype NAT + close socket from a different thread

• Used on Seattle Testbed seattle.poly.edu

General Findings in Practice

52

Page 47: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

NetCheck Performance Overhead

53

Firefox

Skype

Telnet

SSH

VLC

Page 48: How do we troubleshoot this?bestchai/papers/nsdi14_netcheck... · 2014. 6. 23. · Blackbox Tracing mechanism How Model Return Values Impact Trace Ordering Trace Ordering: linear

Built and evaluated NetCheck, a tool to diagnose network failures in complex apps

!● Key insights:

○ model the programmer’s misconceptions ○ relation between calls → reconstruct order

● NetCheck is effective

○ Everyday applications & networks ○ Real network / application bugs ○ No per-network knowledge ○ No per-application knowledge

Try it here: https://netcheck.poly.edu/ 54

Conclusion