25
Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 7/19/05 http://www.psc.edu/~mathis/papers/ PathDiag20050719.ppt

Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 7/19/05 PathDiag20050719.ppt

Embed Size (px)

DESCRIPTION

TCP tuning requires expert knowledge By design TCP/IP hides the ‘net from upper layers –TCP/IP provides basic reliable data delivery –The “hour glass” between applications and networks This is a good thing, because it allows: –Invisible recovery from data loss, etc –Old applications to use new networks –New application to use old networks But then (nearly) all problems have the same symptom –Less than expected performance –The details are hidden from nearly everyone

Citation preview

Page 1: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 7/19/05  PathDiag20050719.ppt

Network Path andApplication Diagnostics

Matt MathisJohn HeffnerRagu Reddy

7/19/05

http://www.psc.edu/~mathis/papers/PathDiag20050719.ppt

Page 2: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 7/19/05  PathDiag20050719.ppt

Outline• What is the real problem?

– Lessons from Web100– A new perspective

• Application and upper layer diagnosis– LAN bench testing

• Path and lower layer diagnosis– The pathdiag tool– A diagnostic server

• Future work– Alpha deployment– Open research questions

Page 3: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 7/19/05  PathDiag20050719.ppt

TCP tuning requires expert knowledge• By design TCP/IP hides the ‘net from upper layers

– TCP/IP provides basic reliable data delivery– The “hour glass” between applications and networks

• This is a good thing, because it allows:– Invisible recovery from data loss, etc– Old applications to use new networks– New application to use old networks

• But then (nearly) all problems have the same symptom– Less than expected performance– The details are hidden from nearly everyone

Page 4: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 7/19/05  PathDiag20050719.ppt

TCP tuning is painful debugging• All problems reduce performance

– But the specific symptoms are hidden• Any one problem can prevent good performance

– Completely masking all other problems• Trying to fix the weakest link of an invisible chain

– General tendency is to guess and “fix” random parts– Repairs are sometimes “random walks”– Repair one problem at time at best

Page 5: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 7/19/05  PathDiag20050719.ppt

The Web100 project• When there is a problem, just ask TCP

– TCP has the ideal vantage point• In between the application and the network

– TCP already “measures” key network parameters• Round Trip Time (RTT), available data capacity, etc• Can add many more

– TCP can identify the bottleneck• Why did it stop sending data?

– TCP can even adjust itself• “autotuning” eliminates one major class of flaws

See: www.web100.org

Page 6: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 7/19/05  PathDiag20050719.ppt

The next step• Web100 tools still require too much expertise

– They are not really end user tools– Too easy to overlook problems– Current diagnostic procedures are still cumbersome

• New insight from web100 experience– Nearly all symptoms scale with round trip time

• New NSF funding– Network Path and Application Diagnosis (NPAD)– 3 Years, we are at the midpoint

Page 7: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 7/19/05  PathDiag20050719.ppt

Nearly all symptoms scale with RTT

• For example– TCP Buffer Space, Network loss and reordering, etc– On a short path TCP can compensate for the flaw

• Local Client to Server: all applications work– Including all standard diagnostics

• Remote Client to Server: all applications fail – Leading to faulty implication of other components

Page 8: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 7/19/05  PathDiag20050719.ppt

Examples of flaws that scale• Chatty application (e.g., 50 transactions per request)

– On 1ms LAN, this adds 50ms to user response time– On 100ms WAN, this adds 5s to user response time

• Fixed TCP socket buffer space (e.g., 32kBytes)– On a 1ms LAN, limit throughput to 200Mb/s– On a 100ms WAN, limit throughput to 2Mb/s

• Packet Loss (e.g., 0.1% loss at 1500 bytes)– On a 1ms LAN, models predict 300 Mb/s– On a 100ms WAN, models predict 3 Mb/s

Page 9: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 7/19/05  PathDiag20050719.ppt

The confounded problems• For nearly all network flaws

– The only symptom is reduced performance– But the reduction is scaled by RTT

• On short paths, most flaws are undetectable– False pass for even the best conventional diagnostics– Leads to faulty inductive reasoning about flaw locations– This is the essence of the “end-to-end” problem– Current state-of-the-art diagnosis relies on tomography

and complicated inference techniques

Page 10: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 7/19/05  PathDiag20050719.ppt

The solutions• New diagnostic techniques to compensate for

“symptom scaling” • For applications (and upper layers)

– Bench test over an (emulated) ideal long path• For path testing (and lower layers)

– Test path sections using a instrumented application that can extrapolate test results to a long path

Page 11: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 7/19/05  PathDiag20050719.ppt

Diagnosing applications• Goal: Tools to “bench test” applications in the lab

– Client and server on the same LAN• App developer has easy access to all components

– Emulate a long ideal path between client and server• Also checks some OS and TCP features• Several different techniques (next topic)

• Developer gets first hand experience with delay– If it fails in the lab, it will not work on a WAN– Can not blame the network– Can not repeal the speed of light– Has to fix the application

Page 12: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 7/19/05  PathDiag20050719.ppt

Emulating delay• Multiple techniques to emulate long paths

– Scenic routing via tunnels– Kernel delays (e.g. netem, nistnet, dummynet)– Application (pipe) delay via a proxy

• We have ~5 techniques prototyped/under test– Kernel hacking vs non-privileged users– Ease of use/ease of installation– Maximum data rate– Authenticity of the delay

Page 13: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 7/19/05  PathDiag20050719.ppt

The high end• GRE tunnel

– Encapsulate in either client or server– Decapsulate in a carrier grade router

• Can support authentic delays at rates up to 10 Gb/s– PSC uses this to test ETF apps and infrastructure

• Very expensive– Requires carrier grade gear– Additional “tunnel pic”– Awkward (non-scaleable) configuration management– Generally requires privileged access to critical HW

Page 14: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 7/19/05  PathDiag20050719.ppt

The low end• Simple user mode application proxy to forward connections

– Connect from client to proxy– Proxy connects to server– (Think “man in the middle” attack)

• Proxy copies application data between the two sockets– but with a scheduler and timers to delay the data– Implements pure “application delay” over ideal TCP

• Cheap and easy for low rate applications– Non-privileged “c” program (POSIX(?))– Can be controlled and configured with a web form

• But beware of the security implications

• Interacts badly with authentication, etc– All of the same problems as NAT

Page 15: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 7/19/05  PathDiag20050719.ppt

Starting to look at common applications• Classic ssh & scp

– Session setup takes many RTTsAuthentication, negotiating options & encryption, etc24 RTTs in one typical(?) example

– Internal flow control• Subject of Chris Rapier’s “HPN-SSH” talk this afternoon• (This work predates NPAD)

• Thinking about X11 and others

Page 16: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 7/19/05  PathDiag20050719.ppt

Testing the path• Need to test short sections to localize a flaw

– But “symptom scaling” normally hides a failing section• New tool (“pathdiag”):

– Measure the performance of each short section• Use Web100 to collect detailed statistics• Loss, delay, queuing properties, etc

– Use models to extrapolate results to the full path• Assume that the rest of the path is ideal• You have to specify the end-to-end performance goal

– Data rate and RTT

– Pass/Fail on the basis of the extrapolated performance

Page 17: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 7/19/05  PathDiag20050719.ppt

Deploy as a Diagnostic Server

• Use pathdiag in a Diagnostic Server (DS)• Specify End to End target performance

– from server (S) to client (C) (RTT and data rate)• Measure the performance from DS to C

– Use Web100 in the DS to collect detailed statistics– Extrapolate performance assuming ideal backbone

• Pass/Fail on the basis of extrapolated performance

Page 18: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 7/19/05  PathDiag20050719.ppt

Example 1- good news

Page 19: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 7/19/05  PathDiag20050719.ppt

Example 1, continued

Page 20: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 7/19/05  PathDiag20050719.ppt

Example 2 - not so good

Page 21: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 7/19/05  PathDiag20050719.ppt

Example 2, continued

Page 22: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 7/19/05  PathDiag20050719.ppt

Key pathdiag/DS features• Coverage for a majority of OS and network flaws

– Most of the remaining flaws can be detected with pathdiag in the client or traceroute

– Eliminates nearly all(?) false pass results• Tests becomes more sensitive on shorter paths

– Conventional diagnostics become less sensitive– Depending on models, perhaps too sensitive

• New problem is false fail (e.g. queue space tests)

• Flaws no longer completely mask other flaws– A single test often detects several flaws

• E.g. find both OS and network flaws in the same test

– They can be repaired concurrently

Page 23: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 7/19/05  PathDiag20050719.ppt

Key features, continued• Results are specific and less geeky

– Intended for end-users– Provides a list of action items to be corrected

• Failed tests are showstoppers for HPN apps– Details for escalation to network or system admins

• Archived DS results include raw web100 data– Can reprocess with updated reporting SW

• New reports from old data

– Critical feedback for the NPAD project• We really want to collect “interesting” failures

Page 24: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 7/19/05  PathDiag20050719.ppt

Blast from the past• Same base algorithm as “Windowed Ping” [Mathis, INET’94]

– Aka “mping”– See http://www.psc.edu/~mathis/wping/– Killer diagnostic in use at PSC in the early 90s– Stopped working with the advent of “fast path” routers

• Use a simple fixed window protocol– Scan window size in 1 second steps– Measure data rate, loss rate, RTT, etc as window changes

Page 25: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 7/19/05  PathDiag20050719.ppt

Plan for the Future• Application bench testing

– Considering running a tunnel service for SC2005• Contact us if this would be useful

– Will release the delay proxy software– Plan to release cookbooks for non-proxy methods

Using tunnels, netem, etc

• Pathdiag/DS – Hope to deploy across I2

• Co-locate with NDT servers• Collect interesting test results to refine the tool

– Release pathdiag and DS sources• Bridging the Gap workshop

– http://e2epi.internet2.edu/btg/