Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 7/19/05 PathDiag20050719.ppt

Network Path andApplication Diagnostics

Matt MathisJohn HeffnerRagu Reddy

7/19/05

http://www.psc.edu/~mathis/papers/PathDiag20050719.ppt

Outline• What is the real problem?

– Lessons from Web100– A new perspective

• Application and upper layer diagnosis– LAN bench testing

• Path and lower layer diagnosis– The pathdiag tool– A diagnostic server

• Future work– Alpha deployment– Open research questions

TCP tuning requires expert knowledge• By design TCP/IP hides the ‘net from upper layers

– TCP/IP provides basic reliable data delivery– The “hour glass” between applications and networks

• This is a good thing, because it allows:– Invisible recovery from data loss, etc– Old applications to use new networks– New application to use old networks

• But then (nearly) all problems have the same symptom– Less than expected performance– The details are hidden from nearly everyone

TCP tuning is painful debugging• All problems reduce performance

– But the specific symptoms are hidden• Any one problem can prevent good performance

– Completely masking all other problems• Trying to fix the weakest link of an invisible chain

– General tendency is to guess and “fix” random parts– Repairs are sometimes “random walks”– Repair one problem at time at best

The Web100 project• When there is a problem, just ask TCP

– TCP has the ideal vantage point• In between the application and the network

– TCP already “measures” key network parameters• Round Trip Time (RTT), available data capacity, etc• Can add many more

– TCP can identify the bottleneck• Why did it stop sending data?

– TCP can even adjust itself• “autotuning” eliminates one major class of flaws

See: www.web100.org

The next step• Web100 tools still require too much expertise

– They are not really end user tools– Too easy to overlook problems– Current diagnostic procedures are still cumbersome

• New insight from web100 experience– Nearly all symptoms scale with round trip time

• New NSF funding– Network Path and Application Diagnosis (NPAD)– 3 Years, we are at the midpoint

Nearly all symptoms scale with RTT

• For example– TCP Buffer Space, Network loss and reordering, etc– On a short path TCP can compensate for the flaw

• Local Client to Server: all applications work– Including all standard diagnostics

• Remote Client to Server: all applications fail – Leading to faulty implication of other components

Examples of flaws that scale• Chatty application (e.g., 50 transactions per request)

– On 1ms LAN, this adds 50ms to user response time– On 100ms WAN, this adds 5s to user response time

• Fixed TCP socket buffer space (e.g., 32kBytes)– On a 1ms LAN, limit throughput to 200Mb/s– On a 100ms WAN, limit throughput to 2Mb/s

• Packet Loss (e.g., 0.1% loss at 1500 bytes)– On a 1ms LAN, models predict 300 Mb/s– On a 100ms WAN, models predict 3 Mb/s

The confounded problems• For nearly all network flaws

– The only symptom is reduced performance– But the reduction is scaled by RTT

• On short paths, most flaws are undetectable– False pass for even the best conventional diagnostics– Leads to faulty inductive reasoning about flaw locations– This is the essence of the “end-to-end” problem– Current state-of-the-art diagnosis relies on tomography

and complicated inference techniques

The solutions• New diagnostic techniques to compensate for

“symptom scaling” • For applications (and upper layers)

– Bench test over an (emulated) ideal long path• For path testing (and lower layers)

– Test path sections using a instrumented application that can extrapolate test results to a long path

Diagnosing applications• Goal: Tools to “bench test” applications in the lab

– Client and server on the same LAN• App developer has easy access to all components

– Emulate a long ideal path between client and server• Also checks some OS and TCP features• Several different techniques (next topic)

• Developer gets first hand experience with delay– If it fails in the lab, it will not work on a WAN– Can not blame the network– Can not repeal the speed of light– Has to fix the application

Emulating delay• Multiple techniques to emulate long paths

– Scenic routing via tunnels– Kernel delays (e.g. netem, nistnet, dummynet)– Application (pipe) delay via a proxy

• We have ~5 techniques prototyped/under test– Kernel hacking vs non-privileged users– Ease of use/ease of installation– Maximum data rate– Authenticity of the delay

The high end• GRE tunnel

– Encapsulate in either client or server– Decapsulate in a carrier grade router

• Can support authentic delays at rates up to 10 Gb/s– PSC uses this to test ETF apps and infrastructure

• Very expensive– Requires carrier grade gear– Additional “tunnel pic”– Awkward (non-scaleable) configuration management– Generally requires privileged access to critical HW

The low end• Simple user mode application proxy to forward connections

– Connect from client to proxy– Proxy connects to server– (Think “man in the middle” attack)

• Proxy copies application data between the two sockets– but with a scheduler and timers to delay the data– Implements pure “application delay” over ideal TCP

• Cheap and easy for low rate applications– Non-privileged “c” program (POSIX(?))– Can be controlled and configured with a web form

• But beware of the security implications

• Interacts badly with authentication, etc– All of the same problems as NAT

Starting to look at common applications• Classic ssh & scp

– Session setup takes many RTTsAuthentication, negotiating options & encryption, etc24 RTTs in one typical(?) example

– Internal flow control• Subject of Chris Rapier’s “HPN-SSH” talk this afternoon• (This work predates NPAD)

• Thinking about X11 and others

Testing the path• Need to test short sections to localize a flaw

– But “symptom scaling” normally hides a failing section• New tool (“pathdiag”):

– Measure the performance of each short section• Use Web100 to collect detailed statistics• Loss, delay, queuing properties, etc

– Use models to extrapolate results to the full path• Assume that the rest of the path is ideal• You have to specify the end-to-end performance goal

– Data rate and RTT

– Pass/Fail on the basis of the extrapolated performance

Deploy as a Diagnostic Server

• Use pathdiag in a Diagnostic Server (DS)• Specify End to End target performance

– from server (S) to client (C) (RTT and data rate)• Measure the performance from DS to C

– Use Web100 in the DS to collect detailed statistics– Extrapolate performance assuming ideal backbone

• Pass/Fail on the basis of extrapolated performance

Example 1- good news

Example 1, continued

Example 2 - not so good

Example 2, continued

Key pathdiag/DS features• Coverage for a majority of OS and network flaws

– Most of the remaining flaws can be detected with pathdiag in the client or traceroute

– Eliminates nearly all(?) false pass results• Tests becomes more sensitive on shorter paths

– Conventional diagnostics become less sensitive– Depending on models, perhaps too sensitive

• New problem is false fail (e.g. queue space tests)

• Flaws no longer completely mask other flaws– A single test often detects several flaws

• E.g. find both OS and network flaws in the same test

– They can be repaired concurrently

Key features, continued• Results are specific and less geeky

– Intended for end-users– Provides a list of action items to be corrected

• Failed tests are showstoppers for HPN apps– Details for escalation to network or system admins

• Archived DS results include raw web100 data– Can reprocess with updated reporting SW

• New reports from old data

– Critical feedback for the NPAD project• We really want to collect “interesting” failures

Blast from the past• Same base algorithm as “Windowed Ping” [Mathis, INET’94]

– Aka “mping”– See http://www.psc.edu/~mathis/wping/– Killer diagnostic in use at PSC in the early 90s– Stopped working with the advent of “fast path” routers

• Use a simple fixed window protocol– Scan window size in 1 second steps– Measure data rate, loss rate, RTT, etc as window changes

Plan for the Future• Application bench testing

– Considering running a tunnel service for SC2005• Contact us if this would be useful

– Will release the delay proxy software– Plan to release cookbooks for non-proxy methods

Using tunnels, netem, etc

• Pathdiag/DS – Hope to deploy across I2

• Co-locate with NDT servers• Collect interesting test results to refine the tool

– Release pathdiag and DS sources• Bridging the Gap workshop

– http://e2epi.internet2.edu/btg/

Documents

Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 7/19/05 PathDiag20050719.ppt