View
212
Download
0
Category
Preview:
Citation preview
Network Path andApplication Diagnostics
Matt Mathis
John Heffner
Ragu Reddy
4/24/06
http://www.psc.edu/~mathis/papers/
PathDiag20060424.ppt
Outline• What is the real problem?
– Lessons from Web100
– A new perspective
• Path and lower layer diagnosis– The pathdiag tool
– A diagnostic server
• Application and upper layer diagnosis– LAN bench testing
• Future plans
TCP tuning requires expert knowledge• By design TCP/IP hides the ‘net from upper layers
– TCP/IP provides basic reliable data delivery
– The “hour glass” between applications and networks
• This is a good thing, because it allows:– Invisible recovery from data loss, etc
– Old applications to use new networks
– New application to use old networks
• But then (nearly) all problems have the same symptom– Less than expected performance
– The details are hidden from nearly everyone
TCP tuning is painful debugging• All problems reduce performance
– But the specific symptoms are hidden
• Any one problem can prevent good performance– Completely masking all other problems
• Trying to fix the weakest link of an invisible chain– General tendency is to guess and “fix” random parts
– Repairs are sometimes “random walks”
– Repair one problem at time at best
The Web100 project• When there is a problem, just ask TCP
– TCP has the ideal vantage point• In between the application and the network
– TCP already “measures” key network parameters• Round Trip Time (RTT), available data capacity, etc
• Can add many more
– TCP can identify the bottleneck• Why did it stop sending data?
– TCP can even adjust itself• “autotuning” eliminates one major class of flaws
See: www.web100.org
The next step• Web100 tools still require too much expertise
– They are not really end user tools
– Too easy to overlook problems
– Current diagnostic procedures are still cumbersome
• New insight from web100 experience– Nearly all symptoms scale with round trip time
• New NSF funded project:Network Path and Application Diagnosis (NPAD)
Nearly all symptoms scale with RTT
• For example– TCP Buffer Space, Network loss and reordering, etc– On a short path TCP can compensate for the flaw
• Local Client to Server: all applications work– Including all standard diagnostics
• Remote Client to Server: all applications fail – Leading to faulty implication of other components
Examples of flaws that scale• Chatty application (e.g., 50 transactions per request)
– On 1ms LAN, this adds 50ms to user response time– On 100ms WAN, this adds 5s to user response time
• Fixed TCP socket buffer space (e.g., 32kBytes)– On a 1ms LAN, limit throughput to 200Mb/s– On a 100ms WAN, limit throughput to 2Mb/s
• Packet Loss (e.g., 0.1% loss at 1500 bytes)– On a 1ms LAN, models predict 300 Mb/s– On a 100ms WAN, models predict 3 Mb/s
The confounded problems• For nearly all network flaws
– The only symptom is reduced performance
– But the reduction is scaled by RTT
• On short paths, most flaws are undetectable– False pass for even the best conventional diagnostics
– Leads to faulty inductive reasoning about flaw locations
– This is the essence of the “end-to-end” problem
– Current state-of-the-art diagnosis relies on tomography and complicated inference techniques
The solutions• New diagnostic techniques to compensate for
“symptom scaling” • For path testing (and lower layers)
– Test path sections using a instrumented application that can extrapolate test results to a long path
• For applications (and upper layers)– Bench test over an (emulated) ideal long path
Testing the path• Need to test short path sections to localize a flaw
– But “symptom scaling” normally hides a failing section• New tool (“pathdiag”):
– Measure the performance of each short section• Use Web100 to collect detailed statistics• Loss, delay, queuing properties, etc
– Use models to extrapolate results to the full path• Assume that the rest of the path is ideal• You have to specify the end-to-end performance goal
– Data rate and RTT
– Pass/Fail on the basis of the extrapolated performance
Deploy as a Diagnostic Server
• Use pathdiag in a Diagnostic Server (DS)• Specify End to End target performance
– From server (S) to client (C) (RTT and data rate)• Measure the performance from DS to C
– Use Web100 in the DS to collect detailed statistics– Extrapolate performance assuming ideal backbone
• Pass/Fail on the basis of extrapolated performance
Example 1- good news
Example 1, continued
Example 2 - not so good
Example 2, continued
Key pathdiag/DS features• Results are intended for end-users
– Provides a list of specific items to be corrected• Failed tests are showstoppers for HPN apps
– Includes explanations and tutorial information– Details for escalation to network or system admins
• Coverage for a majority of OS and network flaws– Most of the remaining flaws can be detected with pathdiag in
the client or traceroute– Eliminates nearly all(?) false pass results
• Tests becomes more sensitive on shorter paths– Conventional diagnostics become less sensitive– Depending on models, perhaps too sensitive
• New problem is false fail (e.g. queue space tests)
Key features, continued• Flaws no longer completely mask other flaws
– A single test often detects several flaws• E.g. find both OS and network flaws in the same test
– They can be repaired concurrently• Archived DS results include raw web100 data
– Can reprocess with updated reporting SW• New reports from old data
– Critical feedback for the NPAD project• We really want to collect “interesting” failures
Status• Public servers are now online. See:
– http://www.psc.edu/networking/projects/pathdiag/
• Version 1.0 available for download– Follow the download link
– Requires current web100 kernel patches
– Should be faster than clients
• Version 1.1 is coming soon– Better support for non-local testing
– Better support for TeraGrid scale testing
Blast from the past• Same base algorithm as “Windowed Ping” [Mathis, INET’94]
– Aka “mping”– See http://www.psc.edu/~mathis/wping/– Killer diagnostic in use at PSC in the early 90s– Stopped working with the advent of “fast path” routers
• Use a simple fixed window protocol– Scan window size in 1 second steps– Measure data rate, loss rate, RTT, etc as window changes
Diagnosing applications• Goal: Tools to “bench test” applications in the lab
– Client and server on the same LAN• App developer has easy access to all components
– Emulate a long ideal path between client and server• Also checks some OS and TCP features• Several different techniques (next topic)
• Developer gets first hand experience with delay– If it fails in the lab, it will not work on a WAN– Can not blame the network– Can not repeal the speed of light– Has to fix the application
Emulating delay• Multiple techniques to emulate long paths
– Scenic routing via tunnels– Kernel delays (e.g. netem, nistnet, dummynet)– Application (pipe) delay via a proxy
• We have ~5 techniques prototyped/under test– Kernel hacking vs non-privileged users– Ease of use/ease of installation– Maximum data rate– Authenticity of the delay
• Not ready for prime time
Try it!
http://www.psc.edu/networking/projects/pathdiag/
Recommended