Detailed diagnosis in enterprise networks
Srikanth Kandula, Ratul Mahajan, Patrick Verkaik (UCSD), Sharad Agarwal, Jitu Padhye, Victor Bahl
Network diagnosis
Explaining faulty behavior
ratul | sigcomm | '09
Current landscape of network diagnosis systems
ratul | sigcomm | '09
Big enterprisesLarge ISPs
Big enterprisesLarge ISPs Network size
Small enterprises
??
Why study small enterprise networks separately?
ratul | sigcomm | '09
Big enterprisesLarge ISPs
Big enterprisesLarge ISPs
Small enterprises
Less sophisticated adminsLess rich connectivity
Many shared components
IIS, SQL, Exchange, …
Our work
1. Shows that small enterprises need “detailed diagnosis”• Not enabled by current systems that focus on scale
2. Develops NetMedic for detailed diagnosis• Diagnoses application faults without application knowledge
ratul | sigcomm | '09
Understanding problems in small enterprises
ratul | sigcomm | '09
100+ cases
Symptoms, root causes
7
Symptom
App-specific 60 %
Failed initialization
13 %
Poor performance
10 %
Hang or crash 10 %
Unreachability 7 %
Identified cause
Non-app config (e.g., firewall)
30 %
Software/driver bug 21 %
App config 19 %
Overload 4 %
Hardware fault 2 %
Unknown 25 %
And the survey says …..
Detailed diagnosis
Handle app-specific as well as generic faults
Identify culpritsat a fine granularity
Example problem 1: Server misconfig
ratul | sigcomm | '09
Web server
Browser
Browser
Server config
Example problem 2: Buggy client
ratul | sigcomm | '09
SQL server
SQL client C2
SQL client C1
Requests
Current formulations sacrifice detail (to scale)
Dependency graph based formulations (e.g., Sherlock [SIGCOMM2007])
• Model the network as a dependency graph at a coarse level• Simple dependency model
ratul | sigcomm | '09
Example problem 1: Server misconfig
ratul | sigcomm | '09
Web server
Browser
Browser
Server config
The network model is too coarse in current formulations
Example problem 2: Buggy client
ratul | sigcomm | '09
SQL server
SQL client C2
SQL client C1
Requests
The dependency model is too simple in current formulations
A formulation for detailed diagnosis
Dependency graph offine-grained components
Component state is a multi-dimensional vector
ratul | sigcomm | '09
SQL svr
Exch.svr IIS
svr
IIS config
ProcessOS
Config
SQL client
C1
SQL client
C2
% CPU timeIO bytes/sec
Connections/sec404 errors/sec
The goal of diagnosis
ratul | sigcomm | '09
Svr
C1
C2
Identify likely culprits for components of interest
Without using semantics of state variables No application knowledge Process
OS
Config
Using joint historical behavior to estimate impact
ratul | sigcomm | '09
D S
d0a d0
b d0c s0
a s0b s0
c s0d
dna dn
b dnc
. . .
. . .
. . .
. . .
. . .d1
a d1b d1
c
sna sn
b snc sn
d
. . . .
. . . .
. . . .
. . . .
. . . .s1
a s1b s1
c s1d
Identify time periods when state of S was “similar”
How “similar” on average states of D are at those times
Svr
C1
C2
Request rate (low)Response time (high)
Request rate (high)Response time (high)
Request rate (high)H
HL
Robust implementation of impact estimation
• Ignore state variables that represent redundant info• Place higher weight on state variables likely related
to faults being diagnosed• Ignore state variables irrelevant to interaction with
neighbor• Account for aggregate relationships among state
variables of neighboring components• Account for disparate ranges of state variables
ratul | sigcomm | '09
Diagnose a. edge impactb. path impact
Implementation of NetMedic
ratul | sigcomm | '09
Target componentsDiagnosis timeReference time
Monitor components
Component states
Ranked list of likely culprits
Evaluation setup
ratul | sigcomm | '09
IIS, SQL, Exchange, …
.
.
.
10 actively used desktops
Diverse set of faults observed in the logs
#components ~1000
#dimensions per component (avg)
35
NetMedic assigns low ranks to actual culprits
ratul | sigcomm | '09
0 20 40 60 80 1000
20
40
60
80
100
NetMedicCoarse
Rank of actual culprit
Cu
mu
lati
ve %
of
fau
lts
NetMedic handles concurrent faults well
ratul | sigcomm | '09
2 simultaneous faults
0 20 40 60 80 1000
20
40
60
80
100
NetMedic
Coarse
Rank of actual culprit
Cu
mu
lati
ve %
of
fau
lts
Other results in the paper
Netmedic needs a modest amount (~60 mins) of history
It compares favorably with a method that understands variable semantics
ratul | sigcomm | '09
Conclusions
NetMedic enables detailed diagnosis in enterprise networks w/o application knowledge
Think small: Small enterprise networks deserve more attention
ratul | sigcomm | '09