Detailed diagnosis in enterprise networks Srikanth Kandula, Ratul Mahajan, Patrick Verkaik (UCSD),...

Preview:

Citation preview

Detailed diagnosis in enterprise networks

Srikanth Kandula, Ratul Mahajan, Patrick Verkaik (UCSD), Sharad Agarwal, Jitu Padhye, Victor Bahl

Network diagnosis

Explaining faulty behavior

ratul | sigcomm | '09

Current landscape of network diagnosis systems

ratul | sigcomm | '09

Big enterprisesLarge ISPs

Big enterprisesLarge ISPs Network size

Small enterprises

??

Why study small enterprise networks separately?

ratul | sigcomm | '09

Big enterprisesLarge ISPs

Big enterprisesLarge ISPs

Small enterprises

Less sophisticated adminsLess rich connectivity

Many shared components

IIS, SQL, Exchange, …

Our work

1. Shows that small enterprises need “detailed diagnosis”• Not enabled by current systems that focus on scale

2. Develops NetMedic for detailed diagnosis• Diagnoses application faults without application knowledge

ratul | sigcomm | '09

Understanding problems in small enterprises

ratul | sigcomm | '09

100+ cases

Symptoms, root causes

7

Symptom

App-specific 60 %

Failed initialization

13 %

Poor performance

10 %

Hang or crash 10 %

Unreachability 7 %

Identified cause

Non-app config (e.g., firewall)

30 %

Software/driver bug 21 %

App config 19 %

Overload 4 %

Hardware fault 2 %

Unknown 25 %

And the survey says …..

Detailed diagnosis

Handle app-specific as well as generic faults

Identify culpritsat a fine granularity

Example problem 1: Server misconfig

ratul | sigcomm | '09

Web server

Browser

Browser

Server config

Example problem 2: Buggy client

ratul | sigcomm | '09

SQL server

SQL client C2

SQL client C1

Requests

Current formulations sacrifice detail (to scale)

Dependency graph based formulations (e.g., Sherlock [SIGCOMM2007])

• Model the network as a dependency graph at a coarse level• Simple dependency model

ratul | sigcomm | '09

Example problem 1: Server misconfig

ratul | sigcomm | '09

Web server

Browser

Browser

Server config

The network model is too coarse in current formulations

Example problem 2: Buggy client

ratul | sigcomm | '09

SQL server

SQL client C2

SQL client C1

Requests

The dependency model is too simple in current formulations

A formulation for detailed diagnosis

Dependency graph offine-grained components

Component state is a multi-dimensional vector

ratul | sigcomm | '09

SQL svr

Exch.svr IIS

svr

IIS config

ProcessOS

Config

SQL client

C1

SQL client

C2

% CPU timeIO bytes/sec

Connections/sec404 errors/sec

The goal of diagnosis

ratul | sigcomm | '09

Svr

C1

C2

Identify likely culprits for components of interest

Without using semantics of state variables No application knowledge Process

OS

Config

Using joint historical behavior to estimate impact

ratul | sigcomm | '09

D S

d0a d0

b d0c s0

a s0b s0

c s0d

dna dn

b dnc

. . .

. . .

. . .

. . .

. . .d1

a d1b d1

c

sna sn

b snc sn

d

. . . .

. . . .

. . . .

. . . .

. . . .s1

a s1b s1

c s1d

Identify time periods when state of S was “similar”

How “similar” on average states of D are at those times

Svr

C1

C2

Request rate (low)Response time (high)

Request rate (high)Response time (high)

Request rate (high)H

HL

Robust implementation of impact estimation

• Ignore state variables that represent redundant info• Place higher weight on state variables likely related

to faults being diagnosed• Ignore state variables irrelevant to interaction with

neighbor• Account for aggregate relationships among state

variables of neighboring components• Account for disparate ranges of state variables

ratul | sigcomm | '09

Diagnose a. edge impactb. path impact

Implementation of NetMedic

ratul | sigcomm | '09

Target componentsDiagnosis timeReference time

Monitor components

Component states

Ranked list of likely culprits

Evaluation setup

ratul | sigcomm | '09

IIS, SQL, Exchange, …

.

.

.

10 actively used desktops

Diverse set of faults observed in the logs

#components ~1000

#dimensions per component (avg)

35

NetMedic assigns low ranks to actual culprits

ratul | sigcomm | '09

0 20 40 60 80 1000

20

40

60

80

100

NetMedicCoarse

Rank of actual culprit

Cu

mu

lati

ve %

of

fau

lts

NetMedic handles concurrent faults well

ratul | sigcomm | '09

2 simultaneous faults

0 20 40 60 80 1000

20

40

60

80

100

NetMedic

Coarse

Rank of actual culprit

Cu

mu

lati

ve %

of

fau

lts

Other results in the paper

Netmedic needs a modest amount (~60 mins) of history

It compares favorably with a method that understands variable semantics

ratul | sigcomm | '09

Conclusions

NetMedic enables detailed diagnosis in enterprise networks w/o application knowledge

Think small: Small enterprise networks deserve more attention

ratul | sigcomm | '09

Recommended