22
Detailed diagnosis in enterprise networks Srikanth Kandula, Ratul Mahajan, Patrick Verkaik (UCSD), Sharad Agarwal, Jitu Padhye, Victor Bahl

Detailed diagnosis in enterprise networks Srikanth Kandula, Ratul Mahajan, Patrick Verkaik (UCSD), Sharad Agarwal, Jitu Padhye, Victor Bahl

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Detailed diagnosis in enterprise networks Srikanth Kandula, Ratul Mahajan, Patrick Verkaik (UCSD), Sharad Agarwal, Jitu Padhye, Victor Bahl

Detailed diagnosis in enterprise networks

Srikanth Kandula, Ratul Mahajan, Patrick Verkaik (UCSD), Sharad Agarwal, Jitu Padhye, Victor Bahl

Page 2: Detailed diagnosis in enterprise networks Srikanth Kandula, Ratul Mahajan, Patrick Verkaik (UCSD), Sharad Agarwal, Jitu Padhye, Victor Bahl

Network diagnosis

Explaining faulty behavior

ratul | sigcomm | '09

Page 3: Detailed diagnosis in enterprise networks Srikanth Kandula, Ratul Mahajan, Patrick Verkaik (UCSD), Sharad Agarwal, Jitu Padhye, Victor Bahl

Current landscape of network diagnosis systems

ratul | sigcomm | '09

Big enterprisesLarge ISPs

Big enterprisesLarge ISPs Network size

Small enterprises

??

Page 4: Detailed diagnosis in enterprise networks Srikanth Kandula, Ratul Mahajan, Patrick Verkaik (UCSD), Sharad Agarwal, Jitu Padhye, Victor Bahl

Why study small enterprise networks separately?

ratul | sigcomm | '09

Big enterprisesLarge ISPs

Big enterprisesLarge ISPs

Small enterprises

Less sophisticated adminsLess rich connectivity

Many shared components

IIS, SQL, Exchange, …

Page 5: Detailed diagnosis in enterprise networks Srikanth Kandula, Ratul Mahajan, Patrick Verkaik (UCSD), Sharad Agarwal, Jitu Padhye, Victor Bahl

Our work

1. Shows that small enterprises need “detailed diagnosis”• Not enabled by current systems that focus on scale

2. Develops NetMedic for detailed diagnosis• Diagnoses application faults without application knowledge

ratul | sigcomm | '09

Page 6: Detailed diagnosis in enterprise networks Srikanth Kandula, Ratul Mahajan, Patrick Verkaik (UCSD), Sharad Agarwal, Jitu Padhye, Victor Bahl

Understanding problems in small enterprises

ratul | sigcomm | '09

100+ cases

Symptoms, root causes

Page 7: Detailed diagnosis in enterprise networks Srikanth Kandula, Ratul Mahajan, Patrick Verkaik (UCSD), Sharad Agarwal, Jitu Padhye, Victor Bahl

7

Symptom

App-specific 60 %

Failed initialization

13 %

Poor performance

10 %

Hang or crash 10 %

Unreachability 7 %

Identified cause

Non-app config (e.g., firewall)

30 %

Software/driver bug 21 %

App config 19 %

Overload 4 %

Hardware fault 2 %

Unknown 25 %

And the survey says …..

Detailed diagnosis

Handle app-specific as well as generic faults

Identify culpritsat a fine granularity

Page 8: Detailed diagnosis in enterprise networks Srikanth Kandula, Ratul Mahajan, Patrick Verkaik (UCSD), Sharad Agarwal, Jitu Padhye, Victor Bahl

Example problem 1: Server misconfig

ratul | sigcomm | '09

Web server

Browser

Browser

Server config

Page 9: Detailed diagnosis in enterprise networks Srikanth Kandula, Ratul Mahajan, Patrick Verkaik (UCSD), Sharad Agarwal, Jitu Padhye, Victor Bahl

Example problem 2: Buggy client

ratul | sigcomm | '09

SQL server

SQL client C2

SQL client C1

Requests

Page 10: Detailed diagnosis in enterprise networks Srikanth Kandula, Ratul Mahajan, Patrick Verkaik (UCSD), Sharad Agarwal, Jitu Padhye, Victor Bahl

Current formulations sacrifice detail (to scale)

Dependency graph based formulations (e.g., Sherlock [SIGCOMM2007])

• Model the network as a dependency graph at a coarse level• Simple dependency model

ratul | sigcomm | '09

Page 11: Detailed diagnosis in enterprise networks Srikanth Kandula, Ratul Mahajan, Patrick Verkaik (UCSD), Sharad Agarwal, Jitu Padhye, Victor Bahl

Example problem 1: Server misconfig

ratul | sigcomm | '09

Web server

Browser

Browser

Server config

The network model is too coarse in current formulations

Page 12: Detailed diagnosis in enterprise networks Srikanth Kandula, Ratul Mahajan, Patrick Verkaik (UCSD), Sharad Agarwal, Jitu Padhye, Victor Bahl

Example problem 2: Buggy client

ratul | sigcomm | '09

SQL server

SQL client C2

SQL client C1

Requests

The dependency model is too simple in current formulations

Page 13: Detailed diagnosis in enterprise networks Srikanth Kandula, Ratul Mahajan, Patrick Verkaik (UCSD), Sharad Agarwal, Jitu Padhye, Victor Bahl

A formulation for detailed diagnosis

Dependency graph offine-grained components

Component state is a multi-dimensional vector

ratul | sigcomm | '09

SQL svr

Exch.svr IIS

svr

IIS config

ProcessOS

Config

SQL client

C1

SQL client

C2

% CPU timeIO bytes/sec

Connections/sec404 errors/sec

Page 14: Detailed diagnosis in enterprise networks Srikanth Kandula, Ratul Mahajan, Patrick Verkaik (UCSD), Sharad Agarwal, Jitu Padhye, Victor Bahl

The goal of diagnosis

ratul | sigcomm | '09

Svr

C1

C2

Identify likely culprits for components of interest

Without using semantics of state variables No application knowledge Process

OS

Config

Page 15: Detailed diagnosis in enterprise networks Srikanth Kandula, Ratul Mahajan, Patrick Verkaik (UCSD), Sharad Agarwal, Jitu Padhye, Victor Bahl

Using joint historical behavior to estimate impact

ratul | sigcomm | '09

D S

d0a d0

b d0c s0

a s0b s0

c s0d

dna dn

b dnc

. . .

. . .

. . .

. . .

. . .d1

a d1b d1

c

sna sn

b snc sn

d

. . . .

. . . .

. . . .

. . . .

. . . .s1

a s1b s1

c s1d

Identify time periods when state of S was “similar”

How “similar” on average states of D are at those times

Svr

C1

C2

Request rate (low)Response time (high)

Request rate (high)Response time (high)

Request rate (high)H

HL

Page 16: Detailed diagnosis in enterprise networks Srikanth Kandula, Ratul Mahajan, Patrick Verkaik (UCSD), Sharad Agarwal, Jitu Padhye, Victor Bahl

Robust implementation of impact estimation

• Ignore state variables that represent redundant info• Place higher weight on state variables likely related

to faults being diagnosed• Ignore state variables irrelevant to interaction with

neighbor• Account for aggregate relationships among state

variables of neighboring components• Account for disparate ranges of state variables

ratul | sigcomm | '09

Page 17: Detailed diagnosis in enterprise networks Srikanth Kandula, Ratul Mahajan, Patrick Verkaik (UCSD), Sharad Agarwal, Jitu Padhye, Victor Bahl

Diagnose a. edge impactb. path impact

Implementation of NetMedic

ratul | sigcomm | '09

Target componentsDiagnosis timeReference time

Monitor components

Component states

Ranked list of likely culprits

Page 18: Detailed diagnosis in enterprise networks Srikanth Kandula, Ratul Mahajan, Patrick Verkaik (UCSD), Sharad Agarwal, Jitu Padhye, Victor Bahl

Evaluation setup

ratul | sigcomm | '09

IIS, SQL, Exchange, …

.

.

.

10 actively used desktops

Diverse set of faults observed in the logs

#components ~1000

#dimensions per component (avg)

35

Page 19: Detailed diagnosis in enterprise networks Srikanth Kandula, Ratul Mahajan, Patrick Verkaik (UCSD), Sharad Agarwal, Jitu Padhye, Victor Bahl

NetMedic assigns low ranks to actual culprits

ratul | sigcomm | '09

0 20 40 60 80 1000

20

40

60

80

100

NetMedicCoarse

Rank of actual culprit

Cu

mu

lati

ve %

of

fau

lts

Page 20: Detailed diagnosis in enterprise networks Srikanth Kandula, Ratul Mahajan, Patrick Verkaik (UCSD), Sharad Agarwal, Jitu Padhye, Victor Bahl

NetMedic handles concurrent faults well

ratul | sigcomm | '09

2 simultaneous faults

0 20 40 60 80 1000

20

40

60

80

100

NetMedic

Coarse

Rank of actual culprit

Cu

mu

lati

ve %

of

fau

lts

Page 21: Detailed diagnosis in enterprise networks Srikanth Kandula, Ratul Mahajan, Patrick Verkaik (UCSD), Sharad Agarwal, Jitu Padhye, Victor Bahl

Other results in the paper

Netmedic needs a modest amount (~60 mins) of history

It compares favorably with a method that understands variable semantics

ratul | sigcomm | '09

Page 22: Detailed diagnosis in enterprise networks Srikanth Kandula, Ratul Mahajan, Patrick Verkaik (UCSD), Sharad Agarwal, Jitu Padhye, Victor Bahl

Conclusions

NetMedic enables detailed diagnosis in enterprise networks w/o application knowledge

Think small: Small enterprise networks deserve more attention

ratul | sigcomm | '09