1 Automatic Misconfiguration Disagnosis with PeerPressure Helen J. Wang, John C. Platt, Yu Chen, Ruyun Zhang, and Yi-Min Wang Microsoft Research OSDI 2004,

1

Automatic Misconfiguration Disagnosis with PeerPressure

Helen J. Wang, John C. Platt, Yu Chen, Ruyun Zhang, and Yi-Min Wang

Microsoft Research

OSDI 2004, San Francisco, CA

2

Misconfiguration Diagnosis

• Technical support contributes 17% of TCO [Tolly2000]

• Much of application malfunctioning comes from misconfigurations

• Why?– Shared configuration data (e.g., Registry) and

uncoordinated access and update from different applications

• How about maintaining the golden config state?– Very hard [Larsson2001]

• Complex software components and compositions• Third party applications• …

3

Outline

Motivation

• Goals

• Design

• Prototype

• Evaluation results

• Future work

• Concluding remarks

4

Goals

• Effectiveness– Small set of sick configuration candidates that

contain the root-cause entries

• Automation – No second party involvement – No need to remember or identify what is

healthy

5

Intuition behind PeerPressure

• Assumption– Applications function correctly on most

machines -- malfunctioning is anomaly

• Succumb to the peer pressure

6

An Example

Suspects Mine P1’s P2’s P3’s P4’s

e1 0 1 1 1 1

e2 on on on on off

e3 57 4 0 100 34

• Is R1 sick? Most likely• Is R2 sick? Probably not• Is R3 sick? Maybe not

– R3 looks like an operational state

• We use Bayesian statistics to estimate the sick probability of a suspect -- our ranking metric

7

Registry Entry Suspects

0HKLM\System\Setup\...

OnHKLM\Software\Msft\...

nullHKCU\%\Software\...

DataEntry

PeerPressure

Search& Fetch

StatisticalAnalyzer

CanonicalizerPeer-to-Peer

TroubleshootingCommunity

Database

Troubleshooting Result

0.2HKLM\System\Setup\...

0.6HKLM\Software\Msft\...

0.003HKCU\%\Software\...

Prob.Entry

AppTracer

Run the faulty app

System Overview

8

The Sick Probability

• P(Sick) = (N + c) / (N + ct + cm (t-1) )– N: # of the samples– C: cardinality– t: the number of suspects– m: the number of entries that match the suspect entry

value

• Properties:– As m increases, P decreases– As c increases, P decreases; when m = 0, smaller c

implies smaller p

9

The PeerPressure Prototype

• Database of 87 live Windows XP registry snapshots as our sample pool– hierarchical persistent storage for named, typed

entries

• PeerPressure troubleshooter implemented in C#• Needed to “sanitize” the entry values

– 1, “1”, “#1”– Heuristics: unifying values of entries with different

types

10

Outline

MotivationGoalsDesignPrototype

• Evaluation results

• Future work

• Concluding remarks

11

Windows Registry Characteristics

• Max size: 333,193• Min size: 77,517• Average size: 198,376• Median size: 198,608• Cardinality: 87% 1, 94% <=2• Distinct canonicalized entries in GeneBank

1,476,665• Common canonicalized entries 43,913• Distinct entries data-sanitized 1,820,706

12

Evaluation Data Set

• 87 live Windows XP registry snapshots (in the database)– Half of these snapshots are from three diverse

organizations within Microsoft: Operations and Technology Group (OTG) Helpdesk in Colorado, MSR-Asia, and MSR-Redmond.

– The other half are from machines across Microsoft that were reported to have potential Registry problems

• 20 real-world troubleshooting cases with known root-causes

13

Response Time

• # of suspects: 8 to 26,308 with a median: 1171• 45 seconds in average for SQL server hosted on a 2.4GHz

CPU workstation with 1 GB RAM• Sequential database queries dominate

0.00

50.00

100.00

150.00

200.00

250.00

8

37 64

105

135

182

237

293

354

482

853

1171

1230

1350

1777

1779

3209

3590

3983

5483

# of Suspects

Sec

onds

14

Troubleshooting Effectiveness

• Metric: root cause ranking

• Results:– Rank = 1 for 12 cases– Rank = 2 for 3 cases– Rank = 3, 9, 12, 16 for 4 cases, respectively– cannot solve one case

15

Source of False Positives

• Nature of the root-cause entry – Root cause entry has a large cardinality

• How unique other suspects– A highly customized machine likely produces

more noise

• The database is not pristine

16

Impact of the Sample Set Size

• Larger sample set doesn’t necessarily indicate better accuracy– Strong conformity doesn’t depend on the

number of samples– Operational state doesn’t depend on the

number of samples– Only helps with non-pristine sample set

• 10 samples are large enough for most cases

17

Related Work

• Blackbox-based techniques– Strider: need to identify the healthy [Wang ‘03]– Hardware, software component dependencies [Brown

‘01]

• Much prior on leveraging statistics to pinpoint anomaly– Bug as deviant behavior [Engler et al SOSP ‘01]– Host-based intrusion detection based on system calls

[Forrest ’96] and based on registry behavior [Apap et al, ‘99]

18

Future Work

• Only scratch the surface!

• Multiple root cause entries

• Cross-application troubleshooting

• Database maintenenance

• Privacy– Friends Troubleshooting Network

19

Concluding Remarks

• Automatic misconfiguration diagnosis is possible– Use statistics from the mass to automate

manual identification of the healthy– Initial results promising

Documents

1 Automatic Misconfiguration Disagnosis with PeerPressure Helen J. Wang, John C. Platt, Yu Chen, Ruyun Zhang, and Yi-Min Wang Microsoft Research OSDI 2004,