Upload
edward-french
View
214
Download
0
Embed Size (px)
Citation preview
1
Automatic Misconfiguration Disagnosis with PeerPressure
Helen J. Wang, John C. Platt, Yu Chen, Ruyun Zhang, and Yi-Min Wang
Microsoft Research
OSDI 2004, San Francisco, CA
2
Misconfiguration Diagnosis
• Technical support contributes 17% of TCO [Tolly2000]
• Much of application malfunctioning comes from misconfigurations
• Why?– Shared configuration data (e.g., Registry) and
uncoordinated access and update from different applications
• How about maintaining the golden config state?– Very hard [Larsson2001]
• Complex software components and compositions• Third party applications• …
3
Outline
Motivation
• Goals
• Design
• Prototype
• Evaluation results
• Future work
• Concluding remarks
4
Goals
• Effectiveness– Small set of sick configuration candidates that
contain the root-cause entries
• Automation – No second party involvement – No need to remember or identify what is
healthy
5
Intuition behind PeerPressure
• Assumption– Applications function correctly on most
machines -- malfunctioning is anomaly
• Succumb to the peer pressure
6
An Example
Suspects Mine P1’s P2’s P3’s P4’s
e1 0 1 1 1 1
e2 on on on on off
e3 57 4 0 100 34
• Is R1 sick? Most likely• Is R2 sick? Probably not• Is R3 sick? Maybe not
– R3 looks like an operational state
• We use Bayesian statistics to estimate the sick probability of a suspect -- our ranking metric
7
Registry Entry Suspects
0HKLM\System\Setup\...
OnHKLM\Software\Msft\...
nullHKCU\%\Software\...
DataEntry
PeerPressure
Search& Fetch
StatisticalAnalyzer
CanonicalizerPeer-to-Peer
TroubleshootingCommunity
Database
Troubleshooting Result
0.2HKLM\System\Setup\...
0.6HKLM\Software\Msft\...
0.003HKCU\%\Software\...
Prob.Entry
AppTracer
Run the faulty app
System Overview
8
The Sick Probability
• P(Sick) = (N + c) / (N + ct + cm (t-1) )– N: # of the samples– C: cardinality– t: the number of suspects– m: the number of entries that match the suspect entry
value
• Properties:– As m increases, P decreases– As c increases, P decreases; when m = 0, smaller c
implies smaller p
9
The PeerPressure Prototype
• Database of 87 live Windows XP registry snapshots as our sample pool– hierarchical persistent storage for named, typed
entries
• PeerPressure troubleshooter implemented in C#• Needed to “sanitize” the entry values
– 1, “1”, “#1”– Heuristics: unifying values of entries with different
types
10
Outline
MotivationGoalsDesignPrototype
• Evaluation results
• Future work
• Concluding remarks
11
Windows Registry Characteristics
• Max size: 333,193• Min size: 77,517• Average size: 198,376• Median size: 198,608• Cardinality: 87% 1, 94% <=2• Distinct canonicalized entries in GeneBank
1,476,665• Common canonicalized entries 43,913• Distinct entries data-sanitized 1,820,706
12
Evaluation Data Set
• 87 live Windows XP registry snapshots (in the database)– Half of these snapshots are from three diverse
organizations within Microsoft: Operations and Technology Group (OTG) Helpdesk in Colorado, MSR-Asia, and MSR-Redmond.
– The other half are from machines across Microsoft that were reported to have potential Registry problems
• 20 real-world troubleshooting cases with known root-causes
13
Response Time
• # of suspects: 8 to 26,308 with a median: 1171• 45 seconds in average for SQL server hosted on a 2.4GHz
CPU workstation with 1 GB RAM• Sequential database queries dominate
0.00
50.00
100.00
150.00
200.00
250.00
8
37 64
105
135
182
237
293
354
482
853
1171
1230
1350
1777
1779
3209
3590
3983
5483
# of Suspects
Sec
onds
14
Troubleshooting Effectiveness
• Metric: root cause ranking
• Results:– Rank = 1 for 12 cases– Rank = 2 for 3 cases– Rank = 3, 9, 12, 16 for 4 cases, respectively– cannot solve one case
15
Source of False Positives
• Nature of the root-cause entry – Root cause entry has a large cardinality
• How unique other suspects– A highly customized machine likely produces
more noise
• The database is not pristine
16
Impact of the Sample Set Size
• Larger sample set doesn’t necessarily indicate better accuracy– Strong conformity doesn’t depend on the
number of samples– Operational state doesn’t depend on the
number of samples– Only helps with non-pristine sample set
• 10 samples are large enough for most cases
17
Related Work
• Blackbox-based techniques– Strider: need to identify the healthy [Wang ‘03]– Hardware, software component dependencies [Brown
‘01]
• Much prior on leveraging statistics to pinpoint anomaly– Bug as deviant behavior [Engler et al SOSP ‘01]– Host-based intrusion detection based on system calls
[Forrest ’96] and based on registry behavior [Apap et al, ‘99]
18
Future Work
• Only scratch the surface!
• Multiple root cause entries
• Cross-application troubleshooting
• Database maintenenance
• Privacy– Friends Troubleshooting Network
19
Concluding Remarks
• Automatic misconfiguration diagnosis is possible– Use statistics from the mass to automate
manual identification of the healthy– Initial results promising