22
R. Barret, P. Maglio, E. Kandogan, J. Bailey, Usable Autonomic Computing Systems: the Admi nistrators' Perspective , ICAC 2004 Brown and J. Hellerstein, Reducing the Cost of IT Operations - Is Automation Always the Answer? , HOTOS 2005. Helen J. Wang, John C. Platt, Yu Chen, Ruyun Zhang, and Yi-Min Wang, Automatic Misconfiguration Troubleshooting with PeerPressure, OSDI ’04

R. Barret, P. Maglio, E. Kandogan, J. Bailey, Usable Autonomic Computing Systems: the Administrators' Perspective, ICAC 2004Usable Autonomic Computing

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: R. Barret, P. Maglio, E. Kandogan, J. Bailey, Usable Autonomic Computing Systems: the Administrators' Perspective, ICAC 2004Usable Autonomic Computing

• R. Barret, P. Maglio, E. Kandogan, J. Bailey, Usable Autonomic Computing Systems: the Administrators' Perspective, ICAC 2004

• Brown and J. Hellerstein, Reducing the Cost of IT Operations - Is Automation Always the Answer?, HOTOS 2005.

• Helen J. Wang, John C. Platt, Yu Chen, Ruyun Zhang, and Yi-Min Wang, Automatic Misconfiguration Troubleshooting with PeerPressure, OSDI ’04

Page 2: R. Barret, P. Maglio, E. Kandogan, J. Bailey, Usable Autonomic Computing Systems: the Administrators' Perspective, ICAC 2004Usable Autonomic Computing

• R. Barret, P. Maglio, E. Kandogan, J. Bailey, Usable Autonomic Computing Systems: the Administrators' Perspective, ICAC 2004,

Page 3: R. Barret, P. Maglio, E. Kandogan, J. Bailey, Usable Autonomic Computing Systems: the Administrators' Perspective, ICAC 2004Usable Autonomic Computing

• Motivation– the problem of administrating highly complex systems– managing complexity through automation

• from low-level configuration settings to high-level business-oriented policies

– the risk of making management harder• systems change more rapidly• administrator controls affecting more systems• So, administrator controls will be both more powerful and

more dangerous

• Goal: inform the design of AC

• Methodology: ethnographic field study!

Page 4: R. Barret, P. Maglio, E. Kandogan, J. Bailey, Usable Autonomic Computing Systems: the Administrators' Perspective, ICAC 2004Usable Autonomic Computing

• What system administrators do?– rehearsal and planning– maintaining situation awareness– managing multitasking, interruptions and

diversions

Page 5: R. Barret, P. Maglio, E. Kandogan, J. Bailey, Usable Autonomic Computing Systems: the Administrators' Perspective, ICAC 2004Usable Autonomic Computing

Tools• command-line based console

– command-line interfaces (CLIs)– multitasking, history, scripting– fast and reliable probing of disparate parts of system– easy to customize!

• standalone graphical applications– graphical user interfaces (GUIs)– good for unfamiliar tasks and novice users– depending on graphics support, insufficient support for multitasking

• web-based management tools– don’t depend on graphics support– can be integrated to provide an organized suite

Page 6: R. Barret, P. Maglio, E. Kandogan, J. Bailey, Usable Autonomic Computing Systems: the Administrators' Perspective, ICAC 2004Usable Autonomic Computing

• Phases– rehearsal and planning– maintaining situation awareness– managing multitasking, interruptions and

diversions

Analysis and Guidelines for AC

Page 7: R. Barret, P. Maglio, E. Kandogan, J. Bailey, Usable Autonomic Computing Systems: the Administrators' Perspective, ICAC 2004Usable Autonomic Computing

• Rehearsing and Planning– necessary to critical systems because of both the

chance for human error and the danger of unforeseen consequences

– AC may increase both of these dangers• as the scale and degree of coupling within complex systems

increases, new patterns of failure may develop through a series of several smaller failures

• as autonomic managers automatically reconfigure subsystems, the results on the overall system may be difficult to predict

– Guidelines• should be easy to build test systems• should be designed to be able to quickly undo changes

Page 8: R. Barret, P. Maglio, E. Kandogan, J. Bailey, Usable Autonomic Computing Systems: the Administrators' Perspective, ICAC 2004Usable Autonomic Computing

• Situation Awareness• Administrators deal with dynamic and complex processes at

many different levels of abstraction• They need to be aware of systems that are not only complex,

but that also change frequently• Each system had its own management interface and so

gaining overall situation awareness was very difficult

– Guidelines• Automation has made operators more passive• Automated systems typically hide details from operators

– Consequently, operator workload decreases during normal operating conditions, but increases during critical conditions

• Must provide facilities for rapidly gaining deeper situation awareness when problems arise

Page 9: R. Barret, P. Maglio, E. Kandogan, J. Bailey, Usable Autonomic Computing Systems: the Administrators' Perspective, ICAC 2004Usable Autonomic Computing

• Multitasking, Interruptions, Diversions– conventional systems

• Working with many components, but each component works relatively independently

– Guidelines• each level affects a component’s operation, it will be difficult

to design a general workflow for debugging• Therefore AC interfaces should allow multiple simultaneous

views of system components and aggregates to support interaction at multiple levels

Page 10: R. Barret, P. Maglio, E. Kandogan, J. Bailey, Usable Autonomic Computing Systems: the Administrators' Perspective, ICAC 2004Usable Autonomic Computing

• Brown and J. Hellerstein, Reducing the Cost of IT Operations - Is Automation Always the Answer?, HOTOS 2005.

Page 11: R. Barret, P. Maglio, E. Kandogan, J. Bailey, Usable Autonomic Computing Systems: the Administrators' Perspective, ICAC 2004Usable Autonomic Computing

Is Automation Always the Answer?

No!

Why?

Page 12: R. Barret, P. Maglio, E. Kandogan, J. Bailey, Usable Autonomic Computing Systems: the Administrators' Perspective, ICAC 2004Usable Autonomic Computing

• Helen J. Wang, John C. Platt, Yu Chen, Ruyun Zhang, and Yi-Min Wang, Automatic Misconfiguration Troubleshooting with PeerPressure, OSDI ’04

Page 13: R. Barret, P. Maglio, E. Kandogan, J. Bailey, Usable Autonomic Computing Systems: the Administrators' Perspective, ICAC 2004Usable Autonomic Computing

Misconfiguration Diagnosis

• Technical support contributes 17% of TCO [Tolly2000]

• Much of application malfunctioning comes from misconfigurations

• Why?– Shared configuration data (e.g., Registry) and

uncoordinated access and update from different applications

• How about maintaining the golden config state?– Very hard [Larsson2001]

• Complex software components and compositions• Third party applications• …

Page 14: R. Barret, P. Maglio, E. Kandogan, J. Bailey, Usable Autonomic Computing Systems: the Administrators' Perspective, ICAC 2004Usable Autonomic Computing

Outline

Motivation

• Goals

• Design

• Prototype

• Evaluation results

• Future work

• Concluding remarks

Page 15: R. Barret, P. Maglio, E. Kandogan, J. Bailey, Usable Autonomic Computing Systems: the Administrators' Perspective, ICAC 2004Usable Autonomic Computing

Goals

• Effectiveness– Small set of sick configuration candidates that

contain the root-cause entries

• Automation – No second party involvement – No need to remember or identify what is

healthy

Page 16: R. Barret, P. Maglio, E. Kandogan, J. Bailey, Usable Autonomic Computing Systems: the Administrators' Perspective, ICAC 2004Usable Autonomic Computing

Intuition behind PeerPressure

• Assumption– Applications function correctly on most

machines -- malfunctioning is anomaly

• Succumb to the peer pressure

Page 17: R. Barret, P. Maglio, E. Kandogan, J. Bailey, Usable Autonomic Computing Systems: the Administrators' Perspective, ICAC 2004Usable Autonomic Computing

An Example

Suspects Mine P1’s P2’s P3’s P4’s

e1 0 1 1 1 1

e2 on on on on off

e3 57 4 0 100 34

• Is R1 sick? Most likely• Is R2 sick? Probably not• Is R3 sick? Maybe not

– R3 looks like an operational state

• We use Bayesian statistics to estimate the sick probability of a suspect -- our ranking metric

Page 18: R. Barret, P. Maglio, E. Kandogan, J. Bailey, Usable Autonomic Computing Systems: the Administrators' Perspective, ICAC 2004Usable Autonomic Computing

Registry Entry Suspects

0HKLM\System\Setup\...

OnHKLM\Software\Msft\...

nullHKCU\%\Software\...

DataEntry

PeerPressure

Search& Fetch

StatisticalAnalyzer

CanonicalizerPeer-to-Peer

TroubleshootingCommunity

Database

Troubleshooting Result

0.2HKLM\System\Setup\...

0.6HKLM\Software\Msft\...

0.003HKCU\%\Software\...

Prob.Entry

AppTracer

Run the faulty app

System Overview

Page 19: R. Barret, P. Maglio, E. Kandogan, J. Bailey, Usable Autonomic Computing Systems: the Administrators' Perspective, ICAC 2004Usable Autonomic Computing

Evaluation Data Set

• 87 live Windows XP registry snapshots (in the database)– Half of these snapshots are from three diverse

organizations within Microsoft: Operations and Technology Group (OTG) Helpdesk in Colorado, MSR-Asia, and MSR-Redmond.

– The other half are from machines across Microsoft that were reported to have potential Registry problems

• 20 real-world troubleshooting cases with known root-causes

Page 20: R. Barret, P. Maglio, E. Kandogan, J. Bailey, Usable Autonomic Computing Systems: the Administrators' Perspective, ICAC 2004Usable Autonomic Computing

Response Time

• # of suspects: 8 to 26,308 with a median: 1171• 45 seconds in average for SQL server hosted on a 2.4GHz

CPU workstation with 1 GB RAM• Sequential database queries dominate

0.00

50.00

100.00

150.00

200.00

250.00

8

37 64

105

135

182

237

293

354

482

853

1171

1230

1350

1777

1779

3209

3590

3983

5483

# of Suspects

Sec

onds

Page 21: R. Barret, P. Maglio, E. Kandogan, J. Bailey, Usable Autonomic Computing Systems: the Administrators' Perspective, ICAC 2004Usable Autonomic Computing

Troubleshooting Effectiveness

• Metric: root cause ranking

• Results:– Rank = 1 for 12 cases– Rank = 2 for 3 cases– Rank = 3, 9, 12, 16 for 4 cases, respectively– cannot solve one case

Page 22: R. Barret, P. Maglio, E. Kandogan, J. Bailey, Usable Autonomic Computing Systems: the Administrators' Perspective, ICAC 2004Usable Autonomic Computing

Concluding Remarks

• Automatic misconfiguration diagnosis is possible– Use statistics from the mass to automate

manual identification of the healthy– Initial results promising