31
Method-based Problem Diagnosis Presented by: Paul Offord FBCS CITP Development Director, Advance7 Chair of the itSMF Problem Management SIG

Method-based Problem Diagnosis - British Computer Society · 4/22/2009  · Case Study 2 –Banking System Lost productivity cost Number of users impacted per incident 20 Number of

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Method-based Problem Diagnosis - British Computer Society · 4/22/2009  · Case Study 2 –Banking System Lost productivity cost Number of users impacted per incident 20 Number of

Method-based

Problem DiagnosisProblem Diagnosis

Presented by:

Paul Offord FBCS CITP

Development Director, Advance7

Chair of the itSMF Problem Management SIG

Page 2: Method-based Problem Diagnosis - British Computer Society · 4/22/2009  · Case Study 2 –Banking System Lost productivity cost Number of users impacted per incident 20 Number of

Agenda

• What’s the issue?

• What is everyone else doing?

• Does ITIL or COBIT help?

• RPR – an IT-specific diagnosis method• RPR – an IT-specific diagnosis method

• Case studies

• Other methods

24-Apr-09 © Advance Seven Limited. All rights reserved. 2

Page 3: Method-based Problem Diagnosis - British Computer Society · 4/22/2009  · Case Study 2 –Banking System Lost productivity cost Number of users impacted per incident 20 Number of

Routing of Problems

24-Apr-09 © Advance Seven Limited. All rights reserved. 3

Page 4: Method-based Problem Diagnosis - British Computer Society · 4/22/2009  · Case Study 2 –Banking System Lost productivity cost Number of users impacted per incident 20 Number of

Grey Problems

80-90%

Grey problems:

• Bounce between technical

support teams

• Productivity hit + Loss of sales +

High IT workload = High cost

24-Apr-09 © Advance Seven Limited. All rights reserved. 4

High IT workload = High cost

• Lack of method = Slow progress

Page 5: Method-based Problem Diagnosis - British Computer Society · 4/22/2009  · Case Study 2 –Banking System Lost productivity cost Number of users impacted per incident 20 Number of

Handling Grey Problems

?

24-Apr-09 © Advance Seven Limited. All rights reserved. 5

?

Page 6: Method-based Problem Diagnosis - British Computer Society · 4/22/2009  · Case Study 2 –Banking System Lost productivity cost Number of users impacted per incident 20 Number of

IT Problem Solving

Symptom

KnowledgeMonitorsLogs

Other info

Knowledge

Experience

KEDB

Documentation

Knowledgebase

ApplySkill

ToAccess

Fix

Pass

24-Apr-09 © Advance Seven Limited. All rights reserved. 6

Other info Knowledgebase

Fix Unknown

Causing Technology Unknown

Pass

OwnershipTo Causing

Technology

Owner

Try

Again

Page 7: Method-based Problem Diagnosis - British Computer Society · 4/22/2009  · Case Study 2 –Banking System Lost productivity cost Number of users impacted per incident 20 Number of

Typical Activity

• Brainstorm -> Limited use for unusual problems

• Theorise then Make a Change -> Risky

• Perform a Health Check -> Slow & unreliable• Perform a Health Check -> Slow & unreliable

• Theorise then Upgrade -> Slow & costly

• Put on the “too difficult to deal with” pile

24-Apr-09 © Advance Seven Limited. All rights reserved. 7

Page 8: Method-based Problem Diagnosis - British Computer Society · 4/22/2009  · Case Study 2 –Banking System Lost productivity cost Number of users impacted per incident 20 Number of

ITIL Solution to Recurring Problems

24-Apr-09 © Advance Seven Limited. All rights reserved. 8

“The actual solving of problems is likely to be undertaken

by one or more technical support groups and/or suppliers

or support contractors under the coordination of the

Problem Manager.”

Page 9: Method-based Problem Diagnosis - British Computer Society · 4/22/2009  · Case Study 2 –Banking System Lost productivity cost Number of users impacted per incident 20 Number of

“ITIL” Reality

“Prove

not the

“Prove

not the

“Prove

not the

“Prove

not the

Worst Case => No coordination, left to bounce between teams

Best Case

24-Apr-09 © Advance Seven Limited. All rights reserved. 9

e you are

e cause”

e you are

e cause”

e you are

e cause”

e you are

e cause”

Response => Perform a healthcheck

Result => Everything is OK!

Page 10: Method-based Problem Diagnosis - British Computer Society · 4/22/2009  · Case Study 2 –Banking System Lost productivity cost Number of users impacted per incident 20 Number of

Changing Tack

Root Cause

Symptom Fix

Workaround

Live with it

24-Apr-09 © Advance Seven Limited. All rights reserved. 10

Symptom Fix

When to switch:

“We’re just going to try one more thing”

“We made a change and it’s improved a bit”

Page 11: Method-based Problem Diagnosis - British Computer Society · 4/22/2009  · Case Study 2 –Banking System Lost productivity cost Number of users impacted per incident 20 Number of

Bridging The Gap

IT Operations

Service Technology

Procedural

Skills

Technical

Skills

Gap

24-Apr-09 © Advance Seven Limited. All rights reserved. 11

Incident Mgmt

Problem Mgmt

CCR Mgmt

Etc.

Desktop Support

Server Support

Network Support

Etc.

RPR

Page 12: Method-based Problem Diagnosis - British Computer Society · 4/22/2009  · Case Study 2 –Banking System Lost productivity cost Number of users impacted per incident 20 Number of

RPR – What to Do & How to Do It

Supporting TechniquesCore Process

• Discover- Gather & review existing information- Reach an agreed understanding

• Investigate- Create & execute a diagnostic plan- Analyse & iterate if necessary - Identify Root Cause

Fix

24-Apr-09 © Advance Seven Limited. All rights reserved. 12

Enter from

PM Process

Gain detailed & accurate

understanding

of problem symptoms

Agreed

understanding?

Exit to

PM Process

Choose

one symptom

to investigate

Share, gather,

explain & sort

Agree diagnostic

objective and plan

capture of definitive data

Gain accurate

understanding of the

symptom environment

Execute the

diagnostic capture plan

Analyse the

captured diagnostics

No

Work with the owning

Support Team to determine

the fix

Implement the fix

and re-activate the

diagnostic capture plan

Translate diagnostic data

and present to the

Support Team owning the

RC technology

No

NoRoot Cause

identified?Yes Fixed?

New

Root Cause

?

No

YesYes

Adequate

diagnostics

?

Yes

No

Analyse

captured data

Review the

captured diagnostics

Is Quality

Acceptable?

Yes

No

No

• Fix- Translate diagnostic data- Determine & implement fix- Confirm Root Cause addressed

Page 13: Method-based Problem Diagnosis - British Computer Society · 4/22/2009  · Case Study 2 –Banking System Lost productivity cost Number of users impacted per incident 20 Number of

RPR Core Process

24-Apr-09 © Advance Seven Limited. All rights reserved. 13

Page 14: Method-based Problem Diagnosis - British Computer Society · 4/22/2009  · Case Study 2 –Banking System Lost productivity cost Number of users impacted per incident 20 Number of

ITIL Problem Mgmt Process Alignment

Prioritization

CMS

Investigation & Diagnosis

Change Needed?

Resolution

Change

Management

24-Apr-09 © Advance Seven Limited. All rights reserved. 14

Workaround?

Create Known

Error Record

Known

Error

Database

Closure

Major

Problem?Prioritization

End

RPR as a

sub-process

Page 15: Method-based Problem Diagnosis - British Computer Society · 4/22/2009  · Case Study 2 –Banking System Lost productivity cost Number of users impacted per incident 20 Number of

RPR Features & Benefits

• Process-based method

- Controllable, repeatable & scalable

• Very reliable (99.6%)

- Reduced IT support costs Reduces- Reduced IT support costs

• Fast – cuts fix time by up to 97%

- Reduced downtime cost

• Evidence-based

- Reduces wasted cap-ex

24-Apr-09 © Advance Seven Limited. All rights reserved. 15

Support Staff

Frazzle

Page 16: Method-based Problem Diagnosis - British Computer Society · 4/22/2009  · Case Study 2 –Banking System Lost productivity cost Number of users impacted per incident 20 Number of

RPR Deployment

24-Apr-09 © Advance Seven Limited. All rights reserved. 16

RPR Supporting Techniques

Fast & Effective Communication

End to End Approach

Page 17: Method-based Problem Diagnosis - British Computer Society · 4/22/2009  · Case Study 2 –Banking System Lost productivity cost Number of users impacted per incident 20 Number of

Case Study 1 - Leisure System

24-Apr-09 © Advance Seven Limited. All rights reserved. 17

“It’s slow”

Page 18: Method-based Problem Diagnosis - British Computer Society · 4/22/2009  · Case Study 2 –Banking System Lost productivity cost Number of users impacted per incident 20 Number of

Finger Pointing

• Must be a network problem because

other branch applications are slow

• Must be a database problem because• Must be a database problem because

we’ve had other similar problems

• It can’t be a database problem because

we’ve profiled all Stored Procedures

24-Apr-09 © Advance Seven Limited. All rights reserved. 18

Page 19: Method-based Problem Diagnosis - British Computer Society · 4/22/2009  · Case Study 2 –Banking System Lost productivity cost Number of users impacted per incident 20 Number of

Step 1 – Understand the problem

Click on Appointments in the menu bar it

intermittently takes 10+ seconds to respond

24-Apr-09 © Advance Seven Limited. All rights reserved. 19

intermittently takes 10+ seconds to respond

It should take less than 5 seconds

Page 20: Method-based Problem Diagnosis - British Computer Society · 4/22/2009  · Case Study 2 –Banking System Lost productivity cost Number of users impacted per incident 20 Number of

Step 3 – Understand Environment

24-Apr-09 © Advance Seven Limited. All rights reserved. 20

Not used by Appointments transaction

Page 21: Method-based Problem Diagnosis - British Computer Society · 4/22/2009  · Case Study 2 –Banking System Lost productivity cost Number of users impacted per incident 20 Number of

Step 6 – Diag. Objective & Capture Plan

24-Apr-09 © Advance Seven Limited. All rights reserved. 21

Prove that the cause is or is not one of these

Page 22: Method-based Problem Diagnosis - British Computer Society · 4/22/2009  · Case Study 2 –Banking System Lost productivity cost Number of users impacted per incident 20 Number of

Step 10 – Analyse the diagnostics

User PC 0.873

WAN 0.085

Web Server

- Central DB 10.779

Time Contribution (s)

Two database stored procedure calls:

dbo.usp_gr_sel_StaffListWithAppointmentByDate

dbo.usp_gr_sel_AppointmentListByDate

24-Apr-09 © Advance Seven Limited. All rights reserved. 22

- Central DB 10.779

- Branch DB 2.476

- Other 0.000

- Application 0.787

14.042

15.000

Shouldn’t be there

Page 23: Method-based Problem Diagnosis - British Computer Society · 4/22/2009  · Case Study 2 –Banking System Lost productivity cost Number of users impacted per incident 20 Number of

Case Study 2 – Banking System

Lost productivity cost

Number of users impacted per incident 20

Number of incidents per day 20

Time lost 10 mins (0.17 hours)

Total hours lost per day 66.67 hours per day

Estimated loaded cost of admin staff £10.23 per hour

Total cost per day £682.03

IT support recovery cost

Without RPR:

• Problem Duration => 12 months

• Total Cost => £272,219

24-Apr-09 © Advance Seven Limited. All rights reserved. 23

IT support recovery cost

Number of resets per day 20 resets per day

Time to reset 5 mins

Total workload per day 100 mins (1.67 hours)

Estimated loaded cost of IT support staff £39 per hour

Total cost per day £65.13

Grand total per day £747.16

With RPR:

• RC determined in 9.2 days

• Approx IT support cost => £800

• Potential savings => £263,939*

* Based on switching to RPR after 10 days of investigation

Page 24: Method-based Problem Diagnosis - British Computer Society · 4/22/2009  · Case Study 2 –Banking System Lost productivity cost Number of users impacted per incident 20 Number of

Case Study 3 – Banking System

• RPR Proof of Concept project

• Corporate client system

• Slow production of reports

• Very high profile => 10,000 users impacted

• Problem duration => 3 months

• 24 man hours / day on conference calls alone

• 360 man days effort

- Our client says this is conservative

• Server upgrade proposed24-Apr-09 © Advance Seven Limited. All rights reserved. 24

Page 25: Method-based Problem Diagnosis - British Computer Society · 4/22/2009  · Case Study 2 –Banking System Lost productivity cost Number of users impacted per incident 20 Number of

Case Study 3 – Banking System

• Used RPR to determine RC in 17.5 man days

• Server upgrade would not solve it• Server upgrade would not solve it

• Direct evidence provided of IIS issue

• Being pursued with Microsoft

24-Apr-09 © Advance Seven Limited. All rights reserved. 25

Page 26: Method-based Problem Diagnosis - British Computer Society · 4/22/2009  · Case Study 2 –Banking System Lost productivity cost Number of users impacted per incident 20 Number of

Error

26%

Recurring Problem Types

( Most Recent 100 Problems )

24-Apr-09 © Advance Seven Limited. All rights reserved. 26

Incorrect Output

2%Other

1%

Performance

71%

Page 27: Method-based Problem Diagnosis - British Computer Society · 4/22/2009  · Case Study 2 –Banking System Lost productivity cost Number of users impacted per incident 20 Number of

Bug Fix

20%Redesign

5%

Upgrade

10%

Recurring Problem Resolution

( Most Recent 100 Problems - All Problem Type )

24-Apr-09 © Advance Seven Limited. All rights reserved. 27

Config Change

42%

Other

5%

Programming

18%

Page 28: Method-based Problem Diagnosis - British Computer Society · 4/22/2009  · Case Study 2 –Banking System Lost productivity cost Number of users impacted per incident 20 Number of

12

14

16

18

20

Nu

mb

er

of

Pro

ble

ms

Distribution of % IT Workload Saved for Recurring Problem RCI

( Most Recent 100 Problems - Average is 69% )

Projected

Minimum

47%

Projected

Maximum

91%

24-Apr-09 © Advance Seven Limited. All rights reserved. 28

0

2

4

6

8

10

0 - 10% 10 - 20% 20 - 30% 30 - 40% 40 - 50% 50 - 60% 60 - 70% 70 - 80% 80 - 90% 90 - 100%

Nu

mb

er

of

Pro

ble

ms

Percentage of Man Days Saved

Page 29: Method-based Problem Diagnosis - British Computer Society · 4/22/2009  · Case Study 2 –Banking System Lost productivity cost Number of users impacted per incident 20 Number of

30

35

40

45

50

Nu

mb

er

of

Pro

ble

ms

Distribution of % Problem Duration Saved for Recurring Problem RCI

( Most Recent 100 Problems - Average is 83% )

Projected

Minimum

64%

24-Apr-09 © Advance Seven Limited. All rights reserved. 29

0

5

10

15

20

25

0 - 10% 10 - 20% 20 - 30% 30 - 40% 40 - 50% 50 - 60% 60 - 70% 70 - 80% 80 - 90% 90 - 100%

Nu

mb

er

of

Pro

ble

ms

Percentage of Man Days Saved

Page 30: Method-based Problem Diagnosis - British Computer Society · 4/22/2009  · Case Study 2 –Banking System Lost productivity cost Number of users impacted per incident 20 Number of

Other MethodsRCA & Pattern-based Methods

- Structured analysis of past problems+Good use of historic data

- Includes non-technical RCs+ Environmental+ Organisational+ Commercial / contractual

RPR

- Completely evidence based+ Very reliable

- Includes what to do & how+ Fast

- Uses IT standard techniques

24-Apr-09 © Advance Seven Limited. All rights reserved. 30

+ Commercial / contractual+ Other

- Uses IT standard techniques+ Easily integrated into support

- Process-driven+ Better management and control

Better suited to:

historic / forensic analysis,non-technical issues andprevention of similar problems

Better suited to

determining the Root Cause ofongoing & recurring problems

Page 31: Method-based Problem Diagnosis - British Computer Society · 4/22/2009  · Case Study 2 –Banking System Lost productivity cost Number of users impacted per incident 20 Number of

Thank You

Questions?

24-Apr-09 © Advance Seven Limited. All rights reserved. 31

Questions?