Using PageRank to Analyse Incident Data - WordPress.com...Running head: USING PAGERANK TO ANALYSE INCIDENT DATA Using PageRank to Analyse Incident Data. by Patrick Collins A Research

Analysing Incidents with PageRank

Running head: USING PAGERANK TO ANALYSE INCIDENT DATA

Using PageRank to Analyse Incident Data.

by

Patrick Collins

A Research Project submitted in partial fulfilment of the requirements for the degree

of Master of Science in Software and Information Systems.

NUI Galway

Department of Information Technology

August, 2008

Head of Department: Prof. Gerard Lyons

Project Supervisor: Owen Molloy

1


Final Project/Thesis Submission

MSc in Software & Information Systems

Department of Information Technology

National University of Ireland, Galway

Student Name: Patrick Collins

Telephone: +353874189477

Email: [email protected]

Date of Submission: 29 August 2008.

Title of Submission: Using PageRank to Analyse Incident Data

Supervisor Name: Owen Molloy

Certification of Authorship:

I hereby certify that I am the author of this document and that any assistance I

received in its preparation is fully acknowledged and disclosed in the document. I have

also cited all sources from which I obtained data, ideas or words that are copied directly

or paraphrased in the document. Sources are properly credited according to accepted

standards for professional publications. I also certify that this paper was prepared by me

for the purpose of partial fulfilment of requirements for the Degree Programme.

Signed: Date: 29 August 2008

2


“The important thing in science is not so much to obtain new facts as to

discover new ways of thinking about them.”

Sir William Bragg

British physicist (1862 – 1942)

3


Acknowledgements

Eugene Maxwell & John Atherly of the IT team,

American Power Conversion,

for their assistance in finding source data.

Owen Molloy,

Thesis Supervisor,

for guidance along the way.

Noel Fegan, Ray Fallon, Paul Bohan and Martina Kiely,

for reviews and comments.

All past and present members of the Software Development team,

American Power Conversion, Galway,

for inspiration, assistance and camaraderie.

Martina Kiely,

for motivation and encouragement.

4


Table of Contents

Acknowledgements........................................................................................................4

1. Abstract....................................................................................................................10

2. Introduction..............................................................................................................11

2.1. Research Objectives.........................................................................................11

3. Review of Literature.................................................................................................12

3.1. Information Technology Infrastructure Library...............................................12

3.2. Incident Management.......................................................................................14

3.3. Problem Management.......................................................................................15

3.4. Problem Management in Practice.....................................................................17

3.4.1. Reactive Problem Management................................................................18

3.4.2. Proactive Problem Management...............................................................19

3.5. Incident Analysis..............................................................................................22

3.5.1. Model Based Approaches.........................................................................22

3.5.2. Rule Based Approaches............................................................................24

3.5.3. Codebook Approaches..............................................................................25

3.5.4. Case Based Approaches............................................................................26

3.5.5. Other Approaches.....................................................................................30

4. Methodology............................................................................................................32

4.1. Proposed approach............................................................................................32

4.2. Introduction to PageRank.................................................................................33

4.3. IncidentRank.....................................................................................................34

5. Methods....................................................................................................................37

5


5.1. Data...................................................................................................................37

5.2. Experiment 1: Pareto Analysis.........................................................................38

5.2.1. Data...........................................................................................................38

5.2.2. Apparatus..................................................................................................38

5.2.3. Procedure..................................................................................................38

5.2.4. Results.......................................................................................................39

5.2.4.1. Pareto of Category.............................................................................39

5.2.4.2. Pareto of Type...................................................................................40

5.2.4.3. Pareto of Item....................................................................................41

5.3. Experiment 2: PageRank Analysis...................................................................42

5.3.1. Data...........................................................................................................42

5.3.2. Apparatus..................................................................................................42

5.3.2.1. Database............................................................................................42

5.3.2.2. Tagging.............................................................................................43

5.3.2.3. PageRank Algorithm.........................................................................45

5.3.3. Procedure..................................................................................................48

5.3.3.1. Associating Tags with Incidents.......................................................48

5.3.3.2. Running the PageRank Algorithm....................................................49

5.3.3.3. PageRank Reports.............................................................................49

6. Results......................................................................................................................50

6.1. Experiment 1: Pareto Results...........................................................................50

6.2. Experiment 2: PageRank Results.....................................................................50

6.3. Comparison.......................................................................................................53

6.3.1. On PageRank and Tags.............................................................................53

6


6.3.1.1. Comparison of Incident Count to PageRank.....................................53

6.3.1.2. Comparison of Incident Count to Resolution Time..........................54

6.3.1.3. Comparison of PageRank to Resolution Time..................................55

6.3.1.4. Relationships between Tags..............................................................56

6.3.2. On PageRank and Incidents......................................................................58

6.3.2.1. Comparison of PageRank to Resolution Time..................................58

6.3.2.2. Comparison of PageRank to Urgency...............................................59

6.3.2.3. Comparison of PageRank to Priority................................................60

6.3.2.4. Comparison of PageRank to Impact.................................................61

6.3.2.5. Relationships between Priority, Urgency, Impact and Resolution

Time...............................................................................................................62

7. Discussion................................................................................................................66

7.1. Realisation of Project Aims..............................................................................66

7.2. Big Science.......................................................................................................66

7.3. Diagnosing the IT System................................................................................70

7.4. On Categorisation.............................................................................................71

7.5. The Wisdom of Crowds....................................................................................74

7.6. Analysis of Tags...............................................................................................74

7.7. Further Research...............................................................................................75

8. References................................................................................................................78

Appendix: Source Data Definition...............................................................................82

7


List of Tables

Table 1: Top 80% of Incidents by Category................................................................42

Table 2: Top 80% of Oracle Incidents by Type...........................................................43

Table 3: Top 80% of Oracle Operations by Item.........................................................44

Table 4: Top Ten Tags Ordered by PageRank.............................................................52

Table 5: Known Issue Tags..........................................................................................54

Table 6: Relationships between Priority, Urgency, Impact and Resolution Time.......65

Table 7: Deviation of Resolution Times by Priority....................................................66

8


List of Illustrations

Illustration 1: Pareto by Category................................................................................39

Illustration 2: Pareto of Oracle Types..........................................................................40

Illustration 3: Pareto of Oracle Operations Items.........................................................41

Illustration 4: Database Design....................................................................................43

Illustration 5: Tagging User Interface..........................................................................44

Illustration 6: Incident Tag Graph................................................................................45

Illustration 7: Tags Ordered by PageRank...................................................................51

Illustration 8: Comparison of Incident Count to PageRank.........................................54

Illustration 9: Correlation of Incident Count to Resolution Time................................55

Illustration 10: Correlation of PageRank to Resolution Time......................................56

Illustration 11: Example Venn of Tag Relationships...................................................57

Illustration 12: Comparison of PageRank to Resolution Time for Incidents...............59

Illustration 13: Comparison of PageRank to Urgency for Incidents............................60

Illustration 14: Comparison of PageRank to Priority for Incidents..............................61

Illustration 15: Comparison of PageRank to Impact for Incidents...............................62

9


1. Abstract

In this paper we discuss approaches to Incident and Problem management

within the context of IT Service Management, and its de facto standard the

Information Technology Infrastructure Library (ITIL). We show how the Problem

Management process attempts to diagnose problem root causes by applying various

analysis techniques to historical incident data.

We propose a new categorisation mechanism. We break the data free from its

hierarchical categorisation scheme through the use of a free form tagging system. By

allowing all system users to categorise incidents using their own terms, we show that

while individuals may differ, the aggregate meta data produced for each incident

stabilises.

Further, by the application of PageRank analysis to the relationships between

tags and incidents, we hope to show useful and interesting correlations. While these

may or may not be indicative of a causal relationship, they are nonetheless, new facts

about the system under scrutiny.

We conclude by showing the system shows some merit, assuming a certain set

of minimum system requirements. If these requirements can be met, then this

approach, can become another tool in the system administrators’ arsenal of system

analysis approaches.

10


2. Introduction

In recent years, the IT Infrastructure Library (ITIL) has become the defacto

standard for IT service management within organisations of all sizes. It recognises

that businesses are increasingly dependent on Information Systems and Software to

meet the strategic and tactical business and end user needs.

Within the many processes defined by ITIL, the Incident and Problem

Management processes are significant. They attempt to provide a framework for

businesses to maintain a cost effective and high quality IT service. These benefit both

internal departments, and external end users and customers.

The incident management process attempts to return malfunctioning systems

to nominal operating parameters by collecting and analysing help desk incidents or

trouble tickets. Problem Management provides analysis and trending to highlight

areas of recurrent incidents, etc.

The effectiveness of Problem Management is directly influenced by how the

individual incidents are classified. By using a new classification mechanism, we hope

to show that the insight offered by Problem Management can also be influenced.

2.1. Research Objectives

Within the framework defined by ITIL, this thesis aims to:

● Show how incident classification influences problem management.

● Propose a new classification mechanism for incidents.

● Explore the analysis opportunities for the new classification

mechanism.

11


3. Review of Literature

3.1. Information Technology Infrastructure Library

The Information Technology Infrastructure Library (ITIL) is published by the

Office of Government Commerce in the year 2000. It is variously described as a set

of common, or set of best practices for IT system operation. It introduces itself as

(OGC, 2000):

The ethos behind the development of the IT Infrastructure Library (ITIL)

is the recognition that organisations are increasingly dependent upon IT to

satisfy their corporate aims and meet their business needs. This growing

dependency leads to growing needs for quality IT services – quality that is

matched to business needs and user requirements as they emerge.

This is true no matter what type or size of organisation, be it national

government, a multinational conglomerate, a decentralised office with

either a local or centralised IT provision, an outsourced service provider,

or a single office environment with one person providing IT support. In

each case there is the requirement to provide an economical service that is

reliable, consistent and of the highest quality.

IT Service Management is concerned with delivering and supporting IT

services that are appropriate to the business requirements of the

organisation. ITIL provides a comprehensive, consistent and coherent set

of best practices for IT Service Management processes, promoting a

quality approach to achieving business effectiveness and efficiency in the

use of information systems. ITIL processes are intended to be

12


implemented so that they underpin but do not dictate the business

processes of an organisation. IT service providers will be striving to

improve the quality of the service, but at the same time they will be trying

to reduce the costs or, at a minimum, maintain costs at the current level.

ITIL goes on to define processes for the majority of IT service provision.

These extend to:

● Business Continuity Management

● Partnerships and Outsourcing

● Surviving Change

● Transformation of Business Practice through Radical Change

● Capacity Management

● Financial Management for IT Services

● Availability Management

● Service Level Management

● IT Service Continuity Management

● Customer Relationship Management

● Service Desk

● Incident Management

● Problem Management

● Configuration Management

● Change Management

● Release Management

● Network Service Management

● Operations Management

13


● Management of Local Processors

● Computer Installation and Acceptance

● Systems Management

For our purposes, we will concentrate on Incident and Problem management,

and to a certain extent, the Service Desk where it interfaces with these processes.

3.2. Incident Management.

Any business which uses IT will have to learn how to deal with the day to day

minor disruptions which occur when running an IT system. Hard drives fill, network

links fail, operators make configuration changes and sometimes cause

misconfiguration of a system, software, no matter how well tested, is assumed to

contain defects, and users fail to use systems correctly through bad usability or

insufficient training.

Many businesses develop an Incident Management process for dealing with

these disruptions. The goal of the service desk is to resolve as many of these

incidents as possible before the quality of the IT service being delivered suffers.

ITIL defines an Incident management process and states its goal as being

(OGC, 2000):

The primary goal of Incident Management process is to restore normal

service operation as quickly as possible and minimise the adverse impact

on business operations, thus ensuring that the best possible levels of

service quality and availability are maintained. 'Normal service operation'

is defined here as service operation within Service Level Agreement

(SLA) limits.

14


ITIL recommends that as much detail as possible be maintained about each

incident, such as reporter, time, systems affected, relationship with similar incidents

etc. This information is annotated over time as the incident is resolved, before finally

being archived in a knowledge base. This can then become the basis for faster

resolution of the same or similar incidents in future.

3.3. Problem Management.

While Incident Management defines a process for quickly returning a system

to normal operational levels, problem management is defined by ITIL as (OGC,

2000):

The goal of Problem Management is to minimise the adverse impact of

Incidents and Problems on the business that are caused by errors within

the IT Infrastructure, and to prevent recurrence of Incidents related to

these errors. In order to achieve this goal, Problem Management seeks to

get to the root cause of Incidents and then initiate actions to improve or

correct the situation.

The Problem Management process has both reactive and proactive

aspects. The reactive aspect is concerned with solving Problems in

response to one or more Incidents. Proactive Problem Management is

concerned with identifying and solving Problems and Known Errors

before Incidents occur in the first place.

ITIL takes a very IT centric view of both Incident and Problem management.

Given the above definitions for both Incident and Problem management, one may

conclude that the solution to all IT system problems is a change in IT infrastructure.

15


This is not necessarily the case, as there can be many reasons why a business may not

be in a position to make in IT infrastructure change. For example, the system was

supplied by a third party vendor, time constraints, cost constraints, lack of skills

within the organisation, etc. In these cases, workarounds, process changes, publishing

of additional documentation and user training may be easier or less costly to

implement as a solution to IT service incidents.

Kajko Mattsson (2002) discusses the role of problem management from a

more holistic perspective. She breaks problem management into two areas, Software

Problem Management, and Continuous Software Management Process Improvement,

stating:

Software problem management process is the dominating process within

corrective maintenance. Its main role is to attend to the reported software

problems in software products. This is mainly achieved by collecting

information on software problems, identifying their underlying defects

and removing these defects. Within the scope of this role, problem

management process should additionally provide data relevant for

accessing product quality and reliability.

The other role of problem management is to provide a basis for

continuous process analysis, process improvement and defect prevention.

This can be achieved by studying the defects, and by identifying and

analysing the process states during which these defects were injected.

Identification and analysis of the process steps may then aid in diagnosing

the root causes of these defects. This should in turn result in appropriate

process improvement actions to prevent the defects from recurring.

16


KajkoMattsson goes further than ITIL to suggest tracking and improvement

of problem management processes. This supports the findings of Oppenheimer &

Patterson (2002)e, only to introduce others. Oppenheimer & Patterson (2002) show

that operator error is a leading cause of system failure stating:

From a study of 62 uservisible failures in three large scale Internet

services, we observe that frontends are a more significant problem than is

commonly believed, that operator error and network problems are leading

contributors to uservisible failures, and that more thoroughly exposing

and handling component failures would reduce failure rates in at least one

service.

KajkoMattsson's approach also challenges ITIL's assumption that all IT

service issues can be solved via a change to the IT infrastructure. In a majority of

cases, it holds true, that IT changes solve IT issues. In a minority of cases, for

example, if a business is running a software system which it cannot change, then IT

issues could be solved by providing better documentation, or giving additional

training on system use to the user community.

3.4. Problem Management in Practice

Problem management, under the ITIL definition, takes the records of Incidents

submitted through the help or service desk, and attempts to apply some analysis on

them to uncover underlying root causes. These root causes are then addressed

through planned changes to the structure of the underlying IT infrastructure, or

through changes to the operational processes.

In order to facilitate this, ITIL requires a certain Incident management

protocol to allow for easier incident analysis. ITIL recommends (OGC, 2000), “an

17


effective automated registration of Incidents, with an effective classification, is

fundamental for the success of Problem Management”. ITIL also recognises some

risks to problem management, such as:

● Absence of a good Incident control process, and thus the absence

of detailed historical data on Incidents (necessary for the correct

identification of Problems).

● Failure to link Incident records with Problem/ means a failure to

gain many of the potential benefits. This is a key feature in

moving from reactive support to a more planned and proactive

support approach.

● Failure to set aside time to build and maintain the knowledge base

will restrict the delivery of benefits.

● An inability to determine accurately the business impact of

Incidents and Problems. Consequently the businesscritical

Incidents and Problems are not given the correct priority.

3.4.1. Reactive Problem Management.

Reactive problem management is largely taken care of by the Incident

Management process. The goal here is to (Microsoft, 2007):

● Identify and take ownership of problems affecting infrastructure

and service.

● Take steps to reduce the impact of incidents and problems.

18


● Identify the root cause of problems and initiate activity aimed at

establishing workarounds or permanent solutions to these

identified problems.

3.4.2. Proactive Problem Management

While reactive problem management, or incident management attempts to

reduce the impact of system failures as quickly as possible, proactive problem

management takes a longer view. Using recorded problem and incident data, trend

analysis can be performed to predict future problems and enable prioritisation of

problem management activities (Microsoft, 2007).

When businesses invest in IT systems, they would like to be assured that the

new system will be operational when needed. The availability of a system is a

measure used to give an indication of the probability that a system will be available at

any given moment. This is usually a major consideration for IT system purchasers.

Gokhale, Crigler, Farr & Wallace (2005) discuss the relationship between

availability and reliability saying:

The reliability of a system may be defined as the probability of failure

free operation for a specified period of time in a specified environment.

Reliability is a key metric for many lifecritical systems that are required

to operate without failure for a given period of time. Many systems,

however, are capable of tolerating some failures and continue to operate

despite failures, perhaps in a degraded mode. Also, even though a failure

causes total loss of service, the underlying fault may be repaired in order

to restore the system back into operation. For such repairable systems as

well as for systems which are capable of operating in a degraded mode,

19


availability is a more relevant metric than reliability. The availability of a

system is defined as the ability of the system to be in a state to perform a

required function at a given instant of time or any instant of time within a

given time interval. A crucial difference between reliability and

availability is that reliability refers to failurefree operation during an

entire interval, while availability refers to failurefree operation at a given

instant of time. This time is usually the instant when the system is first

accessed to provide a required function or service.

Mockus (2006) discusses how the availability of a system can be estimated

given the data recorded by the help desk, stating:

Our primary contribution is to propose a method to assess empirically

software availability and reliability based on information from operational

customer support and inventory systems. In addition, the novelty of our

approach has several aspects. The precise information about the system

population, configuration, and age is linked to the outage information in

order to produce more accurate estimates of availability. The

methodology of data collection to estimate availability of software is

proposed. The experiences and findings applying the approach to a large

enterprise communication system are discussed. We ask several practical

and theoretical questions and evaluate them based on the obtained results.

In particular, we compare samples to obtain approximations of reliability

with more accurate, but harder to obtain estimates. We also evaluate if

the common reliability measure of mean time between failures (MTBF) is

20


appropriate for varying system run times by investigating the hazard

function.

Mockus' approach could be considered an appropriate approach a business

would use to analyse an IT system which is to be deployed, but has not been

developed in house. It is arguable that a Model Based Approach (discussed in the

next sections) would give the best indication of a system's availability,

notwithstanding the difficulties this approach imposes.

Regardless of how availability is estimated, given time, a real value will be

arrived at by measuring the system as it is being used. Businesses deploy Incident

Management and Problem Management processes to ensure that the availability of a

system is as high as possible and disruption to customers and business processes is

minimised. Jantti & Eerola (2006) note:

The primary goal of problem management is to minimise the impact of

problems on the business and to identify the root cause of problems.

According to a recent IT service management survey, the problem

management process is one of the main development targets for many

organisations in the near future. Many organisations have realised the

value of problem management in preventing failures and problems.

However, IT organisations do not have a clear understanding of the basic

concepts of problem management process and the relationships between

the concepts. This is mainly due to the complex IT service management

standards that cause difficulties in the implementation of problem

management. A welldesigned problem management model helps

21


organisations to prevent problems (proactive problem management), to

resolve reported problems effectively (reactive problem management),

and also takes into consideration cost, effort and quality aspects.

3.5. Incident Analysis

Given a set of IT service incidents collected over time by the Service Desk, or

other means, the problem management process mandates analysing these incidents in

an effort to uncover the root causes, and implement permanent solutions to them.

Haneman, Sailer and Schmitz (2004) discuss various incident correlation approaches.

They classify these into four areas, Model Based Reasoning (MBR), Rule Based

Reasoning (RBR), a Codebook approach, and Case Based Reasoning (CBR).

3.5.1. Model Based Approaches

Haneman, Sailer and Schmitz (2004) note of model based approaches to

incident correlation, and problem discovery:

Modelbased reasoning (MBR) represents a system by modelling each of

its components. A model can either represent a physical entity or a logical

entity (e.g. LAN, WAN, domain, service, business process, etc.). The

models for physical entities are called functional model, while the models

for all logical entities are called logical model. A description of each

model contains three categories of information: attributes, relations to

other models, and behaviour. The event correlation is a result of the

collaboration among models.

Gokhale, Crigler, Farr & Wallace (2005) show how a model based approach is

used to produce a closed form expression of system availability stating:

22


We present a system availability model which considers failure severities

in conjunction with system structure. Based on the model, we obtain a

closed form expression which relates system availability to the failure and

repair parameters of the components. We then describe availability

analysis of a satellite system using the model based on the data collected

during the acceptance testing of the system.

They also note, not all failure states carry the same weight with regard to

system availability, saying (Gokhale, Crigler, Farr & Wallace, 2005):

In the literature, modelbased analysis has regarded all the failures of all

the components to be equivalent. The consequence of each failure on the

services provided by the system is considered to be the same. In other

words, each failure is considered to be the same level of severity. As a

result, redundancy is used to tolerate some failures and provide degraded

mode of operation and repair/restoration is used to bring the system back

into a completely operational state. In many reallife systems, however,

all failures do not always have the same impact on system services. In

fact, failures are typically classified into multiple severity levels, where

failures belonging to the highest severity level cause a complete loss of

service, while failures belonging to levels below the highest level enable

the system to operate in a degraded mode. Thus, the system is capable of

tolerating low severity failures without employing any other means such

as redundancy. This makes it necessary to consider the influence of

failure severities on system availability in conjunction with the system

structure.

23


An advantage of a model based system is that a model is developed of the

system showing its various states and state transitions. This can be used to analyse

the system on an ongoing basis as a Markov model and allows a deterministic

approach to system management. The disadvantage of this approach is that creating

and maintaining a system model becomes exponentially more difficult as systems

grow. As such, apart from several niche applications, such as the satellite systems

discussed by Gokhale, Crigler, Farr & Wallace (2005), model based reasoning is

generally not used by system administrators.

3.5.2. Rule Based Approaches

Haneman, Sailer and Schmitz (2004) discuss Rule Based approaches stating:

Rulebased reasoning (RBR) uses a set of rules for event correlation. The

rules have the form of “conclusion, if condition”. The condition uses

received events and information about the system, while the conclusion

contains actions which can either lead to system changes or use system

parameters to choose the next rule.

Advantages of a rule based approach is the possibility of automatic system

healing, given the system's state, the state of any incidents raised, and rules developed

over time, the system can begin to take automatic recovery steps.

An example of this is noted by BMC Software (2006), when a hard disk

partition approaches a critical high water mark for disk space, an incident is

automatically logged to the incident management system. This in turn sends a

response to the system asking it to delete any temporary or unused files, thus

automatically freeing some disk space.

24


Using an approach like this can lead to many of the most easily fixed, though

most frequent incidents becoming automated, thus freeing system administration staff

to deal with the less frequent though more involved incidents.

Rule based reasoning does have its disadvantages. Similar to the Model Based

approaches, the system rules need to be constantly kept in line with the system

architecture. As problems are resolved, patches deployed, new system features

introduced, and other day to day changes occur to the IT system, the set of system

rules need to be revalidated each time. This adds significant effort to system

administrators in ensuring that the system does not lead itself into a misconfiguration

or suboptimal state through application of system rules.

3.5.3. Codebook Approaches

Haneman, Sailer and Schmitz (2004) define the codebook approach as:

The codebook approach has similarities to RBR, but takes a further step

and encodes the rules into a correlation matrix.

The approach starts using a dependency graph with two kinds of nodes for

the modelling. The first kind of node are the faults (problems/ root

causes) which have to be detected, while the second kind of nodes are

observable events (symptoms/ incidents) which are caused by the faults or

other events. The dependencies between the nodes are denoted as directed

edges. It is possible to choose weights for the edges, e.g., a weight for the

probability that fault/event A causes event B.

After a final input graph is chosen, the graph is transformed into a

correlation matrix where the columns contain the faults and the rows

25


contain the events. If there is a dependency in the graph, the weight of the

corresponding edge is put into the according matrix cell.

The codebook approach has the advantage that it uses longterm

experience with graphs and coding. This experience is used to minimize

the dependency graph and to select an optional group of events with

respect to processing time and robustness against noise. A disadvantage

of the approach could be that similar to RBR frequent changes in the

environment make it necessary to frequently edit the input graph.

3.5.4. Case Based Approaches

Finally Haneman, Sailer and Schmitz (2004) discuss Case based reasoning,

stating:

In contrast to other approaches case based reasoning (CBR) systems do

not use any knowledge about the system structure. The knowledge base

saves cases with their values for system parameters and successful

recovery actions for these cases. The recovery actions are not performed

by the CBR system in the first place, but in most cases by a human

operator.

ITIL solely recommends attempting to analyse and solve incidents using case

based reasoning. ITIL's processes are structured around creating, maintaining, and

using a knowledge base by Service Desk personnel and system operators to diagnose,

implement workarounds, and permanent fixes for incidents and problems.

The CBR approach to incident management has some advantages, in that a

body of knowledge (knowledge base) is built up over time as service incidents are

26


logged and corrected. While this serves the needs of Incident Management, additional

processing is required for root cause analysis, as mandated by proactive Problem

Management.

Card (1993 & 1998) discusses defectcausal analysis. He offers the following

advice on classification and analysis of problems, (Card, 1998):

Classifying or grouping problems helps to identify clusters in which

systematic errors are likely to be found. You should select the

classification schemes to be used when you set up the Defect Causal

Analysis (DCA) process. Moreover, the meeting itself will go faster if

you classify the problems to be analysed according to a predefined

classification scheme. Ideally, each problem should be classified by the

programmer when implementing its fix. Alternatively, the moderator may

classify the problems prior to the group meeting. Three dimensions are

especially useful for classifying problems:

● When was the defect that caused the problem inserted into the

software?

● When was the problem detected?

● What type of mistake was made or defect introduced?

The first two classification dimensions correspond to activities or phases

in the software process. The last dimension reflects the nature of the work

performed and the opportunities for making mistakes. Some commonly

used error types include interface, computational, logical, initialisation

27


and data structure. Depending on the project's nature, you can add other

classes such as documentation and user interface.

You can produce tables or Pareto charts to help identify problem clusters.

A Pareto chart is a bar chart that shows the count of problems by type in

decreasing order of frequency.

Leszak, Perry & Stoll (2000) show how this type of analysis can be used

during a case study of the development cycle of a software application. It is

interesting to note that in their study, Leszak, Perry & Stoll (2000) state:

An important study decision was to allow for several root causes to be

specified during analysis of each modification request (MR). The

intuition is that there may well be several factors contributing to the

occurrence of a defect. Thus, in addition to phase, we have allowed

human, project and review root causes to be specified. These four

dimensional root cause classifications give indications as to what played a

role in a defect's occurrence.

Pareto and Ishikawa are also suggested by ITIL (OGC, 2000) as a means of

problem discovery, as they note:

Categorisation of Incidents and Problems and creative analysis may reveal

trends and lead to the identification of specific (potential) Problem areas

that need further investigation. For instance, analysis may indicate that

Incidents related to the usability of recently installed clientserver systems

is the Problem area that has the most growth in terms of negative impact

on business.

28


Analysis – for example of events from System Management tools,

literature, conferences and feedback from User groups – can also reveal

possible Problems deserving further investigation. Organising workshops

with prominent Customers or conducting Customer surveys can also lead

to the identification of trends and (potential) Problem areas.

Analysis of Problem Management data may reveal:

● That Problems occurring on one platform may occur on another

platform – for example, a Problem concerning network software

on a midrange system may well be of significance on a mainframe

system.

● The existence of recurring Problems – for example, if three routers

are substituted serially, because of the same failure, it may indicate

that the routertype concerned is not appropriate and should be

replaced by another type, or when a software application is

involved then complete redevelopment might be necessary which

would be classed as a major Change.

Pareto and Ishikawa approaches to defect analysis have one major advantage.

They are approaches which are not specific to software defects and it is likely that

administrators and management within businesses understand and are comfortable

using them.

While Pareto charts and Ishikawa diagrams may be widely used, they are

labour intensive. While ITIL recommends their use, it also recommends that problem

29


management committees meet to discuss, and analyse incidents, and to do the work of

prioritising efforts towards permanently fixing problems or root causes of incidents.

Given that this approach requires much human effort and is not easily

automated, Barash, Bartolini, & Wu (2007) discuss how we should best organise our

problem handling processes such that we resolve incidents and problems in the most

efficient way. CBR relies on human operators to change the IT system so as to ffect a

fix for an incident or problem. For large IT systems, it is not uncommon that there are

teams of administrators spread across geographies and time zones.

Barash, Bartolini & Wu (2007) provide a system of metrics to analyse how

incidents are routed between groups of administrators, so as to help define better,

more efficient incident handling protocols. The ultimate goal of their model is to

reduce the time to fix for an incident, while also reducing the business disruption

caused by it.

Their approach is less about the incidents than the human operators who have

to deal with them. While other authors deal with correct recording, and classification

of incidents, and analysis of problems to uncover root causes, Barash, Bartolini & Wu

(2007) are more interested in understanding “the improvements brought about by

restructuring the support organisation by increasing or cutting staffing levels, moving

operators around support groups (possibly on retraining), and even implementing

different prioritisation policies for the technician when dealing with queues of

incidents”.

3.5.5. Other Approaches

Talluru, & Deshmukh (1995) introduce a decision support system for the

purposes of aiding problem management, stating “a decision support system (DSS) is

30


characterised as a computer based information processing system which allows the

decision maker to interact with the problem solving process”. Although they only

develop a prototype system, the report some success in their findings, stating:

Using this DSS, managers can solve recurring problems by referring to

previously solved problems. It aids advanced users by providing all the

semantics. It helps the decision maker to do background analysis and

expert opinion sampling. The decision maker can also communicate

effectively with the quantitative analysts and other intermediaries. As the

problem manager stores all the problem situations, solutions, and in

between transformations, experts can use this information to fine tune the

knowledge periodically. It can improve the productivity of the decision

maker by cutting down turnaround times. This facility can also be used to

train entry level managers.

Apostolov (2006) monitors a network of electrical devices by continuous

automatic monitoring devices. While his application is hardware based, or periodic

service probing. In this scenario, a software agent would periodically probe or test a

system service to ensure it is both functionally correct and meeting service levels.

Any nonconformances would automatically be recorded to the Incident Management

process incident database. BMC's Performance Manager uses a technique akin to this

for proactive incident and problem management (BMC Software, 2006).

31


4. Methodology

This section outlines the approach taken to answer the basic research questions

identified in Chapter 1.

4.1. Proposed approach.

As previously shown, there exist several means for incident analysis, each

with advantages and disadvantages for any given system. Any new means of incident

analysis would have to be generally applicable, if would hope to achieve adoption.

Given the dominance of Pareto and Ishikawa as incident analysis tools, the results of

any new means of analysis would also have to be as good as or better than these

approaches. Any reduction in the time it takes to analyse historical incident data

would also be well received.

In the literature, there is a tacit acceptance, that underlying root causes are not

obvious, and only express themselves through incidents where software services fail

to meet expected functionalities or service levels. Much as a doctor diagnoses a

disease by correlating the symptoms, so do system administrators hope to uncover the

underlying causes of system failures through the analysis of incident reports.

The standard software disruption metric takes the number of occurrences of an

incident with their severity into account. While this is useful in an Incident

Management process to decide which incidents should be prioritised, the Problem

Management process would like to take incident correlations into consideration. The

literature shows an understanding that a problem root cause can express itself by

causing more than one type of incident record to be logged, as the root causes affect

different systems or users differently.

32


If the ITIL advice is followed, and as new Incidents are being logged, they are

related to existing and historical incident records, a “web of incident citations” would

emerge. Additional meta data can be captured here, through the use of tagging. With

this, any user of the system can categorise an incident using free form keywords.

Over time, as users categorise and tag incidents, a set of meta data grows around each

incident. This web of citations between incidents and tags can also be analysed.

Using the PageRank algorithm to analyse these webs, much as it does the web

of interconnected Internet pages, should reveal where incidents are clustering. Using

this information, a new metric based on PageRank, would be produced which could

be used in both Incident and Problem Management processes to direct system

administrators and system managers to direct their resources in order to gain

maximum impact for finding and resolving problem root causes.

4.2. Introduction to PageRank

Brin & Page (1998) developed the PageRank algorithm while studying at

Stanford University. They introduce it, writing:

Academic citation literature has been applied to the web, largely by

counting citations or backlinks to a given page. This gives some

approximation of a page's importance or quality. PageRank extends this

idea by not counting links from all pages equally, and by normalizing by

the number of links on a page. PageRank is defined as follows:

We assume page A has pages T1...Tn which point to it (i.e., are citations).

The parameter d is a damping factor which can be set between 0 and 1.

We usually set d to 0.85. Also C(A) is defined as the number of links

going out of page A. The PageRank of a page A is given as follows:

33


PR(A) = (1d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

PageRank or PR(A) can be calculated using a simple iterative algorithm,

and corresponds to the principal eigenvector of the normalized link matrix

of the web. Also, a PageRank for 26 million web pages can be computed

in a few hours on a medium size workstation.

4.3. IncidentRank

The benefit of the PageRank algorithm is that each web page in a set of web

pages has its importance or quality measured by the number of incoming links or

citations. It is this characteristic of the PageRank algorithm that the MRSA

researchers leveraged in order to find which interactions within a ward had the most

effect at spreading MRSA. Simonite (2008), writing on Shepherds research, notes:

“Our new model is based very much on the way Google has achieved

number one status among search engines. When Google's spiders crawl

the web they build up a connectivity matrix of links between pages”.

Shepherd's idea is to build a similar matrix describing all interactions

between people and objects in a hospital ward, based on observing normal

daily activity. “Obviously nurses move among patients and that can

spread infection, but they also touch light switches and lots of other

surfaces too. If you observe a network of all those interactions you can

build a matrix of which nodes in the network are in contact with which

other nodes”.

Shepherd has started testing the technique using data gathered for another

study. "We sussed out in one ward that the chief node was a light switch,"

34


he says. "It could potentially distribute infection to the rest of the ward

very quickly."

This approach has many parallels with incident and problem management in

software systems. In a codebook approach we would model a graph of problems with

related incidents. Given the Pareto principal, one would expect that the majority of

incidents or infections logged come from a minority of root causes. Finding and

addressing the root causes is the goal of Problem Management. As previously

discussed, a variety of correlation techniques can be brought to bear on the data.

ITIL and other problem management frameworks recommend that incidents,

when analysed should be either linked with known problems, or other incidents to aid

the root cause analysis. It is our proposal that the PageRank algorithm be brought to

bear on these records.

Using a similar approach when analysing linked incident records, should

reveal similar results in finding which incidents have the most 'knock on' effects in a

software system. It is hoped that the use of the PageRank algorithm in this instance,

would help narrow the focus of the problem management team to the issues which

cause or could potentially cause the most system disruption.

When the problem management team try to decide which incidents and

problems to prioritise, an incident’s assigned rank can be considered. A prioritisation

metric could be developed using the incident rank with a ranking of potential business

disruption. The product of these metrics would give a good prioritisation metric to

each incident, directing system administrators in focusing their efforts.

By analysing how incidents are related, one would expect this approach to:

● Show which incidents have the greatest, or potentially greatest system impact.

35


● Be as effective as or better than existing Pareto approaches.

● Allow automation of as much of the incident/ problem analysis process as

possible.

● Require a minimum of operational process change. That is, we do not propose

a completely new way of working, but use the data already available from

existing procedures.

36


5. Methods

5.1. Data

This section briefly describes the source data used for all experiments in this

thesis.

The source data was kindly donated by Eugene Maxwell & John Atherly. It

consists of 19220 rows of incident data, with each row representing a separate

incident logged to an Incident Management system. The time period covered by this

data is approximately the first 3.5 months of 2008. A brief description of each

column and its type are given in the Appendix.

Since the source software system is designed for Pareto analysis, each row is

categorised into a hierarchy. The 'Category', 'Type', and 'Item' fields capture a high,

mid and low level categorisation of the incident.

These organise each incident into a tree style categorisation. For example, the

Category could specify 'Operating System', the Type, 'Windows XP', and the Item,

'Spyware'. Thus incidents are grouped into various sets, and related to each other.

Business intelligence style reports can rollup or drill down through the sets of

incidents by including and excluding various values for Category, Type and Item.

The 'Case Category Type' field provides a concatenation of all three sub fields for

convenience.

Other fields within the data support other reporting opportunities, such as the

Arrival Time, Assigned Time, Work In Progress Time, etc. As the incident is

progressed through the analysis and resolution work flow, the incident is transitioned

through several statuses. As the status of the incident changes, timestamps are

recorded, and the amount of time the incident spent in each phase can be deduced.

37


This is helpful in supporting the analysis of which individual, and groups of incidents

take the most time to resolve.

5.2. Experiment 1: Pareto Analysis

5.2.1. Data

The system which the source data was procured from already uses Pareto

analysis for high level management of incident lists. As such, the data is readily

amenable to Pareto analysis.

As mentioned previously, the 'Category', 'Type' and 'Item' fields define a

hierarchy of categories which the data can be analysed by. Within each Category,

several Types exist, and within each Type, so do several Items.

5.2.2. Apparatus

This experiment was run using Microsoft ® Excel on a standard desktop PC.

5.2.3. Procedure

Using the Pivot Table feature of Microsoft ® Excel one can produce

frequency analysis of the list of incidents. This orders the incidents into their

respective categories and gives the number of incidents within each category.

The incidents were first analysed by the contents of the Category field. This is

the highest level category the incidents are classified under. Based on the result of

this analysis, the category with the most incidents was analysed based on the contents

of the Type field. Finally, the incident type with the most incident occurrences was

analysed in a similar fashion, based on the contents of the item field.

Based on the results of these, several Pareto charts could be produced. Dorner

(1997) shows how this can be achieved.

38


5.2.4. Results

The results of Pareto analysis is given in the following sections

5.2.4.1. Pareto of Category

The highest level of categorisation in the hierarchical scheme applied to the

source data is by the 'Category' field. Illustration 1 shows the Pareto chart of all

incidents by the contents of the Category field.

It can be seen that the majority of incidents recorded by the system are

grouped in a small few top level categories. The top 80% of incidents fall into the

categories shown in Table 2.

39

Illustration 1: Pareto by Category

P a r e t o b y C a t e g o r y

0 . 0 0 %

2 . 0 0 %

4 . 0 0 %

6 . 0 0 %

8 . 0 0 %

1 0 . 0 0 %

1 2 . 0 0 %

1 4 . 0 0 %

1 6 . 0 0 %

1 8 . 0 0 %

2 0 . 0 0 %

O r a c l e

N o t e s

H a r d w a r e

N e t wo r k

R e m o t e Ac c e s s

I n T o u c h

T e l e c o m

A p p l i ca t i o

n

S o f t wa r e

R e m e d y

B l a c k b e r r y

I n t r a n e t

O p e r a t i n g S y s t e m s

B u s i n e s s I n t e l l i ge n c e

I n t e r n e t

C i t r i x / Me t a f r a m e

A P C Wi r e l e s s

A d mi n i s t r a t i o n

D a t ac e n t e r O

p s

Mf g S

y s t e m s

D e f a u l t

D a t ac e n t e r

D a t a c o mH R

S a m e t i me

Mi d d l e w a r e

D e s i g n Po r t a l & I S X I n t e r n a l

D e s i g n Po r t a l

S e r v i c e w e b

A l e r t i n g

F i r s t To u c h

V i r t u a l

S t ar s

C a t e g o r y

Per

cent

age

0 . 0 0 %

2 0 . 0 0 %

4 0 . 0 0 %

6 0 . 0 0 %

8 0 . 0 0 %

1 0 0 . 0 0 %

1 2 0 . 0 0 %

Cum

ulat

ive

Per

cent

age


Rank Category Frequency % of Total Cumulative %

1 Oracle 3601 18.74% 18.74%

2 Notes 2474 12.87% 31.61%

3 Hardware 2220 11.55% 43.16%

4 Network 1764 9.18% 52.34%

5 Remote Access 1320 6.87% 59.21%

6 InTouch 1235 6.43% 65.64%

7 Telecom 1059 5.51% 71.15%

8 Application 1045 5.44% 76.58%

9 Software 988 5.14% 81.73%

Table 1: Top 80% of Incidents by Category

5.2.4.2. Pareto of Type

Knowing that the Oracle Category is the highest Pareto ranked Category, we

can drill down to the Type level. Illustration 2 shows the Pareto of the Oracle

Category by the Type fields. The top 80% of incidents by type are tabulated in Table

3.

40

Illustration 2: Pareto of Oracle Types

P a r e t o o f O r a c l e T y p e s

0 . 0 0 %

5 . 0 0 %

1 0 . 0 0 %

1 5 . 0 0 %

2 0 . 0 0 %

2 5 . 0 0 %

3 0 . 0 0 %

3 5 . 0 0 %

4 0 . 0 0 %

4 5 . 0 0 %

5 0 . 0 0 %

O p e r a t i o n s E M A l e r t s A p p l i c a t i o n s A D I E r r o r C o r r e c t i o n S c r e e n O D P O r a c l e D e m a n dP l a n n i n g

S e r v i c e O r d e r ( R M A ) I n t e g r a t i o n I s s u e s M G E

T y p e

Perc

enta

ge

0 . 0 0 %

2 0 . 0 0 %

4 0 . 0 0 %

6 0 . 0 0 %

8 0 . 0 0 %

1 0 0 . 0 0 %

1 2 0 . 0 0 %

Cum

ulat

ive P

erce

ntag

e


Rank Type Frequency % of Total Cumulative %

1 Operations 1625 45.13% 45.13%

2 EM Alerts 957 26.58% 71.70%

3 Applications 903 25.08% 96.78%

Table 2: Top 80% of Oracle Incidents by Type

5.2.4.3. Pareto of Item

Finally we run the same analysis on the Operations type, as it is the highest

ranked incident type within the Oracle category. Using the Item field we can further

analyse these incidents to find the highest ranked incident item. These are charted in

Illustration 3 and the top 80% of items tabulated in Table 4.

As can be seen from Illustration 3 and Table 4, the highest ranked item is

'Access Issues'. This categorises various incidents logged by users who were having

trouble accessing the Oracle system and its applications.

41

Illustration 3: Pareto of Oracle Operations Items.

P a r e t o o f O p e r a t i o n s

0 . 0 0 %

5 . 0 0 %

1 0 . 0 0 %

1 5 . 0 0 %

2 0 . 0 0 %

2 5 . 0 0 %

3 0 . 0 0 %

3 5 . 0 0 %

4 0 . 0 0 %

4 5 . 0 0 %

A c c e s s i s s u e s

A p p l i c a t i o n i s s u e s

P a s s w o r d r e s e t

D a t a b a s e i s s u e s

P e r f o r m a n c e i s s u e s

O r a c l e S R s u p p o r t

P r i n t e r i s s u e s

J in i t i a t o r i s s u e s

W e b s e r v i c e s i s s u e s

C o n c m a n a g e r s / A p a c h e

M i s s i n g u s e r n a m e

P r o d u c t i o n I s s u e s

M a t e r i a l i z e v i e w i s s u e s

Q u e u e p r o p a g a t i o n

N o n P r o d u c t i o n I s s u e s

D M L n o n p r o d

R e c o m p i l e o b je c t s

D D L n o n p r o d

O p e r a t i o n s

Perce

ntage

0 . 0 0 %

2 0 . 0 0 %

4 0 . 0 0 %

6 0 . 0 0 %

8 0 . 0 0 %

1 0 0 . 0 0 %

1 2 0 . 0 0 %

Cumu

lative

Perc

entag

e


Rank Item Frequency % of Total Cumulative %

1 Access Issues 636 38.71% 38.71%

2 Application Issues 314 19.11% 57.82%

3 Password Reset 208 12.66% 70.48%

4 Database Issues 166 10.10% 80.58%

Table 3: Top 80% of Oracle Operations by Item.

5.3. Experiment 2: PageRank Analysis

5.3.1. Data

As shown in section 5.2.4, some 636 incidents were flagged as access issues

for the Oracle system. These became the input data for experiment two.

5.3.2. Apparatus

5.3.2.1. Database

To prepare the data, the selected rows were imported into a Microsoft Access

database from the source Microsoft Excel spreadsheet. This was achieved by using

the Import functionality of the Microsoft Access database application. A new table

(IssueTbl) was created from the imported data.

Once in database format, some additional tables could be built around this

data. Illustration 4 shows the database design. The IssueTbl table contains the source

data, which was imported from its original spreadsheet. The TagTbl contains the

definitions of tags which have been applied to incidents. The IssueTagTbl provides

the bridging table, allowing for a many to many relationship between Incidents and

Tags.

This design allows a many to many relationship between Incidents and Tags.

Each incident can have multiple tags associated with it, while each tag can be applied

to multiple different incidents.

42


Using this structure, a graph of the interrelationships between incidents and

tags can be produced. It is this which is analysed using the PageRank algorithm to

rank the relative importance of each incident and tag within the system.

5.3.2.2. Tagging

In order to ease the process of adding tags to incidents a simple user interface

was developed using the features of Microsoft Access. This is shown as Illustration 5

below. On the left, the Category, Description, Summary and Work Log are displayed.

Reading these gives the user a sense of the issue, the root cause, if recorded and the

43

Illustration 4: Database Design

I s s u e T a g T b l

T a g I D ( O ) ( F K , I E 3 )C a s e I D ( O ) ( F K , I E 2 , I E 1 )

I s s u e T b l

C a s e I D P K ( O )

A r r i v a l T i m e ( O ) A s s i g n e d ( O ) C a s e C a t e g o r y T y p e I t e m ( 2 6 0 0 0 0 1 2 3 ) ( O ) C a s e I D ( O ) ( I E 1 )C a s e T y p e * ( O ) C a t e g o r y * ( O ) C r e a t e D a t e ( O ) D e p a r t m e n t ( O ) D e s c r i p t i o n * ( O ) G r o u p + ( O ) H o u r s t o r e s o l v e ( 2 6 0 0 0 0 0 0 4 ) ( O ) I t e m * ( O ) P r i o r i t y * ( O ) R e g i o n ( O ) R e q u e s t I m p a c t ( O ) R e s o l v e d ( O ) S i t e ( O ) S L A P a r e n t ( O ) S o u r c e * ( O ) S t a t u s * ( O ) A s s i g n e d T I M E ( O ) W o r k I n P r o g r e s s U S E R ( O ) W o r k I n P r o g r e s s T I M E ( O ) P e n d i n g T I M E ( O ) R e s o l v e d T I M E ( O ) C l o s e d T I M E ( O ) S u m m a r y * ( O ) T y p e * ( O ) U r g e n c y ( O ) W o r k L o g ( O ) P a g e R a n k ( O )

T a g T b l

T a g I D ( O ) ( I E 1 )

T a g ( O ) P a g e R a n k ( O )


steps taken in trying to resolve the issue. On the right, a set of tags which have been

applied to the incident are displayed.

When adding a tag, previous tags are looked up, and autocompletion is used

to facilitate ease and speed of tagging. If a previously undefined tag is added, the

system asks if the user wishes to add this tag to the system. The new tag is then added

to the TagTbl table, while the link is added to the IssueTagTbl table.

Over time as issues are tagged, a graph of the interrelationships between

incidents and tags can be produced. A simple example of this is shown in Illustration

6. This becomes the basis for the next step in analysis, the application of the

PageRank algorithm to the linked incident data.

44

Illustration 5: Tagging User Interface


As can be seen in Illustration 6, each link between a tag and an incident are

assumed to be bidirectional. This is important for the calculation of the associated

ranks, and is discussed further below.

5.3.2.3. PageRank Algorithm

Both the IssueTbl and the TagTbl contain a PageRank field. This field is

introduced to hold the ranking given to each tag and incident by the PageRank

algorithm.

The PageRank algorithm was implemented as a Java application which

connected to the Microsoft Access database, queried for incidents and tags, and ran

the process of generating the rank for each. Austin (2008) gives a good undertaking

of how the PageRank algorithm can be implemented, which is briefly discussed here.

45

Illustration 6: Incident Tag Graph

I n c i d e n t

T a g

I n c i d e n t I n c i d e n t

T a g


To begin the process, a link matrix (L) is produced. If the number of incidents

in the system is Ni, and the number of tags in the system is Nt, then the link matrix is a

square matrix of Ni+Nt size. Each row and column corresponds to either an incident

or a tag. When an incident is linked with a tag, then the value of 1 is assigned to the

cells where those incidents and tags intersect. That is, L(i, j) and L(j,i) where i and j

represent the row and column number of the respective incident and tag.

Based on the Link Matrix, a probability matrix (H) is produced. This encodes

the probability of moving from one node in the link graph to another, based on the

number of outgoing links from the current node. That is, if a node has three outgoing

links to other nodes, then it is assumed by the algorithm that any link could be chosen

with equal probability. The probability matrix encodes this by counting the number

of links from each node, and storing the appropriate probability.

A second probability matrix (A) is needed for those nodes which have no

explicit links to other nodes. The PageRank algorithm assumes that if a node has no

links defined, then it is linked to all other nodes with equal probability. The A matrix

encodes this by searching the Link matrix for nonlinked nodes and replacing them

with a probability of 1/ Ni+Nt.

The PageRank algorithm is based on stochastic matrices (Austin, 2008). A

property of stochastic matrices is that each column of the matrix sums to 1. In our

system, the stochastic matrix (S) is given by:

S = H + A.

Therefore our S matrix is made of columns representing the probability of

moving to another node on the graph based on explicit links, or implicitly by being

linked to all nodes.

46


A dampening factor is introduced (a) to model the movement of a 'random

surfer' on the graph. The algorithm assumes that a surfer will move from one node to

a linked node with probability a, but move to a random node with probability 1a.

Finally the PageRank matrix (I) is one column by Ni+Nt. rows. This is

initialised with one of the nodes given all the rank (represented by a value of 1). As

iterations of the algorithm run, this initial rank will be shared and spread across the

graph. Eventually the values of rank converge to a stable matrix. Several factors are

involved in this, as discussed by Austin (2008). One can continue to run the

algorithm until the values converge, or used a fixed number of iterations to come to a

reasonable approximation, as Austin (2008) states:

With the value of a chosen to be near 0.85, Brin and Page report that 50 to

100 iterations are required to obtain a sufficiently good approximation to

I.

Thus the Google matrix (G) is defined as:

GIk = aSIk + (1 – a / n) Ik

where Ik represents the kth iteration of the algorithm.

Once all iterations have been complete, the I matrix holds the PageRank

values (a value between 0 and 1) for each incident and tag. The final step is to save

these values with their respective incidents and tags using the PageRank field in the

IssueTbl and TagTbl, as discussed previously. A report can then be generated

showing which tags and incidents are ranked highest, and how they relate to each

other.

Austin (2008), speaking on Google's implementation of the PageRank

algorithm on the graph of websites gathered by the company, that “The calculation is

47


reported to take a few days to complete”. Running the PageRank algorithm on our

source data took much less time, as we are dealing with considerably less data than

Google Corp., and took in the order of 5 minutes.

5.3.3. Procedure

Once the apparatus is in place, and sufficiently tested, the procedure for

experiment two followed the steps:

1. Associate tags with the source incident set.

2. Run the PageRank generation algorithm.

3. Generate reports showing the highest ranked incidents and tags.

These are discussed in more detail below.

5.3.3.1. Associating Tags with Incidents.

Using the user interface described in Illustration 5, tags can be added to

incidents. Tags are free form keywords. They are created and used by all participants

in the incidents life cycle. They form a set of metadata associated with the incident,

and can be used as a form of categorisation for incidents.

Based on the description, summary and work log of each incident, a set of tags

was associated with each incident. These represent various aspects of the incident

including:

● the symptoms of the issue as reported by the end user,

● the approaches taken by the IT staff in attempting to resolve the issue,

● the root cause or causes,

● the change made that fixed this incident,

● any other relevant information.

48


5.3.3.2. Running the PageRank Algorithm

Running the PageRank algorithm involved creating an ODBC data source

within Microsoft Windows. Then the Java executable which encoded the algorithm

was executed. This connected via a JDBCODBC bridge, queried for incidents and

tags, and generated the PageRank numbers, before saving them to the database.

5.3.3.3. PageRank Reports

PageRank reports were generated using Microsoft Access query and reporting

functionality. These queried both tags and incidents and ordered them in descending

order of PageRank value. These are presented in section 6.

49


6. Results

This section shows the results from both experiments.

6.1. Experiment 1: Pareto Results

The results of experiment 1, are discussed in section 5.2.4, and are

summarised here.

On application of Pareto analysis to the source dataset, it was found that the

top level category of Oracle had the most incidents, with an incident count of 3,601,

accounting for 18.74% of all incidents. Within the Oracle category, the largest mid

level subcategory was found to be Operations with an incident count of 1,625, which

accounted for 45.13% of all Oracle incidents. Finally, within the Operations category,

the largest lowlevel subcategory was found to be Access Issues, with an incident

count of 636 and accounting for 38.71% of all Operations issues.

6.2. Experiment 2: PageRank Results

A summary of the top ranked Tags is given in the following tables and charts.

Rank Tag PageRank

(x1000)

% Cumulative

%

1 ResetPassword 49.16 10.40% 10.40%

2 NewResponsibilities 37.82 8.00% 18.40%

3 UserRoles 35.44 7.50% 25.90%

4 ApplicationError 27.69 5.86% 31.76%

5 OAR (Oracle Access Request) 24.63 5.21% 36.97%

6 IECache 20.85 4.41% 41.38%

7 NewUser 18.59 3.93% 45.31%

8 DesktopIssue 15.69 3.32% 48.63%

9 LostResponsibilities 15.64 3.31% 51.94%

10 UserTraining 15.31 3.24% 55.18%

Table 4: Top Ten Tags Ordered by PageRank

50


Some 57 new categories were introduced through the tagging of the source

data set. Despite this, the majority of incidents are contained within a minority of

categories. As Table 5 shows, the top ten tags account for over 55% of the total

allocated PageRank for tags.

On reading through the source data, it was clear that the majority of incidents

fell into a small set of high level categories. Those were:

1. Application\System errors.

2. Loss of access due to scheduled downtime.

3. Loss of access due to changing user circumstances.

4. Routine loss of access rights.

One might expect that application or system errors would account for the

majority of incidents logged. The tacit assumption of ITIL is that all incidents are

caused by some defect within the IT system. This does not seem to be borne out by

51

Illustration 7: Tags Ordered by PageRank

P a g e R a n k o f T a g s

0 . 0 0

1 0 . 0 0

2 0 . 0 0

3 0 . 0 0

4 0 . 0 0

5 0 . 0 0

6 0 . 0 0

R e s e t P a s s w o r d

N e w R e s p o n s i b il i t i e s

U s e r R o l e s

A p p l i c a t i o n E r r o rO A R

I E C a c h e

N e w U s e r

D e s k t o p I s s u e

L o s t R e s p o n s i b i l i t ie s

U s e r T r a i n i n g

U p g r a d e I s s u e s

D e v e l o p m e n t I s s u e

J I n i t i a t o r C a c h e

E M R e m e d y

S y s t e m C h e c k

A p p l i c a t i o n U n a v a i l a b le

G U I Mi s s i n g

C o o k i e I s s u e

M a i n t e n a n c e

N e t w o r k C o n f i g E r r o r

A c c e s s R i g h t s

N e w P a s s w o r d

I n v a l id T ic k e t

U s e r n a m e C h a n g e

U s e r Q u e s t i o n

D i s k s O f f l i ne

N e w U n i x U s e r

R e p o r t U n a v a i la b l e

J a v a E r r o r

I n c o r r e c t U s e

L o c a l e I n c o r r e c t

R e s e t F o r m

T e m p A c c e s s

N o t e s U s e r n a m e C h a n g e

D u p l i c a t e T i c k e t

W e b C u s t o m e r

U s e r M i g r a t io n

R o l e M ig r a t i o n

R e p o r t I n P r o g r e s s

C a p a c i t y

C o n f ig E r r o r

M a i l Se r v ic e C o n f i g u r a t i o n

O r a c l e U s e r n a m e C h a n g e

W i n d o w s U s e r n a m e C h a n g e

R e n a m e U s e r

U s e r n a m e C o n f l i c t

A p p l i c a t i o n N o L o n g e r U s e d

N e w P o s i t i o n

S t a t u s R e q u e s t

R o l e s R e q u e s t

E n a b l e P r o f i l es

P a s s w o r d P o l i c y

N o t e s B r o w s e r S e t t i n g

M o n t h E n d

I n c o r r e c t L o c a l e

A c c e s s E r r o r

C l e a r T e m p F i l e s

T a g s

Page

Rank

0 . 0 0 %

2 0 . 0 0 %

4 0 . 0 0 %

6 0 . 0 0 %

8 0 . 0 0 %

1 0 0 . 0 0 %

1 2 0 . 0 0 %

Cum

ulat

ive

%


the PageRank analysis. Tags associated with application errors are ranked at 4

(ApplicationError), 6 (IECache) and 8 (DesktopIssue).

The standard response from IT staff is to return the user's client to a known

good configuration by clearing various caches (IECache), and this usually fixes the

access issue. The ApplicationError tag was introduced as a catchall tag for those

incidents which did not provide sufficient information for other tags to be applied.

Loss of access due to known issues, such as system upgrades, scheduled

downtime, and report generation were captured by the tags in Table 6.

Rank Tag PageRank

(x1000)

%

11 UpgradeIssues 14.66 3.10%

19 Maintenance 8.48 1.79%

39 ReportInProgress 2.53 0.54%

Table 5: Known Issue Tags

These issues represent a minority of cases, with upgrade issues being the

highest ranked. Several incidents were logged when Oracle instances were upgraded,

and IT staff worked to restore user roles and access rights. Outside of this once off,

access issues related to known issues are ranked quite low.

The final two categories refer to routine loss of access rights. These are due to

the individual business rules which govern the operation of the various Oracle

systems. The Human Resources department define access rights based on the user's

job role. Several of the top ten tags are associated with operation of this policy.

Those are:

● NewResponsibilities, users request new access rights through the

incident handling system.

52


● UserRoles, users request changes to their roles through the incident

handling system.

● OAR, the Oracle Access Rights system. Users are directed to request

changes in access rights through a separate system.

● NewUser, a new user has difficulties accessing Oracle.

● LostResponsibilities, typically a user has lost some rights due to

changing job role, or their access rights have reached their time limit and must

be renewed.

● UserTraining, due to the operational rules, IT staff can only address

these issues through training users in the operational policies.

Finally we must note the top ranked tag of ResetPassword, which was applied

to any incident resulting in a password being reset. This seems to be a standard IT

staff response to an access issue, and perhaps represents the overuse of a particular

incident resolution strategy.

6.3. Comparison

In this section we compare various aspects of Pareto and PageRank analysis.

6.3.1. On PageRank and Tags

We begin our comparison by looking at the PageRank values given to the tags

defined in the system.

6.3.1.1. Comparison of Incident Count to PageRank

As Dorner (1997) shows, Pareto charts are based on the number of incidents

within a particular category. By comparing the number of incidents associated with

each tag with that tag's PageRank value, we can assess if the PageRank analysis

53


differs from Pareto. A scatter graph of Incident Count versus PageRank is displayed

in Illustration 8.

As can be seen on Illustration 8, a roughly linear relationship exists between

incident count and PageRank value for tags. This is also noted when the correlation

coefficient is computed for this, and results in a value of =0.98.

Given that the correlation coefficient is as close to 1 as this. We can say that

PageRank analysis is quite similar to Pareto in this regard. Though it must be said

that Pareto counts each incident in only one category, where PageRank can count the

same incident in multiple categories.

6.3.1.2. Comparison of Incident Count to Resolution Time

For each tag we can find the associated incident count, and we can also find

the sum of the resolution time for those associated incidents. In Illustration 9 we

graph this relationship.

54

Illustration 8: Comparison of Incident Count to PageRank

C o m p a r i s o n o f I n c i d e n t C o u n t t o P a g e R a n k

0 . 0 0

1 0 . 0 0

2 0 . 0 0

3 0 . 0 0

4 0 . 0 0

5 0 . 0 0

6 0 . 0 0

0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0 4 5

C o u n t

Page

Rank


As previously mentioned, Pareto analysis associates each incident with only

one category. In this system it is obvious then that the categories with the most

incidents are likely to be the categories which have the highest total resolution time.

As can be seen from Illustration 9, the relationship is less well defined as the

previous example, yet can still be said to be roughly linear. The correlation

coefficient of this data set is =0.92.

Although not as high as the previous example, it shows that a strong

relationship exists between the Tag categories and the resolution times. Again, those

categories with the most incidents are most likely to be the categories which show the

longest resolution times.

6.3.1.3. Comparison of PageRank to Resolution Time

Similarly, for each tag we can assess the relationship between the assigned

PageRank value and the resolution time for the associated incidents. There is no

55

Illustration 9: Correlation of Incident Count to Resolution Time

C o m p a r i s o n o f I n c i d e n t C o u n t t o R e s o l u t i o n T i m e

0

5 0

1 0 0

1 5 0

2 0 0

2 5 0

3 0 0

3 5 0

0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0 4 5

C o u n t

Sum

of T

ime

to R

esol

ve


direct comparison to be made with Pareto here, as Pareto does not rank categories in

any way other than through incident counts.

A comparison of PageRank to resolution time is displayed in Illustration 10.

This has increased scatter compared with the previous examples. but again can be said

to be roughly linear.

The correlation coefficient for this relationship is =0.90. Based on this value,

we can say that the PageRank values for tags are strongly correlated with the

resolution times for associated incidents. This shows that PageRank analysis can

have some value, as those tags with the highest ranks are likely to be those requiring

the most time to fix.

6.3.1.4. Relationships between Tags

An attribute of the tagging categorisation system, is that incidents can be

simultaneously associated with multiple tags. This is in contrast with the singular

56

Illustration 10: Correlation of PageRank to Resolution Time

C o m p a r i s o n o f R e s o l u t i o n T i m e t o P a g e R a n k

0 . 0 0

1 0 . 0 0

2 0 . 0 0

3 0 . 0 0

4 0 . 0 0

5 0 . 0 0

6 0 . 0 0

0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 0

R e s o l u t i o n T i m e

Page

Rank


category an incident may be related to in a fixed or hierarchical categorisation

scheme.

This allows us to analyse incident counts using Venn diagrams, as shown in

Illustration 11. The numbers represent the incident counts for that area of the

diagram. That is, 37 incidents are tagged with ResetPassword along, while 3

incidents are tagged with both ResetPassword and IECache.

While this is a simple example, with three tags, more sophisticated Venn

diagrams, or other analysis mechanisms can be brought to bear on the relationships

between tags. This shows the flexibility of the tagging system for allowing multiple

analysis mechanisms to be brought to bear on the meta data which the tagging system

introduces on the source data. By analysing these relationships, system administrators

57

Illustration 11: Example Venn of Tag Relationships


can gain further insight into how users categorise incidents, and can begin to learn

how people react to the system when errors or other unexpected behaviours occur.

6.3.2. On PageRank and Incidents

We continue our comparison by looking at the PageRank values given to

incidents.

6.3.2.1. Comparison of PageRank to Resolution Time.

As can be seen in section 6.3.1.3, the PageRank values assigned to tags was a

good indicator of the amount of time required to resolve those associated incidents.

We now consider if this also holds true for the PageRank values assigned to incidents.

As can be seen in Illustration 12, the values of PageRank and Resolution Time

for Incidents are rather scattered, as if randomly. Calculating the correlation

coefficient shows a value of =3.07x103. With a value so close to zero, we can

confidently say that there exists no relationship between PageRank values for

incidents and their resolution times.

While PageRank values showed considerable correlation with the resolution

times for tags, the same cannot be said for incidents. It would appear that the relative

importance of incidents when ranked by PageRank would not give any relative

indication of how long those incidents would take to resolve.

58


6.3.2.2. Comparison of PageRank to Urgency.

When incidents are logged to the help desk, they are assigned an urgency

value of Low (1), Medium (2), High (3) and Urgent (4). We now analyse the

correlation between the PageRank values assigned to incidents and their associated

Urgency values.

As can be seen from Illustration 13, the PageRank values are scattered widely.

Calculating the correlation coefficient shows it to be =7.26x103. With a value so

close to zero we can confidently say that the PageRank value assigned to an incident

is not an indicator of the incidents urgency.

We can only speculate as to why this may be. Assuming that PageRank is a

true indicator of an incidents importance, it could be that there is no consistent

understanding of what the urgency field of an incident is used for. In this case

different people may mark similar incidents with vastly different urgency values.

59

Illustration 12: Comparison of PageRank to Resolution Time for Incidents

C o m p a r i s o n o f P a g e R a n k t o R e s o l u t i o n T i m e

0

5

1 0

1 5

2 0

2 5

0 1 2 3 4 5 6 7

P a g e R a n k

Reso

lutio

n Ti

me


It could also be argued that the urgency values assigned to incidents are valid.

In that case we would have to argue that the PageRank value assigned are not a good

indicator of the incident's true importance.

6.3.2.3. Comparison of PageRank to Priority.

In a similar way to the Urgency values previously discussed, Priority values

are also assigned to incidents by those who log them. An incident's priority field can

take the values of Low (1), Medium (2), High (3) or Urgent (4). We now analyse the

relationship between PageRank values and Priority values for incidents.

When plotted on a chart, as in Illustration 14, we can see that the values of

PageRank versus Priority are quite scattered. Calculating the correlation coefficient

shows a value of =0.14. Being so close to zero we can say that there is no

meaningful relationship between PageRank values and the assigned Priority values.

60

Illustration 13: Comparison of PageRank to Urgency for Incidents

C o m p a r i s o n o f P a g e R a n k w i t h U r g e n c y

0

0 . 5

1

1 . 5

2

2 . 5

3

3 . 5

4

4 . 5

0 . 0 0 1 . 0 0 2 . 0 0 3 . 0 0 4 . 0 0 5 . 0 0 6 . 0 0 7 . 0 0

P a g e R a n k

Urge

ncy


Similar to the Urgency values, we can assume that PageRank is a true rank of

relative importance, and then argue that the Priority values are not assigned

consistency to incidents. This would account for the lack of a relationship between

the two values. We can also argue that the Priority values are valid, and therefore the

PageRank values do not represent a meaningful measure of an incident's relative

priority.

6.3.2.4. Comparison of PageRank to Impact.

Similar to Urgency and Priority, an incident is given an impact value when

created. This can have values of Low (1), Medium (2) or High (3). We now analyse

the relationship between PageRank and Impact values for incidents.

As can be seen from the chart in Illustration 15, the values of PageRank versus

Impact are scattered. A correlation coefficient of =0.12 shows an extremely weak

relationship. Thus we can say that PageRank values for incidents are not a good

indication of an incident's impact.

61

Illustration 14: Comparison of PageRank to Priority for Incidents

C o m p a r i s o n o f P a g e R a n k t o P r i o r i t y

0

0 . 5

1

1 . 5

2

2 . 5

3

3 . 5

4

4 . 5

0 . 0 0 1 . 0 0 2 . 0 0 3 . 0 0 4 . 0 0 5 . 0 0 6 . 0 0 7 . 0 0

P a g e R a n k

Prio

rity


Again, we can assume that the PageRank values are a true rank of an incidents

impact, and thus argue that the Impact values are not consistently applied by the

human system users. We can also argue that the Impact values are valid, and the

PageRank values are of no value as indicators of an incident's impact.

6.3.2.5. Relationships between Priority, Urgency, Impact and Resolution Time

In the previous sections we have left the question open to the quality of

PageRank values versus the quality of the Priority, Urgency and Impact data. By

analysing the relationships between these values we can show if system users are

being consistent in applying similar values for these fields, and thus if the PageRank

values for incidents are meaningless.

By calculating the correlation coefficients between these values we arrived at

the results in Table 7.

62

Illustration 15: Comparison of PageRank to Impact for Incidents

C o m p a r i s o n o f P a g e R a n k t o I m p a c t

0

0 . 5

1

1 . 5

2

2 . 5

3

3 . 5

0 . 0 0 1 . 0 0 2 . 0 0 3 . 0 0 4 . 0 0 5 . 0 0 6 . 0 0 7 . 0 0

P a g e R a n k

Impa

ct


Relationship Correlation Coefficient()

Priority versus Urgency 0.02

Priority versus Impact 0.37

Priority versus Resolution Time 0.07

Urgency versus Impact 0.16

Urgency versus Resolution Time 0.16

Impact versus Resolution Time 0.10

Table 6: Relationships between Priority, Urgency, Impact and Resolution

Time

Based on the values in Table 7, we can see that the strongest relationship is

that between Priority and Impact. With a correlation coefficient of =0.37, the best

we can say is that it is a weak relationship.

The other relationships between Priority, Urgency and Impact show values

close to zero. These values show clear evidence that the system users who create and

manipulate incidents do not provide consistent values for Priority, Urgency and

Impact. While this does not argue in favour of the idea that PageRank values for

incidents represent a meaningful measure, it does show that we cannot discount those

PageRank values as meaningless with respect to true measures of incident priority,

urgency or impact.

As an aside, it is also interesting to note the relationships between Resolution

Time and the values of Priority, Urgency and Impact. These also show very weak

relationships.

Within a correctly run ITIL incident management system, one would expect

that higher priority incidents would have every effort made to resolve them quickly,

and would show faster resolution times. This does no appear to be the case in the

system which produced the source data for this analysis.

63


Several factors may explain this discrepancy:

1. Assigned values of Priority, Impact and Urgency provide no value and

are ignored by IT staff.

2. IT staff may have so little a throughput of incidents that they

effectively resolve them on a first come, first served basis. That is, the help

desk workload is so little as to not require prioritisation of incoming incidents.

3. Resolution Time is highly variable. That is, regardless of how

incidents are prioritised, or in what order they are resolved, resolution time is

so variable as to show no relationship with priority.

The first factor is not likely the case. While the values of priority, urgency

and impact may not hold much value, they are respected by IT staff, as service level

agreements tie IT staff to fixed resolution times associated with each priority level.

The second factor is also unlikely, as an analysis of the data shows an average of 178

incidents being logged on a daily basis.

Finally the third factor may have some merit. If we calculate the standard

deviation for resolution time within the priority groups we arrive at the values in

Table 8.

Priority Standard Deviation of Resolution Time

Low 4.32 hours.

Medium 4.96 hours.

High 4.87 hours.

Urgent 4.81 hours.

Table 7: Deviation of Resolution Times by Priority

The values are both large and similar. This shows that two incidents arriving

at the same time can have resolution times which differ by over 4 hours, or half a

64


business day, regardless of the priority associated. This would account for the weak

correlation between resolution time and priority which we discussed earlier.

65


7. Discussion

7.1. Realisation of Project Aims

The aims of the project as stated in section 2.1 are:

1. Show how incident classification influences effective problem

management.

2. Propose a new classification mechanism for incidents.

3. Explore the analysis opportunities for the new classification

mechanism.

The new classification mechanism for incidents we propose is that of tagging

with PageRank analysis. We will argue that tagging provides additional flexibility in

categorisation beyond fixed or hierarchical categorisation schemes.

We have shown that PageRank analysis on a small data set is at least as good

as a corresponding Pareto analysis. With regard to the analysis opportunities for the

new classification mechanism, we will further argue that as the data set grows, the

quality of correlations which can be produced from the data, will allow much better

insight into the relationships which govern the operation of an IT system.

7.2. Big Science

As a race we produce and consume ever growing amounts of data each year.

As we move ever closer to the vision of ubiquitous computing, this can only rise, as

sensors and computing devices become smaller and cheaper. They will become

embedded in a vast array of locations, allowing us to gather ever more detailed

measurements of the physical world.

With so much data, Kevin Kelly (2008) suggests we need new methods of

science to analyse and make use of this data, stating:

66


There's a dawning sense that extremely large databases of information,

starting in the petabyte level, could change how we learn things. The

traditional way of doing science entails constructing a hypothesis to match

observed data or to solicit new data. Here's a bunch of observations; what

theory explains the data sufficiently so that we can predict the next

observation?

It may turn out that tremendously large volumes of data are sufficient to

skip the theory part in order to make a predicted observation. Google was

one of the first to notice this. For instance, take Google's spell checker.

When you misspell a word when googling, Google suggests the proper

spelling. How does it know this? How does it predict the correctly

spelled word? It is not because it has a theory of good spelling, or has

mastered spelling rules. In fact Google knows nothing about spelling

rules at all.

Instead Google operates a very large dataset of observations which show

that for any given spelling of a word, x number of people say "yes" when

asked if they meant to spell word "y." Google's spelling engine consists

entirely of these data points, rather than any notion of what correct

English spelling is. That is why the same system can correct spelling in

any language.

The traditional goal of science has been to build a better model of the physical

world. Taking all known facts about a particular system, a scientist creates a model

for that system. Using this model, they hypothesise about the behaviour of the model

67


under new circumstances. Experiments allow scientists to test the models under the

new circumstances, and thus validate or nullify their model.

With the growth of data, sensors and meta data, do we need to continue to

produce models? With enough data, can we not analyse the real, physical world,

without having to create a simplified abstraction of it? Anderson (2008) argues we

can, stating:

Faced with massive data, this approach to science — hypothesize, model,

test — is becoming obsolete. Consider physics: Newtonian models were

crude approximations of the truth (wrong at the atomic level, but still

useful). A hundred years ago, statistically based quantum mechanics

offered a better picture — but quantum mechanics is yet another model,

and as such it, too, is flawed, no doubt a caricature of a more complex

underlying reality. The reason physics has drifted into theoretical

speculation about ndimensional grand unified models over the past few

decades is that we don't know how to run the experiments that would

falsify the hypotheses — the energies are too high, the accelerators too

expensive, and so on.

Now biology is heading in the same direction. The models we were

taught in school about "dominant" and "recessive" genes steering a strictly

Mendelian process have turned out to be an even greater simplification of

reality than Newton's laws. The discovery of geneprotein interactions

and other aspects of epigenetics has challenged the view of DNA as

destiny and even introduced evidence that environment can influence

inheritable traits, something once considered a genetic impossibility.

68


In short, the more we learn about biology, the further we find ourselves

from a model that can explain it.

There is now a better way. Petabytes allow us to say: "Correlation is

enough." We can stop looking for models. We can analyse the data

without hypotheses about what it might show. We can throw the numbers

into the biggest computing clusters the world has ever seen and let

statistical algorithms find patterns where science cannot.

This is a new way to approach scientific discovery. With this, we can discover

correlations between data, but without the model we may not understand the causal

relationship, if any, between them. What is definite is that with so much data, new

methods of science will have to be created, as Kelly (2008) states:

Many sciences such as astronomy, physics, genomics, linguistics, and

geology are generating extremely huge datasets and constant streams of

data in the petabyte level today. They'll be in the exabyte level in a

decade. Using old fashioned "machine learning," computers can extract

patterns in this ocean of data that no human could ever possibly detect.

These patterns are correlations. They may or may not be causative, but we

can learn new things. Therefore they accomplish what science does,

although not in the traditional manner.

What Anderson (2008) is suggesting is that sometimes enough

correlations are sufficient. There is a good parallel in health. A lot of

doctoring works on the correlative approach. The doctor may not ever

69


find the actual cause of an ailment, or understand it if they did, but they

can correctly predict the course and treat the symptom.

7.3. Diagnosing the IT System.

If this approach allows us to ask questions and receive perfectly good answers,

without having to construct a simplified model of the system, can such an approach be

used to analyse an ITIL incident database? Much like a doctor diagnoses a symptom,

without knowing the underlying cause; can a system administrator diagnose an IT

system in a similar way?

Yet, doctors don't diagnose blindly, but have built up a knowledge base of

relationships between symptoms and illnesses. Given symptom A, a doctor can say

with a good degree of confidence, based on historical analysis, the probability that

illness X, Y or Z is the underlying cause. Taking a broader approach, and including

several symptoms in the analysis can lead to a higher or lower probability for the

underlying illnesses. The doctor then begins by attempting to treat the most likely

illness which could cause a patient’s symptoms.

If a system administrator is to take a similar approach, then a similar

knowledge base of symptom to likely cause relationships must be developed. ITIL

attempts to achieve this through the administration of an incident log. Incident

records are created along with their work logs, and eventual fixes, in the hope that

similar incidents in future will mandate similar fixes.

Continuing with the medical metaphor, this seems rather simplistic. Doctors

tend to take a holistic view of a patient's well being. One would question their health

care provider if they treated your cough, while your leg remained broken. Yet this is

the approach that ITIL's Incident Management process suggests.

70


Secondly, many illnesses produce flulike symptoms, so it would be remiss of

a doctor to assume that these represent flu and begin treatment, without further

investigating the illness. In medicine at least, similar symptoms may have disparate

underlying causes. Assuming the same for IT systems, ITIL's approach of attempting

to cure each incident separately, while assuming each symptom has a singular root

cause, seems rather simplistic.

7.4. On Categorisation

ITIL's Problem Management process recommends the use of Pareto charts for

incident analysis. This forces a categorisation scheme on the incidents where each

incident is associated with only one category, usually related to its root cause. As an

example, the source data we analysed used a hierarchical categorisation scheme.

Again, we can argue that this is rather simplistic. While many incidents may

well have only one root cause, it is naive to suggest that all incidents are like this.

There may well be some incidents which have multiple root causes.

Taking a systems engineering view of a system, would suggest that there are a

certain set of incidents which rely on a series of events to happen, no matter how

unlikely, for an incident to be realised. The one to one relationship between symptom

and cause, forced by a fixed or hierarchical categorisation system, cannot capture

these subtleties.

To overcome this limitation, we propose a collaborative tagging system for

incidents. Golder & Bernardo (2005) introduce tagging as:

Marking content with descriptive terms, also called keywords or tags, is a

common way of organising content for future navigation, filtering or

search. Though organising electronic content this way is not new, a

71


collaborative form of this process, which has been given the name

“tagging” by its proponents, is gaining popularity on the web.

Document repositories or digital libraries often allow documents in their

collections to be organised by assigned keywords. However, traditionally

such categorising or indexing is either performed by an authority, such as

a librarian, or else derived from the material provided by the authors of

the documents (Rowley, 1995). In contrast, collaborative tagging is the

practice of allowing anyone – especially consumers – to freely attach

keywords or tags to content. Collaborative tagging is most useful when

there is nobody in the “librarian” role or there is simply too much content

for a single authority to classify; both of these traits are true of the web,

where collaborative tagging has grown popular.

This is in contrast to the fixed categorisation scheme already applied to our

source data. In this schema, a user could only categorise an incident into one of the

already predefined system categories. Allowing users to categorise incidents in a

freeform “folksonomy”, presents its own particularities. The issues of polysemy,

synonymy, and basic level variation need to be considered.

Golder & Bernardo (2005) discuss these saying:

A polysemous word is one that has many (“poly”) related senses

(“semy”). For example, a “window” may refer to a hole in the wall or to

the pane of glass that resides within it (Pustejovsky, 1995). In practice,

polysemy dilutes query results by returning related but potentially

inapplicable items.

72


Synonymy, or multiple words having the same or closely related

meanings, presents a greater problem for tagging systems because

inconsistency among the terms used in tagging can make it very difficult

for one to be sure that all the relevant items have been found. It is

difficult for a tagger to be consistent in the terms chosen for tags; for

example, items about television may be tagged either television, or TV.

This problem is compounded in a collaborative system, where all taggers

either need to widely agree on a convention, or else accept that they must

issue multiple or more complex queries to cover many possibilities.

Reflecting the cognitive aspect of hierarchy and categorisation, the “basic

level” problem is that related terms that describe an item vary along a

continuum of specificity ranging from very general to very specific; cat,

cheetah and animal are all reasonable ways to describe a particular entity.

The problem lies in the fact that different people may consider terms at

different levels of specificity to be most useful or appropriate for

describing the item in question.

Kelly (2008) refers to these personality types as “lumpers and splitters”,

stating:

In every classification scheme, there are two camps. There are those

classifiers who tend to find similarities between things and prefer to lump

smaller groups into larger groups, and on the other hand those cataloguers

who tend to find differences and prefer to split larger groups into smaller

groups.

73


7.5. The Wisdom of Crowds

Given these peculiarities, and the differing opinions between users as to how

to classify something, one could be forgiven for thinking this style of categorisation is

a recipe for chaos. Speaking of their analysis of the bookmarks on the Del.ico.us

website, Golder & Bernardo (2005) state:

One might expect that individuals varying tag collections and personal

preferences, compounded by an everincreasing number of users, would

yield a chaotic pattern of tags. However, it turns out that the combined

tags of many users bookmarks give rise to a stable pattern in which the

proportions of each tag are nearly fixed. Empirically, we found that,

usually after the first 100 or so bookmarks, each tag's frequency is a

nearly fixed proportion of the total frequency of all tags used.

This shows nicely the concept of “wisdom of crowds” (Surowiecki, 2004).

While the opinions of how to categorise an item may differ between individuals, the

sum total of the tags tend to stabilise around each tagged item, so as to fully annotate

its concept. Allied with PageRank analysis, the more frequently applied generalised

terms tend to get ranked higher than the less frequently applied personalised tags.

The system can accommodate both “lumpers and splitters”, to come to a consensus

regarding how items are categorised.

7.6. Analysis of Tags

The tags applied to incidents create a set of meta data which describes the

incident data. We have shown through the application of the PageRank algorithm to

the graph of incidents and tags, that both tags and incidents can receive a rank of

relative importance.

74


We go on to show in section that the PageRank values for tags correlates well

with the total resolution time for those incidents. This represents a firstorder analysis

of metadata to source data, through the comparison of PageRank to base data.

Secondly, we show that further opportunities for analysis exist in section .

Here we show the flexibility of the tagging system, as incidents can be associated

with multiple categories. We also show briefly, that the relationships between tags

can be analysed. Further and deeper analysis could show which set of tags describes

the most incidents, or most important incidents within the system, thus giving

administrators increased insight into where users perceive problems to exist.

7.7. Further Research

Perhaps the best way to approach this research, is to look at it as a feasibility

study of a new approach to “correlative analysis” (Kelly, 2008) on incident data. We

hope that we have shown that the approach shows some merit.

To further the research, one could consider running similar analysis on a larger

scale, perhaps using a case study, given a minimum set of requirements. If these can

be met, the experiment may prove valuable.

As previously discussed, this approach requires an abundance of meta data, if

it hopes to find correlations among them. For this to be achieved, a candidate system

would have to be:

1. sufficiently large,

2. accessed by many users,

3. and annotated with meta data by everyone.

In order to be confident in the quality of analysis available from a system, the

aggregate meta data available within the system would have to reach a certain

75


minimum level. Below this, one could not guarantee the quality of correlative

analytics. One would have to ensure that the system is sufficiently large in terms of

users, incidents and tags for a sufficient aggregate amount of meta data to be created

and maintained.

Golder & Huberman (2005) find great variety in the amount of tags individual

users apply, stating:

As might be expected, users vary greatly in the frequency and nature of

their Delicious use. In our “people” dataset, there is only a weak

relationship between the age of the user's account (i.e. the time since they

created the account) and the number of days on which they created at least

one bookmark (n=229; R2=.52). That is, some users use Delicious very

frequently, and others less frequently.

More interestingly, there is not a strong relationship between the number

of bookmarks a user has created and the number of tags they used in those

bookmarks (n=229; R2=.33). The relationship is weak at the low end of

the scale, users with fewer than 30 bookmarks (n=39; R2=.33), and even

weaker at the upper end, users with more than 500 bookmarks (n=36;

R2=.14). Some users have comparatively large sets of tags, and others

have comparatively small sets.

PageRank ranking of tags and incidents could be created automatically by the

system on a regular basis. Finally, a free form analysis tool would have to be

produced. This would allow data and meta data to be organised, summarised and

76


analysed. Using this, arbitrary data points can be chosen, and relationships between

these data points produced.

Such a system could support many arbitrary reports. Over time, one would

expect that some reports would prove more useful in managing the system than

others, thus providing increased insight into the operation of the system under

scrutiny, and adding to the collected set of meta data about the system.

77


8. References

Anderson, C. (2008) The End of Theory: The Data Deluge Makes the Scientific

Method Obsolete. Wired Magazine. Retrieved on August 7, 2008 from http://

www.wired.com/science/discoveries/magazine/1607/pb_theory/

Apostolov, A. (2006) Automatic fault analysis and user notification for predictive

maintenance. IEEE Conference Record, Cement Industry Technical

Conference, 2006. 9 – 14.

Austin, D. (2008) How Google Finds Your Needle in the Web's Haystack. American

Mathematical Society. Retrieved on August 1, 2008 from http://www.ams.org/

featurecolumn/archive/pagerank.html

Barash, G., Bartolini, C. & Wu, L. (2007) Measuring and Improving the

Performance of an IT Support Organization in Managing Service Incidents.

Proceedings of the 2nd International Workshop on BusinessDriven IT

Management. 11 – 18.

BMC Software. (2006) Achieving Proactive Incident and Problem Management

Using BMC Performance Manager. Retrieved January 6, 2008 from

http://www.bmc.com/products/attachments/WP_Achieving_PIPM_Using_BM

C_PM_2202v3ww_SOLUTION_USA4_FY07_Q3.pdf.

Brin, S., & Page, L. (1998) The Anatomy of a LargeScale Hypertextual Web Search

Engine. Retrieved on January 6, 2008 from

http://infolab.stanford.edu/~backrub/google.html.

Card, D (1993) Defectcausal analysis drives down error rates. IEEE Software. Vol.

10, No. 4. 98 – 99.

78


Card, D. (1998) Learning from our mistakes with defect causal analysis. IEEE

Software. Vol. 15, No. 1. 56 – 63.

Dorner, W. (1997) Using Excel for Data Analysis. Quality Digest. Retrieved on July

07, 2008 from http://www.qualitydigest.com/oct97/html/excel.html

Gokhale, S., Crigler, J., Farr, W. & Wallace, D. (2005) System Availability Analysis

Considering Hardware/Software Failure Severities. Proceedings of the 29th

Annual IEEE/NASA Software Engineering Workshop. 47 – 56.

Golder, S. & Huberman, B. (2005) The Structure of Collaborative Tagging Systems.

Information Dynamics Lab, HP Labs. Retrieved on August 10, 2008 from

http://www.hpl.hp.com/research/idl/papers/tags/tags.pdf

Hanemann, A., Sailer, M. & Schmitz, D. (2004) Assured service quality by improved

fault management. Proceedings of the 2nd International conference on Service

Oriented Computing. 183 – 192.

Jantti, M. & Eerola, A. (2006) A Conceptual Model of IT Service Problem

Management. Proceedings of the 2006 International Conference on Service

Systems and Service Management. Vol. 1. 798 – 803.

KajkoMattsson, M. (2002) Corrective Maintenance Maturity Model: Problem

Management. Proceedings of the 2002 International Conference on Software

Maintenance. 486 – 490.

Kelly, K. (2008) The Google Way of Science. The Technium. Retrieved on August

7, 2008 from

http://www.kk.org/thetechnium/archives/2008/06/the_google_way.php

79


Kelly, K. (2008) Lumpers and Splitters. The Technium. Retrieved on August 10,

2008 from

http://www.kk.org/thetechnium/archives/2008/01/lumpers_and_spl.php

Leszak, M., Perry, D. & Stoll, D. (2000) A case study in root cause defect analysis.

Proceedings of the 2000 International Conference on Software Engineering.

428 – 437.

Microsoft Corp. (2007) Service Management Functions, Problem Management.

Retrieved on January 1, 2008 from

http://www.microsoft.com/technet/solutionaccelerators/cits/mo/smf/smfprbmg.

mspx

Mockus, A. (2006) Empirical estimates of software availability of deployed systems.

Proceedings of the 2006 ACM/IEEE International Symposium on Empirical

Software Engineering. 222 – 231.

OGC. (2000) Best Practise for Service Support. TSO:London.

Oppenheimer, D. & Patterson, D. (2002) Studying and using failure data from large

scale internet services. Proceedings of the 10th workshop on ACM SIGOPS

European Workshop. 255 – 258.

Pustejovsky, J. (1995) The Generative Lexicon. MIT Press.

Rowley, J. (1995) Organising Knowledge. 2nd Ed. Brookfield, VT: Gower.

Simonite, T. (2008) Google tool could search out hospital superbugs. New Scientist.

Retrieved on January 04, 2008 from

http://www.newscientist.com/channel/health/dn13142googletoolcould

searchouthospitalsuperbugs.html.

80


Surowiecki, J. (2004) The Wisdom of Crowds: Why the Many Are Smarter Than the

Few and How Collective Wisdom Shapes Business, Economies, Societies and

Nations. Doubleday.

Tague, N. R. (2004) Pareto Chart. Retrieved February 10, 2008 from

http://www.asq.org/learnaboutquality/causeanalysis

tools/overview/pareto.html

Talluru, L. & Deshmukh, A. (1995) Problem management in decision support

systems: a knowledgebased approach. Proceedings of the 1995 IEEE

International Conference on Intelligent Systems for the 21st Century. Vol. 3,

Systems, Man and Cybernetics. 1957 – 1962.

Tanaka, J. & Taylor, M. (1991) Object Categories and Expertise: Is the Basic Level in

the Eye of the Beholder? Cognitive Psychology. Vol. 23. 457 – 482.

81


Appendix: Source Data Definition

Field Type Description

Arrival Time Time stamp

Assigned Time stamp Date and time the issue is assigned to an IT

resource for resolution.

Case Category Type String Description of type and subtype of the

incident. E.g., “Operating Systems |

Windows XP | Blue Screen”

Case ID ID Auto generated incident ID number.

Case Type Enumeration Type of request. Can have values of,

Incident, Problem, Question or Request.

Category Enumeration The high level category of the incident.

Incidents are categorised by application

name, such as Oracle, Intranet, etc.

Create Date Time stamp

Department String Department name of the incident creator.

Description String Description of the incident.

Group String Group within the IT department which

deals with this type of incident.

Hours to resolve Number How long the incident took to resolve.

Item String The lowest level categorisation of the

incident.

Priority Enumeration Requester assigned priority. Can be one of

Urgent, High, Medium, Low.

Region String Geographic region (continent) where the

incident occurred. Can be one of, APAC,

EMEA, LAM, NAM.

Request Impact Enumeration The impact of the request. Can have values

of High, Medium or Low.

82


Resolved Time stamp Date and time the incident was resolved.

Site String Name of the site at which the incident

occurred.

SLA Parent String Reference to the Service Level Agreement

this incident is being measured under.

Source String How this incident entered the tracking

system. Can have values of, Email, NMP

(Network Management Protocol), Phone,

Requester, Instant Messenger, TopTen or

Web.

Status Enumeration The current status of the incident. Can be

one of, New, Assigned, Pending, Work in

progress, Resolved or Closed.

Assigned.Time Time stamp Time the incident is assigned.

Work In

Progress.User

String Who is currently working on the incident.

Work In

Progress.Time

Duration How long the incident spent in the Work In

Progress status.

Pending.Time Duration How long the incident spent in the Pending

status.

Resolved.Time Duration How long the incident spent in the

Resolved status.

Closed.Time Duration Date and time the incident was closed.

Summary String Brief description of the incident.

Type String Mid level categorisation of the incident.

Urgency Enumeration How urgent a resolution is needed. Can

have values of, Urgent, High, Medium or

Low.

83


Work Log String A log of work on the incident.

84

Documents

Using PageRank to Analyse Incident Data - WordPress.com...Running head: USING PAGERANK TO ANALYSE INCIDENT DATA Using PageRank to Analyse Incident Data. by Patrick Collins A Research