Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Analysing Incidents with PageRank
Running head: USING PAGERANK TO ANALYSE INCIDENT DATA
Using PageRank to Analyse Incident Data.
by
Patrick Collins
A Research Project submitted in partial fulfilment of the requirements for the degree
of Master of Science in Software and Information Systems.
NUI Galway
Department of Information Technology
August, 2008
Head of Department: Prof. Gerard Lyons
Project Supervisor: Owen Molloy
1
Analysing Incidents with PageRank
Final Project/Thesis Submission
MSc in Software & Information Systems
Department of Information Technology
National University of Ireland, Galway
Student Name: Patrick Collins
Telephone: +353874189477
Email: [email protected]
Date of Submission: 29 August 2008.
Title of Submission: Using PageRank to Analyse Incident Data
Supervisor Name: Owen Molloy
Certification of Authorship:
I hereby certify that I am the author of this document and that any assistance I
received in its preparation is fully acknowledged and disclosed in the document. I have
also cited all sources from which I obtained data, ideas or words that are copied directly
or paraphrased in the document. Sources are properly credited according to accepted
standards for professional publications. I also certify that this paper was prepared by me
for the purpose of partial fulfilment of requirements for the Degree Programme.
Signed: Date: 29 August 2008
2
Analysing Incidents with PageRank
“The important thing in science is not so much to obtain new facts as to
discover new ways of thinking about them.”
Sir William Bragg
British physicist (1862 – 1942)
3
Analysing Incidents with PageRank
Acknowledgements
Eugene Maxwell & John Atherly of the IT team,
American Power Conversion,
for their assistance in finding source data.
Owen Molloy,
Thesis Supervisor,
for guidance along the way.
Noel Fegan, Ray Fallon, Paul Bohan and Martina Kiely,
for reviews and comments.
All past and present members of the Software Development team,
American Power Conversion, Galway,
for inspiration, assistance and camaraderie.
Martina Kiely,
for motivation and encouragement.
4
Analysing Incidents with PageRank
Table of Contents
Acknowledgements........................................................................................................4
1. Abstract....................................................................................................................10
2. Introduction..............................................................................................................11
2.1. Research Objectives.........................................................................................11
3. Review of Literature.................................................................................................12
3.1. Information Technology Infrastructure Library...............................................12
3.2. Incident Management.......................................................................................14
3.3. Problem Management.......................................................................................15
3.4. Problem Management in Practice.....................................................................17
3.4.1. Reactive Problem Management................................................................18
3.4.2. Proactive Problem Management...............................................................19
3.5. Incident Analysis..............................................................................................22
3.5.1. Model Based Approaches.........................................................................22
3.5.2. Rule Based Approaches............................................................................24
3.5.3. Codebook Approaches..............................................................................25
3.5.4. Case Based Approaches............................................................................26
3.5.5. Other Approaches.....................................................................................30
4. Methodology............................................................................................................32
4.1. Proposed approach............................................................................................32
4.2. Introduction to PageRank.................................................................................33
4.3. IncidentRank.....................................................................................................34
5. Methods....................................................................................................................37
5
Analysing Incidents with PageRank
5.1. Data...................................................................................................................37
5.2. Experiment 1: Pareto Analysis.........................................................................38
5.2.1. Data...........................................................................................................38
5.2.2. Apparatus..................................................................................................38
5.2.3. Procedure..................................................................................................38
5.2.4. Results.......................................................................................................39
5.2.4.1. Pareto of Category.............................................................................39
5.2.4.2. Pareto of Type...................................................................................40
5.2.4.3. Pareto of Item....................................................................................41
5.3. Experiment 2: PageRank Analysis...................................................................42
5.3.1. Data...........................................................................................................42
5.3.2. Apparatus..................................................................................................42
5.3.2.1. Database............................................................................................42
5.3.2.2. Tagging.............................................................................................43
5.3.2.3. PageRank Algorithm.........................................................................45
5.3.3. Procedure..................................................................................................48
5.3.3.1. Associating Tags with Incidents.......................................................48
5.3.3.2. Running the PageRank Algorithm....................................................49
5.3.3.3. PageRank Reports.............................................................................49
6. Results......................................................................................................................50
6.1. Experiment 1: Pareto Results...........................................................................50
6.2. Experiment 2: PageRank Results.....................................................................50
6.3. Comparison.......................................................................................................53
6.3.1. On PageRank and Tags.............................................................................53
6
Analysing Incidents with PageRank
6.3.1.1. Comparison of Incident Count to PageRank.....................................53
6.3.1.2. Comparison of Incident Count to Resolution Time..........................54
6.3.1.3. Comparison of PageRank to Resolution Time..................................55
6.3.1.4. Relationships between Tags..............................................................56
6.3.2. On PageRank and Incidents......................................................................58
6.3.2.1. Comparison of PageRank to Resolution Time..................................58
6.3.2.2. Comparison of PageRank to Urgency...............................................59
6.3.2.3. Comparison of PageRank to Priority................................................60
6.3.2.4. Comparison of PageRank to Impact.................................................61
6.3.2.5. Relationships between Priority, Urgency, Impact and Resolution
Time...............................................................................................................62
7. Discussion................................................................................................................66
7.1. Realisation of Project Aims..............................................................................66
7.2. Big Science.......................................................................................................66
7.3. Diagnosing the IT System................................................................................70
7.4. On Categorisation.............................................................................................71
7.5. The Wisdom of Crowds....................................................................................74
7.6. Analysis of Tags...............................................................................................74
7.7. Further Research...............................................................................................75
8. References................................................................................................................78
Appendix: Source Data Definition...............................................................................82
7
Analysing Incidents with PageRank
List of Tables
Table 1: Top 80% of Incidents by Category................................................................42
Table 2: Top 80% of Oracle Incidents by Type...........................................................43
Table 3: Top 80% of Oracle Operations by Item.........................................................44
Table 4: Top Ten Tags Ordered by PageRank.............................................................52
Table 5: Known Issue Tags..........................................................................................54
Table 6: Relationships between Priority, Urgency, Impact and Resolution Time.......65
Table 7: Deviation of Resolution Times by Priority....................................................66
8
Analysing Incidents with PageRank
List of Illustrations
Illustration 1: Pareto by Category................................................................................39
Illustration 2: Pareto of Oracle Types..........................................................................40
Illustration 3: Pareto of Oracle Operations Items.........................................................41
Illustration 4: Database Design....................................................................................43
Illustration 5: Tagging User Interface..........................................................................44
Illustration 6: Incident Tag Graph................................................................................45
Illustration 7: Tags Ordered by PageRank...................................................................51
Illustration 8: Comparison of Incident Count to PageRank.........................................54
Illustration 9: Correlation of Incident Count to Resolution Time................................55
Illustration 10: Correlation of PageRank to Resolution Time......................................56
Illustration 11: Example Venn of Tag Relationships...................................................57
Illustration 12: Comparison of PageRank to Resolution Time for Incidents...............59
Illustration 13: Comparison of PageRank to Urgency for Incidents............................60
Illustration 14: Comparison of PageRank to Priority for Incidents..............................61
Illustration 15: Comparison of PageRank to Impact for Incidents...............................62
9
Analysing Incidents with PageRank
1. Abstract
In this paper we discuss approaches to Incident and Problem management
within the context of IT Service Management, and its de facto standard the
Information Technology Infrastructure Library (ITIL). We show how the Problem
Management process attempts to diagnose problem root causes by applying various
analysis techniques to historical incident data.
We propose a new categorisation mechanism. We break the data free from its
hierarchical categorisation scheme through the use of a free form tagging system. By
allowing all system users to categorise incidents using their own terms, we show that
while individuals may differ, the aggregate meta data produced for each incident
stabilises.
Further, by the application of PageRank analysis to the relationships between
tags and incidents, we hope to show useful and interesting correlations. While these
may or may not be indicative of a causal relationship, they are nonetheless, new facts
about the system under scrutiny.
We conclude by showing the system shows some merit, assuming a certain set
of minimum system requirements. If these requirements can be met, then this
approach, can become another tool in the system administrators’ arsenal of system
analysis approaches.
10
Analysing Incidents with PageRank
2. Introduction
In recent years, the IT Infrastructure Library (ITIL) has become the defacto
standard for IT service management within organisations of all sizes. It recognises
that businesses are increasingly dependent on Information Systems and Software to
meet the strategic and tactical business and end user needs.
Within the many processes defined by ITIL, the Incident and Problem
Management processes are significant. They attempt to provide a framework for
businesses to maintain a cost effective and high quality IT service. These benefit both
internal departments, and external end users and customers.
The incident management process attempts to return malfunctioning systems
to nominal operating parameters by collecting and analysing help desk incidents or
trouble tickets. Problem Management provides analysis and trending to highlight
areas of recurrent incidents, etc.
The effectiveness of Problem Management is directly influenced by how the
individual incidents are classified. By using a new classification mechanism, we hope
to show that the insight offered by Problem Management can also be influenced.
2.1. Research Objectives
Within the framework defined by ITIL, this thesis aims to:
● Show how incident classification influences problem management.
● Propose a new classification mechanism for incidents.
● Explore the analysis opportunities for the new classification
mechanism.
11
Analysing Incidents with PageRank
3. Review of Literature
3.1. Information Technology Infrastructure Library
The Information Technology Infrastructure Library (ITIL) is published by the
Office of Government Commerce in the year 2000. It is variously described as a set
of common, or set of best practices for IT system operation. It introduces itself as
(OGC, 2000):
The ethos behind the development of the IT Infrastructure Library (ITIL)
is the recognition that organisations are increasingly dependent upon IT to
satisfy their corporate aims and meet their business needs. This growing
dependency leads to growing needs for quality IT services – quality that is
matched to business needs and user requirements as they emerge.
This is true no matter what type or size of organisation, be it national
government, a multinational conglomerate, a decentralised office with
either a local or centralised IT provision, an outsourced service provider,
or a single office environment with one person providing IT support. In
each case there is the requirement to provide an economical service that is
reliable, consistent and of the highest quality.
IT Service Management is concerned with delivering and supporting IT
services that are appropriate to the business requirements of the
organisation. ITIL provides a comprehensive, consistent and coherent set
of best practices for IT Service Management processes, promoting a
quality approach to achieving business effectiveness and efficiency in the
use of information systems. ITIL processes are intended to be
12
Analysing Incidents with PageRank
implemented so that they underpin but do not dictate the business
processes of an organisation. IT service providers will be striving to
improve the quality of the service, but at the same time they will be trying
to reduce the costs or, at a minimum, maintain costs at the current level.
ITIL goes on to define processes for the majority of IT service provision.
These extend to:
● Business Continuity Management
● Partnerships and Outsourcing
● Surviving Change
● Transformation of Business Practice through Radical Change
● Capacity Management
● Financial Management for IT Services
● Availability Management
● Service Level Management
● IT Service Continuity Management
● Customer Relationship Management
● Service Desk
● Incident Management
● Problem Management
● Configuration Management
● Change Management
● Release Management
● Network Service Management
● Operations Management
13
Analysing Incidents with PageRank
● Management of Local Processors
● Computer Installation and Acceptance
● Systems Management
For our purposes, we will concentrate on Incident and Problem management,
and to a certain extent, the Service Desk where it interfaces with these processes.
3.2. Incident Management.
Any business which uses IT will have to learn how to deal with the day to day
minor disruptions which occur when running an IT system. Hard drives fill, network
links fail, operators make configuration changes and sometimes cause
misconfiguration of a system, software, no matter how well tested, is assumed to
contain defects, and users fail to use systems correctly through bad usability or
insufficient training.
Many businesses develop an Incident Management process for dealing with
these disruptions. The goal of the service desk is to resolve as many of these
incidents as possible before the quality of the IT service being delivered suffers.
ITIL defines an Incident management process and states its goal as being
(OGC, 2000):
The primary goal of Incident Management process is to restore normal
service operation as quickly as possible and minimise the adverse impact
on business operations, thus ensuring that the best possible levels of
service quality and availability are maintained. 'Normal service operation'
is defined here as service operation within Service Level Agreement
(SLA) limits.
14
Analysing Incidents with PageRank
ITIL recommends that as much detail as possible be maintained about each
incident, such as reporter, time, systems affected, relationship with similar incidents
etc. This information is annotated over time as the incident is resolved, before finally
being archived in a knowledge base. This can then become the basis for faster
resolution of the same or similar incidents in future.
3.3. Problem Management.
While Incident Management defines a process for quickly returning a system
to normal operational levels, problem management is defined by ITIL as (OGC,
2000):
The goal of Problem Management is to minimise the adverse impact of
Incidents and Problems on the business that are caused by errors within
the IT Infrastructure, and to prevent recurrence of Incidents related to
these errors. In order to achieve this goal, Problem Management seeks to
get to the root cause of Incidents and then initiate actions to improve or
correct the situation.
The Problem Management process has both reactive and proactive
aspects. The reactive aspect is concerned with solving Problems in
response to one or more Incidents. Proactive Problem Management is
concerned with identifying and solving Problems and Known Errors
before Incidents occur in the first place.
ITIL takes a very IT centric view of both Incident and Problem management.
Given the above definitions for both Incident and Problem management, one may
conclude that the solution to all IT system problems is a change in IT infrastructure.
15
Analysing Incidents with PageRank
This is not necessarily the case, as there can be many reasons why a business may not
be in a position to make in IT infrastructure change. For example, the system was
supplied by a third party vendor, time constraints, cost constraints, lack of skills
within the organisation, etc. In these cases, workarounds, process changes, publishing
of additional documentation and user training may be easier or less costly to
implement as a solution to IT service incidents.
Kajko Mattsson (2002) discusses the role of problem management from a
more holistic perspective. She breaks problem management into two areas, Software
Problem Management, and Continuous Software Management Process Improvement,
stating:
Software problem management process is the dominating process within
corrective maintenance. Its main role is to attend to the reported software
problems in software products. This is mainly achieved by collecting
information on software problems, identifying their underlying defects
and removing these defects. Within the scope of this role, problem
management process should additionally provide data relevant for
accessing product quality and reliability.
The other role of problem management is to provide a basis for
continuous process analysis, process improvement and defect prevention.
This can be achieved by studying the defects, and by identifying and
analysing the process states during which these defects were injected.
Identification and analysis of the process steps may then aid in diagnosing
the root causes of these defects. This should in turn result in appropriate
process improvement actions to prevent the defects from recurring.
16
Analysing Incidents with PageRank
KajkoMattsson goes further than ITIL to suggest tracking and improvement
of problem management processes. This supports the findings of Oppenheimer &
Patterson (2002)e, only to introduce others. Oppenheimer & Patterson (2002) show
that operator error is a leading cause of system failure stating:
From a study of 62 uservisible failures in three large scale Internet
services, we observe that frontends are a more significant problem than is
commonly believed, that operator error and network problems are leading
contributors to uservisible failures, and that more thoroughly exposing
and handling component failures would reduce failure rates in at least one
service.
KajkoMattsson's approach also challenges ITIL's assumption that all IT
service issues can be solved via a change to the IT infrastructure. In a majority of
cases, it holds true, that IT changes solve IT issues. In a minority of cases, for
example, if a business is running a software system which it cannot change, then IT
issues could be solved by providing better documentation, or giving additional
training on system use to the user community.
3.4. Problem Management in Practice
Problem management, under the ITIL definition, takes the records of Incidents
submitted through the help or service desk, and attempts to apply some analysis on
them to uncover underlying root causes. These root causes are then addressed
through planned changes to the structure of the underlying IT infrastructure, or
through changes to the operational processes.
In order to facilitate this, ITIL requires a certain Incident management
protocol to allow for easier incident analysis. ITIL recommends (OGC, 2000), “an
17
Analysing Incidents with PageRank
effective automated registration of Incidents, with an effective classification, is
fundamental for the success of Problem Management”. ITIL also recognises some
risks to problem management, such as:
● Absence of a good Incident control process, and thus the absence
of detailed historical data on Incidents (necessary for the correct
identification of Problems).
● Failure to link Incident records with Problem/ means a failure to
gain many of the potential benefits. This is a key feature in
moving from reactive support to a more planned and proactive
support approach.
● Failure to set aside time to build and maintain the knowledge base
will restrict the delivery of benefits.
● An inability to determine accurately the business impact of
Incidents and Problems. Consequently the businesscritical
Incidents and Problems are not given the correct priority.
3.4.1. Reactive Problem Management.
Reactive problem management is largely taken care of by the Incident
Management process. The goal here is to (Microsoft, 2007):
● Identify and take ownership of problems affecting infrastructure
and service.
● Take steps to reduce the impact of incidents and problems.
18
Analysing Incidents with PageRank
● Identify the root cause of problems and initiate activity aimed at
establishing workarounds or permanent solutions to these
identified problems.
3.4.2. Proactive Problem Management
While reactive problem management, or incident management attempts to
reduce the impact of system failures as quickly as possible, proactive problem
management takes a longer view. Using recorded problem and incident data, trend
analysis can be performed to predict future problems and enable prioritisation of
problem management activities (Microsoft, 2007).
When businesses invest in IT systems, they would like to be assured that the
new system will be operational when needed. The availability of a system is a
measure used to give an indication of the probability that a system will be available at
any given moment. This is usually a major consideration for IT system purchasers.
Gokhale, Crigler, Farr & Wallace (2005) discuss the relationship between
availability and reliability saying:
The reliability of a system may be defined as the probability of failure
free operation for a specified period of time in a specified environment.
Reliability is a key metric for many lifecritical systems that are required
to operate without failure for a given period of time. Many systems,
however, are capable of tolerating some failures and continue to operate
despite failures, perhaps in a degraded mode. Also, even though a failure
causes total loss of service, the underlying fault may be repaired in order
to restore the system back into operation. For such repairable systems as
well as for systems which are capable of operating in a degraded mode,
19
Analysing Incidents with PageRank
availability is a more relevant metric than reliability. The availability of a
system is defined as the ability of the system to be in a state to perform a
required function at a given instant of time or any instant of time within a
given time interval. A crucial difference between reliability and
availability is that reliability refers to failurefree operation during an
entire interval, while availability refers to failurefree operation at a given
instant of time. This time is usually the instant when the system is first
accessed to provide a required function or service.
Mockus (2006) discusses how the availability of a system can be estimated
given the data recorded by the help desk, stating:
Our primary contribution is to propose a method to assess empirically
software availability and reliability based on information from operational
customer support and inventory systems. In addition, the novelty of our
approach has several aspects. The precise information about the system
population, configuration, and age is linked to the outage information in
order to produce more accurate estimates of availability. The
methodology of data collection to estimate availability of software is
proposed. The experiences and findings applying the approach to a large
enterprise communication system are discussed. We ask several practical
and theoretical questions and evaluate them based on the obtained results.
In particular, we compare samples to obtain approximations of reliability
with more accurate, but harder to obtain estimates. We also evaluate if
the common reliability measure of mean time between failures (MTBF) is
20
Analysing Incidents with PageRank
appropriate for varying system run times by investigating the hazard
function.
Mockus' approach could be considered an appropriate approach a business
would use to analyse an IT system which is to be deployed, but has not been
developed in house. It is arguable that a Model Based Approach (discussed in the
next sections) would give the best indication of a system's availability,
notwithstanding the difficulties this approach imposes.
Regardless of how availability is estimated, given time, a real value will be
arrived at by measuring the system as it is being used. Businesses deploy Incident
Management and Problem Management processes to ensure that the availability of a
system is as high as possible and disruption to customers and business processes is
minimised. Jantti & Eerola (2006) note:
The primary goal of problem management is to minimise the impact of
problems on the business and to identify the root cause of problems.
According to a recent IT service management survey, the problem
management process is one of the main development targets for many
organisations in the near future. Many organisations have realised the
value of problem management in preventing failures and problems.
However, IT organisations do not have a clear understanding of the basic
concepts of problem management process and the relationships between
the concepts. This is mainly due to the complex IT service management
standards that cause difficulties in the implementation of problem
management. A welldesigned problem management model helps
21
Analysing Incidents with PageRank
organisations to prevent problems (proactive problem management), to
resolve reported problems effectively (reactive problem management),
and also takes into consideration cost, effort and quality aspects.
3.5. Incident Analysis
Given a set of IT service incidents collected over time by the Service Desk, or
other means, the problem management process mandates analysing these incidents in
an effort to uncover the root causes, and implement permanent solutions to them.
Haneman, Sailer and Schmitz (2004) discuss various incident correlation approaches.
They classify these into four areas, Model Based Reasoning (MBR), Rule Based
Reasoning (RBR), a Codebook approach, and Case Based Reasoning (CBR).
3.5.1. Model Based Approaches
Haneman, Sailer and Schmitz (2004) note of model based approaches to
incident correlation, and problem discovery:
Modelbased reasoning (MBR) represents a system by modelling each of
its components. A model can either represent a physical entity or a logical
entity (e.g. LAN, WAN, domain, service, business process, etc.). The
models for physical entities are called functional model, while the models
for all logical entities are called logical model. A description of each
model contains three categories of information: attributes, relations to
other models, and behaviour. The event correlation is a result of the
collaboration among models.
Gokhale, Crigler, Farr & Wallace (2005) show how a model based approach is
used to produce a closed form expression of system availability stating:
22
Analysing Incidents with PageRank
We present a system availability model which considers failure severities
in conjunction with system structure. Based on the model, we obtain a
closed form expression which relates system availability to the failure and
repair parameters of the components. We then describe availability
analysis of a satellite system using the model based on the data collected
during the acceptance testing of the system.
They also note, not all failure states carry the same weight with regard to
system availability, saying (Gokhale, Crigler, Farr & Wallace, 2005):
In the literature, modelbased analysis has regarded all the failures of all
the components to be equivalent. The consequence of each failure on the
services provided by the system is considered to be the same. In other
words, each failure is considered to be the same level of severity. As a
result, redundancy is used to tolerate some failures and provide degraded
mode of operation and repair/restoration is used to bring the system back
into a completely operational state. In many reallife systems, however,
all failures do not always have the same impact on system services. In
fact, failures are typically classified into multiple severity levels, where
failures belonging to the highest severity level cause a complete loss of
service, while failures belonging to levels below the highest level enable
the system to operate in a degraded mode. Thus, the system is capable of
tolerating low severity failures without employing any other means such
as redundancy. This makes it necessary to consider the influence of
failure severities on system availability in conjunction with the system
structure.
23
Analysing Incidents with PageRank
An advantage of a model based system is that a model is developed of the
system showing its various states and state transitions. This can be used to analyse
the system on an ongoing basis as a Markov model and allows a deterministic
approach to system management. The disadvantage of this approach is that creating
and maintaining a system model becomes exponentially more difficult as systems
grow. As such, apart from several niche applications, such as the satellite systems
discussed by Gokhale, Crigler, Farr & Wallace (2005), model based reasoning is
generally not used by system administrators.
3.5.2. Rule Based Approaches
Haneman, Sailer and Schmitz (2004) discuss Rule Based approaches stating:
Rulebased reasoning (RBR) uses a set of rules for event correlation. The
rules have the form of “conclusion, if condition”. The condition uses
received events and information about the system, while the conclusion
contains actions which can either lead to system changes or use system
parameters to choose the next rule.
Advantages of a rule based approach is the possibility of automatic system
healing, given the system's state, the state of any incidents raised, and rules developed
over time, the system can begin to take automatic recovery steps.
An example of this is noted by BMC Software (2006), when a hard disk
partition approaches a critical high water mark for disk space, an incident is
automatically logged to the incident management system. This in turn sends a
response to the system asking it to delete any temporary or unused files, thus
automatically freeing some disk space.
24
Analysing Incidents with PageRank
Using an approach like this can lead to many of the most easily fixed, though
most frequent incidents becoming automated, thus freeing system administration staff
to deal with the less frequent though more involved incidents.
Rule based reasoning does have its disadvantages. Similar to the Model Based
approaches, the system rules need to be constantly kept in line with the system
architecture. As problems are resolved, patches deployed, new system features
introduced, and other day to day changes occur to the IT system, the set of system
rules need to be revalidated each time. This adds significant effort to system
administrators in ensuring that the system does not lead itself into a misconfiguration
or suboptimal state through application of system rules.
3.5.3. Codebook Approaches
Haneman, Sailer and Schmitz (2004) define the codebook approach as:
The codebook approach has similarities to RBR, but takes a further step
and encodes the rules into a correlation matrix.
The approach starts using a dependency graph with two kinds of nodes for
the modelling. The first kind of node are the faults (problems/ root
causes) which have to be detected, while the second kind of nodes are
observable events (symptoms/ incidents) which are caused by the faults or
other events. The dependencies between the nodes are denoted as directed
edges. It is possible to choose weights for the edges, e.g., a weight for the
probability that fault/event A causes event B.
After a final input graph is chosen, the graph is transformed into a
correlation matrix where the columns contain the faults and the rows
25
Analysing Incidents with PageRank
contain the events. If there is a dependency in the graph, the weight of the
corresponding edge is put into the according matrix cell.
The codebook approach has the advantage that it uses longterm
experience with graphs and coding. This experience is used to minimize
the dependency graph and to select an optional group of events with
respect to processing time and robustness against noise. A disadvantage
of the approach could be that similar to RBR frequent changes in the
environment make it necessary to frequently edit the input graph.
3.5.4. Case Based Approaches
Finally Haneman, Sailer and Schmitz (2004) discuss Case based reasoning,
stating:
In contrast to other approaches case based reasoning (CBR) systems do
not use any knowledge about the system structure. The knowledge base
saves cases with their values for system parameters and successful
recovery actions for these cases. The recovery actions are not performed
by the CBR system in the first place, but in most cases by a human
operator.
ITIL solely recommends attempting to analyse and solve incidents using case
based reasoning. ITIL's processes are structured around creating, maintaining, and
using a knowledge base by Service Desk personnel and system operators to diagnose,
implement workarounds, and permanent fixes for incidents and problems.
The CBR approach to incident management has some advantages, in that a
body of knowledge (knowledge base) is built up over time as service incidents are
26
Analysing Incidents with PageRank
logged and corrected. While this serves the needs of Incident Management, additional
processing is required for root cause analysis, as mandated by proactive Problem
Management.
Card (1993 & 1998) discusses defectcausal analysis. He offers the following
advice on classification and analysis of problems, (Card, 1998):
Classifying or grouping problems helps to identify clusters in which
systematic errors are likely to be found. You should select the
classification schemes to be used when you set up the Defect Causal
Analysis (DCA) process. Moreover, the meeting itself will go faster if
you classify the problems to be analysed according to a predefined
classification scheme. Ideally, each problem should be classified by the
programmer when implementing its fix. Alternatively, the moderator may
classify the problems prior to the group meeting. Three dimensions are
especially useful for classifying problems:
● When was the defect that caused the problem inserted into the
software?
● When was the problem detected?
● What type of mistake was made or defect introduced?
The first two classification dimensions correspond to activities or phases
in the software process. The last dimension reflects the nature of the work
performed and the opportunities for making mistakes. Some commonly
used error types include interface, computational, logical, initialisation
27
Analysing Incidents with PageRank
and data structure. Depending on the project's nature, you can add other
classes such as documentation and user interface.
You can produce tables or Pareto charts to help identify problem clusters.
A Pareto chart is a bar chart that shows the count of problems by type in
decreasing order of frequency.
Leszak, Perry & Stoll (2000) show how this type of analysis can be used
during a case study of the development cycle of a software application. It is
interesting to note that in their study, Leszak, Perry & Stoll (2000) state:
An important study decision was to allow for several root causes to be
specified during analysis of each modification request (MR). The
intuition is that there may well be several factors contributing to the
occurrence of a defect. Thus, in addition to phase, we have allowed
human, project and review root causes to be specified. These four
dimensional root cause classifications give indications as to what played a
role in a defect's occurrence.
Pareto and Ishikawa are also suggested by ITIL (OGC, 2000) as a means of
problem discovery, as they note:
Categorisation of Incidents and Problems and creative analysis may reveal
trends and lead to the identification of specific (potential) Problem areas
that need further investigation. For instance, analysis may indicate that
Incidents related to the usability of recently installed clientserver systems
is the Problem area that has the most growth in terms of negative impact
on business.
28
Analysing Incidents with PageRank
Analysis – for example of events from System Management tools,
literature, conferences and feedback from User groups – can also reveal
possible Problems deserving further investigation. Organising workshops
with prominent Customers or conducting Customer surveys can also lead
to the identification of trends and (potential) Problem areas.
Analysis of Problem Management data may reveal:
● That Problems occurring on one platform may occur on another
platform – for example, a Problem concerning network software
on a midrange system may well be of significance on a mainframe
system.
● The existence of recurring Problems – for example, if three routers
are substituted serially, because of the same failure, it may indicate
that the routertype concerned is not appropriate and should be
replaced by another type, or when a software application is
involved then complete redevelopment might be necessary which
would be classed as a major Change.
Pareto and Ishikawa approaches to defect analysis have one major advantage.
They are approaches which are not specific to software defects and it is likely that
administrators and management within businesses understand and are comfortable
using them.
While Pareto charts and Ishikawa diagrams may be widely used, they are
labour intensive. While ITIL recommends their use, it also recommends that problem
29
Analysing Incidents with PageRank
management committees meet to discuss, and analyse incidents, and to do the work of
prioritising efforts towards permanently fixing problems or root causes of incidents.
Given that this approach requires much human effort and is not easily
automated, Barash, Bartolini, & Wu (2007) discuss how we should best organise our
problem handling processes such that we resolve incidents and problems in the most
efficient way. CBR relies on human operators to change the IT system so as to ffect a
fix for an incident or problem. For large IT systems, it is not uncommon that there are
teams of administrators spread across geographies and time zones.
Barash, Bartolini & Wu (2007) provide a system of metrics to analyse how
incidents are routed between groups of administrators, so as to help define better,
more efficient incident handling protocols. The ultimate goal of their model is to
reduce the time to fix for an incident, while also reducing the business disruption
caused by it.
Their approach is less about the incidents than the human operators who have
to deal with them. While other authors deal with correct recording, and classification
of incidents, and analysis of problems to uncover root causes, Barash, Bartolini & Wu
(2007) are more interested in understanding “the improvements brought about by
restructuring the support organisation by increasing or cutting staffing levels, moving
operators around support groups (possibly on retraining), and even implementing
different prioritisation policies for the technician when dealing with queues of
incidents”.
3.5.5. Other Approaches
Talluru, & Deshmukh (1995) introduce a decision support system for the
purposes of aiding problem management, stating “a decision support system (DSS) is
30
Analysing Incidents with PageRank
characterised as a computer based information processing system which allows the
decision maker to interact with the problem solving process”. Although they only
develop a prototype system, the report some success in their findings, stating:
Using this DSS, managers can solve recurring problems by referring to
previously solved problems. It aids advanced users by providing all the
semantics. It helps the decision maker to do background analysis and
expert opinion sampling. The decision maker can also communicate
effectively with the quantitative analysts and other intermediaries. As the
problem manager stores all the problem situations, solutions, and in
between transformations, experts can use this information to fine tune the
knowledge periodically. It can improve the productivity of the decision
maker by cutting down turnaround times. This facility can also be used to
train entry level managers.
Apostolov (2006) monitors a network of electrical devices by continuous
automatic monitoring devices. While his application is hardware based, or periodic
service probing. In this scenario, a software agent would periodically probe or test a
system service to ensure it is both functionally correct and meeting service levels.
Any nonconformances would automatically be recorded to the Incident Management
process incident database. BMC's Performance Manager uses a technique akin to this
for proactive incident and problem management (BMC Software, 2006).
31
Analysing Incidents with PageRank
4. Methodology
This section outlines the approach taken to answer the basic research questions
identified in Chapter 1.
4.1. Proposed approach.
As previously shown, there exist several means for incident analysis, each
with advantages and disadvantages for any given system. Any new means of incident
analysis would have to be generally applicable, if would hope to achieve adoption.
Given the dominance of Pareto and Ishikawa as incident analysis tools, the results of
any new means of analysis would also have to be as good as or better than these
approaches. Any reduction in the time it takes to analyse historical incident data
would also be well received.
In the literature, there is a tacit acceptance, that underlying root causes are not
obvious, and only express themselves through incidents where software services fail
to meet expected functionalities or service levels. Much as a doctor diagnoses a
disease by correlating the symptoms, so do system administrators hope to uncover the
underlying causes of system failures through the analysis of incident reports.
The standard software disruption metric takes the number of occurrences of an
incident with their severity into account. While this is useful in an Incident
Management process to decide which incidents should be prioritised, the Problem
Management process would like to take incident correlations into consideration. The
literature shows an understanding that a problem root cause can express itself by
causing more than one type of incident record to be logged, as the root causes affect
different systems or users differently.
32
Analysing Incidents with PageRank
If the ITIL advice is followed, and as new Incidents are being logged, they are
related to existing and historical incident records, a “web of incident citations” would
emerge. Additional meta data can be captured here, through the use of tagging. With
this, any user of the system can categorise an incident using free form keywords.
Over time, as users categorise and tag incidents, a set of meta data grows around each
incident. This web of citations between incidents and tags can also be analysed.
Using the PageRank algorithm to analyse these webs, much as it does the web
of interconnected Internet pages, should reveal where incidents are clustering. Using
this information, a new metric based on PageRank, would be produced which could
be used in both Incident and Problem Management processes to direct system
administrators and system managers to direct their resources in order to gain
maximum impact for finding and resolving problem root causes.
4.2. Introduction to PageRank
Brin & Page (1998) developed the PageRank algorithm while studying at
Stanford University. They introduce it, writing:
Academic citation literature has been applied to the web, largely by
counting citations or backlinks to a given page. This gives some
approximation of a page's importance or quality. PageRank extends this
idea by not counting links from all pages equally, and by normalizing by
the number of links on a page. PageRank is defined as follows:
We assume page A has pages T1...Tn which point to it (i.e., are citations).
The parameter d is a damping factor which can be set between 0 and 1.
We usually set d to 0.85. Also C(A) is defined as the number of links
going out of page A. The PageRank of a page A is given as follows:
33
Analysing Incidents with PageRank
PR(A) = (1d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
PageRank or PR(A) can be calculated using a simple iterative algorithm,
and corresponds to the principal eigenvector of the normalized link matrix
of the web. Also, a PageRank for 26 million web pages can be computed
in a few hours on a medium size workstation.
4.3. IncidentRank
The benefit of the PageRank algorithm is that each web page in a set of web
pages has its importance or quality measured by the number of incoming links or
citations. It is this characteristic of the PageRank algorithm that the MRSA
researchers leveraged in order to find which interactions within a ward had the most
effect at spreading MRSA. Simonite (2008), writing on Shepherds research, notes:
“Our new model is based very much on the way Google has achieved
number one status among search engines. When Google's spiders crawl
the web they build up a connectivity matrix of links between pages”.
Shepherd's idea is to build a similar matrix describing all interactions
between people and objects in a hospital ward, based on observing normal
daily activity. “Obviously nurses move among patients and that can
spread infection, but they also touch light switches and lots of other
surfaces too. If you observe a network of all those interactions you can
build a matrix of which nodes in the network are in contact with which
other nodes”.
Shepherd has started testing the technique using data gathered for another
study. "We sussed out in one ward that the chief node was a light switch,"
34
Analysing Incidents with PageRank
he says. "It could potentially distribute infection to the rest of the ward
very quickly."
This approach has many parallels with incident and problem management in
software systems. In a codebook approach we would model a graph of problems with
related incidents. Given the Pareto principal, one would expect that the majority of
incidents or infections logged come from a minority of root causes. Finding and
addressing the root causes is the goal of Problem Management. As previously
discussed, a variety of correlation techniques can be brought to bear on the data.
ITIL and other problem management frameworks recommend that incidents,
when analysed should be either linked with known problems, or other incidents to aid
the root cause analysis. It is our proposal that the PageRank algorithm be brought to
bear on these records.
Using a similar approach when analysing linked incident records, should
reveal similar results in finding which incidents have the most 'knock on' effects in a
software system. It is hoped that the use of the PageRank algorithm in this instance,
would help narrow the focus of the problem management team to the issues which
cause or could potentially cause the most system disruption.
When the problem management team try to decide which incidents and
problems to prioritise, an incident’s assigned rank can be considered. A prioritisation
metric could be developed using the incident rank with a ranking of potential business
disruption. The product of these metrics would give a good prioritisation metric to
each incident, directing system administrators in focusing their efforts.
By analysing how incidents are related, one would expect this approach to:
● Show which incidents have the greatest, or potentially greatest system impact.
35
Analysing Incidents with PageRank
● Be as effective as or better than existing Pareto approaches.
● Allow automation of as much of the incident/ problem analysis process as
possible.
● Require a minimum of operational process change. That is, we do not propose
a completely new way of working, but use the data already available from
existing procedures.
36
Analysing Incidents with PageRank
5. Methods
5.1. Data
This section briefly describes the source data used for all experiments in this
thesis.
The source data was kindly donated by Eugene Maxwell & John Atherly. It
consists of 19220 rows of incident data, with each row representing a separate
incident logged to an Incident Management system. The time period covered by this
data is approximately the first 3.5 months of 2008. A brief description of each
column and its type are given in the Appendix.
Since the source software system is designed for Pareto analysis, each row is
categorised into a hierarchy. The 'Category', 'Type', and 'Item' fields capture a high,
mid and low level categorisation of the incident.
These organise each incident into a tree style categorisation. For example, the
Category could specify 'Operating System', the Type, 'Windows XP', and the Item,
'Spyware'. Thus incidents are grouped into various sets, and related to each other.
Business intelligence style reports can rollup or drill down through the sets of
incidents by including and excluding various values for Category, Type and Item.
The 'Case Category Type' field provides a concatenation of all three sub fields for
convenience.
Other fields within the data support other reporting opportunities, such as the
Arrival Time, Assigned Time, Work In Progress Time, etc. As the incident is
progressed through the analysis and resolution work flow, the incident is transitioned
through several statuses. As the status of the incident changes, timestamps are
recorded, and the amount of time the incident spent in each phase can be deduced.
37
Analysing Incidents with PageRank
This is helpful in supporting the analysis of which individual, and groups of incidents
take the most time to resolve.
5.2. Experiment 1: Pareto Analysis
5.2.1. Data
The system which the source data was procured from already uses Pareto
analysis for high level management of incident lists. As such, the data is readily
amenable to Pareto analysis.
As mentioned previously, the 'Category', 'Type' and 'Item' fields define a
hierarchy of categories which the data can be analysed by. Within each Category,
several Types exist, and within each Type, so do several Items.
5.2.2. Apparatus
This experiment was run using Microsoft ® Excel on a standard desktop PC.
5.2.3. Procedure
Using the Pivot Table feature of Microsoft ® Excel one can produce
frequency analysis of the list of incidents. This orders the incidents into their
respective categories and gives the number of incidents within each category.
The incidents were first analysed by the contents of the Category field. This is
the highest level category the incidents are classified under. Based on the result of
this analysis, the category with the most incidents was analysed based on the contents
of the Type field. Finally, the incident type with the most incident occurrences was
analysed in a similar fashion, based on the contents of the item field.
Based on the results of these, several Pareto charts could be produced. Dorner
(1997) shows how this can be achieved.
38
Analysing Incidents with PageRank
5.2.4. Results
The results of Pareto analysis is given in the following sections
5.2.4.1. Pareto of Category
The highest level of categorisation in the hierarchical scheme applied to the
source data is by the 'Category' field. Illustration 1 shows the Pareto chart of all
incidents by the contents of the Category field.
It can be seen that the majority of incidents recorded by the system are
grouped in a small few top level categories. The top 80% of incidents fall into the
categories shown in Table 2.
39
Illustration 1: Pareto by Category
P a r e t o b y C a t e g o r y
0 . 0 0 %
2 . 0 0 %
4 . 0 0 %
6 . 0 0 %
8 . 0 0 %
1 0 . 0 0 %
1 2 . 0 0 %
1 4 . 0 0 %
1 6 . 0 0 %
1 8 . 0 0 %
2 0 . 0 0 %
O r a c l e
N o t e s
H a r d w a r e
N e t wo r k
R e m o t e Ac c e s s
I n T o u c h
T e l e c o m
A p p l i ca t i o
n
S o f t wa r e
R e m e d y
B l a c k b e r r y
I n t r a n e t
O p e r a t i n g S y s t e m s
B u s i n e s s I n t e l l i ge n c e
I n t e r n e t
C i t r i x / Me t a f r a m e
A P C Wi r e l e s s
A d mi n i s t r a t i o n
D a t ac e n t e r O
p s
Mf g S
y s t e m s
D e f a u l t
D a t ac e n t e r
D a t a c o mH R
S a m e t i me
Mi d d l e w a r e
D e s i g n Po r t a l & I S X I n t e r n a l
D e s i g n Po r t a l
S e r v i c e w e b
A l e r t i n g
F i r s t To u c h
V i r t u a l
S t ar s
C a t e g o r y
Per
cent
age
0 . 0 0 %
2 0 . 0 0 %
4 0 . 0 0 %
6 0 . 0 0 %
8 0 . 0 0 %
1 0 0 . 0 0 %
1 2 0 . 0 0 %
Cum
ulat
ive
Per
cent
age
Analysing Incidents with PageRank
Rank Category Frequency % of Total Cumulative %
1 Oracle 3601 18.74% 18.74%
2 Notes 2474 12.87% 31.61%
3 Hardware 2220 11.55% 43.16%
4 Network 1764 9.18% 52.34%
5 Remote Access 1320 6.87% 59.21%
6 InTouch 1235 6.43% 65.64%
7 Telecom 1059 5.51% 71.15%
8 Application 1045 5.44% 76.58%
9 Software 988 5.14% 81.73%
Table 1: Top 80% of Incidents by Category
5.2.4.2. Pareto of Type
Knowing that the Oracle Category is the highest Pareto ranked Category, we
can drill down to the Type level. Illustration 2 shows the Pareto of the Oracle
Category by the Type fields. The top 80% of incidents by type are tabulated in Table
3.
40
Illustration 2: Pareto of Oracle Types
P a r e t o o f O r a c l e T y p e s
0 . 0 0 %
5 . 0 0 %
1 0 . 0 0 %
1 5 . 0 0 %
2 0 . 0 0 %
2 5 . 0 0 %
3 0 . 0 0 %
3 5 . 0 0 %
4 0 . 0 0 %
4 5 . 0 0 %
5 0 . 0 0 %
O p e r a t i o n s E M A l e r t s A p p l i c a t i o n s A D I E r r o r C o r r e c t i o n S c r e e n O D P O r a c l e D e m a n dP l a n n i n g
S e r v i c e O r d e r ( R M A ) I n t e g r a t i o n I s s u e s M G E
T y p e
Perc
enta
ge
0 . 0 0 %
2 0 . 0 0 %
4 0 . 0 0 %
6 0 . 0 0 %
8 0 . 0 0 %
1 0 0 . 0 0 %
1 2 0 . 0 0 %
Cum
ulat
ive P
erce
ntag
e
Analysing Incidents with PageRank
Rank Type Frequency % of Total Cumulative %
1 Operations 1625 45.13% 45.13%
2 EM Alerts 957 26.58% 71.70%
3 Applications 903 25.08% 96.78%
Table 2: Top 80% of Oracle Incidents by Type
5.2.4.3. Pareto of Item
Finally we run the same analysis on the Operations type, as it is the highest
ranked incident type within the Oracle category. Using the Item field we can further
analyse these incidents to find the highest ranked incident item. These are charted in
Illustration 3 and the top 80% of items tabulated in Table 4.
As can be seen from Illustration 3 and Table 4, the highest ranked item is
'Access Issues'. This categorises various incidents logged by users who were having
trouble accessing the Oracle system and its applications.
41
Illustration 3: Pareto of Oracle Operations Items.
P a r e t o o f O p e r a t i o n s
0 . 0 0 %
5 . 0 0 %
1 0 . 0 0 %
1 5 . 0 0 %
2 0 . 0 0 %
2 5 . 0 0 %
3 0 . 0 0 %
3 5 . 0 0 %
4 0 . 0 0 %
4 5 . 0 0 %
A c c e s s i s s u e s
A p p l i c a t i o n i s s u e s
P a s s w o r d r e s e t
D a t a b a s e i s s u e s
P e r f o r m a n c e i s s u e s
O r a c l e S R s u p p o r t
P r i n t e r i s s u e s
J in i t i a t o r i s s u e s
W e b s e r v i c e s i s s u e s
C o n c m a n a g e r s / A p a c h e
M i s s i n g u s e r n a m e
P r o d u c t i o n I s s u e s
M a t e r i a l i z e v i e w i s s u e s
Q u e u e p r o p a g a t i o n
N o n P r o d u c t i o n I s s u e s
D M L n o n p r o d
R e c o m p i l e o b je c t s
D D L n o n p r o d
O p e r a t i o n s
Perce
ntage
0 . 0 0 %
2 0 . 0 0 %
4 0 . 0 0 %
6 0 . 0 0 %
8 0 . 0 0 %
1 0 0 . 0 0 %
1 2 0 . 0 0 %
Cumu
lative
Perc
entag
e
Analysing Incidents with PageRank
Rank Item Frequency % of Total Cumulative %
1 Access Issues 636 38.71% 38.71%
2 Application Issues 314 19.11% 57.82%
3 Password Reset 208 12.66% 70.48%
4 Database Issues 166 10.10% 80.58%
Table 3: Top 80% of Oracle Operations by Item.
5.3. Experiment 2: PageRank Analysis
5.3.1. Data
As shown in section 5.2.4, some 636 incidents were flagged as access issues
for the Oracle system. These became the input data for experiment two.
5.3.2. Apparatus
5.3.2.1. Database
To prepare the data, the selected rows were imported into a Microsoft Access
database from the source Microsoft Excel spreadsheet. This was achieved by using
the Import functionality of the Microsoft Access database application. A new table
(IssueTbl) was created from the imported data.
Once in database format, some additional tables could be built around this
data. Illustration 4 shows the database design. The IssueTbl table contains the source
data, which was imported from its original spreadsheet. The TagTbl contains the
definitions of tags which have been applied to incidents. The IssueTagTbl provides
the bridging table, allowing for a many to many relationship between Incidents and
Tags.
This design allows a many to many relationship between Incidents and Tags.
Each incident can have multiple tags associated with it, while each tag can be applied
to multiple different incidents.
42
Analysing Incidents with PageRank
Using this structure, a graph of the interrelationships between incidents and
tags can be produced. It is this which is analysed using the PageRank algorithm to
rank the relative importance of each incident and tag within the system.
5.3.2.2. Tagging
In order to ease the process of adding tags to incidents a simple user interface
was developed using the features of Microsoft Access. This is shown as Illustration 5
below. On the left, the Category, Description, Summary and Work Log are displayed.
Reading these gives the user a sense of the issue, the root cause, if recorded and the
43
Illustration 4: Database Design
I s s u e T a g T b l
T a g I D ( O ) ( F K , I E 3 )C a s e I D ( O ) ( F K , I E 2 , I E 1 )
I s s u e T b l
C a s e I D P K ( O )
A r r i v a l T i m e ( O ) A s s i g n e d ( O ) C a s e C a t e g o r y T y p e I t e m ( 2 6 0 0 0 0 1 2 3 ) ( O ) C a s e I D ( O ) ( I E 1 )C a s e T y p e * ( O ) C a t e g o r y * ( O ) C r e a t e D a t e ( O ) D e p a r t m e n t ( O ) D e s c r i p t i o n * ( O ) G r o u p + ( O ) H o u r s t o r e s o l v e ( 2 6 0 0 0 0 0 0 4 ) ( O ) I t e m * ( O ) P r i o r i t y * ( O ) R e g i o n ( O ) R e q u e s t I m p a c t ( O ) R e s o l v e d ( O ) S i t e ( O ) S L A P a r e n t ( O ) S o u r c e * ( O ) S t a t u s * ( O ) A s s i g n e d T I M E ( O ) W o r k I n P r o g r e s s U S E R ( O ) W o r k I n P r o g r e s s T I M E ( O ) P e n d i n g T I M E ( O ) R e s o l v e d T I M E ( O ) C l o s e d T I M E ( O ) S u m m a r y * ( O ) T y p e * ( O ) U r g e n c y ( O ) W o r k L o g ( O ) P a g e R a n k ( O )
T a g T b l
T a g I D ( O ) ( I E 1 )
T a g ( O ) P a g e R a n k ( O )
Analysing Incidents with PageRank
steps taken in trying to resolve the issue. On the right, a set of tags which have been
applied to the incident are displayed.
When adding a tag, previous tags are looked up, and autocompletion is used
to facilitate ease and speed of tagging. If a previously undefined tag is added, the
system asks if the user wishes to add this tag to the system. The new tag is then added
to the TagTbl table, while the link is added to the IssueTagTbl table.
Over time as issues are tagged, a graph of the interrelationships between
incidents and tags can be produced. A simple example of this is shown in Illustration
6. This becomes the basis for the next step in analysis, the application of the
PageRank algorithm to the linked incident data.
44
Illustration 5: Tagging User Interface
Analysing Incidents with PageRank
As can be seen in Illustration 6, each link between a tag and an incident are
assumed to be bidirectional. This is important for the calculation of the associated
ranks, and is discussed further below.
5.3.2.3. PageRank Algorithm
Both the IssueTbl and the TagTbl contain a PageRank field. This field is
introduced to hold the ranking given to each tag and incident by the PageRank
algorithm.
The PageRank algorithm was implemented as a Java application which
connected to the Microsoft Access database, queried for incidents and tags, and ran
the process of generating the rank for each. Austin (2008) gives a good undertaking
of how the PageRank algorithm can be implemented, which is briefly discussed here.
45
Illustration 6: Incident Tag Graph
I n c i d e n t
T a g
I n c i d e n t I n c i d e n t
T a g
Analysing Incidents with PageRank
To begin the process, a link matrix (L) is produced. If the number of incidents
in the system is Ni, and the number of tags in the system is Nt, then the link matrix is a
square matrix of Ni+Nt size. Each row and column corresponds to either an incident
or a tag. When an incident is linked with a tag, then the value of 1 is assigned to the
cells where those incidents and tags intersect. That is, L(i, j) and L(j,i) where i and j
represent the row and column number of the respective incident and tag.
Based on the Link Matrix, a probability matrix (H) is produced. This encodes
the probability of moving from one node in the link graph to another, based on the
number of outgoing links from the current node. That is, if a node has three outgoing
links to other nodes, then it is assumed by the algorithm that any link could be chosen
with equal probability. The probability matrix encodes this by counting the number
of links from each node, and storing the appropriate probability.
A second probability matrix (A) is needed for those nodes which have no
explicit links to other nodes. The PageRank algorithm assumes that if a node has no
links defined, then it is linked to all other nodes with equal probability. The A matrix
encodes this by searching the Link matrix for nonlinked nodes and replacing them
with a probability of 1/ Ni+Nt.
The PageRank algorithm is based on stochastic matrices (Austin, 2008). A
property of stochastic matrices is that each column of the matrix sums to 1. In our
system, the stochastic matrix (S) is given by:
S = H + A.
Therefore our S matrix is made of columns representing the probability of
moving to another node on the graph based on explicit links, or implicitly by being
linked to all nodes.
46
Analysing Incidents with PageRank
A dampening factor is introduced (a) to model the movement of a 'random
surfer' on the graph. The algorithm assumes that a surfer will move from one node to
a linked node with probability a, but move to a random node with probability 1a.
Finally the PageRank matrix (I) is one column by Ni+Nt. rows. This is
initialised with one of the nodes given all the rank (represented by a value of 1). As
iterations of the algorithm run, this initial rank will be shared and spread across the
graph. Eventually the values of rank converge to a stable matrix. Several factors are
involved in this, as discussed by Austin (2008). One can continue to run the
algorithm until the values converge, or used a fixed number of iterations to come to a
reasonable approximation, as Austin (2008) states:
With the value of a chosen to be near 0.85, Brin and Page report that 50 to
100 iterations are required to obtain a sufficiently good approximation to
I.
Thus the Google matrix (G) is defined as:
GIk = aSIk + (1 – a / n) Ik
where Ik represents the kth iteration of the algorithm.
Once all iterations have been complete, the I matrix holds the PageRank
values (a value between 0 and 1) for each incident and tag. The final step is to save
these values with their respective incidents and tags using the PageRank field in the
IssueTbl and TagTbl, as discussed previously. A report can then be generated
showing which tags and incidents are ranked highest, and how they relate to each
other.
Austin (2008), speaking on Google's implementation of the PageRank
algorithm on the graph of websites gathered by the company, that “The calculation is
47
Analysing Incidents with PageRank
reported to take a few days to complete”. Running the PageRank algorithm on our
source data took much less time, as we are dealing with considerably less data than
Google Corp., and took in the order of 5 minutes.
5.3.3. Procedure
Once the apparatus is in place, and sufficiently tested, the procedure for
experiment two followed the steps:
1. Associate tags with the source incident set.
2. Run the PageRank generation algorithm.
3. Generate reports showing the highest ranked incidents and tags.
These are discussed in more detail below.
5.3.3.1. Associating Tags with Incidents.
Using the user interface described in Illustration 5, tags can be added to
incidents. Tags are free form keywords. They are created and used by all participants
in the incidents life cycle. They form a set of metadata associated with the incident,
and can be used as a form of categorisation for incidents.
Based on the description, summary and work log of each incident, a set of tags
was associated with each incident. These represent various aspects of the incident
including:
● the symptoms of the issue as reported by the end user,
● the approaches taken by the IT staff in attempting to resolve the issue,
● the root cause or causes,
● the change made that fixed this incident,
● any other relevant information.
48
Analysing Incidents with PageRank
5.3.3.2. Running the PageRank Algorithm
Running the PageRank algorithm involved creating an ODBC data source
within Microsoft Windows. Then the Java executable which encoded the algorithm
was executed. This connected via a JDBCODBC bridge, queried for incidents and
tags, and generated the PageRank numbers, before saving them to the database.
5.3.3.3. PageRank Reports
PageRank reports were generated using Microsoft Access query and reporting
functionality. These queried both tags and incidents and ordered them in descending
order of PageRank value. These are presented in section 6.
49
Analysing Incidents with PageRank
6. Results
This section shows the results from both experiments.
6.1. Experiment 1: Pareto Results
The results of experiment 1, are discussed in section 5.2.4, and are
summarised here.
On application of Pareto analysis to the source dataset, it was found that the
top level category of Oracle had the most incidents, with an incident count of 3,601,
accounting for 18.74% of all incidents. Within the Oracle category, the largest mid
level subcategory was found to be Operations with an incident count of 1,625, which
accounted for 45.13% of all Oracle incidents. Finally, within the Operations category,
the largest lowlevel subcategory was found to be Access Issues, with an incident
count of 636 and accounting for 38.71% of all Operations issues.
6.2. Experiment 2: PageRank Results
A summary of the top ranked Tags is given in the following tables and charts.
Rank Tag PageRank
(x1000)
% Cumulative
%
1 ResetPassword 49.16 10.40% 10.40%
2 NewResponsibilities 37.82 8.00% 18.40%
3 UserRoles 35.44 7.50% 25.90%
4 ApplicationError 27.69 5.86% 31.76%
5 OAR (Oracle Access Request) 24.63 5.21% 36.97%
6 IECache 20.85 4.41% 41.38%
7 NewUser 18.59 3.93% 45.31%
8 DesktopIssue 15.69 3.32% 48.63%
9 LostResponsibilities 15.64 3.31% 51.94%
10 UserTraining 15.31 3.24% 55.18%
Table 4: Top Ten Tags Ordered by PageRank
50
Analysing Incidents with PageRank
Some 57 new categories were introduced through the tagging of the source
data set. Despite this, the majority of incidents are contained within a minority of
categories. As Table 5 shows, the top ten tags account for over 55% of the total
allocated PageRank for tags.
On reading through the source data, it was clear that the majority of incidents
fell into a small set of high level categories. Those were:
1. Application\System errors.
2. Loss of access due to scheduled downtime.
3. Loss of access due to changing user circumstances.
4. Routine loss of access rights.
One might expect that application or system errors would account for the
majority of incidents logged. The tacit assumption of ITIL is that all incidents are
caused by some defect within the IT system. This does not seem to be borne out by
51
Illustration 7: Tags Ordered by PageRank
P a g e R a n k o f T a g s
0 . 0 0
1 0 . 0 0
2 0 . 0 0
3 0 . 0 0
4 0 . 0 0
5 0 . 0 0
6 0 . 0 0
R e s e t P a s s w o r d
N e w R e s p o n s i b il i t i e s
U s e r R o l e s
A p p l i c a t i o n E r r o rO A R
I E C a c h e
N e w U s e r
D e s k t o p I s s u e
L o s t R e s p o n s i b i l i t ie s
U s e r T r a i n i n g
U p g r a d e I s s u e s
D e v e l o p m e n t I s s u e
J I n i t i a t o r C a c h e
E M R e m e d y
S y s t e m C h e c k
A p p l i c a t i o n U n a v a i l a b le
G U I Mi s s i n g
C o o k i e I s s u e
M a i n t e n a n c e
N e t w o r k C o n f i g E r r o r
A c c e s s R i g h t s
N e w P a s s w o r d
I n v a l id T ic k e t
U s e r n a m e C h a n g e
U s e r Q u e s t i o n
D i s k s O f f l i ne
N e w U n i x U s e r
R e p o r t U n a v a i la b l e
J a v a E r r o r
I n c o r r e c t U s e
L o c a l e I n c o r r e c t
R e s e t F o r m
T e m p A c c e s s
N o t e s U s e r n a m e C h a n g e
D u p l i c a t e T i c k e t
W e b C u s t o m e r
U s e r M i g r a t io n
R o l e M ig r a t i o n
R e p o r t I n P r o g r e s s
C a p a c i t y
C o n f ig E r r o r
M a i l Se r v ic e C o n f i g u r a t i o n
O r a c l e U s e r n a m e C h a n g e
W i n d o w s U s e r n a m e C h a n g e
R e n a m e U s e r
U s e r n a m e C o n f l i c t
A p p l i c a t i o n N o L o n g e r U s e d
N e w P o s i t i o n
S t a t u s R e q u e s t
R o l e s R e q u e s t
E n a b l e P r o f i l es
P a s s w o r d P o l i c y
N o t e s B r o w s e r S e t t i n g
M o n t h E n d
I n c o r r e c t L o c a l e
A c c e s s E r r o r
C l e a r T e m p F i l e s
T a g s
Page
Rank
0 . 0 0 %
2 0 . 0 0 %
4 0 . 0 0 %
6 0 . 0 0 %
8 0 . 0 0 %
1 0 0 . 0 0 %
1 2 0 . 0 0 %
Cum
ulat
ive
%
Analysing Incidents with PageRank
the PageRank analysis. Tags associated with application errors are ranked at 4
(ApplicationError), 6 (IECache) and 8 (DesktopIssue).
The standard response from IT staff is to return the user's client to a known
good configuration by clearing various caches (IECache), and this usually fixes the
access issue. The ApplicationError tag was introduced as a catchall tag for those
incidents which did not provide sufficient information for other tags to be applied.
Loss of access due to known issues, such as system upgrades, scheduled
downtime, and report generation were captured by the tags in Table 6.
Rank Tag PageRank
(x1000)
%
11 UpgradeIssues 14.66 3.10%
19 Maintenance 8.48 1.79%
39 ReportInProgress 2.53 0.54%
Table 5: Known Issue Tags
These issues represent a minority of cases, with upgrade issues being the
highest ranked. Several incidents were logged when Oracle instances were upgraded,
and IT staff worked to restore user roles and access rights. Outside of this once off,
access issues related to known issues are ranked quite low.
The final two categories refer to routine loss of access rights. These are due to
the individual business rules which govern the operation of the various Oracle
systems. The Human Resources department define access rights based on the user's
job role. Several of the top ten tags are associated with operation of this policy.
Those are:
● NewResponsibilities, users request new access rights through the
incident handling system.
52
Analysing Incidents with PageRank
● UserRoles, users request changes to their roles through the incident
handling system.
● OAR, the Oracle Access Rights system. Users are directed to request
changes in access rights through a separate system.
● NewUser, a new user has difficulties accessing Oracle.
● LostResponsibilities, typically a user has lost some rights due to
changing job role, or their access rights have reached their time limit and must
be renewed.
● UserTraining, due to the operational rules, IT staff can only address
these issues through training users in the operational policies.
Finally we must note the top ranked tag of ResetPassword, which was applied
to any incident resulting in a password being reset. This seems to be a standard IT
staff response to an access issue, and perhaps represents the overuse of a particular
incident resolution strategy.
6.3. Comparison
In this section we compare various aspects of Pareto and PageRank analysis.
6.3.1. On PageRank and Tags
We begin our comparison by looking at the PageRank values given to the tags
defined in the system.
6.3.1.1. Comparison of Incident Count to PageRank
As Dorner (1997) shows, Pareto charts are based on the number of incidents
within a particular category. By comparing the number of incidents associated with
each tag with that tag's PageRank value, we can assess if the PageRank analysis
53
Analysing Incidents with PageRank
differs from Pareto. A scatter graph of Incident Count versus PageRank is displayed
in Illustration 8.
As can be seen on Illustration 8, a roughly linear relationship exists between
incident count and PageRank value for tags. This is also noted when the correlation
coefficient is computed for this, and results in a value of =0.98.
Given that the correlation coefficient is as close to 1 as this. We can say that
PageRank analysis is quite similar to Pareto in this regard. Though it must be said
that Pareto counts each incident in only one category, where PageRank can count the
same incident in multiple categories.
6.3.1.2. Comparison of Incident Count to Resolution Time
For each tag we can find the associated incident count, and we can also find
the sum of the resolution time for those associated incidents. In Illustration 9 we
graph this relationship.
54
Illustration 8: Comparison of Incident Count to PageRank
C o m p a r i s o n o f I n c i d e n t C o u n t t o P a g e R a n k
0 . 0 0
1 0 . 0 0
2 0 . 0 0
3 0 . 0 0
4 0 . 0 0
5 0 . 0 0
6 0 . 0 0
0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0 4 5
C o u n t
Page
Rank
Analysing Incidents with PageRank
As previously mentioned, Pareto analysis associates each incident with only
one category. In this system it is obvious then that the categories with the most
incidents are likely to be the categories which have the highest total resolution time.
As can be seen from Illustration 9, the relationship is less well defined as the
previous example, yet can still be said to be roughly linear. The correlation
coefficient of this data set is =0.92.
Although not as high as the previous example, it shows that a strong
relationship exists between the Tag categories and the resolution times. Again, those
categories with the most incidents are most likely to be the categories which show the
longest resolution times.
6.3.1.3. Comparison of PageRank to Resolution Time
Similarly, for each tag we can assess the relationship between the assigned
PageRank value and the resolution time for the associated incidents. There is no
55
Illustration 9: Correlation of Incident Count to Resolution Time
C o m p a r i s o n o f I n c i d e n t C o u n t t o R e s o l u t i o n T i m e
0
5 0
1 0 0
1 5 0
2 0 0
2 5 0
3 0 0
3 5 0
0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0 4 5
C o u n t
Sum
of T
ime
to R
esol
ve
Analysing Incidents with PageRank
direct comparison to be made with Pareto here, as Pareto does not rank categories in
any way other than through incident counts.
A comparison of PageRank to resolution time is displayed in Illustration 10.
This has increased scatter compared with the previous examples. but again can be said
to be roughly linear.
The correlation coefficient for this relationship is =0.90. Based on this value,
we can say that the PageRank values for tags are strongly correlated with the
resolution times for associated incidents. This shows that PageRank analysis can
have some value, as those tags with the highest ranks are likely to be those requiring
the most time to fix.
6.3.1.4. Relationships between Tags
An attribute of the tagging categorisation system, is that incidents can be
simultaneously associated with multiple tags. This is in contrast with the singular
56
Illustration 10: Correlation of PageRank to Resolution Time
C o m p a r i s o n o f R e s o l u t i o n T i m e t o P a g e R a n k
0 . 0 0
1 0 . 0 0
2 0 . 0 0
3 0 . 0 0
4 0 . 0 0
5 0 . 0 0
6 0 . 0 0
0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 0
R e s o l u t i o n T i m e
Page
Rank
Analysing Incidents with PageRank
category an incident may be related to in a fixed or hierarchical categorisation
scheme.
This allows us to analyse incident counts using Venn diagrams, as shown in
Illustration 11. The numbers represent the incident counts for that area of the
diagram. That is, 37 incidents are tagged with ResetPassword along, while 3
incidents are tagged with both ResetPassword and IECache.
While this is a simple example, with three tags, more sophisticated Venn
diagrams, or other analysis mechanisms can be brought to bear on the relationships
between tags. This shows the flexibility of the tagging system for allowing multiple
analysis mechanisms to be brought to bear on the meta data which the tagging system
introduces on the source data. By analysing these relationships, system administrators
57
Illustration 11: Example Venn of Tag Relationships
Analysing Incidents with PageRank
can gain further insight into how users categorise incidents, and can begin to learn
how people react to the system when errors or other unexpected behaviours occur.
6.3.2. On PageRank and Incidents
We continue our comparison by looking at the PageRank values given to
incidents.
6.3.2.1. Comparison of PageRank to Resolution Time.
As can be seen in section 6.3.1.3, the PageRank values assigned to tags was a
good indicator of the amount of time required to resolve those associated incidents.
We now consider if this also holds true for the PageRank values assigned to incidents.
As can be seen in Illustration 12, the values of PageRank and Resolution Time
for Incidents are rather scattered, as if randomly. Calculating the correlation
coefficient shows a value of =3.07x103. With a value so close to zero, we can
confidently say that there exists no relationship between PageRank values for
incidents and their resolution times.
While PageRank values showed considerable correlation with the resolution
times for tags, the same cannot be said for incidents. It would appear that the relative
importance of incidents when ranked by PageRank would not give any relative
indication of how long those incidents would take to resolve.
58
Analysing Incidents with PageRank
6.3.2.2. Comparison of PageRank to Urgency.
When incidents are logged to the help desk, they are assigned an urgency
value of Low (1), Medium (2), High (3) and Urgent (4). We now analyse the
correlation between the PageRank values assigned to incidents and their associated
Urgency values.
As can be seen from Illustration 13, the PageRank values are scattered widely.
Calculating the correlation coefficient shows it to be =7.26x103. With a value so
close to zero we can confidently say that the PageRank value assigned to an incident
is not an indicator of the incidents urgency.
We can only speculate as to why this may be. Assuming that PageRank is a
true indicator of an incidents importance, it could be that there is no consistent
understanding of what the urgency field of an incident is used for. In this case
different people may mark similar incidents with vastly different urgency values.
59
Illustration 12: Comparison of PageRank to Resolution Time for Incidents
C o m p a r i s o n o f P a g e R a n k t o R e s o l u t i o n T i m e
0
5
1 0
1 5
2 0
2 5
0 1 2 3 4 5 6 7
P a g e R a n k
Reso
lutio
n Ti
me
Analysing Incidents with PageRank
It could also be argued that the urgency values assigned to incidents are valid.
In that case we would have to argue that the PageRank value assigned are not a good
indicator of the incident's true importance.
6.3.2.3. Comparison of PageRank to Priority.
In a similar way to the Urgency values previously discussed, Priority values
are also assigned to incidents by those who log them. An incident's priority field can
take the values of Low (1), Medium (2), High (3) or Urgent (4). We now analyse the
relationship between PageRank values and Priority values for incidents.
When plotted on a chart, as in Illustration 14, we can see that the values of
PageRank versus Priority are quite scattered. Calculating the correlation coefficient
shows a value of =0.14. Being so close to zero we can say that there is no
meaningful relationship between PageRank values and the assigned Priority values.
60
Illustration 13: Comparison of PageRank to Urgency for Incidents
C o m p a r i s o n o f P a g e R a n k w i t h U r g e n c y
0
0 . 5
1
1 . 5
2
2 . 5
3
3 . 5
4
4 . 5
0 . 0 0 1 . 0 0 2 . 0 0 3 . 0 0 4 . 0 0 5 . 0 0 6 . 0 0 7 . 0 0
P a g e R a n k
Urge
ncy
Analysing Incidents with PageRank
Similar to the Urgency values, we can assume that PageRank is a true rank of
relative importance, and then argue that the Priority values are not assigned
consistency to incidents. This would account for the lack of a relationship between
the two values. We can also argue that the Priority values are valid, and therefore the
PageRank values do not represent a meaningful measure of an incident's relative
priority.
6.3.2.4. Comparison of PageRank to Impact.
Similar to Urgency and Priority, an incident is given an impact value when
created. This can have values of Low (1), Medium (2) or High (3). We now analyse
the relationship between PageRank and Impact values for incidents.
As can be seen from the chart in Illustration 15, the values of PageRank versus
Impact are scattered. A correlation coefficient of =0.12 shows an extremely weak
relationship. Thus we can say that PageRank values for incidents are not a good
indication of an incident's impact.
61
Illustration 14: Comparison of PageRank to Priority for Incidents
C o m p a r i s o n o f P a g e R a n k t o P r i o r i t y
0
0 . 5
1
1 . 5
2
2 . 5
3
3 . 5
4
4 . 5
0 . 0 0 1 . 0 0 2 . 0 0 3 . 0 0 4 . 0 0 5 . 0 0 6 . 0 0 7 . 0 0
P a g e R a n k
Prio
rity
Analysing Incidents with PageRank
Again, we can assume that the PageRank values are a true rank of an incidents
impact, and thus argue that the Impact values are not consistently applied by the
human system users. We can also argue that the Impact values are valid, and the
PageRank values are of no value as indicators of an incident's impact.
6.3.2.5. Relationships between Priority, Urgency, Impact and Resolution Time
In the previous sections we have left the question open to the quality of
PageRank values versus the quality of the Priority, Urgency and Impact data. By
analysing the relationships between these values we can show if system users are
being consistent in applying similar values for these fields, and thus if the PageRank
values for incidents are meaningless.
By calculating the correlation coefficients between these values we arrived at
the results in Table 7.
62
Illustration 15: Comparison of PageRank to Impact for Incidents
C o m p a r i s o n o f P a g e R a n k t o I m p a c t
0
0 . 5
1
1 . 5
2
2 . 5
3
3 . 5
0 . 0 0 1 . 0 0 2 . 0 0 3 . 0 0 4 . 0 0 5 . 0 0 6 . 0 0 7 . 0 0
P a g e R a n k
Impa
ct
Analysing Incidents with PageRank
Relationship Correlation Coefficient()
Priority versus Urgency 0.02
Priority versus Impact 0.37
Priority versus Resolution Time 0.07
Urgency versus Impact 0.16
Urgency versus Resolution Time 0.16
Impact versus Resolution Time 0.10
Table 6: Relationships between Priority, Urgency, Impact and Resolution
Time
Based on the values in Table 7, we can see that the strongest relationship is
that between Priority and Impact. With a correlation coefficient of =0.37, the best
we can say is that it is a weak relationship.
The other relationships between Priority, Urgency and Impact show values
close to zero. These values show clear evidence that the system users who create and
manipulate incidents do not provide consistent values for Priority, Urgency and
Impact. While this does not argue in favour of the idea that PageRank values for
incidents represent a meaningful measure, it does show that we cannot discount those
PageRank values as meaningless with respect to true measures of incident priority,
urgency or impact.
As an aside, it is also interesting to note the relationships between Resolution
Time and the values of Priority, Urgency and Impact. These also show very weak
relationships.
Within a correctly run ITIL incident management system, one would expect
that higher priority incidents would have every effort made to resolve them quickly,
and would show faster resolution times. This does no appear to be the case in the
system which produced the source data for this analysis.
63
Analysing Incidents with PageRank
Several factors may explain this discrepancy:
1. Assigned values of Priority, Impact and Urgency provide no value and
are ignored by IT staff.
2. IT staff may have so little a throughput of incidents that they
effectively resolve them on a first come, first served basis. That is, the help
desk workload is so little as to not require prioritisation of incoming incidents.
3. Resolution Time is highly variable. That is, regardless of how
incidents are prioritised, or in what order they are resolved, resolution time is
so variable as to show no relationship with priority.
The first factor is not likely the case. While the values of priority, urgency
and impact may not hold much value, they are respected by IT staff, as service level
agreements tie IT staff to fixed resolution times associated with each priority level.
The second factor is also unlikely, as an analysis of the data shows an average of 178
incidents being logged on a daily basis.
Finally the third factor may have some merit. If we calculate the standard
deviation for resolution time within the priority groups we arrive at the values in
Table 8.
Priority Standard Deviation of Resolution Time
Low 4.32 hours.
Medium 4.96 hours.
High 4.87 hours.
Urgent 4.81 hours.
Table 7: Deviation of Resolution Times by Priority
The values are both large and similar. This shows that two incidents arriving
at the same time can have resolution times which differ by over 4 hours, or half a
64
Analysing Incidents with PageRank
business day, regardless of the priority associated. This would account for the weak
correlation between resolution time and priority which we discussed earlier.
65
Analysing Incidents with PageRank
7. Discussion
7.1. Realisation of Project Aims
The aims of the project as stated in section 2.1 are:
1. Show how incident classification influences effective problem
management.
2. Propose a new classification mechanism for incidents.
3. Explore the analysis opportunities for the new classification
mechanism.
The new classification mechanism for incidents we propose is that of tagging
with PageRank analysis. We will argue that tagging provides additional flexibility in
categorisation beyond fixed or hierarchical categorisation schemes.
We have shown that PageRank analysis on a small data set is at least as good
as a corresponding Pareto analysis. With regard to the analysis opportunities for the
new classification mechanism, we will further argue that as the data set grows, the
quality of correlations which can be produced from the data, will allow much better
insight into the relationships which govern the operation of an IT system.
7.2. Big Science
As a race we produce and consume ever growing amounts of data each year.
As we move ever closer to the vision of ubiquitous computing, this can only rise, as
sensors and computing devices become smaller and cheaper. They will become
embedded in a vast array of locations, allowing us to gather ever more detailed
measurements of the physical world.
With so much data, Kevin Kelly (2008) suggests we need new methods of
science to analyse and make use of this data, stating:
66
Analysing Incidents with PageRank
There's a dawning sense that extremely large databases of information,
starting in the petabyte level, could change how we learn things. The
traditional way of doing science entails constructing a hypothesis to match
observed data or to solicit new data. Here's a bunch of observations; what
theory explains the data sufficiently so that we can predict the next
observation?
It may turn out that tremendously large volumes of data are sufficient to
skip the theory part in order to make a predicted observation. Google was
one of the first to notice this. For instance, take Google's spell checker.
When you misspell a word when googling, Google suggests the proper
spelling. How does it know this? How does it predict the correctly
spelled word? It is not because it has a theory of good spelling, or has
mastered spelling rules. In fact Google knows nothing about spelling
rules at all.
Instead Google operates a very large dataset of observations which show
that for any given spelling of a word, x number of people say "yes" when
asked if they meant to spell word "y." Google's spelling engine consists
entirely of these data points, rather than any notion of what correct
English spelling is. That is why the same system can correct spelling in
any language.
The traditional goal of science has been to build a better model of the physical
world. Taking all known facts about a particular system, a scientist creates a model
for that system. Using this model, they hypothesise about the behaviour of the model
67
Analysing Incidents with PageRank
under new circumstances. Experiments allow scientists to test the models under the
new circumstances, and thus validate or nullify their model.
With the growth of data, sensors and meta data, do we need to continue to
produce models? With enough data, can we not analyse the real, physical world,
without having to create a simplified abstraction of it? Anderson (2008) argues we
can, stating:
Faced with massive data, this approach to science — hypothesize, model,
test — is becoming obsolete. Consider physics: Newtonian models were
crude approximations of the truth (wrong at the atomic level, but still
useful). A hundred years ago, statistically based quantum mechanics
offered a better picture — but quantum mechanics is yet another model,
and as such it, too, is flawed, no doubt a caricature of a more complex
underlying reality. The reason physics has drifted into theoretical
speculation about ndimensional grand unified models over the past few
decades is that we don't know how to run the experiments that would
falsify the hypotheses — the energies are too high, the accelerators too
expensive, and so on.
Now biology is heading in the same direction. The models we were
taught in school about "dominant" and "recessive" genes steering a strictly
Mendelian process have turned out to be an even greater simplification of
reality than Newton's laws. The discovery of geneprotein interactions
and other aspects of epigenetics has challenged the view of DNA as
destiny and even introduced evidence that environment can influence
inheritable traits, something once considered a genetic impossibility.
68
Analysing Incidents with PageRank
In short, the more we learn about biology, the further we find ourselves
from a model that can explain it.
There is now a better way. Petabytes allow us to say: "Correlation is
enough." We can stop looking for models. We can analyse the data
without hypotheses about what it might show. We can throw the numbers
into the biggest computing clusters the world has ever seen and let
statistical algorithms find patterns where science cannot.
This is a new way to approach scientific discovery. With this, we can discover
correlations between data, but without the model we may not understand the causal
relationship, if any, between them. What is definite is that with so much data, new
methods of science will have to be created, as Kelly (2008) states:
Many sciences such as astronomy, physics, genomics, linguistics, and
geology are generating extremely huge datasets and constant streams of
data in the petabyte level today. They'll be in the exabyte level in a
decade. Using old fashioned "machine learning," computers can extract
patterns in this ocean of data that no human could ever possibly detect.
These patterns are correlations. They may or may not be causative, but we
can learn new things. Therefore they accomplish what science does,
although not in the traditional manner.
What Anderson (2008) is suggesting is that sometimes enough
correlations are sufficient. There is a good parallel in health. A lot of
doctoring works on the correlative approach. The doctor may not ever
69
Analysing Incidents with PageRank
find the actual cause of an ailment, or understand it if they did, but they
can correctly predict the course and treat the symptom.
7.3. Diagnosing the IT System.
If this approach allows us to ask questions and receive perfectly good answers,
without having to construct a simplified model of the system, can such an approach be
used to analyse an ITIL incident database? Much like a doctor diagnoses a symptom,
without knowing the underlying cause; can a system administrator diagnose an IT
system in a similar way?
Yet, doctors don't diagnose blindly, but have built up a knowledge base of
relationships between symptoms and illnesses. Given symptom A, a doctor can say
with a good degree of confidence, based on historical analysis, the probability that
illness X, Y or Z is the underlying cause. Taking a broader approach, and including
several symptoms in the analysis can lead to a higher or lower probability for the
underlying illnesses. The doctor then begins by attempting to treat the most likely
illness which could cause a patient’s symptoms.
If a system administrator is to take a similar approach, then a similar
knowledge base of symptom to likely cause relationships must be developed. ITIL
attempts to achieve this through the administration of an incident log. Incident
records are created along with their work logs, and eventual fixes, in the hope that
similar incidents in future will mandate similar fixes.
Continuing with the medical metaphor, this seems rather simplistic. Doctors
tend to take a holistic view of a patient's well being. One would question their health
care provider if they treated your cough, while your leg remained broken. Yet this is
the approach that ITIL's Incident Management process suggests.
70
Analysing Incidents with PageRank
Secondly, many illnesses produce flulike symptoms, so it would be remiss of
a doctor to assume that these represent flu and begin treatment, without further
investigating the illness. In medicine at least, similar symptoms may have disparate
underlying causes. Assuming the same for IT systems, ITIL's approach of attempting
to cure each incident separately, while assuming each symptom has a singular root
cause, seems rather simplistic.
7.4. On Categorisation
ITIL's Problem Management process recommends the use of Pareto charts for
incident analysis. This forces a categorisation scheme on the incidents where each
incident is associated with only one category, usually related to its root cause. As an
example, the source data we analysed used a hierarchical categorisation scheme.
Again, we can argue that this is rather simplistic. While many incidents may
well have only one root cause, it is naive to suggest that all incidents are like this.
There may well be some incidents which have multiple root causes.
Taking a systems engineering view of a system, would suggest that there are a
certain set of incidents which rely on a series of events to happen, no matter how
unlikely, for an incident to be realised. The one to one relationship between symptom
and cause, forced by a fixed or hierarchical categorisation system, cannot capture
these subtleties.
To overcome this limitation, we propose a collaborative tagging system for
incidents. Golder & Bernardo (2005) introduce tagging as:
Marking content with descriptive terms, also called keywords or tags, is a
common way of organising content for future navigation, filtering or
search. Though organising electronic content this way is not new, a
71
Analysing Incidents with PageRank
collaborative form of this process, which has been given the name
“tagging” by its proponents, is gaining popularity on the web.
Document repositories or digital libraries often allow documents in their
collections to be organised by assigned keywords. However, traditionally
such categorising or indexing is either performed by an authority, such as
a librarian, or else derived from the material provided by the authors of
the documents (Rowley, 1995). In contrast, collaborative tagging is the
practice of allowing anyone – especially consumers – to freely attach
keywords or tags to content. Collaborative tagging is most useful when
there is nobody in the “librarian” role or there is simply too much content
for a single authority to classify; both of these traits are true of the web,
where collaborative tagging has grown popular.
This is in contrast to the fixed categorisation scheme already applied to our
source data. In this schema, a user could only categorise an incident into one of the
already predefined system categories. Allowing users to categorise incidents in a
freeform “folksonomy”, presents its own particularities. The issues of polysemy,
synonymy, and basic level variation need to be considered.
Golder & Bernardo (2005) discuss these saying:
A polysemous word is one that has many (“poly”) related senses
(“semy”). For example, a “window” may refer to a hole in the wall or to
the pane of glass that resides within it (Pustejovsky, 1995). In practice,
polysemy dilutes query results by returning related but potentially
inapplicable items.
72
Analysing Incidents with PageRank
Synonymy, or multiple words having the same or closely related
meanings, presents a greater problem for tagging systems because
inconsistency among the terms used in tagging can make it very difficult
for one to be sure that all the relevant items have been found. It is
difficult for a tagger to be consistent in the terms chosen for tags; for
example, items about television may be tagged either television, or TV.
This problem is compounded in a collaborative system, where all taggers
either need to widely agree on a convention, or else accept that they must
issue multiple or more complex queries to cover many possibilities.
Reflecting the cognitive aspect of hierarchy and categorisation, the “basic
level” problem is that related terms that describe an item vary along a
continuum of specificity ranging from very general to very specific; cat,
cheetah and animal are all reasonable ways to describe a particular entity.
The problem lies in the fact that different people may consider terms at
different levels of specificity to be most useful or appropriate for
describing the item in question.
Kelly (2008) refers to these personality types as “lumpers and splitters”,
stating:
In every classification scheme, there are two camps. There are those
classifiers who tend to find similarities between things and prefer to lump
smaller groups into larger groups, and on the other hand those cataloguers
who tend to find differences and prefer to split larger groups into smaller
groups.
73
Analysing Incidents with PageRank
7.5. The Wisdom of Crowds
Given these peculiarities, and the differing opinions between users as to how
to classify something, one could be forgiven for thinking this style of categorisation is
a recipe for chaos. Speaking of their analysis of the bookmarks on the Del.ico.us
website, Golder & Bernardo (2005) state:
One might expect that individuals varying tag collections and personal
preferences, compounded by an everincreasing number of users, would
yield a chaotic pattern of tags. However, it turns out that the combined
tags of many users bookmarks give rise to a stable pattern in which the
proportions of each tag are nearly fixed. Empirically, we found that,
usually after the first 100 or so bookmarks, each tag's frequency is a
nearly fixed proportion of the total frequency of all tags used.
This shows nicely the concept of “wisdom of crowds” (Surowiecki, 2004).
While the opinions of how to categorise an item may differ between individuals, the
sum total of the tags tend to stabilise around each tagged item, so as to fully annotate
its concept. Allied with PageRank analysis, the more frequently applied generalised
terms tend to get ranked higher than the less frequently applied personalised tags.
The system can accommodate both “lumpers and splitters”, to come to a consensus
regarding how items are categorised.
7.6. Analysis of Tags
The tags applied to incidents create a set of meta data which describes the
incident data. We have shown through the application of the PageRank algorithm to
the graph of incidents and tags, that both tags and incidents can receive a rank of
relative importance.
74
Analysing Incidents with PageRank
We go on to show in section that the PageRank values for tags correlates well
with the total resolution time for those incidents. This represents a firstorder analysis
of metadata to source data, through the comparison of PageRank to base data.
Secondly, we show that further opportunities for analysis exist in section .
Here we show the flexibility of the tagging system, as incidents can be associated
with multiple categories. We also show briefly, that the relationships between tags
can be analysed. Further and deeper analysis could show which set of tags describes
the most incidents, or most important incidents within the system, thus giving
administrators increased insight into where users perceive problems to exist.
7.7. Further Research
Perhaps the best way to approach this research, is to look at it as a feasibility
study of a new approach to “correlative analysis” (Kelly, 2008) on incident data. We
hope that we have shown that the approach shows some merit.
To further the research, one could consider running similar analysis on a larger
scale, perhaps using a case study, given a minimum set of requirements. If these can
be met, the experiment may prove valuable.
As previously discussed, this approach requires an abundance of meta data, if
it hopes to find correlations among them. For this to be achieved, a candidate system
would have to be:
1. sufficiently large,
2. accessed by many users,
3. and annotated with meta data by everyone.
In order to be confident in the quality of analysis available from a system, the
aggregate meta data available within the system would have to reach a certain
75
Analysing Incidents with PageRank
minimum level. Below this, one could not guarantee the quality of correlative
analytics. One would have to ensure that the system is sufficiently large in terms of
users, incidents and tags for a sufficient aggregate amount of meta data to be created
and maintained.
Golder & Huberman (2005) find great variety in the amount of tags individual
users apply, stating:
As might be expected, users vary greatly in the frequency and nature of
their Delicious use. In our “people” dataset, there is only a weak
relationship between the age of the user's account (i.e. the time since they
created the account) and the number of days on which they created at least
one bookmark (n=229; R2=.52). That is, some users use Delicious very
frequently, and others less frequently.
More interestingly, there is not a strong relationship between the number
of bookmarks a user has created and the number of tags they used in those
bookmarks (n=229; R2=.33). The relationship is weak at the low end of
the scale, users with fewer than 30 bookmarks (n=39; R2=.33), and even
weaker at the upper end, users with more than 500 bookmarks (n=36;
R2=.14). Some users have comparatively large sets of tags, and others
have comparatively small sets.
PageRank ranking of tags and incidents could be created automatically by the
system on a regular basis. Finally, a free form analysis tool would have to be
produced. This would allow data and meta data to be organised, summarised and
76
Analysing Incidents with PageRank
analysed. Using this, arbitrary data points can be chosen, and relationships between
these data points produced.
Such a system could support many arbitrary reports. Over time, one would
expect that some reports would prove more useful in managing the system than
others, thus providing increased insight into the operation of the system under
scrutiny, and adding to the collected set of meta data about the system.
77
Analysing Incidents with PageRank
8. References
Anderson, C. (2008) The End of Theory: The Data Deluge Makes the Scientific
Method Obsolete. Wired Magazine. Retrieved on August 7, 2008 from http://
www.wired.com/science/discoveries/magazine/1607/pb_theory/
Apostolov, A. (2006) Automatic fault analysis and user notification for predictive
maintenance. IEEE Conference Record, Cement Industry Technical
Conference, 2006. 9 – 14.
Austin, D. (2008) How Google Finds Your Needle in the Web's Haystack. American
Mathematical Society. Retrieved on August 1, 2008 from http://www.ams.org/
featurecolumn/archive/pagerank.html
Barash, G., Bartolini, C. & Wu, L. (2007) Measuring and Improving the
Performance of an IT Support Organization in Managing Service Incidents.
Proceedings of the 2nd International Workshop on BusinessDriven IT
Management. 11 – 18.
BMC Software. (2006) Achieving Proactive Incident and Problem Management
Using BMC Performance Manager. Retrieved January 6, 2008 from
http://www.bmc.com/products/attachments/WP_Achieving_PIPM_Using_BM
C_PM_2202v3ww_SOLUTION_USA4_FY07_Q3.pdf.
Brin, S., & Page, L. (1998) The Anatomy of a LargeScale Hypertextual Web Search
Engine. Retrieved on January 6, 2008 from
http://infolab.stanford.edu/~backrub/google.html.
Card, D (1993) Defectcausal analysis drives down error rates. IEEE Software. Vol.
10, No. 4. 98 – 99.
78
Analysing Incidents with PageRank
Card, D. (1998) Learning from our mistakes with defect causal analysis. IEEE
Software. Vol. 15, No. 1. 56 – 63.
Dorner, W. (1997) Using Excel for Data Analysis. Quality Digest. Retrieved on July
07, 2008 from http://www.qualitydigest.com/oct97/html/excel.html
Gokhale, S., Crigler, J., Farr, W. & Wallace, D. (2005) System Availability Analysis
Considering Hardware/Software Failure Severities. Proceedings of the 29th
Annual IEEE/NASA Software Engineering Workshop. 47 – 56.
Golder, S. & Huberman, B. (2005) The Structure of Collaborative Tagging Systems.
Information Dynamics Lab, HP Labs. Retrieved on August 10, 2008 from
http://www.hpl.hp.com/research/idl/papers/tags/tags.pdf
Hanemann, A., Sailer, M. & Schmitz, D. (2004) Assured service quality by improved
fault management. Proceedings of the 2nd International conference on Service
Oriented Computing. 183 – 192.
Jantti, M. & Eerola, A. (2006) A Conceptual Model of IT Service Problem
Management. Proceedings of the 2006 International Conference on Service
Systems and Service Management. Vol. 1. 798 – 803.
KajkoMattsson, M. (2002) Corrective Maintenance Maturity Model: Problem
Management. Proceedings of the 2002 International Conference on Software
Maintenance. 486 – 490.
Kelly, K. (2008) The Google Way of Science. The Technium. Retrieved on August
7, 2008 from
http://www.kk.org/thetechnium/archives/2008/06/the_google_way.php
79
Analysing Incidents with PageRank
Kelly, K. (2008) Lumpers and Splitters. The Technium. Retrieved on August 10,
2008 from
http://www.kk.org/thetechnium/archives/2008/01/lumpers_and_spl.php
Leszak, M., Perry, D. & Stoll, D. (2000) A case study in root cause defect analysis.
Proceedings of the 2000 International Conference on Software Engineering.
428 – 437.
Microsoft Corp. (2007) Service Management Functions, Problem Management.
Retrieved on January 1, 2008 from
http://www.microsoft.com/technet/solutionaccelerators/cits/mo/smf/smfprbmg.
mspx
Mockus, A. (2006) Empirical estimates of software availability of deployed systems.
Proceedings of the 2006 ACM/IEEE International Symposium on Empirical
Software Engineering. 222 – 231.
OGC. (2000) Best Practise for Service Support. TSO:London.
Oppenheimer, D. & Patterson, D. (2002) Studying and using failure data from large
scale internet services. Proceedings of the 10th workshop on ACM SIGOPS
European Workshop. 255 – 258.
Pustejovsky, J. (1995) The Generative Lexicon. MIT Press.
Rowley, J. (1995) Organising Knowledge. 2nd Ed. Brookfield, VT: Gower.
Simonite, T. (2008) Google tool could search out hospital superbugs. New Scientist.
Retrieved on January 04, 2008 from
http://www.newscientist.com/channel/health/dn13142googletoolcould
searchouthospitalsuperbugs.html.
80
Analysing Incidents with PageRank
Surowiecki, J. (2004) The Wisdom of Crowds: Why the Many Are Smarter Than the
Few and How Collective Wisdom Shapes Business, Economies, Societies and
Nations. Doubleday.
Tague, N. R. (2004) Pareto Chart. Retrieved February 10, 2008 from
http://www.asq.org/learnaboutquality/causeanalysis
tools/overview/pareto.html
Talluru, L. & Deshmukh, A. (1995) Problem management in decision support
systems: a knowledgebased approach. Proceedings of the 1995 IEEE
International Conference on Intelligent Systems for the 21st Century. Vol. 3,
Systems, Man and Cybernetics. 1957 – 1962.
Tanaka, J. & Taylor, M. (1991) Object Categories and Expertise: Is the Basic Level in
the Eye of the Beholder? Cognitive Psychology. Vol. 23. 457 – 482.
81
Analysing Incidents with PageRank
Appendix: Source Data Definition
Field Type Description
Arrival Time Time stamp
Assigned Time stamp Date and time the issue is assigned to an IT
resource for resolution.
Case Category Type String Description of type and subtype of the
incident. E.g., “Operating Systems |
Windows XP | Blue Screen”
Case ID ID Auto generated incident ID number.
Case Type Enumeration Type of request. Can have values of,
Incident, Problem, Question or Request.
Category Enumeration The high level category of the incident.
Incidents are categorised by application
name, such as Oracle, Intranet, etc.
Create Date Time stamp
Department String Department name of the incident creator.
Description String Description of the incident.
Group String Group within the IT department which
deals with this type of incident.
Hours to resolve Number How long the incident took to resolve.
Item String The lowest level categorisation of the
incident.
Priority Enumeration Requester assigned priority. Can be one of
Urgent, High, Medium, Low.
Region String Geographic region (continent) where the
incident occurred. Can be one of, APAC,
EMEA, LAM, NAM.
Request Impact Enumeration The impact of the request. Can have values
of High, Medium or Low.
82
Analysing Incidents with PageRank
Resolved Time stamp Date and time the incident was resolved.
Site String Name of the site at which the incident
occurred.
SLA Parent String Reference to the Service Level Agreement
this incident is being measured under.
Source String How this incident entered the tracking
system. Can have values of, Email, NMP
(Network Management Protocol), Phone,
Requester, Instant Messenger, TopTen or
Web.
Status Enumeration The current status of the incident. Can be
one of, New, Assigned, Pending, Work in
progress, Resolved or Closed.
Assigned.Time Time stamp Time the incident is assigned.
Work In
Progress.User
String Who is currently working on the incident.
Work In
Progress.Time
Duration How long the incident spent in the Work In
Progress status.
Pending.Time Duration How long the incident spent in the Pending
status.
Resolved.Time Duration How long the incident spent in the
Resolved status.
Closed.Time Duration Date and time the incident was closed.
Summary String Brief description of the incident.
Type String Mid level categorisation of the incident.
Urgency Enumeration How urgent a resolution is needed. Can
have values of, Urgent, High, Medium or
Low.
83
Analysing Incidents with PageRank
Work Log String A log of work on the incident.
84