13
Event Management and Monitoring Program Strategy Prepared by: Jim Gingras, Event Management and Monitoring Manager

Event Management and Monitoring Strategy

Embed Size (px)

Citation preview

Page 1: Event Management and Monitoring Strategy

Event Management and Monitoring

Program

Strategy

Prepared by: Jim Gingras, Event Management and

Monitoring Manager

Page 2: Event Management and Monitoring Strategy

Event Management and Monitoring Program Strategy

Proprietary and Confidential 2

Table of Contents 1. Event Management and Monitoring Strategy .................................................................... 3

1.1 Event Management and Monitoring Overview ............................................................... 3 1.2 Stakeholders ................................................................................................................... 4 1.3 Event Management Program Processes........................................................................ 5 1.3.1 Event Management Process....................................................................................... 6 1.3.2 Event Monitoring ......................................................................................................... 7 1.3.3 Designing Manageable Applications .......................................................................... 8

1.4 Event Management Metrics ............................................................................................ 8 1.5 Roadmap......................................................................................................................... 9 1.5.1 Current State to Future State.................................................................................... 10

Appendix A: A Business Value proposition for Event Management................................... 12

Page 3: Event Management and Monitoring Strategy

Event Management and Monitoring Program Strategy

Proprietary and Confidential 3

1. Event Management and Monitoring Strategy The strategy for Event Management and Monitoring is to take advantage of the existing event and monitoring processes and tools and build on them to propel IT at CORPORATE to the next level of IT capabilities, demonstrating business value through management and monitoring of IT services. Appendix_A shows an example of how Event Management demonstrates business value.

The strategy involves the creation of an Event Management program and the associated projects that are executed over the next two years.

The remainder of this document describes the Event Management and Program and the supporting

activities required to ensure it is operating as designed. These include:

Define Event Management and Monitoring

Define the stakeholders for event management and monitoring

Define the high level processes and associated activities in the Event Management Program

Define metrics for determining the status of the processes

Define a roadmap of the actionable and measureable projects/initiatives required to establish

the event management program

Event Management Program Definition: The Event Management Program is responsible for the

management of the Event Management projects and monitoring systems required to determine the

status of the services IT provides.

Event Management and Monitoring Definition: Event Management and Monitoring is the process of

managing IT system and user events to provide the appropriate control action while providing a near

real-time view of the status of the IT services.

1.1 Event Management and Monitoring Overview

Event Management’s value to the business is not direct in that it cannot generate income for the

business. The most relevant measurements to the business are:

Decreased Mean Time To Repair – decreased down time when incidents/problems occur due to

the notification of personnel with the appropriate skill-level sooner and with the correct

information to resolve issues, whenever possible, before they occur.

Increased Mean Time Between Failures – analyzing trended event information to determine

upcoming outages and remediate them before they occur (predictive monitoring)

Service Level Agreements are met or exceeded – due to decreased down time

Decreased IT support cost – due to appropriate personnel being notified, better use of

knowledge from events in the environment, and fewer personnel required to resolve

incidents/problems

Page 4: Event Management and Monitoring Strategy

Event Management and Monitoring Program Strategy

Proprietary and Confidential 4

Event Management is the vital hub on which all process and tool integration is developed. Event

monitoring encompasses all the activities that are required to ensure a device or Configuration Item1 (CI)

is working correctly regardless of whether it is generating events.

The foundational elements for event management are the systems and user events that are created by

CIs or monitoring tools. In order to enable monitoring IT services these events are mapped to all the

related CIs of a specific IT Service. Going forward a service view will be available to all management and

service/support personnel to show the status and configuration information in an easy to understand

format.

1.2 Stakeholders

Position Name Description

Event Management Event Management Process Activities

Incident Management Automated Incident Management for events

Problem Management Troubleshooting and enhancements for Known Errors

Availability Management Monitoring Requirements

Capacity Management Monitoring Requirements

Operations Monitoring Requirements

Steering Committee Program Management and Reporting

IT Instrumentation Monitoring Tools and Reporting

Infrastructure Hosting Monitoring Requirements, Monitoring Tools

Software Solutions and Support Systems Administration

Architecture Instrumentation of Internal Applications and RJSF

design and Service Model

Security Monitoring Requirements for Security

Service Management Monitoring Requirements

Product Management Service Model and Monitoring Requirements

Release Management Service Model

Configuration Management Service Model

1 Configuration Items include services, applications, or components as per CORPORATE service model in the CMDB

Page 5: Event Management and Monitoring Strategy

Event Management and Monitoring Program Strategy

Proprietary and Confidential 5

Service Level Mgmt. Service Model and Service Level Requirements

Software Solutions and Support Internal Software Development, RJSF

Table 1: IT Event Management Stakeholders

The stakeholders for the Event Management Program are management and the process owners for the

ITIL processes of availability, capacity, incident, problem, and event management. Additional

stakeholders include the administrative groups who must manage the tools that are required to deliver

the event management services and develop manageable applications. All stakeholders are required to

agree on service views that provide accurate and relevant service status to service/support personnel in

support of the business.

1.3 Event Management Program Processes

The Event Management program is responsible for the Event Management process and for the direction

of the Event Monitoring environment. It also integrates with the ITIL Service Design processes of

Availability, Capacity and Security Management for monitoring requirements and capabilities, and in the

Incident and Problem Management processes as inputs and outputs for automated remediation or

notification activities based on significant events. Event Management plays a significant role in the

Continuous Service Improvement processes as a point of research, audit and verification.

Considerations for Event Management are also required as part of application development processes

(e.g. RJSF), starting with application design and development. Additionally, the creation and

management of service based views enables the next generation of event monitoring for service status

events.

Page 6: Event Management and Monitoring Strategy

Event Management and Monitoring Program Strategy

Proprietary and Confidential 6

1.3.1 Event Management Process

The Event Management process is the process that monitors all events that occur through the IT

Infrastructure to allow for normal operation and also to detect and escalate exception conditions.

Figure 1: ITIL V3 Event Management Process

The figure shows that the event management process is responsible for detection, filtering, triggering,

alerting, automated response and reviewing actions. The triggers and automated response will control

the scope of the work required by the event management process. In other words, the more triggers

and automated responses that are required, the more work must be accomplished to automate the

response and increase the business value.

One of the keys to a successful Event Management program is to define which actions trigger the event

management process and managing the number and priority of those events. Triggers include:

Exceptions to any level of Configuration Item (CI) performance defined in design specifications,

SLAs, OLAs and SOPs

Exceptions to an automated procedure or process – monitoring an automated workflow

Exception within a IT process that is being monitored – (e.g. server build)

The completion of an automated task or job

Page 7: Event Management and Monitoring Strategy

Event Management and Monitoring Program Strategy

Proprietary and Confidential 7

A status change in a device or database record depending on the granularity of the monitoring

requirements

Access of an application or database by a user or automated procedure or job

A situation where a device, database, or application, or service has reached a pre -defined

performance threshold.

For the current state the most important aspect of the Event Management process is that all types of

alerts will result in an incident being opened in the Service Desk.

1.3.2 Event Monitoring

Event Monitoring covers a broad spectrum of al l the monitoring capabilities across the CORPORATE IT

enterprise. The Event Management architecture deployed today uses a Manager Of Managers (MOM) to

gather events from all the IT Management Domains. The major IT Domains are Application, Database,

End User, Facilities, Network, Security, Server Platform (which includes virtual), Storage, Telephony, and

Workload.

Figure 2: Manager of Managers Architecture

Although all events are monitored, only significant events are managed because they are meaningful.

This is accomplished through filtering at the IT Domain level to identify events that are recognized as

affecting the status of Configuration Items (CI) (i .e. Service, Application, and Component), automation

processes or other significant occurrence. The Manager of Managers then correlates the events from

each of the IT Domains determines the course of action and executes an automated response. For all

significant events an incident will automatically be opened, assigned and prioritized in the Service Desk.

A major portion of the Event Management Program includes creating interfaces that enable monitoring

at the services level. The best approach is to start with a few significant services to demonstrate the

business value of monitoring services.

Page 8: Event Management and Monitoring Strategy

Event Management and Monitoring Program Strategy

Proprietary and Confidential 8

1.3.3 Designing Manageable Applications

In order to optimize operational management of applications the applications must be designed with

operations in mind. This requires that monitoring requirements are identified during the application

design phase of the application lifecycle and instrumented during the application development cycle.

One of the key deliverables that enables this type of monitoring is the “health model” which relates the

status of individual components to the status of the overall application or service. For internally

developed applications CORPORATE has embraced the use of management packs as a means of ensuring

the supportability of applications. This initiative is in line with Microsoft’s Design for Operations

methodology.

1.4 Event Management Metrics

Once Event Management is in place a baseline must be established as to the current performance levels

and value to the organization in terms of optimizing operations activities and Mean Time To Repair. The

following metrics are recommended by ITIL v3:

1. Number of events per category – IT Domain, Service, Application

2. Number of events by significance – Exception (Critical or Major), Warning (Minor), or

Informational (non-exception/warning application messages)

3. Number and percentage of events that required human intervention and whether this was

performed – incidents are not opened

4. Number and percentage of events that resulted in incidents or changes

5. Number and percentage of events caused by existing problems or Known Errors

6. Number and percentage of replicated or duplicated events

7. Number and percentage of events indicating performance issues

8. Number and percentage of events indicating potential availability issues

9. Number and percentage of each type of event per platform or application

10. Number and ratio of events compared with the number of incidents

Further research must be done to determine how to derive these metrics and associated reports with

the existing monitoring tools. Service Desk and the Manager of Managers are good places to begin this

work. These metrics will enable the “tuning” of the event management system through the adjustment

of the filters and correlation engine in the domain managers and Manager of Managers, respectively.

Page 9: Event Management and Monitoring Strategy

Event Management and Monitoring Program Strategy

Proprietary and Confidential 9

1.5 Roadmap

Figure 3: Event Management Program Roadmap

The high level roadmap for the IT Event Management Program has eight projects:

1. Define the Event Management strategy and program including deliverables:

a. Event Management Strategy

b. Event Management Process

c. Event Handling Policies and Standards

i. Notification/Escalation policies and Standards

d. Event Management projects/initiatives

e. Event management program roadmap

2. Establish Event Management Program through:

a. Ratification of the event management and monitoring processes and activities

i. Ratification of event handling policies and standards

b. Communicate and gather support for event management program activities in

collaboration with stakeholders to agree on deliverables

c. Establish a timeline for completing the work activities and deliverables

3. Integration with ITIL other ITIL management processes including:

a. Incident Management for automation of incident management process activities where

applicable.

i. Automatically manage incidents from user events (transactions)

Page 10: Event Management and Monitoring Strategy

Event Management and Monitoring Program Strategy

Proprietary and Confidential 10

ii. Automatically manage incidents from system events

iii. Automatically manage incidents from service events

b. Availability Management – for availability monitoring requirements of service

components

c. Capacity Management – for the capacity monitoring requirements of the service

components

d. Problem Management – for event information in the Known Error Database and for

verification and audit of the root cause of problems.

4. Integration with application design and development – for internally developed applications

through IT architecture and Software Engineering

a. Adoption of management packs for monitoring applications

i. Microsoft Management Packs for internally developed applications on the

Windows platform

b. Propagation of configuration and status information to service views based on the

service and health models.

5. Integration with third-party applications

a. Adoption of management pack methodology for management of events from third-

party applications

i. Create deliverables that are platform dependent

ii. Coordinate with instrumentation and system support for instrumentation

lifecycle (design, develop, test, deploy)

6. Consolidation/Correlation of Domain level events

a. Completion of integration of critical, major and minor events across all IT Domains to

the Manager of Managers.

b. Implementation of correlation policies/rules to forward significant events for incidents

and alerts.

7. Integration with the IT Service/Support groups through the creation and management of service

views and related configuration items.

a. Role based service dashboards for user groups

8. Continuous process improvement

a. Audit and verify quality and efficiency of existing event management and monitoring

systems and adjust filters and correlation engines to streamline automation.

1.5.1 Current State to Future State

Event Management and Monitoring has been in place for years at CORPORATE. It has matured to a level

where events are triggering workload and other automation/remediation, as well as, automated

notification/escalation. As far as a maturity level, CORPORATE is between reactive and proactive. There

are specific cases where we are at the predictive level (monitoring batch), but this is the exception.

There are many management/monitoring tools in place across all the IT Domains. The two major tasks

that must be accomplished in order for the Event Management program to be successful are :

Page 11: Event Management and Monitoring Strategy

Event Management and Monitoring Program Strategy

Proprietary and Confidential 11

Mature IT monitoring from a reactive/proactive level to a proactive/predictive maturity level

through automation of event responses for all significant events.

Consolidate and correlate all the events into meaningful status information for CIs like

applications, systems and IT services.

The biggest enabler going forward is the use of ITIL v3 as the framework for managing IT. This provides

a common vernacular and helps establish accepted governance processes for event management and

monitoring. Use of a framework combined with the use of the services construct to represent IT value

to the business provides a new level of event management and monitoring for CORPORATE.

Page 12: Event Management and Monitoring Strategy

Event Management and Monitoring Program Strategy

Proprietary and Confidential 12

Appendix A: A Business Value proposition for Event

Management2

In simple terms event management enables real time monitoring of the infrastructure (i.e. listening for

things that are wrong), and uses event correlation to filter, de-duplicate and combine events to detect

more serious issues. Event Management is important because it will:

Improve time to resolve through cause identification

Improve visibility to real time

Enable proactive management of impact to the business (IT calls the business)

Improve Security Management

Studies show that fault detection and root-cause analysis are the most important systems management

capabilities. Studies also show that the most time-consuming systems management tasks are diagnosis

and troubleshooting. Event Management enables proactive responses to events and enables automatic

tracking and resolution for most system events. The scenarios below show the difference when event

management is implemented and when it is not3.

2 Taken from Data Network Event Management and ITIL, CISCO, Keith SInclair 3 The scenarios below use a network device issue as the example. CORPORATE is monitoring all infrastructure

domains at some level as described in the Event monitoring section of this document.

Page 13: Event Management and Monitoring Strategy

Event Management and Monitoring Program Strategy

Proprietary and Confidential 13

Figure 4: Scenario Situation normal (w/o Event Management)

Figure 5: Scenario - Situation with Event Management

The bottom line is that Event management allows IT to resolve issues before the users are affected.

Armed with reports that show the effectiveness of Event Management, IT can show the business how

effective they are and demonstrate real business value.

Appendix_A_Back