23
MAJOR INCIDENT PROCESS Overview Version 2.2

HUIT Major Incident Process - Harvard University …huit.harvard.edu/files/huit/files/20141102_huit_major... · Web viewHUIT Major Incident Process Keywords ITSM, ITIL, Incident Managment

  • Upload
    dohanh

  • View
    289

  • Download
    9

Embed Size (px)

Citation preview

Page 1: HUIT Major Incident Process - Harvard University …huit.harvard.edu/files/huit/files/20141102_huit_major... · Web viewHUIT Major Incident Process Keywords ITSM, ITIL, Incident Managment

MAJOR INCIDENT PROCESS

Overview

Version 2.2May 6, 2023

Matthew Wollman

Page 2: HUIT Major Incident Process - Harvard University …huit.harvard.edu/files/huit/files/20141102_huit_major... · Web viewHUIT Major Incident Process Keywords ITSM, ITIL, Incident Managment

This page left intentionally blank.

Page 2 HUIT Major Incident Process

Page 3: HUIT Major Incident Process - Harvard University …huit.harvard.edu/files/huit/files/20141102_huit_major... · Web viewHUIT Major Incident Process Keywords ITSM, ITIL, Incident Managment

Document Change Control

Version # Date of Issue

Author(s) Brief Description

0.1 8/3/2012 Matthew Wollman

Start of Document

0.2 8/21/2012 Matthew Wollman

Incorporated feedback from Courtney Harwood, Richard Ohlsten and Steve Martino

0.3 8/28/2012 Matthew Wollman

Made major modifications to Responsibilities and Workflow.

Added definitions for critical, core, and non-core service

0.4 9/10/2012 Matthew Wollman

Incorporated feedback from Dennis Ravenelle

Drafted water mark Reordered Objectives and Policy by importance Further clarified definition Added additional Role responsibilities Expanded Process activities Made grammatical changes

1.0 9/24/2012 Matthew Wollman

First release of document after core team approval

Removed P1 and P2 differences

1.1 11/27/2012 Matthew Wollman

Separated Incident Commander and Incident Communications roles; added text about criteria for hierarchical escalations

2.0 2/13/2013 Matthew Wollman & Janet Crystal

Combined Purpose and Scope, and objectives and policies. Reorganized roles and responsibilities by order of role involvement in process. Reorganized and reduced Process activities section to a high – level overview. Process activities will be detailed in separate documentation

2.1 8/15/2014 Matthew Change to RACI, Service Owner is Accountable for External

HUIT Major Incident Process Page 3

Page 4: HUIT Major Incident Process - Harvard University …huit.harvard.edu/files/huit/files/20141102_huit_major... · Web viewHUIT Major Incident Process Keywords ITSM, ITIL, Incident Managment

Wollman Communications, Removed C-Cure to Critical Services

2.2 11/2/2014 Matthew Wollman

Page 4 HUIT Major Incident Process

Page 5: HUIT Major Incident Process - Harvard University …huit.harvard.edu/files/huit/files/20141102_huit_major... · Web viewHUIT Major Incident Process Keywords ITSM, ITIL, Incident Managment

This page left intentionally blank.

HUIT Major Incident Process Page 5

Page 6: HUIT Major Incident Process - Harvard University …huit.harvard.edu/files/huit/files/20141102_huit_major... · Web viewHUIT Major Incident Process Keywords ITSM, ITIL, Incident Managment

Table of ContentsDocument Change Control..........................................................................................................................3

Purpose and Scope......................................................................................................................................7

Policies.........................................................................................................................................................7

Process Roles and Responsibilities..............................................................................................................8

Incident Commander...............................................................................................................................8

Incident Commander Escalation..........................................................................................................8

Incident Communicator...........................................................................................................................9

Service Desk.............................................................................................................................................9

SOC Operations.......................................................................................................................................9

Technical Resources (Infrastructure, Development, DevOps, etc.).......................................................10

Technical Line Manager.........................................................................................................................10

Service Owner / Practice (or Product) Manager....................................................................................10

Process Activities.......................................................................................................................................11

Major Incident Identification.................................................................................................................11

Initial Communication and Escalation....................................................................................................11

Incident Coordination............................................................................................................................11

Conference Bridge.................................................................................................................................11

External Communication.......................................................................................................................11

Internal Communication........................................................................................................................12

Investigation..........................................................................................................................................12

Resolution..............................................................................................................................................12

Incident Documentation........................................................................................................................12

Appendix A: Process Flowchart for a Major Incident............................................................................13

Appendix B: RACI Matrix.......................................................................................................................14

Appendix C: Critical Services.................................................................................................................15

Appendix D: Major Incident Process Timeframes (Estimated)..............................................................16

Glossary.....................................................................................................................................................17

Page 6 HUIT Major Incident Process

Page 7: HUIT Major Incident Process - Harvard University …huit.harvard.edu/files/huit/files/20141102_huit_major... · Web viewHUIT Major Incident Process Keywords ITSM, ITIL, Incident Managment

HUIT Major Incident Process Page 7

Page 8: HUIT Major Incident Process - Harvard University …huit.harvard.edu/files/huit/files/20141102_huit_major... · Web viewHUIT Major Incident Process Keywords ITSM, ITIL, Incident Managment

This page left intentionally blank.

Page 8 HUIT Major Incident Process

Page 9: HUIT Major Incident Process - Harvard University …huit.harvard.edu/files/huit/files/20141102_huit_major... · Web viewHUIT Major Incident Process Keywords ITSM, ITIL, Incident Managment

Purpose and ScopeThe Harvard University Information Technology (HUIT) Major Incident process provides a unified system for resolving Major Incidents as quickly as possible through proper identification, predefined escalation paths, and prompt communication procedures across all HUIT services.

A Major Incident is the interruption or degradation of a core production service (any centralized HUIT-provided service that serves multiple customers and users) that results in the disruption of its customers’ ability to carry out University teaching, learning, research and/or administration at the University.

The scope of this document is to provide an overview of the processes that apply to every Major Incident for all HUIT services and that all HUIT employees must follow. Once trained, all HUIT employees will be able to identify a Major Incident and to escalate it to the appropriate technical group for resolution.

Policies1. HUIT’s focus is to alert the community to the occurrence of a Major Incident as quickly as possible.

Early notification of a potential issue is more important than an accurate description of the problem.2. HUIT will use standardized methods and procedures to enable an efficient and prompt response,

analysis, documentation, ongoing coordination and ownership, communication, and reporting.3. Escalation in a Major Incident will start with the Incident Commander and move to the HUIT

employees most responsible for each service. 4. HUIT will communicate with affected end-users regularly throughout the lifecycle of a Major

Incident.5. HUIT will maintain a consistent and regular presence through open communications among HUIT

staff and will provide consistent updates to the Service Desk, Service Owner, Incident Manger, and HUIT leadership.

6. HUIT will log and document all details of Major Incidents throughout the lifetime of each event.

HUIT Major Incident Process Page 9

Page 10: HUIT Major Incident Process - Harvard University …huit.harvard.edu/files/huit/files/20141102_huit_major... · Web viewHUIT Major Incident Process Keywords ITSM, ITIL, Incident Managment

Process Roles and Responsibilities

Incident CommanderThe Incident Commander has the highest level of responsibility during a Major Incident and is accountable for its lifecycle through coordination, documentation, and communication. The roles of HUIT Incident Commander and HUIT Incident Communicator may be combined in one person for incidents that are of short duration or that are deemed less critical. For incidents of longer duration or those with greater impact, the responsibility of the Incident Commander can be escalated to a Manager or Director in HUIT.

The Incident Commander is responsible for the following activities:

Facilitating and participating in and a conference bridge Maintaining communication with Technical Resources and Service Owners for status updates and

additional information Coordinating resources needed to troubleshoot, communicate, and/or make decisions to resolve a

Major Incident Ensuring that internal and external communications about a Major Incident are completed in a

timely manner Creating and completing a Major Incident Report

Incident Commander EscalationIf the scale of the event requires escalation to a HUIT Manager or Director, the responsibilities for the Incident Communicator role will remain with the original Incident Commander. The following conditions, whether individual or in combination, will guide the need for escalation of Incident Commander responsibilities to a higher level of HUIT management:

1. A Major Incident is one of the Critical Services listed in Appendix C of this document.2. A Major Incident affects over 1,000 users of one or more services.3. A Major Incident is not or cannot be resolved within four hours.

Page 10 HUIT Major Incident Process

Page 11: HUIT Major Incident Process - Harvard University …huit.harvard.edu/files/huit/files/20141102_huit_major... · Web viewHUIT Major Incident Process Keywords ITSM, ITIL, Incident Managment

Incident CommunicatorThe Incident Communicator is responsible for the documentation and communication during a Major Incident, both internally to HUIT and externally to customers and end-users. The roles of HUIT Incident Commander and HUIT Incident Communicator may be combined in one person for incidents that are of short duration or that are deemed less critical. For incidents of longer duration or those with greater impact, the responsibility of the Incident Commander can be escalated to a Manager or Director in HUIT.

The Incident Communicator is responsible for the following activities:

Participating in a conference bridge Communicating internally to HUIT staff and externally to the customers of the service, end-users,

and other non-HUIT parties Maintaining a record of events throughout a Major Incident Notifying HUIT staff and any external parties of the resolution Updating the HUIT website, Twitter, Facebook, and email distribution lists with notifications of

incidents, updates, and resolution.

Service DeskThe Service Desk is responsible for the following activities:

Identifying a Major Incident Escalating a Major Incident to the HUIT Incident Commander Logging Major Incident tickets for end-users Participating in a “servicedesk” Jabber chat room or the conference bridge Placing a generic Major Incident message on the ACD system

SOC OperationsThe SOC Operations group is responsible for the following activities:

Identifying a Major Incidents Escalating a Major Incident to the HUIT Incident Commander Logging Major Incident tickets for end-users Participating in a conference bridge Notifying the Service Desk during business hours of any Major Incident

HUIT Major Incident Process Page 11

Page 12: HUIT Major Incident Process - Harvard University …huit.harvard.edu/files/huit/files/20141102_huit_major... · Web viewHUIT Major Incident Process Keywords ITSM, ITIL, Incident Managment

Technical Resources (Infrastructure, Development, DevOps, etc.)Any HUIT Technical Resource who receives alerts, escalations, and/or who has a role in restoring HUIT services to normal operation is responsible for the following activities:

Identifying a Major Incident Escalating a Major Incident to the Technical Line Manger Troubleshooting and working to resolve the incident in accordance with internal procedures for

handling Major Incidents Documenting incident details and steps taken to resolve the underlying problem Providing regular updates to the Line Manger and/or the Incident Commander on the status of an

investigation and the resolution of the incident.

Technical Resource ManagerAny HUIT manager who manages technical resources and their performance is responsible for the following activities:

Identifying and escalating a Major Incident to the HUIT Incident Commander Identifying the scope of the problem and identifying additional services that may be affected by a

Major Incident Notifying respective service areas and providing updates throughout the lifecycle of a Major Incident Participating in a conference bridge Facilitating communication among technical resources, HUIT Incident Commander and Service

Owner Recording and tracking progress throughout the lifecycle of a Major Incident and providing updates

to the Incident Commander and Service Owner Estimating the service recovery time Managing the activities of the Technical Resources

Service Owner / Practice (or Product) ManagerAny HUIT employee or their proxy who is responsible for the overall quality of a service and has the most comprehensive knowledge of its components is responsible for the following activities:

Identifying a Major Incident Participating in a conference bridge Notifying the Service Desk during business hours of a Major Incident Identifying the business impact of a Major Incident Communicating externally to the customers of the service, end-users and other non-HUIT parties Maintaining a record of events throughout a Major Incident Confirming that resolution of a Major Incident is in place Notifying HUIT staff and any external parties of the resolution after confirmation

Page 12 HUIT Major Incident Process

Page 13: HUIT Major Incident Process - Harvard University …huit.harvard.edu/files/huit/files/20141102_huit_major... · Web viewHUIT Major Incident Process Keywords ITSM, ITIL, Incident Managment

Process ActivitiesHUIT maintains detailed descriptions of the following activities in separate documents. They are listed below in this document for high-level reference and overview.

Major Incident Identification Major Incidents can be initiated by customers, reports from user, observations, monitoring, Event

Management, and/or Change Management.

Initial Communication and Escalation As soon as HUIT staff has identified a suspected Major Incident, they must escalate it immediately to

the HUIT Incident Commander. The Incident Commander will declare the event as a Major Incident and set its priority. The Incident Commander will escalate it to the appropriate technical groups and service owners. After declaration of a Major Incident, the Incident Communicator will email the Service Desk with

appropriate information and place a service alert on the HUIT website.

Incident Coordination The Incident Commander will involve and consult with all necessary parties to resolve the incident

as quickly as possible. The Incident Commander will facilitate conference bridges to ensure that information is

disseminated in a timely manner, that time spent on the bridge is focused and that troubleshooting can continue.

The Incident Commander will escalate the incident to additional resources, including hierarchical escalations as necessary.

Conference Bridge Once notified of a Major Incident, the Incident Commander will use a conference bridge that

includes all affected groups to maintain communication between the technical resources and the service owner(s).

The Incident Commander will determine the appropriate schedule for calling a conference bridge and its duration after the initial assessment.

External Communication Throughout the Incident, HUIT will use its website as the primary location for information updates. HUIT will distribute Incident notification(s) to external customers, add an outgoing message to the

Service Desk ACD system (as necessary), and send a tweet whose content will also appear on HUIT's Facebook page and in Harvard’s Yammer community.

HUIT Major Incident Process Page 13

Page 14: HUIT Major Incident Process - Harvard University …huit.harvard.edu/files/huit/files/20141102_huit_major... · Web viewHUIT Major Incident Process Keywords ITSM, ITIL, Incident Managment

Internal Communication The Incident Communicator will create a Major Incident ticket to be available in Remedy. The Incident Communicator will send a notification containing internal details of the Major Incident. The Incident Communicator will notify the Operational Managing Directors, as necessary.

Investigation HUIT will investigate continuously throughout a Major Incident and coordinate updates with

vendors, developers, and end-users.

Resolution Service Owners have final sign-off authority on the resolution of a Major Incident and ensure

end-user notification.

Incident Documentation The Incident Communicator will document the initial assessment of the incident's root cause (if

known), create a timeline, and establish the steps taken for investigation and resolution. The Service Owner(s) and Technical Line Manager(s) will forward any notes or timelines that they

have maintained throughout the incident to the Incident Commander.

Page 14 HUIT Major Incident Process

Page 15: HUIT Major Incident Process - Harvard University …huit.harvard.edu/files/huit/files/20141102_huit_major... · Web viewHUIT Major Incident Process Keywords ITSM, ITIL, Incident Managment

Appendix A: Process Flowchart for a Major Incident

HUIT Major Incident Process Page 15

Page 16: HUIT Major Incident Process - Harvard University …huit.harvard.edu/files/huit/files/20141102_huit_major... · Web viewHUIT Major Incident Process Keywords ITSM, ITIL, Incident Managment

Appendix B: RACI Matrix

A = Accountable, R = Responsible, C = Consulted, I = Informed

Page 16 HUIT Major Incident Process

Activity Inci

dent

Com

man

der

Inci

dent

Com

mun

icat

or

Serv

ice

Desk

SOC

Ope

ratio

ns

Tech

nica

l Res

ourc

e

Tech

nica

l Lin

e M

anag

er

Serv

ice

Ow

ner

Incident Identification A R R R R RInitial Communications A,R R R C CEscalation A,R R R R R RIncident Coordination A,R C CConference Bridge A,R R I C RExternal Communication R R I I C A,RInternal Communication C,I R A C,IInvestigation I I I R C,I AResolution A,R R R RIncident Documentation A,R R C C C C C

Page 17: HUIT Major Incident Process - Harvard University …huit.harvard.edu/files/huit/files/20141102_huit_major... · Web viewHUIT Major Incident Process Keywords ITSM, ITIL, Incident Managment

Appendix C: Critical Services

1. Central Networking Services

a. DNS, CHCP, Infobloxb. Core / Data Center Routersc. Core / Data Center Firewallsd. Load Balancer

2. Data Center

a. Facilities / Powerb. Shared Storagec. Virtualization

3. E-mail4. PIN / LDAP5. University website6. College website7. Phone System / Voicemail / i38. PeopleSoft9. Oracle Financials10. HarvIE11. CAADS12. iSites / Canvas

HUIT Major Incident Process Page 17

Page 18: HUIT Major Incident Process - Harvard University …huit.harvard.edu/files/huit/files/20141102_huit_major... · Web viewHUIT Major Incident Process Keywords ITSM, ITIL, Incident Managment

Appendix D:Major Incident Process Timeframes (Estimated)

Page 18 HUIT Major Incident Process

T0

Major Incident IdentifiedMajor Incident Escalated to Incident Commander Service Desk InformedService Owner InformedTechnical Resources InformedCall Bridge Opened

T+30

Initial CommunicationsService Desk Updates ACD SystemService DeskLlogs Remedy TicketIncident Commander Sends HUIT AlertIncident Commander Updates WebsiteService Owner or Incident Commander Sends External Notification

T+45

First UpdateInitial Diagnosis?Estimated Time to Resolution?Additional Communications Need to be Sent?Agree upon Update Times and Intervals (e.g., every 30 minutes)

T+Interval

Regular UpdatesUpdate on Progress?Additional Rresources?Updated Communications?

Resolution

Service Owner Confirm Service is Restored to Acceptable LevelsIncident Commander Notifies HUIT AlertIncident Commander Updates WebsiteService Owner Sends External CommunicationIncident Commander Resolves Major IncidentIncident Commander Begins Incident Report

Page 19: HUIT Major Incident Process - Harvard University …huit.harvard.edu/files/huit/files/20141102_huit_major... · Web viewHUIT Major Incident Process Keywords ITSM, ITIL, Incident Managment

GlossaryCore Service—Any HUIT-provided service that serves multiple customer groups and end-users, and is a

centralized service. See non-core service.

Critical Service—Any service whose failure or degradation creates an immediate and large-scale impact. See Appendix C.

Incident Commander—The Incident Commander is responsible for the lifecycle of the Major Incident, including coordination, documentation and communication and is its owner.

Major Incident—A Major Incident occurs when a core production service is interrupted or degraded, resulting in a noticeable disruption of the customers’ ability to carry out University teaching, learning, research and administration.

Non-Core Service—Any HUIT service that is hosted or provided to one specific customer or group of users for a non-centralized purpose.

Service Owner—In the context of the Major Incident process, the service owner is a HUIT staff member who has a comprehensive view of the service including but not limited to customer and user relationships, a broad understanding of the components required to deliver that service, and the expectations for the quality set for that service.

Utility—The functionality offered by a service to meet a particular need. Utility can be summarized as ‘what a service does’, and can be used to determine whether a service is able to meet its required outcomes or is ‘fit for purpose’. The business value of an IT service is created by a combination of utility and warranty.

Warranty – Assurance that a product or service will meet agreed requirements. This may be a formal agreement such as a service level agreement or contract, or it may be implied through ad-hoc messages or agreements. Warranty refers to the ability of a service to be available when needed, to provide the required capacity, and to provide the required reliability in terms of continuity and security. Warranty can be summarized as 'how the service is delivered', and can be used to determine whether a service is 'fit for use'. The business value of an IT service is created by the combination of utility and warranty. See also service validation and testing.

HUIT Major Incident Process Page 19