Service-Centric AIOps · Escalation and remediation of alerts are common alert management tasks, often handled by humans. In the old way, first-responders recognized root-cause alerts

SERVICE CENTRIC AIOPS[WHITEPAPER]

Page 1

Service-Centric AIOpsHow OpsRamp OpsQ drives IT operational efficiency and faster response at scale.

[Whitepaper]


Page 2

Table of Contents

Introduction 3

Chapter 1: Intelligent Alerting And Thresholds 4

Chapter 2: Alert Correlation 6

Chapter 3: Auto-Remediation And Escalation 11

Conclusion 13


Page 3

IntroductionAutomation is the catalyst for efficiency in the future of IT Operations. Automation driven by artificial intelligence will be a huge competitive advantage for digital operations teams drowning in data, alerts, and conflicting business priorities. The need to get more work done in lesser time (with fewer resources) is a perfect storm for the adoption of artificial intelligence for IT operations, or AIOps.

AIOps is a game-changing technology that’s still in its infancy. As cloud-native services and hybrid infrastructures present new levels of complexity, AIOps has become a critical ingredient for scaling the IT operations function. The promise of AIOps is tremendous, from machine data analytics to more faster and efficient IT service management.

What’s more, AIOps that’s service-centric offers enterprise IT teams a truly unique solution for managing infrastructure complexity, maintaining digital services, and meeting the needs of the business. A service-centric approach to AIOps will help prevent digital disruptions by proactively monitoring system health, reducing alert storms and remediating issues quickly.

This paper will focus on three routine operational tasks that machine learning and related computational approaches and streamline: Intelligent Alerting, Alert Correlation, Auto-Remediation & Escalation. You’ll learn how OpsRamp OpsQ delivers cross-domain analysis by analyzing, normalizing and processing vast datasets across your IT environment for greater contextual visibility and proactive insights:

• Improved Efficiency: Advanced inference models to determine cause and effect in complex patterns and handle previously unmanageable complexity.

• Automated Incident Detection and Resolution: Machine learning-driven root cause(s) analysis and anomaly detection to speed up the incident resolution.

• Multi-Cloud Optimization: Intelligent capacity analysis and continuous optimization for dynamic cloud environments.

https://blogs.gartner.com/andrew-lerner/2017/08/09/aiops-platforms/

https://www.opsramp.com/solutions/service-centric-aiops/



Page 4

Chapter 1: Intelligent Alerting And Thresholds

Network Operations Center (NOC) teams face a steady stream of alerts from modern digital services, hosted either on physical and virtual devices in enterprise datacenters or IaaS and PaaS services in the public cloud. Alerts result from breaching pre-defined thresholds when monitoring metrics for managed IT resources. These monitoring metrics represent capacity utilization or performance data for IT resources.

As part of the monitoring practice, IT operations teams perform the following tasks associated with alerting:

Each step in this activity workflow consumes human time to execute. The average time spent across task flows, per ingested alert, increases disproportionately with the number of incoming alerts. With OpsRamp OpsQ, you can implement intelligent alerting and automate as much of each task as possible in order to reduce the human time spent per alert.

The goal with intelligent alerting is to identify capacity or performance degradation conditions before there is any user impact. There are two types of intelligent alerting use cases that you’ll frequently encounter in practice:

1. Correlation of ingested alerts to a common cause

2. Triage and prioritization of alerts for resolution

3. Integration with ITSM tools for improved problem management, including the creation of incident tickets for issues that need further root cause(s) analysis

• Conditions with a known threshold: A majority of metrics have known thresholds that represent degraded capacity or performance conditions. For example, sustained CPU utilization > 90% on a server represents a constrained CPU condition.

• Conditions without a known threshold: For certain types of metrics, you cannot establish thresholds in practice because their “normal” operating range depends on the entire system. For example, response times to user requests on a website depends on the web, application and database components behind the site.

• Performance Metrics: Response times to user requests, the latency of disk I/O operations.

• Capacity Metrics: Disk space remaining, memory utilization.

There are two primary types of metrics that trigger alerts:


Page 5

FeaturesOpsRamp offers three features to address these two intelligent alerting use cases (see Figure 1 and Figure 2):

Use Case OpsRamp Feature

Conditions with a known threshold

Alert on breach of threshold: Generate an alert when the metric breaches the known threshold. Such alerts work best for metrics that can fluctuate rapidly, without a clear trend (like CPU or memory utilization).

Alert on forecasted time to breach threshold: Generate alert based on the forecasted time remaining to breach the known threshold. These alerts are suitable for metrics that rise (or fall) gradually with a clear trend (like disk space utilization).

Conditions with an unknown threshold

Alert on sudden change: Generate an alert when you detect a sudden change from recent behavior. The change can be in a positive or negative direction. This is suitable for metrics in which a sudden change in behavior represents a performance degradation (sudden increase in response times).

Alert on forecasted time to breach

Figure 1: Alert response workflow

Alert on sudden change in behavior

Figure 2: Threshold options in OpsRamp


Page 6

Chapter 2: Alert Correlation

Alert correlation rules let you classify alerts into primary and secondary so that you can establish a relationship between them. Users can then cluster alerts into higher-level pieces of information called inferences that help derive cause and effect.

We’ll analyze two use cases, Cascading Impact and Parallel Impact that arise where the same underlying condition triggers multiple alerts.

Cascading ImpactCascading Impact is the use case where one upstream disruption (and its eventual alert) can cascade downstream and cause subsequent dependency-based alerts across the system. Here, it’s critical to identify the root-cause or upstream alert.

Figures 3 and 4 show two examples of cascading impact.

In Figure 3, Switch 1 is down, setting off a cascade of alerts on downstream devices that depend on network connectivity to the switch. The only alert that represents the “signal” is the alert on Switch 1. The remaining alerts are “noise” relative to the underlying condition.

ID Subject Device Status Related Alerts

12789 Interface if_1 down Switch 1 Critical --

34578 Device unreachable Switch 1 Critical --

78657 Host down Server Critical --

... ... ... ... --

VM1 VM1

Server

Switch 2

Switch 1

Cause

Cause

Root Cause

ID Subject Device Status Related Alerts

12789 Interface if_1 down Switch 1 Critical 10

Alerts

Inference

Figure 3: Cascading effect across a network

In Figure 4, Microservice-D goes down and impacts the availability of the other three microservices that rely on it. Here, the right context is critical to understand which microservice is causing the issue.

Figure 4: Cascading effect across an application component

ID Subject Resource

12789 Microservice-A down Microservice-A

34578 Microservice-B down Microservice-B

78657 Microservice-C down Microservice-C

78657 Microservice-D down Microservice-D

ID Subject Resource Correlated Alerts

78657 Microservice-D down Microservice-D 3

Alerts

Inference

Root Cause

Microservice-A Microservice-B

Microservice-C

Microservice-D


Page 7

Parallel ImpactParallel Impact alerts usually begin from a single source applied to multiple elements simultaneously. Like cascading impact alerts, the result is an alert flood triggered from a single root cause. Figure 5 shows an example of the parallel impact use case. The developer rolls out a code update to different servers in a cluster. The update has a bug that causes alerts on each of the servers. There is no single alert that represents the “signal”. The other alerts across all these servers are “noise” because they all carry the same information. The ops team will need to infer the right “signal” from multiple alerts for accurate remediation.

Developer

Code Update

10:00 AM 10:05 AM 10:15 AM

Figure 5: Cascading effect across an application component

How Does OpsRamp Help?OpsRamp OpsQ applies machine learning and other data-driven approaches to correlate and infer alerts arising from the same underlying condition. OpsQ continuously learns patterns and applies learned models against incoming alert streams to make sense of cascading and parallel impacts. It groups related alerts into inferences based on the learning models. IT teams can then manage these inferences instead of addressing individual alerts, reducing the “noise” that users need to sift through in everyday operations. Figure 6 shows OpsQ’s algorithmic alert correlation.

IncomingAlerts

A1A2A3...

Apply Models

Inferences

Models Library

Learned Models

A1 A2, A6, A8

A3 A7, A9, A10

...

UserFeedback

User-DefinedModels

Figure 6: Cascading effect across application components


Page 8

OpsQ automatically builds inference models based on machine learning and provides a framework for users to define their own models which can encapsulate environment-specific patterns seen in practice. The table summarizes OpsRamp’s features for alert correlation:

OpsRamp Feature Summary

Topology Discovery Topology Explorer: A key piece of information for alert correlation is topological relationships between IT elements (at the network and application layers). Knowing “what’s connected to what” and “what talks to what” is essential to understand alert relationships.

OpsRamp automatically discovers network and application topology. Users can view the topology map, which is used in our correlation algorithms. See Figure 7 and 8 for examples of network and application topology maps.

Inference Models Figure 9 shows three types of models that OpsRamp provides currently.

Downstream Impact Model: This user-defined model is suitable for cascading effect use cases (shown in Figures 3 and 4). The model takes as inputs upstream and downstream devices and alert types which cascade down to their topological relationship.

Alert Similarity Model: This user-defined model works well for parallel effect use cases, as seen in Figure 5. The model takes as input attributes of alerts that must “approximately” match to be identified as related.

Statistical Co-Occurrence Model: This machine learning model is best suited for both cascading and parallel effect use cases. The model requires no user input and uses the historical frequency of occurrence of specific alert sequences. Alert sequences with high frequency show up as correlated alerts.

Inference models have the potential to leverage a key benefit of the OpsRamp SaaS form-factor. Machine learning models can learn from thousands of datasets across different managed environments (in the future).

Inference Lifecycle Management

As shown in Figure 6, we group correlated alerts into an Inference. Users can manage the lifecycle of an Inference instead of handling individual alerts, drastically reducing the time required for event correlation. You can acknowledge and suppress inferences and create tickets in a single action. Figure 10 shows an Inference.


Page 9

Figure 7: OpsRamp’s auto-discovered network topology

Figure 8: OpsRamp’s auto-discovered application topology


Page 10

Figure 9: OpsRamp’s Inference Models

Figure 10: An Inference in OpsRamp


Page 11

Chapter 3: Auto-Remediation And Escalation

Escalation and remediation of alerts are common alert management tasks, often handled by humans. In the old way, first-responders recognized root-cause alerts (through manual correlation) and escalated such alerts to subject matter experts for remediation and closure.

With OpsRamp, you can automate remediation through a sequence or policy. In the event that an alert can’t be auto-remediated, OpsQ can then escalate that alert to the right teams for immediate attention.

Both of the two use cases involve automating the first response to an incoming alert. Automation can mimic what a human operator would do with the same alert.

Auto-RemediationEach alert represents a problem condition that must be eventually fixed. If you can address an incident through a well-defined sequence of actions, then you can automate the response. For example, if a server becomes unavailable when a key application process stops running (e.g. Apache), restarting that process can be a well-defined, automatable action to have the server working again.

OpsQ can invoke scripts on alert triggers and execute auto-remediation actions. Figure 11 shows the configuration of auto-remediation actions.

Figure 11: OpsRamp’s auto-remediation feature


Page 12

Figure 12: OpsRamp’s alert escalation policy

Escalation and Incident RoutingAlerts that cannot be auto-remediated need human attention and investigation. Escalation involves notifying the right users about an alert, creating an incident ticket, and routing the ticket to the right user. Notification and routing decisions depend on the device and alert in question.

For example, alerts on devices supporting business-critical IT services require notification of Level 1 support staff within five minutes of alert receipt. If the alert is from a server and for a specific application, you will need to create an incident and route it to the relevant application team.

OpsQ can create alert escalation workflows and help program first-response actions for notification and auto-incident creation. Figure 12 shows the alert escalation policy for on-call management.


Page 13 © 2019 OpsRamp, Inc. All Rights Reserved

ConclusionAs IT operational environments become more complex, with different platforms and management tools, the need for automation and artificial intelligence has only become more critical. Service-centric AIOps is the most efficient way to use artificial intelligence and machine learning to avoid business disruption and remediate incidents more quickly. OpsRamp OpsQ is an intelligent event management, alert correlation, and remediation solution that’s built for ever-growing amounts of event and performance data.

For more on OpsQ and service-centric AIOps, visit www.OpsRamp.com/solutions/service-centric-aiops.

http://opsramp.com


Documents

Service-Centric AIOps · Escalation and remediation of alerts are common alert management tasks, often handled by humans. In the old way, first-responders recognized root-cause alerts