Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
SERVICE CENTRIC AIOPS[WHITEPAPER]
Page 1
Service-Centric AIOpsHow OpsRamp OpsQ drives IT operational efficiency and faster response at scale.
[Whitepaper]
SERVICE CENTRIC AIOPS[WHITEPAPER]
Page 2
Table of Contents
Introduction 3
Chapter 1: Intelligent Alerting And Thresholds 4
Chapter 2: Alert Correlation 6
Chapter 3: Auto-Remediation And Escalation 11
Conclusion 13
SERVICE CENTRIC AIOPS[WHITEPAPER]
Page 3
IntroductionAutomation is the catalyst for efficiency in the future of IT Operations. Automation driven by artificial intelligence will be a huge competitive advantage for digital operations teams drowning in data, alerts, and conflicting business priorities. The need to get more work done in lesser time (with fewer resources) is a perfect storm for the adoption of artificial intelligence for IT operations, or AIOps.
AIOps is a game-changing technology that’s still in its infancy. As cloud-native services and hybrid infrastructures present new levels of complexity, AIOps has become a critical ingredient for scaling the IT operations function. The promise of AIOps is tremendous, from machine data analytics to more faster and efficient IT service management.
What’s more, AIOps that’s service-centric offers enterprise IT teams a truly unique solution for managing infrastructure complexity, maintaining digital services, and meeting the needs of the business. A service-centric approach to AIOps will help prevent digital disruptions by proactively monitoring system health, reducing alert storms and remediating issues quickly.
This paper will focus on three routine operational tasks that machine learning and related computational approaches and streamline: Intelligent Alerting, Alert Correlation, Auto-Remediation & Escalation. You’ll learn how OpsRamp OpsQ delivers cross-domain analysis by analyzing, normalizing and processing vast datasets across your IT environment for greater contextual visibility and proactive insights:
• Improved Efficiency: Advanced inference models to determine cause and effect in complex patterns and handle previously unmanageable complexity.
• Automated Incident Detection and Resolution: Machine learning-driven root cause(s) analysis and anomaly detection to speed up the incident resolution.
• Multi-Cloud Optimization: Intelligent capacity analysis and continuous optimization for dynamic cloud environments.
SERVICE CENTRIC AIOPS[WHITEPAPER]
Page 4
Chapter 1: Intelligent Alerting And Thresholds
Network Operations Center (NOC) teams face a steady stream of alerts from modern digital services, hosted either on physical and virtual devices in enterprise datacenters or IaaS and PaaS services in the public cloud. Alerts result from breaching pre-defined thresholds when monitoring metrics for managed IT resources. These monitoring metrics represent capacity utilization or performance data for IT resources.
As part of the monitoring practice, IT operations teams perform the following tasks associated with alerting:
Each step in this activity workflow consumes human time to execute. The average time spent across task flows, per ingested alert, increases disproportionately with the number of incoming alerts. With OpsRamp OpsQ, you can implement intelligent alerting and automate as much of each task as possible in order to reduce the human time spent per alert.
The goal with intelligent alerting is to identify capacity or performance degradation conditions before there is any user impact. There are two types of intelligent alerting use cases that you’ll frequently encounter in practice:
1. Correlation of ingested alerts to a common cause
2. Triage and prioritization of alerts for resolution
3. Integration with ITSM tools for improved problem management, including the creation of incident tickets for issues that need further root cause(s) analysis
• Conditions with a known threshold: A majority of metrics have known thresholds that represent degraded capacity or performance conditions. For example, sustained CPU utilization > 90% on a server represents a constrained CPU condition.
• Conditions without a known threshold: For certain types of metrics, you cannot establish thresholds in practice because their “normal” operating range depends on the entire system. For example, response times to user requests on a website depends on the web, application and database components behind the site.
• Performance Metrics: Response times to user requests, the latency of disk I/O operations.
• Capacity Metrics: Disk space remaining, memory utilization.
There are two primary types of metrics that trigger alerts:
SERVICE CENTRIC AIOPS[WHITEPAPER]
Page 5
FeaturesOpsRamp offers three features to address these two intelligent alerting use cases (see Figure 1 and Figure 2):
Use Case OpsRamp Feature
Conditions with a known threshold
Alert on breach of threshold: Generate an alert when the metric breaches the known threshold. Such alerts work best for metrics that can fluctuate rapidly, without a clear trend (like CPU or memory utilization).
Alert on forecasted time to breach threshold: Generate alert based on the forecasted time remaining to breach the known threshold. These alerts are suitable for metrics that rise (or fall) gradually with a clear trend (like disk space utilization).
Conditions with an unknown threshold
Alert on sudden change: Generate an alert when you detect a sudden change from recent behavior. The change can be in a positive or negative direction. This is suitable for metrics in which a sudden change in behavior represents a performance degradation (sudden increase in response times).
Alert on forecasted time to breach
Figure 1: Alert response workflow
Alert on sudden change in behavior
Figure 2: Threshold options in OpsRamp
SERVICE CENTRIC AIOPS[WHITEPAPER]
Page 6
Chapter 2: Alert Correlation
Alert correlation rules let you classify alerts into primary and secondary so that you can establish a relationship between them. Users can then cluster alerts into higher-level pieces of information called inferences that help derive cause and effect.
We’ll analyze two use cases, Cascading Impact and Parallel Impact that arise where the same underlying condition triggers multiple alerts.
Cascading ImpactCascading Impact is the use case where one upstream disruption (and its eventual alert) can cascade downstream and cause subsequent dependency-based alerts across the system. Here, it’s critical to identify the root-cause or upstream alert.
Figures 3 and 4 show two examples of cascading impact.
In Figure 3, Switch 1 is down, setting off a cascade of alerts on downstream devices that depend on network connectivity to the switch. The only alert that represents the “signal” is the alert on Switch 1. The remaining alerts are “noise” relative to the underlying condition.
ID Subject Device Status Related Alerts
12789 Interface if_1 down Switch 1 Critical --
34578 Device unreachable Switch 1 Critical --
78657 Host down Server Critical --
... ... ... ... --
VM1 VM1
Server
Switch 2
Switch 1
Cause
Cause
Root Cause
ID Subject Device Status Related Alerts
12789 Interface if_1 down Switch 1 Critical 10
Alerts
Inference
Figure 3: Cascading effect across a network
In Figure 4, Microservice-D goes down and impacts the availability of the other three microservices that rely on it. Here, the right context is critical to understand which microservice is causing the issue.
Figure 4: Cascading effect across an application component
ID Subject Resource
12789 Microservice-A down Microservice-A
34578 Microservice-B down Microservice-B
78657 Microservice-C down Microservice-C
78657 Microservice-D down Microservice-D
ID Subject Resource Correlated Alerts
78657 Microservice-D down Microservice-D 3
Alerts
Inference
Root Cause
Microservice-A Microservice-B
Microservice-C
Microservice-D
SERVICE CENTRIC AIOPS[WHITEPAPER]
Page 7
Parallel ImpactParallel Impact alerts usually begin from a single source applied to multiple elements simultaneously. Like cascading impact alerts, the result is an alert flood triggered from a single root cause. Figure 5 shows an example of the parallel impact use case. The developer rolls out a code update to different servers in a cluster. The update has a bug that causes alerts on each of the servers. There is no single alert that represents the “signal”. The other alerts across all these servers are “noise” because they all carry the same information. The ops team will need to infer the right “signal” from multiple alerts for accurate remediation.
Developer
Code Update
10:00 AM 10:05 AM 10:15 AM
Figure 5: Cascading effect across an application component
How Does OpsRamp Help?OpsRamp OpsQ applies machine learning and other data-driven approaches to correlate and infer alerts arising from the same underlying condition. OpsQ continuously learns patterns and applies learned models against incoming alert streams to make sense of cascading and parallel impacts. It groups related alerts into inferences based on the learning models. IT teams can then manage these inferences instead of addressing individual alerts, reducing the “noise” that users need to sift through in everyday operations. Figure 6 shows OpsQ’s algorithmic alert correlation.
IncomingAlerts
A1A2A3...
Apply Models
Inferences
Models Library
Learned Models
A1 A2, A6, A8
A3 A7, A9, A10
...
UserFeedback
User-DefinedModels
Figure 6: Cascading effect across application components
SERVICE CENTRIC AIOPS[WHITEPAPER]
Page 8
OpsQ automatically builds inference models based on machine learning and provides a framework for users to define their own models which can encapsulate environment-specific patterns seen in practice. The table summarizes OpsRamp’s features for alert correlation:
OpsRamp Feature Summary
Topology Discovery Topology Explorer: A key piece of information for alert correlation is topological relationships between IT elements (at the network and application layers). Knowing “what’s connected to what” and “what talks to what” is essential to understand alert relationships.
OpsRamp automatically discovers network and application topology. Users can view the topology map, which is used in our correlation algorithms. See Figure 7 and 8 for examples of network and application topology maps.
Inference Models Figure 9 shows three types of models that OpsRamp provides currently.
Downstream Impact Model: This user-defined model is suitable for cascading effect use cases (shown in Figures 3 and 4). The model takes as inputs upstream and downstream devices and alert types which cascade down to their topological relationship.
Alert Similarity Model: This user-defined model works well for parallel effect use cases, as seen in Figure 5. The model takes as input attributes of alerts that must “approximately” match to be identified as related.
Statistical Co-Occurrence Model: This machine learning model is best suited for both cascading and parallel effect use cases. The model requires no user input and uses the historical frequency of occurrence of specific alert sequences. Alert sequences with high frequency show up as correlated alerts.
Inference models have the potential to leverage a key benefit of the OpsRamp SaaS form-factor. Machine learning models can learn from thousands of datasets across different managed environments (in the future).
Inference Lifecycle Management
As shown in Figure 6, we group correlated alerts into an Inference. Users can manage the lifecycle of an Inference instead of handling individual alerts, drastically reducing the time required for event correlation. You can acknowledge and suppress inferences and create tickets in a single action. Figure 10 shows an Inference.
SERVICE CENTRIC AIOPS[WHITEPAPER]
Page 9
Figure 7: OpsRamp’s auto-discovered network topology
Figure 8: OpsRamp’s auto-discovered application topology
SERVICE CENTRIC AIOPS[WHITEPAPER]
Page 10
Figure 9: OpsRamp’s Inference Models
Figure 10: An Inference in OpsRamp
SERVICE CENTRIC AIOPS[WHITEPAPER]
Page 11
Chapter 3: Auto-Remediation And Escalation
Escalation and remediation of alerts are common alert management tasks, often handled by humans. In the old way, first-responders recognized root-cause alerts (through manual correlation) and escalated such alerts to subject matter experts for remediation and closure.
With OpsRamp, you can automate remediation through a sequence or policy. In the event that an alert can’t be auto-remediated, OpsQ can then escalate that alert to the right teams for immediate attention.
Both of the two use cases involve automating the first response to an incoming alert. Automation can mimic what a human operator would do with the same alert.
Auto-RemediationEach alert represents a problem condition that must be eventually fixed. If you can address an incident through a well-defined sequence of actions, then you can automate the response. For example, if a server becomes unavailable when a key application process stops running (e.g. Apache), restarting that process can be a well-defined, automatable action to have the server working again.
OpsQ can invoke scripts on alert triggers and execute auto-remediation actions. Figure 11 shows the configuration of auto-remediation actions.
Figure 11: OpsRamp’s auto-remediation feature
SERVICE CENTRIC AIOPS[WHITEPAPER]
Page 12
Figure 12: OpsRamp’s alert escalation policy
Escalation and Incident RoutingAlerts that cannot be auto-remediated need human attention and investigation. Escalation involves notifying the right users about an alert, creating an incident ticket, and routing the ticket to the right user. Notification and routing decisions depend on the device and alert in question.
For example, alerts on devices supporting business-critical IT services require notification of Level 1 support staff within five minutes of alert receipt. If the alert is from a server and for a specific application, you will need to create an incident and route it to the relevant application team.
OpsQ can create alert escalation workflows and help program first-response actions for notification and auto-incident creation. Figure 12 shows the alert escalation policy for on-call management.
SERVICE CENTRIC AIOPS[WHITEPAPER]
Page 13 © 2019 OpsRamp, Inc. All Rights Reserved
ConclusionAs IT operational environments become more complex, with different platforms and management tools, the need for automation and artificial intelligence has only become more critical. Service-centric AIOps is the most efficient way to use artificial intelligence and machine learning to avoid business disruption and remediate incidents more quickly. OpsRamp OpsQ is an intelligent event management, alert correlation, and remediation solution that’s built for ever-growing amounts of event and performance data.
For more on OpsQ and service-centric AIOps, visit www.OpsRamp.com/solutions/service-centric-aiops.