43
Page 1 of 43 Extending SPECTRUM Event Correlation This document provides examples on how to extend SPECTRUM’s event correlation capabilities. Specifically, sample event rules and condition correlations will be demonstrated. It is best if the reader is familiar with basic SPECTRUM event management topics such as how to configure Events and Alarms through Event Configuration editor. Contents 1.0 Introduction 2.0 Simple Event Configuration Updates 2.1 Alarm De-duplication Example 3.0 Event Rules 3.1 EventRate Example 3.2 EventCondtion Example 3.3 Conditional Alarm Severity Example 3.4 Event Rule Troubleshooting 4.0 Condition Correlation 4.1 Caused By Example 4.2 Implied Cause Example 4.3 Correlation Domains as Root Cause Targets Example 4.4 Condition Correlation Troubleshooting 5.0 Supporting Documentation 1.0 Introduction SPECTRUM’s event management system is highly customizable and provides a powerful method for configuring event correlation. The event management system is used to notify users of significant occurrences within the monitored environment primarily through the use of Events and Alarms. SPECTRUM provides out-of-box event correlation capabilities for fault suppression (due to an outage), certain alarm de-duplication and minimizing alarms by suppressing child model alerts in the event of failure (ports, process models etc…). Often times however, it is necessary to enhance event correlation to meet specific customer needs and greater reduce the number of alerts that an operations staff might have to deal with. There are a number of ways that SPECTRUM Event Correlation capabilities can be updated and enhanced. The examples covered in this document are as follows: 1. Simple Event Configuration updates

Extending SPECTRUM Event Correlation

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Extending SPECTRUM Event Correlation

Page 1 of 43

Extending SPECTRUM Event Correlation

This document provides examples on how to extend SPECTRUM’s event correlation capabilities.

Specifically, sample event rules and condition correlations will be demonstrated. It is best if the reader is

familiar with basic SPECTRUM event management topics such as how to configure Events and Alarms

through Event Configuration editor.

Contents

1.0 Introduction

2.0 Simple Event Configuration Updates

2.1 Alarm De-duplication Example

3.0 Event Rules

3.1 EventRate Example

3.2 EventCondtion Example

3.3 Conditional Alarm Severity Example

3.4 Event Rule Troubleshooting

4.0 Condition Correlation

4.1 Caused By Example

4.2 Implied Cause Example

4.3 Correlation Domains as Root Cause Targets Example

4.4 Condition Correlation Troubleshooting

5.0 Supporting Documentation

1.0 Introduction

SPECTRUM’s event management system is highly customizable and provides a powerful method for

configuring event correlation. The event management system is used to notify users of significant

occurrences within the monitored environment primarily through the use of Events and Alarms.

SPECTRUM provides out-of-box event correlation capabilities for fault suppression (due to an outage),

certain alarm de-duplication and minimizing alarms by suppressing child model alerts in the event of

failure (ports, process models etc…). Often times however, it is necessary to enhance event correlation to

meet specific customer needs and greater reduce the number of alerts that an operations staff might have

to deal with. There are a number of ways that SPECTRUM Event Correlation capabilities can be updated

and enhanced. The examples covered in this document are as follows:

1. Simple Event Configuration updates

Page 2: Extending SPECTRUM Event Correlation

Page 2 of 43

This includes specifying which events generate/clear alarms and event variables to discriminate

on when doing so (alarm de-duplication). In addition, event and alarm descriptions can be

modified and enriched.

2. Event Rules

Event rules allow for events to be correlated on individual models. Event Rules permit you to

specify a more complex decision-making system to indicate how an event is to be processed.

You can use event rules to define alarm conditions based on specific event patterns or content.

3. Condition Correlation

Condition correlation allows for multiple events to be correlated across groups of models.

Events (or the alarms associated with those events) can be identified as the “root cause” and

new root cause conditions can be inferred.

2.0 Simple Event Configuration Updates

The ability to modify SPECTRUM event and alarm behavior can easily be done through the use of the

Event Configuration Editor. To launch the UI, from SPECTRUM OneClick select “Tools->Utilities->Event

Configuration…” (this menu is also available by right clicking).

The Event Configuration UI can quickly be used to change event messages, alarming behavior (alarm

generation, severity, clearing) and alarm descriptions.

Page 3: Extending SPECTRUM Event Correlation

Page 3 of 43

2.1 Alarm De-duplication Example

As mentioned previously, the way events are correlated can result in alarm de-duplication. SPECTRUM

can be configured to generate a unique alarm for each event occurrence, create one alarm and append

each additional occurrence to that alarm, or use event variables to determine when new alarms should be

generated (event variable discriminators). This type of behavior is controlled in the Alarm Options section

of the Alarms tab. The “AUTHORIZATION FAILURE TRAP RECEIVED” alarm will be used as an example.

Once you identify what alarm or event you need to modify, the local Filter box located in the Navigation

panel of the Event Configuration UI can be used to quickly find the event in question.

Page 4: Extending SPECTRUM Event Correlation

Page 4 of 43

Free text can be used to identify a string in the event message or alarm type, however the best way of

getting the event you want is using the event code. This is identified in the SPECTRUM Events window in

OneClick.

Back to the example, looking at the 0x00010017 event you can see a Minor alarm is generated. In the

Alarm Options tab, the “Generate a Unique Alarm for Each Event” is unchecked. This means that when

the event first occurs, a new minor alarm will be created. If another event of the same type occurs on the

same model, and the original alarm has NOT been cleared, then the new event will be appended to the

existing alarm.

Page 5: Extending SPECTRUM Event Correlation

Page 5 of 43

If the “Generate a Unique Alarm for Each Event” box is checked, each time a new authentication failure

event is received for the same device; a new alarm will be created regardless if previous alarms have

been cleared. Toggling the “Generate a Unique Alarm for Each Event” option covers the two extreme

cases, generate new alarms for all events, or generate only one alarm (until it’s cleared) and append all

other events to that alarm. In some situations users may want to generate new alarms only if certain

information in the incoming events is consistent. In this situation, “Event Variable Discriminators” can be

used to generate new alarms only if event data is different than previous events of the same type.

SPECTRUM Event Variables are used to store event data that can change from event to event. Most

commonly, event variables represent SNMP trap varbind variables (var data). When an SNMP trap is

mapped in SPECTRUM (via MIB Tools for example), all the varbinds defined in the SNMP MIB are stored in

the event as event variables. Looking at the Authentication Failure event, there is only one varbind and it

is the IP address of the source of the authentication failure.

Using Event Variable Discriminators, it is now possible to enhance the alarm de-duplication capabilities.

By entering the event variable ID in the Event Variable Discriminators field, SPECTRUM will only generate

Page 6: Extending SPECTRUM Event Correlation

Page 6 of 43

a new alarm if the value of the specific event variable is different than the originating alarm event. So, in

our example if 1 (value of the source IP for the authentication failure event) is entered into the Event

Value Discriminator field, when multiple authentication failure events are received on the same device,

new alarms will only be generated if the source IP value is different. Events with the same source IP will

append the existing alarm.

A comma can be used to separate multiple event variable discriminators.

The event variable discriminators can be used when clearing SPECTRUM alarms as well. Events not only

can “generate” alarms, but they can clear existing alarms. Typically to setup event based alarm clearing,

you simply specify the alarm cause code that is to be cleared by the event in question. Event Variable

Discriminators can also be used here to clears alarms only if Event Variable Discriminators match.

Page 7: Extending SPECTRUM Event Correlation

Page 7 of 43

3.0 Event Rules

Event Rules permit you to specify a more complex decision-making system to indicate how an event is to

be processed. You can use event rules to define alarm conditions based on specific event patterns or

content. Each of the event rules looks for a series of events to occur on a model in a certain pattern or

time frame. If the events occur as the rule specifies, another event is generated for that model. This new

event can then be processed as desired. Some common uses of event rules are only generating an alarm

when an event occurs at a certain frequency, in a specified time period or changing the alarming behavior

based on the varbind content of a trap.

SPECTRUM provides four customizable event rule types:

EventPair

In some cases, you expect events to happen in pairs, and if the second event does not occur, this may

indicate a problem in the computing infrastructure. An event pair rule creates an event based on this

scenario. If the first of two expected events is generated but the second event does not follow the first, a

Page 8: Extending SPECTRUM Event Correlation

Page 8 of 43

new event is generated in response. You can specify the amount of time that can elapse before the new

event is generated in response. Note that other, unrelated events can be generated between the first

event and the second event; they do not affect execution of the rule.

An EventPair rule can be used to indicate that a device reload did not complete successfully and therefore

the device can no longer be reached. SPECTRUM would normally give you an event for the reload and

ultimately an alert for the device no longer responding however the two incidents would not be related.

The graphic above depicts the EventPair rule explained previously. In the first scenario, the coldStart

standard trap is received during the 15-minute time frame; therefore, no alarm is generated. In the

second scenario, the necessary trap is not received; therefore, an alarm is generated.

EventRate

Some types of events can be tolerated and do not indicate a problem unless the frequency at which they

are generated reaches a specific threshold within a specific amount of time. An event rate rule creates an

event based on this scenario. When a number of events of the same type (that is, with the same event

code) are created within a given time period, a new event is created in response. Note that other,

unrelated events do not affect execution of the rule.

Event rate rules never terminate. Once the conditions of the rule are met and a new event is created in

response, the rule remains active, but no additional event is created as long as the frequency at which the

evaluated events remains at or above the specified rate in the rule. If the frequency drops below the

specified rate, and then subsequently exceeds that rate again, another new event is generated in

response.

An event rate rule can use either of the following methods to define the window of time in which the

events must occur:

Sliding Window: When the rule uses this type of time window, if the specified number of events (or more)

ever occurs within any window of the specified time period, the output event is created in response. This

type of time window is best suited for accurately detecting a short burst of events.

For example, the following illustration shows the sliding time windows that are active for a rule that

watches for five instances of a given event (e) within a specified time period.

Page 9: Extending SPECTRUM Event Correlation

Page 9 of 43

When a sliding time window is used for a rule, if the rule generates a rule output event, all active time

windows are terminated, and a new time window automatically begins.

Sequential Window: When the rule uses this type of time window, non-overlapping time windows are

examined, one after another, to determine if the requisite number of events has occurred within the time

window. This type of time window is best suited for detecting a long, sustained train of events.

For example, the following illustration shows the sequential time windows that are opened and closed for a

rule that watches for five instances of a given event (e) within a specified time period.

If the current time window closes due to time period expiration, or if the rule creates an output event in

response, the next time window is not opened until a new event occurrence is detected.

The EventRate rule could be used if a device is generating authenticationFailure traps at a sustained rate.

It may be the case where one or two authenticationFailure traps are acceptable and should not cause an

alert, but if a large amount of these occur in a short time period an alert is desired.

Page 10: Extending SPECTRUM Event Correlation

Page 10 of 43

In this example, when there are more than 20 authenticationFailure traps in a minute, an alarm is

generated.

EventSeries

An event series rule creates a new event when a given event is followed by one or more other events in

an ordered or unordered sequence. The combination of events that must occur can include any number

and type of event, and you can specify the amount of wait time that can elapse during which the sequence

of events must occur. Note that other, unrelated events can be generated during the wait time; they do

not affect execution of the rule.

The EventSeries rule could be used where a device receives a Service Performance Manager (SPM)

configuration failure event as well as high memory and high CPU use events. Because the device is

unable to handle the current load of response time testing, you’ll need an alarm.

The originating event, 0x456002, is generated when a device hosting an SPM Real Time Monitor (RTM)

test experiences a failure to run a test. If the device also experiences high CPU use and high memory

events (0x10f03 and 0x10f04) in a one-minute interval, event 0x10031 is generated. This resulting event

is disposed to a major alarm and can include specific messaging for the operator and troubleshooter about

linking the high memory and CPU use failures to SPM load levels.

EventCondition

This rule allows you to generate an event based on a conditional expression. A series of conditional

expressions can be listed with this rule and the first expression that is found to be TRUE will generate the

event specified with the condition. The conditional expressions can compare a variable binding value or a

SPECTRUM attribute value to a user-specified value using the standard comparison operators. You can

use this type of event rule to evaluate complex scenarios using:

- Comparison operators

- Regular expressions

- String comparisons

Page 11: Extending SPECTRUM Event Correlation

Page 11 of 43

- Nested conditions

EventCondition rules can be used in a wide variety of different situations. Perhaps the most common is

using the rule to evaluate the content of a trap varbind and based on that; alter the event/alerting

behavior.

3.1 Event Rule Example - EventRate

A common event rule is for authentication failures. When SPECTRUM receives an authenticationFailure

trap, event 0x00010017 is generated. This event then generates a minor alarm with PCause ID 0x1030a.

As discussed in the EventRate example above, it may not be desired such that a single

authenticationFailure trap generates an alert, rather an alert is only generated when multiple traps occur

in a short time span. We will use the event rule to accomplish this.

Whenever working with event rules, it is most probable that the event that gets generated as a result of

the positive rule match is going to be a new event. This isn’t a requirement, but it makes sense to have a

new more descriptive event. In our case, there will be a new event generated when 20

authenticationFailure traps occur in 60 seconds. It’s easiest just to copy the originating event

(0x00010017) and make the appropriate changes. To do this, with the event highlighted, select the

“Copy” button in the Navigation panel. In the resulting Copy Event window, enter a new code or just

take the default, then select OK. Update the event text if desired.

Page 12: Extending SPECTRUM Event Correlation

Page 12 of 43

Once the new event has been added, disable the alarm generation for 0x00010017. In the Alarms tab for

0x00010017, toggle the “Severity” pull down to “None”.

The next step is to create the event rule. To do this, move to the Event Rules tab and select the “Creates

a new event rule” button, then choose “EventRate…”.

Page 13: Extending SPECTRUM Event Correlation

Page 13 of 43

In the Event Rate creation window, enter the necessary parameters.

Use the “Browse…” button to find the new event created previously. NOTE: You can create and copy

events from the Browse window if needed. Once all is entered, select OK. Now that the event rule is

Page 14: Extending SPECTRUM Event Correlation

Page 14 of 43

created, when a single authentication failure is received, the event will simply be logged. The rule above

generates event 0xfff00000 when 20 authentication failures are received in 60 seconds. The last step is

to ensure that 0xfff00000 results in the desired alarm.

Using the quick filter, enter in the new event that gets generated as a result of the rule. Select the Alarms

tab and ensure the required severity is set.

In the “Cause Code” field the “Browse…” button can be selected to change or update the resulting alarm

description. After selecting the “Browse…” button, the Select Alarm Cause Code window comes up. Here,

you can choose an existing alarm description, or copy/create a new one. Selecting the “Copy” button

allows you to modify the existing AUTHENTICATION FAILURE alarm description to update the new

condition.

Page 15: Extending SPECTRUM Event Correlation

Page 15 of 43

Once the necessary changes have been made, select “OK”. The new alarm is generated alarm event

0xfff00000.

Finally, to save your changes, select File->Save All.

3.2 Event Rule Example - EventCondition

All event rule types will be built in the primarily the same way with the exception of the EventContition

rule. The difference with EventCondition rules is that an expression may need to be constructed.

Page 16: Extending SPECTRUM Event Correlation

Page 16 of 43

In the example here, we will use an EventCondition rule to examine the contents of a trap varbind and

determine whether the trap indicates an alarm should be generated or cleared. In this case we are

looking at a trap from Wily CEM. In one of the varbinds it either has “_OPEN” or “_CLOSED” indicating

whether or not the CEM incident is being opened or closed. The event looks like the following:

You may be asking yourself, “how do you know that the varbind will have _OPEN or _CLOSED”? This is

information that you will have determine previously. Often times when trying to solve integration

problems or event correlation issues, the raw events need to be examined. From there, you can see

patterns or consistent information that can be exploited in an event rule. So, logically what we want to

happen is this:

1. If varbind 116 contains _OPEN, generate an alarm.

2. If varbind 116 contains _CLOSED, clear and alarm.

The process for creating the EventCondition rule starts in the same manner as before. Find the event,

then select the Event Rules tab and select “Creates a new event rule” button. Select “Event Condition…”

from the pull down:

Page 17: Extending SPECTRUM Event Correlation

Page 17 of 43

In the Event Condition rule creation UI, select the “Add” button.

The Edit window will be displayed which allows you build complex (or simple) expressions for evaluating a

specific event condition. Again, in our example the first condition is that varbind 116 contains the string

_OPEN. Using the Operands in the Condition window, the event variable (event attribute) in varbind 116

can be compared to a text string. In this case REGEXP is used to specify that the condition evaluates to

true if varbind 116 contains the string “_OPEN”.

Page 18: Extending SPECTRUM Event Correlation

Page 18 of 43

Once the condition is defined, select the “Insert Criterion” button. Next, specify the event to be created

when the condition evaluates to true, and then select OK. Now, the next condition needs to be defined.

Back in the Event Condition List window, select the “Add” button again.

Similar to the one above, the new condition would be the same concept just a different value comparison

and resulting event.

Page 19: Extending SPECTRUM Event Correlation

Page 19 of 43

When building conditions, there are many possibilities. Not only can event variables be used, but model

attributes and predefined values.

Lastly, you can use the DEFAULT operator to always have a condition that evaluates to true. Event

conditions work sequentially, so by putting the DEFAULT condition last, if no other conditions evaluate to

true the DEFUALT condition will. This provides an opportunity to specify an event for which none of the

rules are met.

Page 20: Extending SPECTRUM Event Correlation

Page 20 of 43

In the end, there are three conditions that make up the event condition rule.

Of course the resulting events might need to be modified to produce the desired behavior (generate

alarm, another event rule etc…). Again, once any changes are made, to save those changes select “File-

>Save All”.

3.3 Conditional Alarm Severity Example

Traditionally, another prime example of using EventCondtion rules was to evaluate a trap varbind that had

severity information and correlate that to different SPECTRUM alarms. Typically, you would have an event

condition rule that looked like this:

Page 21: Extending SPECTRUM Event Correlation

Page 21 of 43

This rule has a set of conditions that evaluate event variable 117. If it contains strings “Moderate”,

“Severe” or “Critical” a different event then gets generated. Those three resulting events then simply

generate the same alarm, with three different SPECTRUM alarm severities (Moderate = SPECTRUM Minor,

Severe = SPECTRUM Major, Critical = SPECTRUM Critical). Now, for alarm severity mapping such as this,

an event rule is no longer needed. Conditional Alarm Severity can be used.

To use the Conditional alarm severity, simply select “Conditional” from the Severity pull-down of the

Alarm tab.

Next, specify the event variable where the severity exists that needs to map to the appropriate SPECTRUM

alarm severity. Also specify an Alarm Cause Code.

To define the alarm severity mappings, select the “Configure…” button. By selecting the Add button, new

severity mapping can be defined.

Page 22: Extending SPECTRUM Event Correlation

Page 22 of 43

In the Add dialog, specify a name, and then begin adding the Value/Severity mappings. The Directory

field can be used if a predefined mapping file (ASCII text) has already been created. Once completed, the

SPECTRUM alarm severities will now be determined by the varbind mappings that have been identified.

As you can see, this method is much quicker and simpler that the event condition rule used previously.

Page 23: Extending SPECTRUM Event Correlation

Page 23 of 43

3.4 Debugging Event Rules

When making event configuration changes, debugging can be enabled to troubleshoot potential issues.

To enable debugging, edit the $SPECROOT/SS/.vnmrc file parameter “event_disp_error_file” by specifying

a file name.

For example:

event_disp_error_file=eventerrors.out

Now anytime the “Update Event Configuration” button is used, or the SpectroSERVER is restarted, any

errors will be written to the file specified ($SPECROOT/SS/eventerrors.out).

4.0 Condition Correlation

SPECTRUM Condition Correlation allows you to logically process multiple events on one or multiple models,

and correlate them into a single infrastructure condition. This will result in a single root cause alarm with

potentially suppressed symptomatic alarms. Condition Correlation Editor (CCE) works by binding a

SPECTRUM Event to a Condition. Unlike an Event, a Condition has persistence, meaning a Condition exists

until it is cleared by its clear Event (for CCE to work effectively, you’ll need your condition to have both set

and clear event codes). Conditions can also be supplemented with parameter data which can come from

vardata in the set event, or model attribute data for the model where the set event occurs. This is allows

users to create “advanced” expressions to more accurately correlate alarms. There are 3 basic scenarios

which are supported.

1. Correlate multiple alarms (events) as symptoms of one which is the common root cause.

2. Correlate multiple alarms (events) into a new alarm which represents the root cause.

3. Correlate alarms for monitored models to produce a new alarm on the domain for the managed

models based on some criteria.

Correlations are built using the following components:

Conditions – Conditions are building blocks of a correlation, they exist on a resource, or model in

SPECTRUM. Simply put, conditions are defined with events, a Set event and Clear event.

Conditions can also be enriched with parameters, which can be trap vardata, model attributes or

user defined. Parameters can then be used for comparison and evaluation in determining accurate

condition correlation.

Rules – Rules define relationships between two or more conditions when certain criteria are met.

Rules look at conditions in terms of existence. The condition(s) either “Exist”, “Does Not Exist” or

“Counts” (exceeds a certain number of occurrences). Conditions are then related with the following

expressions:

Implies - Condition(s) X IMPLIES Condition Y. When Condition(s) X (and any parameter

criteria) are met, Condition Y is generated (the Set events are created). Note, all

events/alarms associated with condition(s) will still be present.

Caused By – Condition(s) X are CAUSED BY Condition Y. When Condition(s) X (and any

parameter criteria) are met, and Condition Y is met, Condition Y is correlated as the Root

Cause condition. Alarms associated with condition X will be suppressed and show as

symptomatic of any alarm associated with condition Y (in the Impact section of an alarm).

Page 24: Extending SPECTRUM Event Correlation

Page 24 of 43

Implied Cause – Combines both of the above. When Condition(s) X are met, Condition Y is

generated and becomes the root cause.

Polices – A set of one or more Rules.

Domain – A group of models that a Policy, and as a result a set of Rules, can be applied to. In

order for a correlation to function properly, all models where the defined conditions (events!) occur

must be in the Correlation Domain associated to the Policy. Correlation Domains are an extremely

beneficial as they allow you to apply different correlations to different groups of models.

Condition Correlation is an extremely powerful tool and is much simpler than most people think. It is also

extremely flexible. Because of the use of correlation domains, correlations can be applied to specific sets

of models. In most environments, correlations may not apply to the entire infrastructure. Correlation

Domains allow for local correlation to occur if necessary. They also provide the ability to assert root cause

alarms on the correlation domain model itself.

Lastly, it is important to remember that condition correlation is ultimately correlating EVENTS, not alarms.

This is often a misconception that can lead to correlations that don’t work. That said, when events are

correlated, any alarms associated with those events also participate in the correlation. So the end result

is that alarms are suppressed and/or designated as “root cause”.

4.1 Caused By Rule Example

As stated above, the “Cause By” rule relationship can be used to correlate multiple conditions (events!)

and designate one of those as the root cause. An example of this is correlating a device outage with OSPF

neighbor loss events from its adjacent network neighbors. Without condition correlation, here is what

happens:

1. SPECTRUM generates a critical alarm stating the downed device is no longer responding to

polls.

2. Each of the neighboring devices generates an alarm stating that OSPF neighbor state of that

device is now DOWN.

3. The end result is that five alarms are sent to the alarm console.

Page 25: Extending SPECTRUM Event Correlation

Page 25 of 43

To effectively correlate these alarms (events), we need to use Condition Correlation Editor to define, and

then correlate the conditions that take place during this scenario. In this example, we have two different

conditions:

1. Device Contact Lost

2. OSPF Neighbor Loss

To open Condition Correlation Editor, select “Tools->Utilities->Condition Correlation Editor…” from the

OneClick console.

The Condition Correlation Editor UI layout is constructed very simply in terms of the basic condition

correlation components as discussed above (Conditions, Rules, Policies, Domains):

Page 26: Extending SPECTRUM Event Correlation

Page 26 of 43

Notice, there are a number of predefined Conditions (Author will be CA). The first step is to identify or

define the Conditions that will be needed for the correlation. To do this, it’s best to get the SPECTRUM

event codes that are involved in the scenario. In our example, we can simply look for what events

generate the alarms in question. It’s easy to do this using the Events tab in OneClick. You can quickly

identify the alarm events using the severity column, and then get the Event IDs from the Event Type

column.

For this example, the OSPF neighbor loss alarm is generated from event 0x220031 and the device contact

loss alarm is from 0x10d35. As mentioned previously, conditions are defined with both set and clear

events. The events tab can also be used to obtain the clear events as well (so can Event Configuration

Editor!). The clear events for the OSPF neighbor loss and device contact loss alarms are 0x220024 and

0x10d30 respectively. So thus far, we have:

1. Device Contact Lost – SET EVENT = 0x10d35, CLEAR EVENT = 0x10d30

2. OSPF Neighbor Loss – SET EVENT = 0x220031, CLEAR EVENT = 0x220024

The next step would be to identify any potential condition parameters. In this example, it would help to

identify the actual neighbor device in the OSPF event, as a single device might report that multiple OSPF

neighbors have been lost. Looking at the OSPF event, the neighbor IP is displayed as a trap varbind.

Page 27: Extending SPECTRUM Event Correlation

Page 27 of 43

When adding parameters to conditions, to reference event variables the event variable ID will be needed

and to reference model attributes, the attribute ID would be needed. In this case, the event variable ID

can be obtained from Event Configuration Editor by looking at the event message.

Now the relevant information looks like:

1. Device Contact Lost – SET EVENT = 0x10d35, CLEAR EVENT = 0x10d30

2. OSPF Neighbor Loss – SET EVENT = 0x220031, Neighbor IP address = Varbind 2, CLEAR EVENT

= 0x220024

In CCE, on the Conditions tab, select the “Create…” button. From the Create Correlation Condition

window, enter a name and the set and clear event codes.

Page 28: Extending SPECTRUM Event Correlation

Page 28 of 43

Now select the “Create…” button in the Parameters section. Enter the parameter name (neighbor IP),

select Var Bind from the Parameter pull down and enter “2” for the value.

Finally select “Create” to finish the Condition. This creates a new condition for OSPF Neighbor Loss with a

parameter for the actual neighbor IP that gets passed to the event (varbind 2 – which was obtained from

the set event configuration).

The device contact lost event is actually already defined in an out of box Condition. You can see this by

entering the event code in the local filter in the Conditions tab (a good practice to check to see if the

Condition you are looking for is already defined).

Page 29: Extending SPECTRUM Event Correlation

Page 29 of 43

Make sure that the network address is defined as a parameter for this condition. The process is similar to

what was done for the varbind. Select the “Edit…” button with the condition highlighted. Add a new

parameter for Network Address with a Parameter Type of Model Attribute and ID of 0x12d7f (this is just

the attribute ID).

Now that the conditions have been created/updated, the rule can be constructed. In this example, the

OSPF condition would be a symptom of the contact lost condition. From the Rules tab of CCE, select

“Create…”. In the “Symptom Condition(s):” list select the OSPF condition defined above (leave type as

“Exists”). Select the “Caused By” relationship from the pull down menu. For the “Root Cause Condition”,

select the ContactLost_Red condition (as determined above).

Page 30: Extending SPECTRUM Event Correlation

Page 30 of 43

To leverage the Parameters that were defined above, select the “Show Advanced” button. This

component allows pre-defined parameters to be related to constants or to other defined parameters. In

this example, the neighbor IP address as shown in the ospf neighbor loss event needs to match the IP of

the device that is no longer responding. Using the operators at the bottom of the Rule Criteria section,

choose the appropriate Parameter from each condition. For the OSPF scenario, the neighbor IP parameter

of the ospf neighbor loss condition must match the IP address parameter of the contact lost parameter.

Select the Insert Criterion button once that is completed.

Page 31: Extending SPECTRUM Event Correlation

Page 31 of 43

Select “OK” to create the Rule.

Once you created a rule, the next step is to associate the rule with a Policy. Policies are simply a list of

rules. They allow multiple rules to easily be associated to a correlation domain. To create a new policy,

from the Policies tab, select “Create…”. In the Create Correlation Policy window, provide a Policy Name,

select the OSPF rule from the Available Rules list and move it to the Policy Rules list, then select Create.

With the OSPF Policy now in place, the last step is to associate the policy with a Correlation Domain.

Correlation Domains can be created in a couple of different ways. The first is through CCE. From the

Domains tab, select “Create…”. In the Create Correlation Domain window, provide a name, and select the

Page 32: Extending SPECTRUM Event Correlation

Page 32 of 43

appropriate policy from the Available Policies list (in this case the OSPF policy) and move it to the Domain

Policies list.

The next step is to define which models will participate in the correlation domain. This is done by

selecting the Resources tab in the Create Correlation Domain window.

From here, use the “Add…” button to bring up a Locate Resources window. Use the available searches to

find the desired models, highlight them in the results list, then select the “Add Selected to Correlation

Domain” button.

Page 33: Extending SPECTRUM Event Correlation

Page 33 of 43

Once the resources have been selected, select the “Create…” button.

It should be noted that you can use Global Collections as resources for correlation domains (there is a pre-

defined search in the Locate Resources window). When a Global Collection is used, all of the models that

belong to the collection participate in the correlation domain. This is great way to ensure that correlation

domains are updated automatically when new devices or models are added to the environment. Having a

dynamic global collection as a correlation domain resource will ensure that any new models that are added

to the collection automatically begin participating in the correlation domain.

Another way of easily adding models to correlation domains is by using the “Add To” option in OneClick.

This can be done by right clicking a model or group of models and selecting “Utilities->Add To-

>Correlation Domain…”.

Page 34: Extending SPECTRUM Event Correlation

Page 34 of 43

This allows you to pick an existing correlation domain or create a new one by entering a new name.

We have successfully created a new SPECTRUM correlation by doing the following:

1. Defined and updated Conditions that are comprised of the events involved in the fault scenario

and any parameters (event variables, model attributes) that can be used for deeper correlation.

2. Created a Rule that defines how the conditions are related to each other.

3. Created a Policy that is associated to the rule.

Page 35: Extending SPECTRUM Event Correlation

Page 35 of 43

4. Defined a Domain that contains the devices/models that participate in the correlation and is

associated to the policy.

With the new correlation in place, when the device failure occurs, the contact lost and OSPF alarms are

correlated to a single root cause which is the only alarm presented to the alarm console. Also, the

“symptomatic” condition alerts are displayed in the Impact tab of the root cause alarm as Symptoms.

4.2 Implied Cause Rule Example

The next example will show how the Implied Cause correlation rule can be used. The Implied Cause rule

relationship is used to correlate a number of conditions (again, events), and “imply” a new root cause

condition that doesn’t currently exist. Comparing this to the previous example, with the caused by rule

relationship we take a number of conditions and determine that one of those is the root cause and the

others are symptoms. In this example, we will take a number of existing conditions and correlate those to

a NEW condition. We can then determine where we want the new condition to be asserted (a model or

the correlation domain itself).

For this scenario a configuration change gets made to an apache configuration file that impacts a hosted

web site’s accessibility. When this happens, from a monitoring standpoint we receive a number of

different alarms in SPECTRUM:

1. A Wily Introscope alarm is generated stating that a 403 error is returned from the site.

2. An SPM test created for the site times out and an alarm is generated.

3. An error gets written to the apache log file, and an alarm is generated.

It’s been determined that these three alarms occur (at the same time) when the apache configuration file

for the site in questions is incorrectly changed. What we would like to do is suppress these alarms, and

make them symptoms of a new root cause alarm stating that the apache configuration has changed.

Page 36: Extending SPECTRUM Event Correlation

Page 36 of 43

Again, the first step in the process is determining which events are needed to define Conditions for each of

the alarms above. That’s not going to be covered in length as the process is described in section 4.1. In

this situation, there are no parameters that will be used for the identifying the correlation (event variables

for example). However, when using the Implied Cause relationship, the new implied condition needs to be

associated to a participating model or the correlation domain itself. In our example we are going to assert

the new condition to the apache server model. To do this, we need to use the “Model” predefined

parameter in the condition that would occur on the model we want to associate the new condition with

(the apache server model).

So, I’ve defined Conditions for:

1. Wily Introscope alarm: CONDITION=”Introscope 403 Error”

2. SPM timeout: CONDITION=”SpmTestTimeOut”

3. Logfile match: CONDTION=”Minor Log Error”

For the conditions above:

1. The “Introscope 403 Error” occurs on a SPECTRUM event model (that represents the Wily

Introscope application component).

2. The “SpmTestTimeOut” occurs on the appropriate SPECTRUM SPM test model.

3. The “Minor Log Error” occurs on the SystemEDGE server model that represents the apache server.

Since the new, implied condition (and resulting alarm) needs to occur on the apache server model, the

Model parameter needs to be added to the “Minor Log Error” condition. In the Create Correlation

Parameter window (see section 4.1 for information on creating parameters), I can select Predefined as the

Parameter Type, and choose “Model”. This will automatically populate in the model handle. Select

“Create” to add the parameter.

Page 37: Extending SPECTRUM Event Correlation

Page 37 of 43

With the Implied Cause relationship, the new root cause condition also needs to be defined. This could be

any condition, but in this example a new alarm is desired. Using Event Configuration Editor a new event

has been created for the “apache configuration change”. This event results in a critical alarm, where a

new probable cause has been defined.

NOTE: This event was created from scratch. A new condition needs to be created using this event code.

The conditions now involved in this correlation look like:

1. Wily Introscope alarm: CONDITION=”Introscope 403 Error”

2. SPM timeout: CONDITION=”SpmTestTimeOut”

3. Logfile match: CONDTION=”Minor Log Error”

4. IMPLIED Apache Configuration change: CONDITION=”Server Config Error”

With the Conditions defined, the Rule can now be created. In the Rules tab, select “Create…”. In the

Create Correlation Rule window, supply a rule name and change the Relationship to “Implied Cause”.

Select the symptoms in the Symptom Condition(s) list (multi-select “Introscope 403 Error”, “Minor Log

Error” and “SpmTestTimeOut”). In the Root Cause Condition list, select the “Sever Config Error”

condition.

Page 38: Extending SPECTRUM Event Correlation

Page 38 of 43

The next step is to identify where to associate the root cause. In the Root Cause Target area, select the

“Condition” radio button. From the pull-down select “Minor Log Error”. In the Parameter pull down, select

“Model”. The Minor Log Error condition appears in the list because we defined the Model parameter for

that condition. Select the “Create” button to finish the rule.

To finish this correlation, a policy needs to be created with the Apache Configuration Change rule

associated to it. A correlation domain then needs to be defined and associated to the policy. Lastly, the

models need to be added to the correlation domain. In this example:

1. The SPECTRUM event model that the Wily “Introscope 403 Error” occurs on.

2. SPECTRUM SPM test model the “SpmTestTimeOut” occurs on.

3. The SystemEDGE server model that the “Minor Log Error” occurs on.

Once this is complete, the end result is that when the three alarms mentioned above occur (on the models

in the domain) there is now a new root cause alarm generated on the SystemEDGE host model that

represents the apache server. The symptomatic alarms are again suppressed and displayed as symptoms

of the apache configuration change alarm in the Impact tab.

Page 39: Extending SPECTRUM Event Correlation

Page 39 of 43

4.3 Correlation Domains as Root Cause Targets Example

Another interesting and very useful example of the Implied Cause relationship correlation rule is using the

correlation domain as the root cause target. In the previous example, a correlation was built using the

implied cause relationship where the new, implied root cause was associated to a model participating in

the correlation. There may be circumstances where multiple events are correlated and the new implied

event/alarm needs to be associated to something arbitrary or something that is not represented in

SPECTRUM. This is a case where specifying the correlation domain as the root cause target can be

extremely effective.

For the example scenario, there are 100 UPS devices that exist in a remote site office (Boston). The

building does not have a backup generator. If the building power fails, all 100 UPS devices switch to

battery power. When this happens an alert is sent to SPECTRUM and an alarm is generated on ALL 100

UPS devices. Let’s assume that the event that is generated for this condition is the one shown below.

When this occurs, a Major alarm is generated. There is also an event that clears the alarm, so that can be

used as the “clear event” for the condition definition.

In CCE, the Condition has been created. In this case there are no parameters.

Page 40: Extending SPECTRUM Event Correlation

Page 40 of 43

For the rule relationship, Implied Cause needs to be used. As a result, a root cause condition is needed.

Here a new event/alarm has been created for “Building Power Failure” (using Event Configuration Editor).

Those events are used to define the Building Power Failure condition.

Again, to create the correlation rule, from the Rules tab, select the “Create…” button. For this particular

correlation, the “Counts” condition type needs to be used on the UPS on Battery condition.

By using the “Counts” type, the correlation rule is looking for the number of concurrent instances of the

specific condition that exists on the assigned correlation domain. The advanced rule criteria can be used

to specify what the count value is. In this example, since there are 100 UPS in the building, a value of 75

will be used to evaluate to true. This way, if a certain number of the UPS’s are off line or no longer

functioning properly the correlation will not be missed. In the Symptom Condition(s) list, the “UPS on

Battery” is selected with the Type set to “Counts”. The relationship is “Implied Cause”, and the Root

Cause Condition is “Building Power Failure”. The Counts value must be specified by selecting the “Show

Advanced” button. In the rule criteria window, select the “UPS on Battery” condition for the Left Operand,

with the “Condition Count” Parameter (this is available once the Counts type is selected in the Symptom

Conditions list). The GREATER THAN operator can be used, then select the “By Value” checkbox in the

Right Operand section. The integer value can then be entered (75 in the example here).

Page 41: Extending SPECTRUM Event Correlation

Page 41 of 43

The last step for defining the rule is to select the “Correlation Domain” in the “Root Cause Target” section.

To finish the correlation, a Policy needs to be created that is associated with the rule created above.

Lastly, create a new correlation domain, associate it to the Power Failure policy and add the UPS device

models as resources. It is important here to give the Domain the appropriate building name. Since the

root cause alarm will be asserted to the correlation domain model, this will represent the building so name

it accordingly.

Page 42: Extending SPECTRUM Event Correlation

Page 42 of 43

4.4 Condition Correlation Debugging

Hopefully after some of the examples above, Condition Correlation Editor is not as complex as it might

have seemed previously. There are some very common mistakes that result in correlations not working.

1. Make sure when defining Conditions the event code is used, not the alarm pcause code. Often

times when events generate alarms, the codes are the same. This is NOT always the case. Always

use the Events tab to validate the correct event code is being used.

2. Verify that the models on which the correlation conditions (events) are occurring, are in the

associated correlation domain. You can do this by looking at the correlation domain under

“Correlation Manager” in the OneClick Navigation panel.

Page 43: Extending SPECTRUM Event Correlation

Page 43 of 43

You can also validate domain membership by looking at the events on the models that should be in

the domain. If they are correctly participating in a correlation domain you will see the following

events (0x10e08):

3. Verify that you have the following components associated correctly: Conditions->Rule->Policy-

>Domain.

4. Another common mistake occurs when condition rule criteria are written incorrectly. If steps 1 and

2 above have been verified, if any of the conditions being used have advanced rule criteria defined,

remove the rule criteria and see if the correlation works. Since rule criteria is most often used to

add more granularity in identifying correlation scenarios, removing that criteria should also allow

the correlation to work. By doing this, it can be determined if the problem is with the condition

rule criteria, or something else.

5.0 Supporting Documentation

Modeling Your IT Infrastructure Guide (5167)

Event Configuration User Guide (5188)

Condition Correlation User Guide (5175)