Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Page 1 of 43
Extending SPECTRUM Event Correlation
This document provides examples on how to extend SPECTRUM’s event correlation capabilities.
Specifically, sample event rules and condition correlations will be demonstrated. It is best if the reader is
familiar with basic SPECTRUM event management topics such as how to configure Events and Alarms
through Event Configuration editor.
Contents
1.0 Introduction
2.0 Simple Event Configuration Updates
2.1 Alarm De-duplication Example
3.0 Event Rules
3.1 EventRate Example
3.2 EventCondtion Example
3.3 Conditional Alarm Severity Example
3.4 Event Rule Troubleshooting
4.0 Condition Correlation
4.1 Caused By Example
4.2 Implied Cause Example
4.3 Correlation Domains as Root Cause Targets Example
4.4 Condition Correlation Troubleshooting
5.0 Supporting Documentation
1.0 Introduction
SPECTRUM’s event management system is highly customizable and provides a powerful method for
configuring event correlation. The event management system is used to notify users of significant
occurrences within the monitored environment primarily through the use of Events and Alarms.
SPECTRUM provides out-of-box event correlation capabilities for fault suppression (due to an outage),
certain alarm de-duplication and minimizing alarms by suppressing child model alerts in the event of
failure (ports, process models etc…). Often times however, it is necessary to enhance event correlation to
meet specific customer needs and greater reduce the number of alerts that an operations staff might have
to deal with. There are a number of ways that SPECTRUM Event Correlation capabilities can be updated
and enhanced. The examples covered in this document are as follows:
1. Simple Event Configuration updates
Page 2 of 43
This includes specifying which events generate/clear alarms and event variables to discriminate
on when doing so (alarm de-duplication). In addition, event and alarm descriptions can be
modified and enriched.
2. Event Rules
Event rules allow for events to be correlated on individual models. Event Rules permit you to
specify a more complex decision-making system to indicate how an event is to be processed.
You can use event rules to define alarm conditions based on specific event patterns or content.
3. Condition Correlation
Condition correlation allows for multiple events to be correlated across groups of models.
Events (or the alarms associated with those events) can be identified as the “root cause” and
new root cause conditions can be inferred.
2.0 Simple Event Configuration Updates
The ability to modify SPECTRUM event and alarm behavior can easily be done through the use of the
Event Configuration Editor. To launch the UI, from SPECTRUM OneClick select “Tools->Utilities->Event
Configuration…” (this menu is also available by right clicking).
The Event Configuration UI can quickly be used to change event messages, alarming behavior (alarm
generation, severity, clearing) and alarm descriptions.
Page 3 of 43
2.1 Alarm De-duplication Example
As mentioned previously, the way events are correlated can result in alarm de-duplication. SPECTRUM
can be configured to generate a unique alarm for each event occurrence, create one alarm and append
each additional occurrence to that alarm, or use event variables to determine when new alarms should be
generated (event variable discriminators). This type of behavior is controlled in the Alarm Options section
of the Alarms tab. The “AUTHORIZATION FAILURE TRAP RECEIVED” alarm will be used as an example.
Once you identify what alarm or event you need to modify, the local Filter box located in the Navigation
panel of the Event Configuration UI can be used to quickly find the event in question.
Page 4 of 43
Free text can be used to identify a string in the event message or alarm type, however the best way of
getting the event you want is using the event code. This is identified in the SPECTRUM Events window in
OneClick.
Back to the example, looking at the 0x00010017 event you can see a Minor alarm is generated. In the
Alarm Options tab, the “Generate a Unique Alarm for Each Event” is unchecked. This means that when
the event first occurs, a new minor alarm will be created. If another event of the same type occurs on the
same model, and the original alarm has NOT been cleared, then the new event will be appended to the
existing alarm.
Page 5 of 43
If the “Generate a Unique Alarm for Each Event” box is checked, each time a new authentication failure
event is received for the same device; a new alarm will be created regardless if previous alarms have
been cleared. Toggling the “Generate a Unique Alarm for Each Event” option covers the two extreme
cases, generate new alarms for all events, or generate only one alarm (until it’s cleared) and append all
other events to that alarm. In some situations users may want to generate new alarms only if certain
information in the incoming events is consistent. In this situation, “Event Variable Discriminators” can be
used to generate new alarms only if event data is different than previous events of the same type.
SPECTRUM Event Variables are used to store event data that can change from event to event. Most
commonly, event variables represent SNMP trap varbind variables (var data). When an SNMP trap is
mapped in SPECTRUM (via MIB Tools for example), all the varbinds defined in the SNMP MIB are stored in
the event as event variables. Looking at the Authentication Failure event, there is only one varbind and it
is the IP address of the source of the authentication failure.
Using Event Variable Discriminators, it is now possible to enhance the alarm de-duplication capabilities.
By entering the event variable ID in the Event Variable Discriminators field, SPECTRUM will only generate
Page 6 of 43
a new alarm if the value of the specific event variable is different than the originating alarm event. So, in
our example if 1 (value of the source IP for the authentication failure event) is entered into the Event
Value Discriminator field, when multiple authentication failure events are received on the same device,
new alarms will only be generated if the source IP value is different. Events with the same source IP will
append the existing alarm.
A comma can be used to separate multiple event variable discriminators.
The event variable discriminators can be used when clearing SPECTRUM alarms as well. Events not only
can “generate” alarms, but they can clear existing alarms. Typically to setup event based alarm clearing,
you simply specify the alarm cause code that is to be cleared by the event in question. Event Variable
Discriminators can also be used here to clears alarms only if Event Variable Discriminators match.
Page 7 of 43
3.0 Event Rules
Event Rules permit you to specify a more complex decision-making system to indicate how an event is to
be processed. You can use event rules to define alarm conditions based on specific event patterns or
content. Each of the event rules looks for a series of events to occur on a model in a certain pattern or
time frame. If the events occur as the rule specifies, another event is generated for that model. This new
event can then be processed as desired. Some common uses of event rules are only generating an alarm
when an event occurs at a certain frequency, in a specified time period or changing the alarming behavior
based on the varbind content of a trap.
SPECTRUM provides four customizable event rule types:
EventPair
In some cases, you expect events to happen in pairs, and if the second event does not occur, this may
indicate a problem in the computing infrastructure. An event pair rule creates an event based on this
scenario. If the first of two expected events is generated but the second event does not follow the first, a
Page 8 of 43
new event is generated in response. You can specify the amount of time that can elapse before the new
event is generated in response. Note that other, unrelated events can be generated between the first
event and the second event; they do not affect execution of the rule.
An EventPair rule can be used to indicate that a device reload did not complete successfully and therefore
the device can no longer be reached. SPECTRUM would normally give you an event for the reload and
ultimately an alert for the device no longer responding however the two incidents would not be related.
The graphic above depicts the EventPair rule explained previously. In the first scenario, the coldStart
standard trap is received during the 15-minute time frame; therefore, no alarm is generated. In the
second scenario, the necessary trap is not received; therefore, an alarm is generated.
EventRate
Some types of events can be tolerated and do not indicate a problem unless the frequency at which they
are generated reaches a specific threshold within a specific amount of time. An event rate rule creates an
event based on this scenario. When a number of events of the same type (that is, with the same event
code) are created within a given time period, a new event is created in response. Note that other,
unrelated events do not affect execution of the rule.
Event rate rules never terminate. Once the conditions of the rule are met and a new event is created in
response, the rule remains active, but no additional event is created as long as the frequency at which the
evaluated events remains at or above the specified rate in the rule. If the frequency drops below the
specified rate, and then subsequently exceeds that rate again, another new event is generated in
response.
An event rate rule can use either of the following methods to define the window of time in which the
events must occur:
Sliding Window: When the rule uses this type of time window, if the specified number of events (or more)
ever occurs within any window of the specified time period, the output event is created in response. This
type of time window is best suited for accurately detecting a short burst of events.
For example, the following illustration shows the sliding time windows that are active for a rule that
watches for five instances of a given event (e) within a specified time period.
Page 9 of 43
When a sliding time window is used for a rule, if the rule generates a rule output event, all active time
windows are terminated, and a new time window automatically begins.
Sequential Window: When the rule uses this type of time window, non-overlapping time windows are
examined, one after another, to determine if the requisite number of events has occurred within the time
window. This type of time window is best suited for detecting a long, sustained train of events.
For example, the following illustration shows the sequential time windows that are opened and closed for a
rule that watches for five instances of a given event (e) within a specified time period.
If the current time window closes due to time period expiration, or if the rule creates an output event in
response, the next time window is not opened until a new event occurrence is detected.
The EventRate rule could be used if a device is generating authenticationFailure traps at a sustained rate.
It may be the case where one or two authenticationFailure traps are acceptable and should not cause an
alert, but if a large amount of these occur in a short time period an alert is desired.
Page 10 of 43
In this example, when there are more than 20 authenticationFailure traps in a minute, an alarm is
generated.
EventSeries
An event series rule creates a new event when a given event is followed by one or more other events in
an ordered or unordered sequence. The combination of events that must occur can include any number
and type of event, and you can specify the amount of wait time that can elapse during which the sequence
of events must occur. Note that other, unrelated events can be generated during the wait time; they do
not affect execution of the rule.
The EventSeries rule could be used where a device receives a Service Performance Manager (SPM)
configuration failure event as well as high memory and high CPU use events. Because the device is
unable to handle the current load of response time testing, you’ll need an alarm.
The originating event, 0x456002, is generated when a device hosting an SPM Real Time Monitor (RTM)
test experiences a failure to run a test. If the device also experiences high CPU use and high memory
events (0x10f03 and 0x10f04) in a one-minute interval, event 0x10031 is generated. This resulting event
is disposed to a major alarm and can include specific messaging for the operator and troubleshooter about
linking the high memory and CPU use failures to SPM load levels.
EventCondition
This rule allows you to generate an event based on a conditional expression. A series of conditional
expressions can be listed with this rule and the first expression that is found to be TRUE will generate the
event specified with the condition. The conditional expressions can compare a variable binding value or a
SPECTRUM attribute value to a user-specified value using the standard comparison operators. You can
use this type of event rule to evaluate complex scenarios using:
- Comparison operators
- Regular expressions
- String comparisons
Page 11 of 43
- Nested conditions
EventCondition rules can be used in a wide variety of different situations. Perhaps the most common is
using the rule to evaluate the content of a trap varbind and based on that; alter the event/alerting
behavior.
3.1 Event Rule Example - EventRate
A common event rule is for authentication failures. When SPECTRUM receives an authenticationFailure
trap, event 0x00010017 is generated. This event then generates a minor alarm with PCause ID 0x1030a.
As discussed in the EventRate example above, it may not be desired such that a single
authenticationFailure trap generates an alert, rather an alert is only generated when multiple traps occur
in a short time span. We will use the event rule to accomplish this.
Whenever working with event rules, it is most probable that the event that gets generated as a result of
the positive rule match is going to be a new event. This isn’t a requirement, but it makes sense to have a
new more descriptive event. In our case, there will be a new event generated when 20
authenticationFailure traps occur in 60 seconds. It’s easiest just to copy the originating event
(0x00010017) and make the appropriate changes. To do this, with the event highlighted, select the
“Copy” button in the Navigation panel. In the resulting Copy Event window, enter a new code or just
take the default, then select OK. Update the event text if desired.
Page 12 of 43
Once the new event has been added, disable the alarm generation for 0x00010017. In the Alarms tab for
0x00010017, toggle the “Severity” pull down to “None”.
The next step is to create the event rule. To do this, move to the Event Rules tab and select the “Creates
a new event rule” button, then choose “EventRate…”.
Page 13 of 43
In the Event Rate creation window, enter the necessary parameters.
Use the “Browse…” button to find the new event created previously. NOTE: You can create and copy
events from the Browse window if needed. Once all is entered, select OK. Now that the event rule is
Page 14 of 43
created, when a single authentication failure is received, the event will simply be logged. The rule above
generates event 0xfff00000 when 20 authentication failures are received in 60 seconds. The last step is
to ensure that 0xfff00000 results in the desired alarm.
Using the quick filter, enter in the new event that gets generated as a result of the rule. Select the Alarms
tab and ensure the required severity is set.
In the “Cause Code” field the “Browse…” button can be selected to change or update the resulting alarm
description. After selecting the “Browse…” button, the Select Alarm Cause Code window comes up. Here,
you can choose an existing alarm description, or copy/create a new one. Selecting the “Copy” button
allows you to modify the existing AUTHENTICATION FAILURE alarm description to update the new
condition.
Page 15 of 43
Once the necessary changes have been made, select “OK”. The new alarm is generated alarm event
0xfff00000.
Finally, to save your changes, select File->Save All.
3.2 Event Rule Example - EventCondition
All event rule types will be built in the primarily the same way with the exception of the EventContition
rule. The difference with EventCondition rules is that an expression may need to be constructed.
Page 16 of 43
In the example here, we will use an EventCondition rule to examine the contents of a trap varbind and
determine whether the trap indicates an alarm should be generated or cleared. In this case we are
looking at a trap from Wily CEM. In one of the varbinds it either has “_OPEN” or “_CLOSED” indicating
whether or not the CEM incident is being opened or closed. The event looks like the following:
You may be asking yourself, “how do you know that the varbind will have _OPEN or _CLOSED”? This is
information that you will have determine previously. Often times when trying to solve integration
problems or event correlation issues, the raw events need to be examined. From there, you can see
patterns or consistent information that can be exploited in an event rule. So, logically what we want to
happen is this:
1. If varbind 116 contains _OPEN, generate an alarm.
2. If varbind 116 contains _CLOSED, clear and alarm.
The process for creating the EventCondition rule starts in the same manner as before. Find the event,
then select the Event Rules tab and select “Creates a new event rule” button. Select “Event Condition…”
from the pull down:
Page 17 of 43
In the Event Condition rule creation UI, select the “Add” button.
The Edit window will be displayed which allows you build complex (or simple) expressions for evaluating a
specific event condition. Again, in our example the first condition is that varbind 116 contains the string
_OPEN. Using the Operands in the Condition window, the event variable (event attribute) in varbind 116
can be compared to a text string. In this case REGEXP is used to specify that the condition evaluates to
true if varbind 116 contains the string “_OPEN”.
Page 18 of 43
Once the condition is defined, select the “Insert Criterion” button. Next, specify the event to be created
when the condition evaluates to true, and then select OK. Now, the next condition needs to be defined.
Back in the Event Condition List window, select the “Add” button again.
Similar to the one above, the new condition would be the same concept just a different value comparison
and resulting event.
Page 19 of 43
When building conditions, there are many possibilities. Not only can event variables be used, but model
attributes and predefined values.
Lastly, you can use the DEFAULT operator to always have a condition that evaluates to true. Event
conditions work sequentially, so by putting the DEFAULT condition last, if no other conditions evaluate to
true the DEFUALT condition will. This provides an opportunity to specify an event for which none of the
rules are met.
Page 20 of 43
In the end, there are three conditions that make up the event condition rule.
Of course the resulting events might need to be modified to produce the desired behavior (generate
alarm, another event rule etc…). Again, once any changes are made, to save those changes select “File-
>Save All”.
3.3 Conditional Alarm Severity Example
Traditionally, another prime example of using EventCondtion rules was to evaluate a trap varbind that had
severity information and correlate that to different SPECTRUM alarms. Typically, you would have an event
condition rule that looked like this:
Page 21 of 43
This rule has a set of conditions that evaluate event variable 117. If it contains strings “Moderate”,
“Severe” or “Critical” a different event then gets generated. Those three resulting events then simply
generate the same alarm, with three different SPECTRUM alarm severities (Moderate = SPECTRUM Minor,
Severe = SPECTRUM Major, Critical = SPECTRUM Critical). Now, for alarm severity mapping such as this,
an event rule is no longer needed. Conditional Alarm Severity can be used.
To use the Conditional alarm severity, simply select “Conditional” from the Severity pull-down of the
Alarm tab.
Next, specify the event variable where the severity exists that needs to map to the appropriate SPECTRUM
alarm severity. Also specify an Alarm Cause Code.
To define the alarm severity mappings, select the “Configure…” button. By selecting the Add button, new
severity mapping can be defined.
Page 22 of 43
In the Add dialog, specify a name, and then begin adding the Value/Severity mappings. The Directory
field can be used if a predefined mapping file (ASCII text) has already been created. Once completed, the
SPECTRUM alarm severities will now be determined by the varbind mappings that have been identified.
As you can see, this method is much quicker and simpler that the event condition rule used previously.
Page 23 of 43
3.4 Debugging Event Rules
When making event configuration changes, debugging can be enabled to troubleshoot potential issues.
To enable debugging, edit the $SPECROOT/SS/.vnmrc file parameter “event_disp_error_file” by specifying
a file name.
For example:
event_disp_error_file=eventerrors.out
Now anytime the “Update Event Configuration” button is used, or the SpectroSERVER is restarted, any
errors will be written to the file specified ($SPECROOT/SS/eventerrors.out).
4.0 Condition Correlation
SPECTRUM Condition Correlation allows you to logically process multiple events on one or multiple models,
and correlate them into a single infrastructure condition. This will result in a single root cause alarm with
potentially suppressed symptomatic alarms. Condition Correlation Editor (CCE) works by binding a
SPECTRUM Event to a Condition. Unlike an Event, a Condition has persistence, meaning a Condition exists
until it is cleared by its clear Event (for CCE to work effectively, you’ll need your condition to have both set
and clear event codes). Conditions can also be supplemented with parameter data which can come from
vardata in the set event, or model attribute data for the model where the set event occurs. This is allows
users to create “advanced” expressions to more accurately correlate alarms. There are 3 basic scenarios
which are supported.
1. Correlate multiple alarms (events) as symptoms of one which is the common root cause.
2. Correlate multiple alarms (events) into a new alarm which represents the root cause.
3. Correlate alarms for monitored models to produce a new alarm on the domain for the managed
models based on some criteria.
Correlations are built using the following components:
Conditions – Conditions are building blocks of a correlation, they exist on a resource, or model in
SPECTRUM. Simply put, conditions are defined with events, a Set event and Clear event.
Conditions can also be enriched with parameters, which can be trap vardata, model attributes or
user defined. Parameters can then be used for comparison and evaluation in determining accurate
condition correlation.
Rules – Rules define relationships between two or more conditions when certain criteria are met.
Rules look at conditions in terms of existence. The condition(s) either “Exist”, “Does Not Exist” or
“Counts” (exceeds a certain number of occurrences). Conditions are then related with the following
expressions:
Implies - Condition(s) X IMPLIES Condition Y. When Condition(s) X (and any parameter
criteria) are met, Condition Y is generated (the Set events are created). Note, all
events/alarms associated with condition(s) will still be present.
Caused By – Condition(s) X are CAUSED BY Condition Y. When Condition(s) X (and any
parameter criteria) are met, and Condition Y is met, Condition Y is correlated as the Root
Cause condition. Alarms associated with condition X will be suppressed and show as
symptomatic of any alarm associated with condition Y (in the Impact section of an alarm).
Page 24 of 43
Implied Cause – Combines both of the above. When Condition(s) X are met, Condition Y is
generated and becomes the root cause.
Polices – A set of one or more Rules.
Domain – A group of models that a Policy, and as a result a set of Rules, can be applied to. In
order for a correlation to function properly, all models where the defined conditions (events!) occur
must be in the Correlation Domain associated to the Policy. Correlation Domains are an extremely
beneficial as they allow you to apply different correlations to different groups of models.
Condition Correlation is an extremely powerful tool and is much simpler than most people think. It is also
extremely flexible. Because of the use of correlation domains, correlations can be applied to specific sets
of models. In most environments, correlations may not apply to the entire infrastructure. Correlation
Domains allow for local correlation to occur if necessary. They also provide the ability to assert root cause
alarms on the correlation domain model itself.
Lastly, it is important to remember that condition correlation is ultimately correlating EVENTS, not alarms.
This is often a misconception that can lead to correlations that don’t work. That said, when events are
correlated, any alarms associated with those events also participate in the correlation. So the end result
is that alarms are suppressed and/or designated as “root cause”.
4.1 Caused By Rule Example
As stated above, the “Cause By” rule relationship can be used to correlate multiple conditions (events!)
and designate one of those as the root cause. An example of this is correlating a device outage with OSPF
neighbor loss events from its adjacent network neighbors. Without condition correlation, here is what
happens:
1. SPECTRUM generates a critical alarm stating the downed device is no longer responding to
polls.
2. Each of the neighboring devices generates an alarm stating that OSPF neighbor state of that
device is now DOWN.
3. The end result is that five alarms are sent to the alarm console.
Page 25 of 43
To effectively correlate these alarms (events), we need to use Condition Correlation Editor to define, and
then correlate the conditions that take place during this scenario. In this example, we have two different
conditions:
1. Device Contact Lost
2. OSPF Neighbor Loss
To open Condition Correlation Editor, select “Tools->Utilities->Condition Correlation Editor…” from the
OneClick console.
The Condition Correlation Editor UI layout is constructed very simply in terms of the basic condition
correlation components as discussed above (Conditions, Rules, Policies, Domains):
Page 26 of 43
Notice, there are a number of predefined Conditions (Author will be CA). The first step is to identify or
define the Conditions that will be needed for the correlation. To do this, it’s best to get the SPECTRUM
event codes that are involved in the scenario. In our example, we can simply look for what events
generate the alarms in question. It’s easy to do this using the Events tab in OneClick. You can quickly
identify the alarm events using the severity column, and then get the Event IDs from the Event Type
column.
For this example, the OSPF neighbor loss alarm is generated from event 0x220031 and the device contact
loss alarm is from 0x10d35. As mentioned previously, conditions are defined with both set and clear
events. The events tab can also be used to obtain the clear events as well (so can Event Configuration
Editor!). The clear events for the OSPF neighbor loss and device contact loss alarms are 0x220024 and
0x10d30 respectively. So thus far, we have:
1. Device Contact Lost – SET EVENT = 0x10d35, CLEAR EVENT = 0x10d30
2. OSPF Neighbor Loss – SET EVENT = 0x220031, CLEAR EVENT = 0x220024
The next step would be to identify any potential condition parameters. In this example, it would help to
identify the actual neighbor device in the OSPF event, as a single device might report that multiple OSPF
neighbors have been lost. Looking at the OSPF event, the neighbor IP is displayed as a trap varbind.
Page 27 of 43
When adding parameters to conditions, to reference event variables the event variable ID will be needed
and to reference model attributes, the attribute ID would be needed. In this case, the event variable ID
can be obtained from Event Configuration Editor by looking at the event message.
Now the relevant information looks like:
1. Device Contact Lost – SET EVENT = 0x10d35, CLEAR EVENT = 0x10d30
2. OSPF Neighbor Loss – SET EVENT = 0x220031, Neighbor IP address = Varbind 2, CLEAR EVENT
= 0x220024
In CCE, on the Conditions tab, select the “Create…” button. From the Create Correlation Condition
window, enter a name and the set and clear event codes.
Page 28 of 43
Now select the “Create…” button in the Parameters section. Enter the parameter name (neighbor IP),
select Var Bind from the Parameter pull down and enter “2” for the value.
Finally select “Create” to finish the Condition. This creates a new condition for OSPF Neighbor Loss with a
parameter for the actual neighbor IP that gets passed to the event (varbind 2 – which was obtained from
the set event configuration).
The device contact lost event is actually already defined in an out of box Condition. You can see this by
entering the event code in the local filter in the Conditions tab (a good practice to check to see if the
Condition you are looking for is already defined).
Page 29 of 43
Make sure that the network address is defined as a parameter for this condition. The process is similar to
what was done for the varbind. Select the “Edit…” button with the condition highlighted. Add a new
parameter for Network Address with a Parameter Type of Model Attribute and ID of 0x12d7f (this is just
the attribute ID).
Now that the conditions have been created/updated, the rule can be constructed. In this example, the
OSPF condition would be a symptom of the contact lost condition. From the Rules tab of CCE, select
“Create…”. In the “Symptom Condition(s):” list select the OSPF condition defined above (leave type as
“Exists”). Select the “Caused By” relationship from the pull down menu. For the “Root Cause Condition”,
select the ContactLost_Red condition (as determined above).
Page 30 of 43
To leverage the Parameters that were defined above, select the “Show Advanced” button. This
component allows pre-defined parameters to be related to constants or to other defined parameters. In
this example, the neighbor IP address as shown in the ospf neighbor loss event needs to match the IP of
the device that is no longer responding. Using the operators at the bottom of the Rule Criteria section,
choose the appropriate Parameter from each condition. For the OSPF scenario, the neighbor IP parameter
of the ospf neighbor loss condition must match the IP address parameter of the contact lost parameter.
Select the Insert Criterion button once that is completed.
Page 31 of 43
Select “OK” to create the Rule.
Once you created a rule, the next step is to associate the rule with a Policy. Policies are simply a list of
rules. They allow multiple rules to easily be associated to a correlation domain. To create a new policy,
from the Policies tab, select “Create…”. In the Create Correlation Policy window, provide a Policy Name,
select the OSPF rule from the Available Rules list and move it to the Policy Rules list, then select Create.
With the OSPF Policy now in place, the last step is to associate the policy with a Correlation Domain.
Correlation Domains can be created in a couple of different ways. The first is through CCE. From the
Domains tab, select “Create…”. In the Create Correlation Domain window, provide a name, and select the
Page 32 of 43
appropriate policy from the Available Policies list (in this case the OSPF policy) and move it to the Domain
Policies list.
The next step is to define which models will participate in the correlation domain. This is done by
selecting the Resources tab in the Create Correlation Domain window.
From here, use the “Add…” button to bring up a Locate Resources window. Use the available searches to
find the desired models, highlight them in the results list, then select the “Add Selected to Correlation
Domain” button.
Page 33 of 43
Once the resources have been selected, select the “Create…” button.
It should be noted that you can use Global Collections as resources for correlation domains (there is a pre-
defined search in the Locate Resources window). When a Global Collection is used, all of the models that
belong to the collection participate in the correlation domain. This is great way to ensure that correlation
domains are updated automatically when new devices or models are added to the environment. Having a
dynamic global collection as a correlation domain resource will ensure that any new models that are added
to the collection automatically begin participating in the correlation domain.
Another way of easily adding models to correlation domains is by using the “Add To” option in OneClick.
This can be done by right clicking a model or group of models and selecting “Utilities->Add To-
>Correlation Domain…”.
Page 34 of 43
This allows you to pick an existing correlation domain or create a new one by entering a new name.
We have successfully created a new SPECTRUM correlation by doing the following:
1. Defined and updated Conditions that are comprised of the events involved in the fault scenario
and any parameters (event variables, model attributes) that can be used for deeper correlation.
2. Created a Rule that defines how the conditions are related to each other.
3. Created a Policy that is associated to the rule.
Page 35 of 43
4. Defined a Domain that contains the devices/models that participate in the correlation and is
associated to the policy.
With the new correlation in place, when the device failure occurs, the contact lost and OSPF alarms are
correlated to a single root cause which is the only alarm presented to the alarm console. Also, the
“symptomatic” condition alerts are displayed in the Impact tab of the root cause alarm as Symptoms.
4.2 Implied Cause Rule Example
The next example will show how the Implied Cause correlation rule can be used. The Implied Cause rule
relationship is used to correlate a number of conditions (again, events), and “imply” a new root cause
condition that doesn’t currently exist. Comparing this to the previous example, with the caused by rule
relationship we take a number of conditions and determine that one of those is the root cause and the
others are symptoms. In this example, we will take a number of existing conditions and correlate those to
a NEW condition. We can then determine where we want the new condition to be asserted (a model or
the correlation domain itself).
For this scenario a configuration change gets made to an apache configuration file that impacts a hosted
web site’s accessibility. When this happens, from a monitoring standpoint we receive a number of
different alarms in SPECTRUM:
1. A Wily Introscope alarm is generated stating that a 403 error is returned from the site.
2. An SPM test created for the site times out and an alarm is generated.
3. An error gets written to the apache log file, and an alarm is generated.
It’s been determined that these three alarms occur (at the same time) when the apache configuration file
for the site in questions is incorrectly changed. What we would like to do is suppress these alarms, and
make them symptoms of a new root cause alarm stating that the apache configuration has changed.
Page 36 of 43
Again, the first step in the process is determining which events are needed to define Conditions for each of
the alarms above. That’s not going to be covered in length as the process is described in section 4.1. In
this situation, there are no parameters that will be used for the identifying the correlation (event variables
for example). However, when using the Implied Cause relationship, the new implied condition needs to be
associated to a participating model or the correlation domain itself. In our example we are going to assert
the new condition to the apache server model. To do this, we need to use the “Model” predefined
parameter in the condition that would occur on the model we want to associate the new condition with
(the apache server model).
So, I’ve defined Conditions for:
1. Wily Introscope alarm: CONDITION=”Introscope 403 Error”
2. SPM timeout: CONDITION=”SpmTestTimeOut”
3. Logfile match: CONDTION=”Minor Log Error”
For the conditions above:
1. The “Introscope 403 Error” occurs on a SPECTRUM event model (that represents the Wily
Introscope application component).
2. The “SpmTestTimeOut” occurs on the appropriate SPECTRUM SPM test model.
3. The “Minor Log Error” occurs on the SystemEDGE server model that represents the apache server.
Since the new, implied condition (and resulting alarm) needs to occur on the apache server model, the
Model parameter needs to be added to the “Minor Log Error” condition. In the Create Correlation
Parameter window (see section 4.1 for information on creating parameters), I can select Predefined as the
Parameter Type, and choose “Model”. This will automatically populate in the model handle. Select
“Create” to add the parameter.
Page 37 of 43
With the Implied Cause relationship, the new root cause condition also needs to be defined. This could be
any condition, but in this example a new alarm is desired. Using Event Configuration Editor a new event
has been created for the “apache configuration change”. This event results in a critical alarm, where a
new probable cause has been defined.
NOTE: This event was created from scratch. A new condition needs to be created using this event code.
The conditions now involved in this correlation look like:
1. Wily Introscope alarm: CONDITION=”Introscope 403 Error”
2. SPM timeout: CONDITION=”SpmTestTimeOut”
3. Logfile match: CONDTION=”Minor Log Error”
4. IMPLIED Apache Configuration change: CONDITION=”Server Config Error”
With the Conditions defined, the Rule can now be created. In the Rules tab, select “Create…”. In the
Create Correlation Rule window, supply a rule name and change the Relationship to “Implied Cause”.
Select the symptoms in the Symptom Condition(s) list (multi-select “Introscope 403 Error”, “Minor Log
Error” and “SpmTestTimeOut”). In the Root Cause Condition list, select the “Sever Config Error”
condition.
Page 38 of 43
The next step is to identify where to associate the root cause. In the Root Cause Target area, select the
“Condition” radio button. From the pull-down select “Minor Log Error”. In the Parameter pull down, select
“Model”. The Minor Log Error condition appears in the list because we defined the Model parameter for
that condition. Select the “Create” button to finish the rule.
To finish this correlation, a policy needs to be created with the Apache Configuration Change rule
associated to it. A correlation domain then needs to be defined and associated to the policy. Lastly, the
models need to be added to the correlation domain. In this example:
1. The SPECTRUM event model that the Wily “Introscope 403 Error” occurs on.
2. SPECTRUM SPM test model the “SpmTestTimeOut” occurs on.
3. The SystemEDGE server model that the “Minor Log Error” occurs on.
Once this is complete, the end result is that when the three alarms mentioned above occur (on the models
in the domain) there is now a new root cause alarm generated on the SystemEDGE host model that
represents the apache server. The symptomatic alarms are again suppressed and displayed as symptoms
of the apache configuration change alarm in the Impact tab.
Page 39 of 43
4.3 Correlation Domains as Root Cause Targets Example
Another interesting and very useful example of the Implied Cause relationship correlation rule is using the
correlation domain as the root cause target. In the previous example, a correlation was built using the
implied cause relationship where the new, implied root cause was associated to a model participating in
the correlation. There may be circumstances where multiple events are correlated and the new implied
event/alarm needs to be associated to something arbitrary or something that is not represented in
SPECTRUM. This is a case where specifying the correlation domain as the root cause target can be
extremely effective.
For the example scenario, there are 100 UPS devices that exist in a remote site office (Boston). The
building does not have a backup generator. If the building power fails, all 100 UPS devices switch to
battery power. When this happens an alert is sent to SPECTRUM and an alarm is generated on ALL 100
UPS devices. Let’s assume that the event that is generated for this condition is the one shown below.
When this occurs, a Major alarm is generated. There is also an event that clears the alarm, so that can be
used as the “clear event” for the condition definition.
In CCE, the Condition has been created. In this case there are no parameters.
Page 40 of 43
For the rule relationship, Implied Cause needs to be used. As a result, a root cause condition is needed.
Here a new event/alarm has been created for “Building Power Failure” (using Event Configuration Editor).
Those events are used to define the Building Power Failure condition.
Again, to create the correlation rule, from the Rules tab, select the “Create…” button. For this particular
correlation, the “Counts” condition type needs to be used on the UPS on Battery condition.
By using the “Counts” type, the correlation rule is looking for the number of concurrent instances of the
specific condition that exists on the assigned correlation domain. The advanced rule criteria can be used
to specify what the count value is. In this example, since there are 100 UPS in the building, a value of 75
will be used to evaluate to true. This way, if a certain number of the UPS’s are off line or no longer
functioning properly the correlation will not be missed. In the Symptom Condition(s) list, the “UPS on
Battery” is selected with the Type set to “Counts”. The relationship is “Implied Cause”, and the Root
Cause Condition is “Building Power Failure”. The Counts value must be specified by selecting the “Show
Advanced” button. In the rule criteria window, select the “UPS on Battery” condition for the Left Operand,
with the “Condition Count” Parameter (this is available once the Counts type is selected in the Symptom
Conditions list). The GREATER THAN operator can be used, then select the “By Value” checkbox in the
Right Operand section. The integer value can then be entered (75 in the example here).
Page 41 of 43
The last step for defining the rule is to select the “Correlation Domain” in the “Root Cause Target” section.
To finish the correlation, a Policy needs to be created that is associated with the rule created above.
Lastly, create a new correlation domain, associate it to the Power Failure policy and add the UPS device
models as resources. It is important here to give the Domain the appropriate building name. Since the
root cause alarm will be asserted to the correlation domain model, this will represent the building so name
it accordingly.
Page 42 of 43
4.4 Condition Correlation Debugging
Hopefully after some of the examples above, Condition Correlation Editor is not as complex as it might
have seemed previously. There are some very common mistakes that result in correlations not working.
1. Make sure when defining Conditions the event code is used, not the alarm pcause code. Often
times when events generate alarms, the codes are the same. This is NOT always the case. Always
use the Events tab to validate the correct event code is being used.
2. Verify that the models on which the correlation conditions (events) are occurring, are in the
associated correlation domain. You can do this by looking at the correlation domain under
“Correlation Manager” in the OneClick Navigation panel.
Page 43 of 43
You can also validate domain membership by looking at the events on the models that should be in
the domain. If they are correctly participating in a correlation domain you will see the following
events (0x10e08):
3. Verify that you have the following components associated correctly: Conditions->Rule->Policy-
>Domain.
4. Another common mistake occurs when condition rule criteria are written incorrectly. If steps 1 and
2 above have been verified, if any of the conditions being used have advanced rule criteria defined,
remove the rule criteria and see if the correlation works. Since rule criteria is most often used to
add more granularity in identifying correlation scenarios, removing that criteria should also allow
the correlation to work. By doing this, it can be determined if the problem is with the condition
rule criteria, or something else.
5.0 Supporting Documentation
Modeling Your IT Infrastructure Guide (5167)
Event Configuration User Guide (5188)
Condition Correlation User Guide (5175)