Upload
ariel-reichman-shalev
View
107
Download
6
Tags:
Embed Size (px)
Citation preview
© 2010 IBM Corporation
IBM Tivoli Netcool/OMNIBUS Key Performance Indicators
IBM Tivoli Netcool/OMNIbus 7.3.1 Key Performance IndicatorsPresented by Christina Grimes [email protected]
© 2010 IBM Corporation2
Agenda Introduction
Object Server Key Performance Indicators
Probe and Gateway Key Performance Indicators
Q&A
© 2010 IBM Corporation3
Introduction
To check that Omnibus is performing well, there are several key performance indicators that can be monitored.
When using the key performance indicators, first establish a baseline on the system when it is under normal load and operation.
The Key Performance indicators can be used to measure performance when changes are made to the environment, by comparing the baseline to the KPI measurements after the change.
© 2010 IBM Corporation4
Object Server Key Performance Indicators
–Object Server Granularity• Profiler Report• Trigger Stats
–Number of rows in alerts.status, alerts.journal and alerts.details–Number of inserts in the alerts.status table–Number of Connections–CPU usage of nco_objserv–Memory Usage of nco_objserv–Memstore usage
© 2010 IBM Corporation5
Object Server Granularity The Object Server property Granularity specifies the update interval for IDUC
Insert, Delete, Update, Control notifications to clients.
The Default Granularity period is 60 seconds. It is not recommended to change the default.
The Object Server should service all client and trigger requests within the Granularity period.
Client and trigger requests may temporarily take more time than the Granularity period.
If client and trigger requests consistently take more time than the Granularity period, Object Server performance will be degraded. Resulting in:
– Client disconnections– Delays or errors in logging into the Object Server– Delays in events being propagated between multiple Object Servers– Object Server will eventually become unresponsive
© 2010 IBM Corporation6
Object Server Profiler Report Profile statistics are also logged to a profile log file
$NCHOME/omnibus/log/ObjectServername_profiler_report.log#.
The profiler report shows a breakdown of the time spent for each client connection and the total time spent by client type, for each granularity period
You can use the profile log file to analyze how the ObjectServer time is spent during each granularity period and calculate the percentage of time used. For example, if the granularity period is set to 60 seconds and the total time spent for all the connections during a particular period was 30 seconds, you can calculate that the ObjectServer spent 50% of its available time on running SQL commands from client connections.
© 2010 IBM Corporation7
Example Profiler ReportMon Oct 12 17:39:46 2009: Individual user profiles: Mon Oct 12 17:39:46 2009: 'Administrator' (uid = 0) time on adminhost: 0.000000s Mon Oct 12 17:39:46 2009: 'isql' (uid = 0) time on omnihost1.ibm.com: 3.770000s Mon Oct 12 17:39:46 2009: 'PROBE' (uid = 0) time on probehost.ibm.com: 5.010000s Mon Oct 12 17:39:46 2009: 'e@c0B4D@c0142:11.0' (uid = 0) time on omnihost1.ibm.com: 10.010000sMon Oct 12 17:39:46 2009: 'c@xxxxx@xxxxx:11.0' (uid = 45) time on omnihost1.ibm.com: 0.000000s Mon Oct 12 17:39:46 2009: 'e@c0B4D@c0142:11.0' (uid = 45) time on omnihost1.ibm.com: 9.870000s Mon Oct 12 17:39:46 2009: 'c@xxxxx@xxxxx:11.0' (uid = 55) time on omnihost1.ibm.com: 0.000000s Mon Oct 12 17:39:46 2009: 'e@c0B4D@c0142:11.0' (uid = 55) time on omnihost1.ibm.com: 6.020000s Mon Oct 12 17:39:46 2009: 'GATEWAY' (uid = 0) time on omnihost1.ibm.com: 0.270000s Mon Oct 12 17:39:46 2009: 'GATEWAY' (uid = 0) time on omnihost1.ibm.com: 0.000000s Mon Oct 12 17:39:46 2009: 'PROBE' (uid = 0) time on omnihost1.ibm.com: 3.010000s Mon Oct 12 17:39:46 2009: Grouped user profiles: Mon Oct 12 17:39:46 2009: Execution time for all connections whose application name is 'PROBE': 8.020000s Mon Oct 12 17:39:46 2009: Execution time for all connections whose application name is 'GATEWAY': 0.270000s Mon Oct 12 17:39:46 2009: Execution time for all connections whose application name is 'c@xxxxx@xxxxx:11.0':
0.000000s Mon Oct 12 17:39:46 2009: Execution time for all connections whose application name is 'e@c0B4D@c0142:11.0':
25.93000s Mon Oct 12 17:39:46 2009: Execution time for all connections whose application name is 'isql': 3.77000s Mon Oct 12 17:39:46 2009: Execution time for all connections whose application name is 'Administrator': 0.000000s [Mon Oct 12 17:39:46 2009: Total time in the report period (59.275782s): 29.980000s
© 2010 IBM Corporation8
Analyzing the Profiler Report
Goal: Profiler Report “Total time in the report period” + Trigger Statistics “Time for all triggers” < Object Server Granularity period of 60 seconds.
• Current Total time in the report period of 29.980000s indicates ~50% of Object Server Granularity period is taken services client connections.
• Identify the highest clients in the report period:
Mon Oct 12 17:39:46 2009: 'e@c0B4D@c0142:11.0' (uid = 0) time on omnihost1.ibm.com: 10.010000s
Mon Oct 12 17:39:46 2009: 'e@c0B4D@c0142:11.0' (uid = 45) time on omnihost1.ibm.com: 9.870000s
High probes can indicate an event flood or large number of status or details insertsHigh event lists or WebGUI clients can indicate inefficient filtersHigh Object Server gateway clients can indicate transfer of many events or a resync
© 2010 IBM Corporation9
Object Server Trigger Statistics
Trigger statistics are also logged to the file $NCHOME/omnibus/log/ObjectServername_trigger_stats.log#.
The trigger statistics log file shows the amount of time that each trigger has used in the last profiling period.
You can use this log file for automation debugging, and to determine which triggers are slow due to slow-running SQL queries.
© 2010 IBM Corporation10
Example Trigger StatisticsMon Oct 12 18:03:56 2009: Trigger Profile Report….Mon Oct 12 18:03:56 2009: Trigger Group 'primary_only'Mon Oct 12 18:03:56 2009: Trigger time for 'generic_clear': 5.879707sMon Oct 12 18:03:56 2009: Trigger time for 'expire': 0.008233sMon Oct 12 18:03:56 2009: Trigger time for 'delete_clears': 0.007219sMon Oct 12 18:03:56 2009: Trigger time for 'enrich_and_correlate': 23.007219s…Mon Oct 12 18:03:56 2009: Trigger Group 'iduc_triggers'Mon Oct 12 18:03:56 2009: Trigger time for 'disconnect_iduc_missed': 0.000000sMon Oct 12 18:03:56 2009: Trigger time for 'iduc_stats_update': 0.000949sMon Oct 12 18:03:56 2009: Trigger time for 'iduc_messages_tblclean': 0.000089sMon Oct 12 18:03:56 2009: Trigger time for 'deduplicate_iduc_stats': 0.000000sMon Oct 12 18:03:56 2009: Trigger time for 'iduc_stats_insert': 0.000000sMon Oct 12 18:03:56 2009: Trigger Group 'automatic_backup_system'Mon Oct 12 18:03:56 2009: Trigger time for 'backup_succeeded': 0.000000sMon Oct 12 18:03:56 2009: Trigger time for 'backup_failed': 0.000000sMon Oct 12 18:03:56 2009: Trigger time for 'backup_state_integrity': 0.000000sMon Oct 12 18:03:56 2009: Trigger Group 'gateway_triggers'Mon Oct 12 18:03:56 2009: Trigger time for 'resync_finished': 0.000000sMon Oct 12 18:03:56 2009: Time for all triggers in report period (60s): 29.789663s
© 2010 IBM Corporation11
Analyzing Trigger StatisticsGoal: Profiler Report “Total time in the report period” + Trigger Statistics “Time for all
triggers” < Granularity period of 60 seconds.• Current Profiler report Total time in the report period of 29.980000s + Trigger Statistics
Time for all triggers 29.789663s indicates Object Server is at 100% utilization of 60 second Granularity period.
• Identify the highest triggers:
Mon Oct 12 18:03:56 2009: Trigger time for 'enrich_and_correlate': 23.007219sMon Oct 12 18:03:56 2009: Trigger time for 'generic_clear': 5.879707s
High generic_clear or deduplication triggers can indicate high event throughput or high number of resident events.
Ensure best practices are used in creating custom triggers
http://publib.boulder.ibm.com/infocenter/tivihelp/v8r1/topic/com.ibm.netcool_OMNIbus.doc_7.3.1/omnibus/wip/admin/reference/omn_adm_per_bestpracticestriggers.html
Ensure trigger execution time is kept to a minimum, no other writes can be performed in the Object Server when a trigger is executed.
© 2010 IBM Corporation12
Number of Rows alerts.status and alerts.journal alerts.details
alerts.details table should only be used when alerts.status is not enough to hold enough information for a specific alarm or during rules file development. On production systems, it is suggested to keep the alerts.details table below 5,000 rows. If you have a large number of rows in alerts.details, the ObjectServer performance can be degraded. Details statement in probe rules file are used to generate records into alerts.details table. details($*) will record each token as one row into alerts.details. If you have details($*) enabled in your rules file, for each event in alerts.status table, you might have 10~50 rows in alerts.details table. Details can be disabled by commenting out any details($*) statements in all your probe rules file, restarting all probes, clearing the current records in details table (with "delete from alerts.details")
© 2010 IBM Corporation13
Manage volume of alerts.detailsThroughput of events on the alerts.status table
Deduplication triggers prevent the number of rows from increasing, but still take processing time in the Object Server.
© 2010 IBM Corporation14
Object Server Connections The maximum permitted number of connections is determined by the
ObjectServer Connections property, with a default of 30.
The Connections property of the Object Server should always be increased in a production environment.
Only a finite number of connections can be made to the ObjectServer. When the maximum number of connections is reached, new connections are refused.
A refused connection might result in the temporary loss of access to data or loss of input to probes or gateways.
The maximum number of connections is 1024.
An error is logged in the Object Server log when connections are exceeded.
Connections can be monitored through the stats_triggers in the master.stats table
© 2010 IBM Corporation15
Monitoring Row Count, Volume and Connections
Enable the stats_triggers trigger group
In the default ObjectServer configuration, the stats_triggers group uses triggers to gather several statistics and metrics. You can enable the trigger group in a production environment.
The triggers in the stats_triggers trigger group count the – Current event counts on alerts.status, alerts.details and alerts.journal– number of inserts to the alerts.status, alerts.details and alerts.journal – number of deduplications and new inserts to alerts.status– number of client connections
© 2010 IBM Corporation16
Stats triggers The data gathered by this trigger group and automation is written
periodically to the master.stats table.
The default write interval is 300 seconds; this value is configurable in the statistics_gather trigger
5 days of data is retained by default
© 2010 IBM Corporation17
CPU usage Monitor the CPU usage of the nco_objserv process
If the Object Server is under heavy load, this will be reflected in CPU usage
Profiler report and trigger statistics logs will show the source of the heavy load
Sizing considerations https://www.ibm.com/developerworks/mydeveloperworks/wikis/home?lang=en#/wiki/Tivoli%20Netcool%20OMNIbus/page/OMNIbus%20Sizing%20Guide
© 2010 IBM Corporation18
Memory Usage Memory usage of nco_objserv process
The memory usage of the process increases proportionally to increases in the number of rows in the alerts.status table, alerts.details table, and the alerts.journal table (or any additional tables you have defined), to increases in the number of connections, and increased usage by clients.
The memory usage should remain stable over time, and any increases should correspond to increases in the numbers of table rows or additional clients.
Sizing considerations
https://www.ibm.com/developerworks/mydeveloperworks/wikis/home?lang=en#/wiki/Tivoli%20Netcool%20OMNIbus/page/OMNIbus%20Sizing%20Guide
© 2010 IBM Corporation19
Object Server Memstore Usage
To monitor the memstore, inspect the content of the catalog.memstores table.
For each row, compare the value of the UsedBytes column with the values of the SoftLimit column and the HardLimit column.
Memstores are containers that are maintained by the ObjectServer, they contain ObjectServer data and tables in the memory. Memstores have a finite size, and, when full, do not permit any further data to be inserted. Consequently, you must ensure that the memstores do not become full.
Once a memstore soft limit is exceeded, an error is logged to the Object Server log file. Once the hard limit is exceeded, the Object Server shuts down.
© 2010 IBM Corporation20
Summary Monitoring Object Server KPI Object Server Profiler report and trigger statistics logs
–Object Server Granularity• Profiler Report• Trigger Stats
Enable Stats triggers–Number of rows in alerts.status, alerts.journal and
alerts.details–Number of inserts in the alerts.status table–Number of Connections
System monitors–CPU usage of nco_objserv–Memory Usage of nco_objserv
Object Server catalog.memstores table–Memstore usage
© 2010 IBM Corporation21
Probe and Gateway KPI
Probe Key Performance Indicators– CPU Usage– Memory Usage– Number of events received– Average Time spent processing rules
Gateway Key Performance Indicators– CPU Usage– Memory Usage
© 2010 IBM Corporation22
Probe Self Monitoring
Probe self monitoring can log the following information in the Probe log and/or at the Object Server
CPU time Rules File Processing Time Memory Used Number of Events Received Number of Events Discarded from Rules Number of Events Generated by Self Monitoring
http://publib.boulder.ibm.com/infocenter/tivihelp/v8r1/topic/com.ibm.netcool_OMNIbus.doc_7.3.1/omnibus/wip/probegtwy/concept/omn_prb_enablingroi.htm
© 2010 IBM Corporation23
Probe Self Monitoring example Probe Log File Info MessageLevel
--- ProbeStats report at Heartbeat ---Timestamp: 1294869063Probe: glfHost: hostnamePID: 29976ProbeID: glf@hostnameNumberEventsProcessed: 8785NumberEventsDiscarded: 0NumberEventsGenerated: 0DiscardPercentage: 0.000000TotalRulesTime: 0.785492 secTotalCPUTime: 43.089477 secProbeUpTime: -1294869003 secAverageEventsPerSecond: 8785AverageRulesFileTime: 89 usec--- ProbeStats report end ---
© 2010 IBM Corporation24
Summary Monitoring Probe and Gateway KPI
Probe Key Performance Indicators– System Monitors and/or Probe Self Monitoring
• CPU Usage• Memory Usage
– Probe Self Monitoring• Number of events received• Average Time spent processing rules
Gateway Key Performance Indicators– System Monitors
• CPU Usage• Memory Usage
© 2010 IBM Corporation25
Summary
Object Server Key Performance Indicators– Object Server Granularity
• Profiler Report• Trigger Stats
– Number of rows in alerts.status, alerts.journal and alerts.details– Number of inserts in the alerts.status table– Number of Connections– CPU usage of nco_objserv– Memory Usage of nco_objserv– Memstore usage
Probe Key Performance Indicators– Number of events received– CPU Usage– Memory Usage– Average Time spent processing rules
Gateway Key Performance Indicators– CPU Usage– Memory Usage
© 2010 IBM Corporation26
Omnibus Best Practice ReferencesOmnibus 7.3.1 Best Practices Guide:
https://www.ibm.com/developerworks/mydeveloperworks/groups/service/html/communityview?communityUuid=cdd16df5-7bb8-4ef1-bcb9-cefb1dd40581#fullpageWidgetId=W05de62601548_4e85_8940_81bb58657a85&file=b7d8ad85-7152-47e8-926a-802246204e8c
Sizing Guide:
https://www.ibm.com/developerworks/mydeveloperworks/wikis/home?lang=en#/wiki/Tivoli%20Netcool%20OMNIbus/page/OMNIbus%20Sizing%20Guide
Multitier Architecture:
http://publib.boulder.ibm.com/infocenter/tivihelp/v8r1/topic/com.ibm.netcool_OMNIbus.doc_7.3.1/omnibus/wip/install/concept/omn_esf_configuringdeploymultitieredarch.html
Best Practices for Performance Tuning:
http://publib.boulder.ibm.com/infocenter/tivihelp/v8r1/topic/com.ibm.netcool_OMNIbus.doc_7.3.1/omnibus/wip/admin/reference/omn_adm_per_bestpracticespertuning.html
SQL Query Guidelines:
http://publib.boulder.ibm.com/infocenter/tivihelp/v8r1/topic/com.ibm.netcool_OMNIbus.doc_7.3.1/omnibus/wip/admin/reference/omn_adm_per_sqlqueryguidelines.html
Best Practices for Creating Triggers:
http://publib.boulder.ibm.com/infocenter/tivihelp/v8r1/topic/com.ibm.netcool_OMNIbus.doc_7.3.1/omnibus/wip/admin/reference/omn_adm_per_bestpracticestriggers.html
© 2010 IBM Corporation27
Questions?