Application Monitoring

The Northwestern Mutual Life Insurance Company – Milwaukee, WI

Application Monitoring

Jeremy Kalsow

Why Application Monitoring

• Majority of all corporations

• Northwestern Mutual

• Total 1,000+ servers

• Team is 6 people

• Team uses 16 servers

• Average 50 applications per server

• Need a way to know status fast

What is it?

• The ability to monitor performance and availability

• Gather metrics

• Show trends

• Pretty pictures for management

Why?

• Trends predict future problems

• Solve application issues faster

• Uptime relates directly to profit for many companies

• View all applications, servers, databases and other items being monitored with a single dashboard.

Types of Monitoring

• Fault

• Performance

• Configuration

• Security

• Accounting

Fault

• Detects major errors

• Easy to implement

• Examples– Network loss– Database Connectivity

• Very Important

Fault

Type of Monitoring

What to Monitor

When to monitor

Hardware CPU utilization CPU load Load > 99% for x minutes

Memory utilization Memory load Load > 99% for x minutes

Storage System Available space System out of Space

Applications Application available

Application working

Working or Error

Application Logs Error Log monitoring

If error occurred

Databases Database online Database is online Database is up/down

Network Latency Latency Latency > acceptable range

Performance

• Slow Performance

• Service Level Agreements

• Metrics

• Old and New Metrics

• Visual Display

Performance

http://www.ibm.com/developerworks/websphere/library/techarticles/0304_polozoff/polozoff.html

Configuration

• Configuration variables

• Connectivity

• Speed

• Performance

• Proactive

• Servers and Applications

Configuration

• Why would the configuration change?

• Hardware

• Storage

• Service packs

• Hot fixes

• Windows Updates

Security

• Attempts to access the system

• Open ports

• Inventories

• Firewall

• Packets

• System events

• Blocked Exploits

Accounting

• Monitors Usage

• Generally used for fees

• Profit/Loss

• Example– Electric Company– Northwestern Mutual

Types of Monitoring Recap

• Fault

• Performance

• Configuration

• Security

• Accounting


• Historical data

• Baseline test

• Current test

• Performance disagreements


• Allows for trends to be seen

• Modifications can be made

• Trends over multiple releases


• Monitoring is important

• Not enough time is given

• Implemented After discovery of an issue

• Monitoring only in areas of known problems

• Adding monitoring requires time and money

Challenges of application monitoring• Various types of systems

• Shared

• Clustered

• Virtualized

• Production logging

Shared Systems

• 1 server / Multiple applications

• System resources are shared

• Tracking individual usage is difficult

• Many applications may be impacted

• Server without access (production)

Clustered Systems

• Applications on more than one server

• Avoid single point of failure

• May be hard to target the issue

Production Logging

• Generally Limited

• Most errors repeated in test

• Application downtime

• Use of company resources

Implement Application Monitoring• Plan Early

• Monitor Proactively

• Create a Recovery Plan

• Create and use SLAs

Plan Early

• Planning stage

• Add monitoring during development

• Late additions cover known issues

Monitor Proactively

• Harder to implement

• Issues are dealt with before end user knows

Monitor Proactively

• Tools based approach

• Easy and relatively fast setup

• No code

• Multiple applications

Monitor Proactively

• Logging is directly in the code

• Less efficient

• More specific

• Developers have less time

Create a Recovery Plan

• Fast resolution

• Knowledge management

Recovery Plan Template

Service Level Agreements

• What percentage of time that the services will be up (uptime)

• How many people can use the application at once without performance issues

• Performance metrics and benchmarks to be used with performance monitoring alerts

• The rules for notification announcements• What statistics will be monitored and

when and where they will be available• Acceptable response time

Service Level Agreements

Using the Statistics

• Visual display

• Alerts

• Tickets

Visual (Dashboard)

• Easily view statistics

• Comparison results

• Trend comparison

• Cross Platform

• Auto-generated management reports

Dashboard

Alerts and Tickets

• Auto-generated alerts

• Tickets for queue system

• Vital information in each

Alerts and Tickets

• Most common: Email

• Text, popup, printout, recording and more

• Tickets: auto-generated

• Knowledge databases

• Common fixes and resolutions

Application Monitoring

• Maximize application uptime

• Higher end user satisfaction

• Higher Profit

References

• Polozoff, A. (2003, April 9). Proactive Application Monitoring. IBM - United States. Retrieved October 20, 2011, from http://www.ibm.com/developerworks/websphere/library/techarticles/0304_polozoff/polozoff.html

• Choice. (2009, December 20). Application Monitoring. Adminschoice - Unix Made Easy. Retrieved October 31, 2011, from http://adminschoice.com/application-monitoring

• Application Monitoring Software - uptime software. (n.d.). Server Monitoring Software - IT Systems Management, Capacity Planning, Application and Server Monitoring Tool by uptime software. Retrieved October 31, 2011, from http://www.uptimesoftware.com/application-monitoring.php

• Marko, K. (2005, December 30). Proactive Application Monitoring. Processor.com:

• Data Center IT Equipment at Processor, Routers, Storage, Rackmount Servers, Computer Room Cabling and Flooring. Retrieved October 29, 2011, from http://www.processor.com/editorial/article.asp?article=articles%2Fp2752%2F43p52%2F43p52.asp

• "IT Service Level Agreement Templates | ContinuityPlanTemplates." ContinuityPlanTemplates |Free Business Continuity Plan (BCP) Templates. ContinuityPlan Templates, n.d. Web.30 Oct. 2011. http://www.continuityplantemplates.com/it-service-level-agreement-templates

XML

Upcoming events with Dashboard•Ability to display visualized graphs and other pertinent information

•Ability to click a failed component and have the system auto generate a ticket

•Ability to Alert others of the issue found

•Performance monitoring as well as fault

Documents

Application Monitoring