How to improve your system monitoring

Monitoring Improvement Assessment Process

The Same Old Problem

CorporateLANs & VPNs

ISPConnection

DNS & InternetServices

Content MgmtSystem

Social NetworkWidgets

Site Tracking& Analytics

Banner Ads & Revenue Generators

Multimedia &CDN Content

Home Wireless& Broadband

Mobile Broadband

Is It My Data Center?• Configuration errors• Application design issues• Code defects• Insufficient infrastructure• Oversubscription Issues• Poor routing optimization• Low cache hit rate

Is It a Service Provider Problem?• Non-optimized mobile content• Bad performance under load• Blocking content delivery• Incorrect geo-targeted content

Is it an ISP Problem?• Peering problems• ISP Outages

Is it My Code or a Browser Problem?• Missing content• Poorly performing JavaScript• Inconsistent CSS rendering• Browser/device incompatibility• Page size too big• Conflicting HTML tag support• Too many objects• Content not optimized for device

The Cloud

Distributed

Database

Mainframe

Network

Middleware

Storage

Anatomy of an Outage


Load Balancer

Firewall

WebServers

MessageQueue

zOSCICS

WAS

Database

WASDatabase

zOSMQ

DB2

4

3

1

5:45-ish pm: CICS ABENDS start flooding the console but not high enough to ticket

2

6:00-ish pm: MQ flows start are interrupted and are alerting in Flow Diagnostics

6:04pm: Synthetic transactions fail at and 6:14 the Ops Center confirms the issue and creates a P0 Incident

6:54pm: Support teams investigate the interrupted flows and determine it is a “back-end” problem

10:29pm: Support teams investigate MQ and ultimately and rule it out and ultimately decide to reset CICS to resolve the issue

5

Gaining Perspective Requires Balance

Packet Capture

Synthetic Transactions

Client Monitoring

Client Monitoring

Synthetic Transactions

Server Probe

1. Client to the Server2. Server to the Client3. “3rd Party” Vantage Point4. Synthetic Transactions

Four Perspectives of User Experience

Why Multiple Perspectives?Know Your Customer:• What they do?

§ Customers care about completing tasks NOT whether the homepage is available

• Where they do it from?§ Your customers don’t live in the cloud, test from their perspective

• When they do it?§ Test at peak and normal traffic levels, to find all the problems

• What expectations do customers have?§ Is 5 seconds fast enough or does it have to be quicker?

Itemize the existing monitors

Brainstorm potential gaps

to fill

Deploy new monitors

Identify the potential

risks

Itemize the existing monitors

Determine if which

gaps exist

Fill the monitoring

gaps

Current Approach

Proposed Approach

Picking Better Monitors

What Does Good Monitoring Look Like?


Load Balancer

Load Balancer

Firewall

Switch

Web Server Farm

Database

Data PowerMainframe

Middleware

Load Balancer

1. System Availability2. Operating System Performance3. Hardware Monitoring4. Service/Daemon and Process Availability5. Error Logs6. Application Resource KPIs7. End-to-End Transactions8. Point of Failure Transactions9. Fail-Over Success10.“Activity Monitors” and “Reverse Hockey Stick”

Elements of Good Monitoring32 4 5 61

7

8

9 10

What Matters Most?

Dr. Lee Goldman

Cook County Hospital, Chicago, IL

1. Is the patient feeling unstable angina?

2. Is there fluid in the patient’s lungs?3. Is the patient’s systolic blood

pressure below 100?

The Goldman Algorithm

Prediction of Patients Expected to Have a Heart Attack Within 72 Hours

0

20

40

60

80

100

Traditional Techniques Goldman Algorithm

By paying attention to what really matters, Dr. Goldman improved the “false negatives” by 20

percentage points and eliminated the “false positives” altogether.

The Goldman Algorithm

ECG Evidence of Acute Ischemia?ST-Segment Depression ≥ 1mm in ≥ 2 Contiguous Leads (New or Unknown Age) orT- Wave Inversion in ≥ 2 Contiguous Leads (New or Unknown Age) orLeft Bundle-Branch Block (New or Unknown Age)

Observation Unit

Inpatient Telemetry

Unit

High Risk Low Risk Very Low RiskModerate Risk

Yes No

Coronary Care Unit

No

ECG Evidence of Acute Myocardial Infarction (MI)?ST-Segment Elevation ≥ 1mm in ≥ 2 Contiguous Leads (New or Unknown Age)orPathologic Q Waves in ≥ 2 Contiguous Leads (New or Unknown Age)

Yes

Patient suspected of Acute Cardiac

Ischema

Perform Electrocardiogram

(EKG)

0 Factors2 or 3 Factors 1 Factors0 or 1 Factors2 or 3 Factors

Urgent Factors Present?Rates Above Both Lung BasesSystolic Blood Pressure <100 mm HgUnstable Ischemic Heart Disease

Urgent Factors Present?Rates Above Both Lung BasesSystolic Blood Pressure <100 mm HgUnstable Ischemic Heart Disease

Seven Deadly SinsAlthough Companies Realize the Importance of an Effective Monitoring System,

Most Fall Prey to Common Mistakes That Erode the Value

UsageReportingCollectionSelection

Strategic

Tactical

Nat

ure

of M

ista

ke

Life Cycle Activity

Ignoring the possibilities: Lackof optimal utilization of available data

One size fits all: Lackof audience segmentation

“Metrics Toilet”: Lack of aggregation and screeningof low-level metrics, resulting in cumbersome reports

Waiting for the perfect tool: Lack of focuson process, leading to over-reliance on technology

An arbitrary exercise: Lackof defined criteria for target setting

Metrics that (don’t) matter: Lackof actionable metrics

IT’s World View: Lack of user involvementin metrics selection and refinement

Source: Infrastructure Executive Council, 2003

Finding Metrics That Matter§ Will the metric be used in a report? If so, which one? How is it used in the report?

§ Will the metric be used in a dashboard? If so, which one? How will it be used?

§ What action(s) will be taken if an alert is generated? Who are the actors? Will a ticket be generated? If so, what severity?

§ How often is this event likely to occur? What is the impact if the event occurs? What is the likelihood it can be detected by monitoring?

§ Will the metric help identify the source of a problem? Is it a coincident / symptomatic indicator?

§ Is the metric always associated with a single problem? Could this metric become a false indicator?

§ What is the impact if this goes undetected?

§ What is the lifespan for this metric? What is the potential for changes that may reduce the efficacy of the metric?

Evaluating the Effectiveness of a Metric

The bulk of the monitoring performed measures the health of

the operating system

Transaction Monitoring

Application Resource

Monitoring

Operating System Monitoring

This monitoring is developed specially for the technologies used by the application to determine if they are functioning correctly

Transaction monitoring is the key to good monitoring as it provides

depth and the capability to determine customer impact

The overlap ensures sufficient fault detection

The Layered Approach

Monitoring PatternsLayers of Pre-Defined Monitoring Patterns

• The OS template is deployed when the server is provisioned

• As a server is customized to fit its role, additional templates are deployed

• Templates are stacked on top of each other until no gaps remain

• This approach provides a high degree of standardization without sacrificing the ability to develop a custom solution

Application-Technology MatrixMaps services, applications and technologies enabling:•Monitoring investment prioritization•Monitoring maturity•Which templates need to be deployed when new hardware is acquired•Whether an service has sufficient monitoring coverage based on its application components•This approach allows for anticipating changes to a customer’s monitoring needs

Scores indicate:0 – No Strategy1 – Limited Monitoring2 – Fully Integrated Strategy

Integrate Your Processes

PresentationFramework

Asset Management & Topology Database

Aggregation and Analysis

Security Management

Availability Management

Configuration Management

Change Management

Performance Management

Enterprise Data Sources

Business Telemetry

Information

Configuration Discrepancies

Enrichment DataBusiness Activity Data

Historical Data

“Enriched” Events

Change Activity

Topology Snapshots

Tre

nd-R

elat

ed F

ault

s

Disco

vered P

rob

lems

Status Indications

Incidents

Audit Information and Suspicious Activity

Enrichment Data Business Activity Data

Automated Discovery

Processing Streams

Situational Awareness

Engine

Adapted from http://www.slideshare.net/TimBassCEP/getting-started-in-cep-how-to-build-an-event-processing-application-presentation-717795

Real-Time Event Streams

Detected and Predicted Situations

Patterns from Historical Data

Causal Relationship from Past RCAs

Complex Event Processing

Event Pipeline

Event Queries

Time Window

Data Events

Control Event

Other Events

Event Filter

Scenarios

A

B

C

Feedback Loop

Event Intelligence

Action Events

Auto

mated

Actio

n

No

tification and

E

scalation

Business

Imp

act A

nalysis

Ro

ot C

ause Analysis

Co

rrelation and

E

vent Supp

ression

Enrichm

ent

Meta-Data Integration Bus

Distrib

uted C

ollecto

rsD

istributed

Co

llectors

LOB Managed Monitoring

System

Service Provider Monitoring

System

Vendor Managed Monitoring

System

Element Manager

Element Manager

Element Manager

Other Enterprise

Data

Document Sharing

Service Desk CMDB Batch Scheduling

Knowledge Database

Online Run Book

PBX/Call Manager

Visualization FrameworkC

om

mo

n Event

Form

at

Topology And Relationship

Database

Automated Action Tools

Distrib

uted C

ollecto

rsAutomated Provisioning

System

Predictive Analysis

Automated Change

Reconciliation

Security Management

Archive and

Rep

ort

Business Telemetry Data

Service Center and Enterprise

Notification Tool

Event Processing

The Management Eco-System

Capacity ManagementCompute Storage Network Facilities

Event Management / Manager of Managers

CMDB

Billing & Chargeback

SoftwareTracking

Server Monitoring

Storage Manager

Network Performance

Manager

Data Center Infrastructure

Manager

Capacity Management

Predictive Insights Capacity Analyzer

Automated Reporting Engine

Cloud Orchestrator

Interface for Capacity Planners

Interface for Business Users

Policies Manager

Data Warehouse

Closeout Meeting

Deliverables•Acceptance Document

Event Integration Test

Deliverables•Acceptance Document

Build Integration Solution

Deliverables•Design Document Package•Integration Rules•As-Built Document•Test Plan & Results•Code Review Results•Quality Inspection Checklist

Event Integration Design

Deliverables•Event Life Cycle Matrix•Data Flow Diagram•Integration Stories

Integration Required?

Deploy Monitoring

Deliverables•Monitors•Alerts•Netcool Facts•Readiness Test Results

Plan Approval

Deliverables•Solution Discussion•Plan Approval Document

Gap Analysis and Monitoring

Strategy Design

Deliverables•Monitoring Strategy•Deployment Plan•Application/Technology Matrix•Additional Questions

Incident History Analysis & Monitor

Discovery

Deliverables•Ticket History Report•Points of Failure List•Monitor Inventory List•Alert History Report•Alert Logic Flow Chart•Non-Standard Monitoring Audit

Question & Answer Session

Deliverables•Physical & Logical Diagrams•Asset List (Hardware & Software)•PBRA Recommendations for Monitoring•Existing “Home Grown” Monitoring Identified•Solution Discussion

Develop Recommended Best Practices

Deliverables•Industry Recommendations•ESM Best Practices•Questions for the QA Session

Y

N

Improvement Lifecycle

Legend

Systems Monitoring ConsultantSA

SMC

Arch

SM

PM

Systems Administrator

Platform ArchitectService ManagerProject Manager

SASMC

SA SMC Arch SM SMC

SA SMC

Arch SM

Arch SM

Arch SM

Arch SM

SA SMC Arch SM SMC

SMC

SMC

SMC

SMC

SMC

SMC

SA SMC

SA SMC

SA SMC

SA SMC

SA

SA

SA

SA

SA

SA

SMC SM

SMC SM

SMC SM

SA SMC SM

SA SMC SM

SA SMC SM

SA SMC SM

SA SMC Arch SM SMC

Arch SM SMC

SA SM