Upload
andrew-white
View
361
Download
2
Embed Size (px)
Citation preview
Monitoring Improvement Assessment Process
The Same Old Problem
CorporateLANs & VPNs
ISPConnection
DNS & InternetServices
Content MgmtSystem
Social NetworkWidgets
Site Tracking& Analytics
Banner Ads & Revenue Generators
Multimedia &CDN Content
Home Wireless& Broadband
Mobile Broadband
Is It My Data Center?• Configuration errors• Application design issues• Code defects• Insufficient infrastructure• Oversubscription Issues• Poor routing optimization• Low cache hit rate
Is It a Service Provider Problem?• Non-optimized mobile content• Bad performance under load• Blocking content delivery• Incorrect geo-targeted content
Is it an ISP Problem?• Peering problems• ISP Outages
Is it My Code or a Browser Problem?• Missing content• Poorly performing JavaScript• Inconsistent CSS rendering• Browser/device incompatibility• Page size too big• Conflicting HTML tag support• Too many objects• Content not optimized for device
The Cloud
Distributed
Database
Mainframe
Network
Middleware
Storage
Anatomy of an Outage
CorporateLANs & VPNs
Load Balancer
Firewall
WebServers
MessageQueue
zOSCICS
WAS
Database
WASDatabase
zOSMQ
DB2
4
3
1
5:45-ish pm: CICS ABENDS start flooding the console but not high enough to ticket
2
6:00-ish pm: MQ flows start are interrupted and are alerting in Flow Diagnostics
6:04pm: Synthetic transactions fail at and 6:14 the Ops Center confirms the issue and creates a P0 Incident
6:54pm: Support teams investigate the interrupted flows and determine it is a “back-end” problem
10:29pm: Support teams investigate MQ and ultimately and rule it out and ultimately decide to reset CICS to resolve the issue
5
Gaining Perspective Requires Balance
Packet Capture
Synthetic Transactions
Client Monitoring
Client Monitoring
Synthetic Transactions
Server Probe
1. Client to the Server2. Server to the Client3. “3rd Party” Vantage Point4. Synthetic Transactions
Four Perspectives of User Experience
Why Multiple Perspectives?Know Your Customer:• What they do?
§ Customers care about completing tasks NOT whether the homepage is available
• Where they do it from?§ Your customers don’t live in the cloud, test from their perspective
• When they do it?§ Test at peak and normal traffic levels, to find all the problems
• What expectations do customers have?§ Is 5 seconds fast enough or does it have to be quicker?
Itemize the existing monitors
Brainstorm potential gaps
to fill
Deploy new monitors
Identify the potential
risks
Itemize the existing monitors
Determine if which
gaps exist
Fill the monitoring
gaps
Current Approach
Proposed Approach
Picking Better Monitors
What Does Good Monitoring Look Like?
CorporateLANs & VPNs
Load Balancer
Load Balancer
Firewall
Switch
Web Server Farm
Database
Data PowerMainframe
Middleware
Load Balancer
1. System Availability2. Operating System Performance3. Hardware Monitoring4. Service/Daemon and Process Availability5. Error Logs6. Application Resource KPIs7. End-to-End Transactions8. Point of Failure Transactions9. Fail-Over Success10.“Activity Monitors” and “Reverse Hockey Stick”
Elements of Good Monitoring32 4 5 61
7
8
9 10
What Matters Most?
Dr. Lee Goldman
Cook County Hospital, Chicago, IL
1. Is the patient feeling unstable angina?
2. Is there fluid in the patient’s lungs?3. Is the patient’s systolic blood
pressure below 100?
The Goldman Algorithm
Prediction of Patients Expected to Have a Heart Attack Within 72 Hours
0
20
40
60
80
100
Traditional Techniques Goldman Algorithm
By paying attention to what really matters, Dr. Goldman improved the “false negatives” by 20
percentage points and eliminated the “false positives” altogether.
The Goldman Algorithm
ECG Evidence of Acute Ischemia?ST-Segment Depression ≥ 1mm in ≥ 2 Contiguous Leads (New or Unknown Age) orT- Wave Inversion in ≥ 2 Contiguous Leads (New or Unknown Age) orLeft Bundle-Branch Block (New or Unknown Age)
Observation Unit
Inpatient Telemetry
Unit
High Risk Low Risk Very Low RiskModerate Risk
Yes No
Coronary Care Unit
No
ECG Evidence of Acute Myocardial Infarction (MI)?ST-Segment Elevation ≥ 1mm in ≥ 2 Contiguous Leads (New or Unknown Age)orPathologic Q Waves in ≥ 2 Contiguous Leads (New or Unknown Age)
Yes
Patient suspected of Acute Cardiac
Ischema
Perform Electrocardiogram
(EKG)
0 Factors2 or 3 Factors 1 Factors0 or 1 Factors2 or 3 Factors
Urgent Factors Present?Rates Above Both Lung BasesSystolic Blood Pressure <100 mm HgUnstable Ischemic Heart Disease
Urgent Factors Present?Rates Above Both Lung BasesSystolic Blood Pressure <100 mm HgUnstable Ischemic Heart Disease
Seven Deadly SinsAlthough Companies Realize the Importance of an Effective Monitoring System,
Most Fall Prey to Common Mistakes That Erode the Value
UsageReportingCollectionSelection
Strategic
Tactical
Nat
ure
of M
ista
ke
Life Cycle Activity
Ignoring the possibilities: Lackof optimal utilization of available data
One size fits all: Lackof audience segmentation
“Metrics Toilet”: Lack of aggregation and screeningof low-level metrics, resulting in cumbersome reports
Waiting for the perfect tool: Lack of focuson process, leading to over-reliance on technology
An arbitrary exercise: Lackof defined criteria for target setting
Metrics that (don’t) matter: Lackof actionable metrics
IT’s World View: Lack of user involvementin metrics selection and refinement
Source: Infrastructure Executive Council, 2003
Finding Metrics That Matter§ Will the metric be used in a report? If so, which one? How is it used in the report?
§ Will the metric be used in a dashboard? If so, which one? How will it be used?
§ What action(s) will be taken if an alert is generated? Who are the actors? Will a ticket be generated? If so, what severity?
§ How often is this event likely to occur? What is the impact if the event occurs? What is the likelihood it can be detected by monitoring?
§ Will the metric help identify the source of a problem? Is it a coincident / symptomatic indicator?
§ Is the metric always associated with a single problem? Could this metric become a false indicator?
§ What is the impact if this goes undetected?
§ What is the lifespan for this metric? What is the potential for changes that may reduce the efficacy of the metric?
Evaluating the Effectiveness of a Metric
The bulk of the monitoring performed measures the health of
the operating system
Transaction Monitoring
Application Resource
Monitoring
Operating System Monitoring
This monitoring is developed specially for the technologies used by the application to determine if they are functioning correctly
Transaction monitoring is the key to good monitoring as it provides
depth and the capability to determine customer impact
The overlap ensures sufficient fault detection
The Layered Approach
Monitoring PatternsLayers of Pre-Defined Monitoring Patterns
• The OS template is deployed when the server is provisioned
• As a server is customized to fit its role, additional templates are deployed
• Templates are stacked on top of each other until no gaps remain
• This approach provides a high degree of standardization without sacrificing the ability to develop a custom solution
Application-Technology MatrixMaps services, applications and technologies enabling:•Monitoring investment prioritization•Monitoring maturity•Which templates need to be deployed when new hardware is acquired•Whether an service has sufficient monitoring coverage based on its application components•This approach allows for anticipating changes to a customer’s monitoring needs
Scores indicate:0 – No Strategy1 – Limited Monitoring2 – Fully Integrated Strategy
Integrate Your Processes
PresentationFramework
Asset Management & Topology Database
Aggregation and Analysis
Security Management
Availability Management
Configuration Management
Change Management
Performance Management
Enterprise Data Sources
Business Telemetry
Information
Configuration Discrepancies
Enrichment DataBusiness Activity Data
Historical Data
“Enriched” Events
Change Activity
Topology Snapshots
Tre
nd-R
elat
ed F
ault
s
Disco
vered P
rob
lems
Status Indications
Incidents
Audit Information and Suspicious Activity
Enrichment Data Business Activity Data
Automated Discovery
Processing Streams
Situational Awareness
Engine
Adapted from http://www.slideshare.net/TimBassCEP/getting-started-in-cep-how-to-build-an-event-processing-application-presentation-717795
Real-Time Event Streams
Detected and Predicted Situations
Patterns from Historical Data
Causal Relationship from Past RCAs
Complex Event Processing
Event Pipeline
Event Queries
Time Window
Data Events
Control Event
Other Events
Event Filter
Scenarios
A
B
C
Feedback Loop
Event Intelligence
Action Events
Auto
mated
Actio
n
No
tification and
E
scalation
Business
Imp
act A
nalysis
Ro
ot C
ause Analysis
Co
rrelation and
E
vent Supp
ression
Enrichm
ent
Meta-Data Integration Bus
Distrib
uted C
ollecto
rsD
istributed
Co
llectors
LOB Managed Monitoring
System
Service Provider Monitoring
System
Vendor Managed Monitoring
System
Element Manager
Element Manager
Element Manager
Other Enterprise
Data
Document Sharing
Service Desk CMDB Batch Scheduling
Knowledge Database
Online Run Book
PBX/Call Manager
Visualization FrameworkC
om
mo
n Event
Form
at
Topology And Relationship
Database
Automated Action Tools
Distrib
uted C
ollecto
rsAutomated Provisioning
System
Predictive Analysis
Automated Change
Reconciliation
Security Management
Archive and
Rep
ort
Business Telemetry Data
Service Center and Enterprise
Notification Tool
Event Processing
The Management Eco-System
Capacity ManagementCompute Storage Network Facilities
Event Management / Manager of Managers
CMDB
Billing & Chargeback
SoftwareTracking
Server Monitoring
Storage Manager
Network Performance
Manager
Data Center Infrastructure
Manager
Capacity Management
Predictive Insights Capacity Analyzer
Automated Reporting Engine
Cloud Orchestrator
Interface for Capacity Planners
Interface for Business Users
Policies Manager
Data Warehouse
Closeout Meeting
Deliverables•Acceptance Document
Event Integration Test
Deliverables•Acceptance Document
Build Integration Solution
Deliverables•Design Document Package•Integration Rules•As-Built Document•Test Plan & Results•Code Review Results•Quality Inspection Checklist
Event Integration Design
Deliverables•Event Life Cycle Matrix•Data Flow Diagram•Integration Stories
Integration Required?
Deploy Monitoring
Deliverables•Monitors•Alerts•Netcool Facts•Readiness Test Results
Plan Approval
Deliverables•Solution Discussion•Plan Approval Document
Gap Analysis and Monitoring
Strategy Design
Deliverables•Monitoring Strategy•Deployment Plan•Application/Technology Matrix•Additional Questions
Incident History Analysis & Monitor
Discovery
Deliverables•Ticket History Report•Points of Failure List•Monitor Inventory List•Alert History Report•Alert Logic Flow Chart•Non-Standard Monitoring Audit
Question & Answer Session
Deliverables•Physical & Logical Diagrams•Asset List (Hardware & Software)•PBRA Recommendations for Monitoring•Existing “Home Grown” Monitoring Identified•Solution Discussion
Develop Recommended Best Practices
Deliverables•Industry Recommendations•ESM Best Practices•Questions for the QA Session
Y
N
Improvement Lifecycle
Legend
Systems Monitoring ConsultantSA
SMC
Arch
SM
PM
Systems Administrator
Platform ArchitectService ManagerProject Manager
SASMC
SA SMC Arch SM SMC
SA SMC
Arch SM
Arch SM
Arch SM
Arch SM
SA SMC Arch SM SMC
SMC
SMC
SMC
SMC
SMC
SMC
SA SMC
SA SMC
SA SMC
SA SMC
SA
SA
SA
SA
SA
SA
SMC SM
SMC SM
SMC SM
SA SMC SM
SA SMC SM
SA SMC SM
SA SMC SM
SA SMC Arch SM SMC
Arch SM SMC
SA SM