TradeTech Architecture 2011 - Rodney Morrison, How to Achieve Success with Application Performance Monitoring Initiatives

How to Achieve Success With Application Performance Monitoring Initiatives

Rodney Morrison

VP, Products

SL Corporation

Agenda

Definition and Objectives of APM Analyst Breakdown of the APM Solution Space Challenges to Success Coming Advancements in APM An Investment Bank Use Case

What is APM?

APM refers to the discipline within systems management that focuses on monitoring and managing the performance and service availability of software applications.

The two main sources of information for proper application management are resources and user experience.

Factors Driving APM Initiatives

Complexity of applications increasing dramatically Cost estimations of application downtimes are

more frequently calculated and on the rise High severity incidents on the rise MTTR is often 30 minutes to 3 hours in even the

most critical application areas

The Perfect Storm

Best practices for change control have created operational silos

Events geared more for operational management are passed to application support teams from all silos

Each technology stack requires expertise and training in administrative and monitoring tools

Result High level of noisy, uncorrelated events make

proactive application management impossible Lack of centralized access to data and

performance history leads to lengthy triage for resolution

APM Breakdown – The Analyst’s Vision

End-user experience monitoring Runtime architecture discovery Transaction monitoring Component deep-dive monitoring Performance Analytics

Usage of Management Tools Evolves With Complexity

The Missing Gap

Centralized console for all performance data

Summary views relevant to applications and services

Application-centric event filtering and correlation Historical views of performance metrics for

baselining and event analysis

Advanced APM

Analytics engines for discovering performance patterns

Automation – Command and Control

Reduce incidents and time to repair with very specific application visibility requirements Emphasis on root cause

Analysis Tools detailing application performance for preventative care and capacity planning

To support line-of-business visibility into application availability, performance, the risk associated with that performance, and SLA monitoring

To standardize on a common delivery platform for application support to reduce operation costs

IT Challenges for APM Initiatives

http://2.bp.blogspot.com/_mAQbIwdJgTg/SLRiFLz37fI/AAAAAAAABwQ/8LT_SAMMqHo/s400/Multiple+Alert+Sysbol.jpg

Best Practices for APM Delivery and Maintenance

Specifically tasked with APM

Skills: Basic knowledge of

applications, underlying

infrastructure, software

components and architectures,

and can liaise with business and

development

Create a team

Define

Measure

Analyze

Improve

Control


Create a team

Define

Measure

Analyze

Improve

Control

Choose initial critical

applications

Gather requirements

Gather all relevant and

accessible performance metrics

Choose tool sets


Create a team

Define

Measure

Analyze

Improve

Control

Aggregate metrics and create

baseline

Determine initial rule sets and

thresholds

Configure summary views and

drill down per defined role


Create a team

Define

Measure

Analyze

Improve

Control

Iterate with key stakeholders

Verify that the information is

correct, and pertinent to role

activities

Practice scenarios to verify that

discovery and drill down to

analysis or root cause is optimal

per role


Create a team

Define

Measure

Analyze

Improve

Control

Track MTTR improvements

Determine repair activities

performed outside of the APM

solution, can they be

automated?

Are there other metrics or

correlations that need to be

included to capture outlier

events?


Create a team

Define

Measure

Analyze

Improve

Control

Tweak thresholds to optimize

alerts

Add any new important metrics

or correlations

Add automated responses

Add new important applications

Case Study: One of the World’s Largest Financial Institutions

Real World Best Practices Designate main support team and determine roll-out plan Choose application group or initial critical applications for

implementation Determine user roles exec/application support/tech support Determine application architectures Determine available monitoring systems and sources of

performance data Work with technology support teams and determine agents/no

agent technology Determine dashboard templates to create consistent company-

wide standards Move to next app group or let app support teams build out



Alert Summary Composite Objects of Underlying Data

Alerting in Context of Process Flow

Technology Summaries - Web Server Farms

Detail JMS Server

WebLogic Server Metrics

Data Grid Utilization

Detail – JMS Destinations

JVM Metrics

ROI

Total benefit – Reduction in outages = reduction in…

• Idle labor • Loss of business • Penalties for unmet SLAs/JIT agreements • Loss of discounts for early payments • Penalties for late payments • Cost of idle equipment/telecommunications • Cost of facilities • Cost of recovery • Cost of damage to perceived customer service • Cost of damage to brand perception

– Improved productivity – Reduced complexity – faster training of new employees – Capacity planning

$35.7MM in Year 1

Conclusion

The enterprise IT environment is becoming exponentially more complex

Multiple disparate unconnected monitoring solutions are only adding to the complexity and management costs.

End to End APM solutions are not only viable but have already been proven to deliver great benefits

Questions?