31
ca Opscenter Redefine Triage by Learning the Golden Nuggets of APM from noted “APM Best Practices” Author Michael Sydor Michael Sydor OCX14S #CAWorld CA Technologies Service Assurance

Redefine Triage by Learning the Golden Nuggets of APM From Noted "APM Best Practices" Author Michael Sydor

Embed Size (px)

DESCRIPTION

Increase your APM proficiency. Learn how you can identify and harness KPIs to make sense of your APM "big data." And find out how these techniques will help to prepare for your upgrade to the new features and functionality with latest APM release. For more information on DevOps solutions from CA Technologies, please visit: http://bit.ly/1wbjjqX

Citation preview

Page 1: Redefine Triage by Learning the Golden Nuggets of APM From Noted "APM Best Practices" Author Michael Sydor

ca Opscenter

Redefine Triage by Learning the Golden Nuggets of APM from noted “APM Best Practices” Author Michael SydorMichael Sydor

OCX14S #CAWorld

CA TechnologiesService Assurance

Page 2: Redefine Triage by Learning the Golden Nuggets of APM From Noted "APM Best Practices" Author Michael Sydor

2 © 2014 CA. ALL RIGHTS RESERVED.

Abstract

Increase your APM proficiency. Learn how you can identify and harness KPIs to make sense of your APM "big data." And find out how these techniques will help to prepare for your upgrade to the new features and functionality with latest APM release.

Michael Sydor

CA Technologies

Sr. Engineering Services Architect

Page 3: Redefine Triage by Learning the Golden Nuggets of APM From Noted "APM Best Practices" Author Michael Sydor

3 © 2014 CA. ALL RIGHTS RESERVED.

Agenda

WHY SO MANY METRICS WITH APM?

WHAT WE ARE LEARNING WITH ADVANCED BEHAVIORAL ANALYTICS (ABA)

HOW TO FIND KPIS

HOW TO GENERATE A CUSTOMER ABA CONFIGURATION

1

2

3

4

Page 4: Redefine Triage by Learning the Golden Nuggets of APM From Noted "APM Best Practices" Author Michael Sydor

4 © 2014 CA. ALL RIGHTS RESERVED.

Typical APM Cluster

Dozens to hundreds of applications– 2800 JVMs/CLRs

Up to 5M metrics, every 15 seconds

Large applications span multiple data centers– 2-8 APM clusters, typical

– 30-70 EM Collectors for a nationwide portal application

12M to 28M metrics, every 15 seconds

… certainly sounds like big data!!!

Page 5: Redefine Triage by Learning the Golden Nuggets of APM From Noted "APM Best Practices" Author Michael Sydor

5 © 2014 CA. ALL RIGHTS RESERVED.

The Experts …

"[T]he problem ... [w]ith this bulk acquisition of data on everybody [is that the NSA has] inundated their analysts with data. Unless they do a very focused attack, they're buried in information and that's why they can't succeed."– Bill Binney (former National Security Agency (NSA)), where he was a high-

ranking official, mathematician and codebreaker.

The numbers have no way of speaking for themselves. We speak for them. We imbue them with meaning .... [W]e may construe them in self-serving ways that are detached from their objective reality."– Nate Silver

Page 6: Redefine Triage by Learning the Golden Nuggets of APM From Noted "APM Best Practices" Author Michael Sydor

6 © 2014 CA. ALL RIGHTS RESERVED.

What is Big Data?APM information is “big”… but it is not “big data” without enrichment.

5M Metrics that you don’t fully

understandOR

5M Metrics that you don’t

fully understand

Trouble management

Versioncontrol

Time of ____constraints

Air trafficadvisories

Weatherforecast

AP newsupdates

Marketingcampaigns

Enrichment

Correlation

Trends

Insights

Anomalies

Page 7: Redefine Triage by Learning the Golden Nuggets of APM From Noted "APM Best Practices" Author Michael Sydor

7 © 2014 CA. ALL RIGHTS RESERVED.

Challenges for Big Data

Data variety – Different sources give different perspectives. Does your data have a significant perspective?

Validation – Is the data source meaningful/predictive?

Consistency – Are the values trustworthy?

Data structure and nomenclature – Mapping, Transformation

Temporal impedance mismatch– APM: Real-time with 15 second reporting interval

– Trouble management: +15-30 minutes later

– Stock ticker: +15-30 minutes later

– Air traffic advisories: +30-60 minutes later

– Version control: days to weeks in advance

– Marketing campaign assessment: 2-4 weeks later

Page 8: Redefine Triage by Learning the Golden Nuggets of APM From Noted "APM Best Practices" Author Michael Sydor

8 © 2014 CA. ALL RIGHTS RESERVED.

KPI Management Maturity

SGCM: Stalls, GC settings, Concurrency, Memory management trends

APC : Availability, Performance, Capacity

EKB: Errors, Key resource performance, Business transaction survey

Va

lue

KPI maturity

(Platform) (Application) (Transaction)

Page 9: Redefine Triage by Learning the Golden Nuggets of APM From Noted "APM Best Practices" Author Michael Sydor

What We are Learning with APM Advanced Behavioral Analytics

Page 10: Redefine Triage by Learning the Golden Nuggets of APM From Noted "APM Best Practices" Author Michael Sydor

10 © 2014 CA. ALL RIGHTS RESERVED.

Advanced Behavioral Analytics Logical Architecture

APM Cluster

5M Metrics100k

Metrics(via RegEx)

Anomaly engine

Anomalies Alerts

Why only 100k Metrics?Why not 5M?

Page 11: Redefine Triage by Learning the Golden Nuggets of APM From Noted "APM Best Practices" Author Michael Sydor

11 © 2014 CA. ALL RIGHTS RESERVED.

RegEx = Regular Expression

analytics.metricfeed.process.3 =

Custom Metric Host (Virtual) \\|Custom Metric Process (Virtual)\\|Custom Business Application Agent (Virtual)

analytics.metricfeed.metric.3 =

By Business Service\\|[^|]+\\|[^|]+\\|[^|]+:.+

Page 12: Redefine Triage by Learning the Golden Nuggets of APM From Noted "APM Best Practices" Author Michael Sydor

12 © 2014 CA. ALL RIGHTS RESERVED.

RegEx is hard … but easy to validate.

Page 13: Redefine Triage by Learning the Golden Nuggets of APM From Noted "APM Best Practices" Author Michael Sydor

13 © 2014 CA. ALL RIGHTS RESERVED.

Metricfeed.3

1 2 3 4 5 6 7 8 9 10 11 12

0

20

40

60

80

100

120

140

160

180

200

Series1

metricfeed.3

Page 14: Redefine Triage by Learning the Golden Nuggets of APM From Noted "APM Best Practices" Author Michael Sydor

14 © 2014 CA. ALL RIGHTS RESERVED.

Suspects Identified via Baseline Technique

1 2 3 4 5 6

0

2

4

6

8

10

12

14

16

18

Series1

Suspects via baseline techniquesaverage RT only

Page 15: Redefine Triage by Learning the Golden Nuggets of APM From Noted "APM Best Practices" Author Michael Sydor

15 © 2014 CA. ALL RIGHTS RESERVED.

Metric Count TypeView

Page 16: Redefine Triage by Learning the Golden Nuggets of APM From Noted "APM Best Practices" Author Michael Sydor

16 © 2014 CA. ALL RIGHTS RESERVED.

What is an Application?

Front-ends– Browser? Webservice? Messaging?

Back-ends– Databases Webservices Messaging Mainframes Trading_Partners

Muck-in-the-middle– Software quality, stability and scalability

- We want to identify KPIs for each of these elements:– Helps us build a useful dashboard for operations

– Helps expose with the resources are really doing

– Helps us define acceptance criteria, to act proactively

– Helps us to triage really effectively

Page 17: Redefine Triage by Learning the Golden Nuggets of APM From Noted "APM Best Practices" Author Michael Sydor

How to Find KPIs

Page 18: Redefine Triage by Learning the Golden Nuggets of APM From Noted "APM Best Practices" Author Michael Sydor

18 © 2014 CA. ALL RIGHTS RESERVED.

Capacity KPIs – “Tree Rings”

Page 19: Redefine Triage by Learning the Golden Nuggets of APM From Noted "APM Best Practices" Author Michael Sydor

19 © 2014 CA. ALL RIGHTS RESERVED.

Performance KPIs

High-volume

+

significant response time

Page 20: Redefine Triage by Learning the Golden Nuggets of APM From Noted "APM Best Practices" Author Michael Sydor

20 © 2014 CA. ALL RIGHTS RESERVED.

Create a simple alert and threshold (ConnectionStatus).

Page 21: Redefine Triage by Learning the Golden Nuggets of APM From Noted "APM Best Practices" Author Michael Sydor

21 © 2014 CA. ALL RIGHTS RESERVED.

Create a simple alert, find restart and threshold (MetricCount).

“UP” – but not actually doing anything!!!

Page 22: Redefine Triage by Learning the Golden Nuggets of APM From Noted "APM Best Practices" Author Michael Sydor

22 © 2014 CA. ALL RIGHTS RESERVED.

Understanding Your Environment

Identify the KPIs.– Availability

Agent ConnectionStatus

Number live metrics (MetricCount)

– Performance

High-volume components with significant response time

– NOT “Top 10 Response Time”

– Capacity

Highest-volume components

Don’t wait for production.– Make it part of your pre-production review.

– Manage the application lifecycle by trending KPIs.

Page 23: Redefine Triage by Learning the Golden Nuggets of APM From Noted "APM Best Practices" Author Michael Sydor

23 © 2014 CA. ALL RIGHTS RESERVED.

KPI Evolution

Good Better (additional) Best (additional)

Stalls Availability – connected status

Errors

GC settings Availability – metric count

Key resource performance

Concurrency Suspect performance Business transaction survey

Memory management (graph)

Suspect capacity

PlatformCoarse information

... but not really APM

Application, transactions, resourcesThe APM Advantage

Page 24: Redefine Triage by Learning the Golden Nuggets of APM From Noted "APM Best Practices" Author Michael Sydor

How to Generate a Custom ABA Configuration

Page 25: Redefine Triage by Learning the Golden Nuggets of APM From Noted "APM Best Practices" Author Michael Sydor

25 © 2014 CA. ALL RIGHTS RESERVED.

Details are on the community site as blog updates.

Search on each of the following keywords:– “average”, “responses”, “errors”, “Stalls”, “Stalled”

Copy each result to a test file (notepad is best).

Feed the files to ./build_config.py.

Copy the resulting Regular Expressions to your Analytics.properties file.– 96:: hot property – changes detected in about a minute

– 95:: recycle MOM

Page 26: Redefine Triage by Learning the Golden Nuggets of APM From Noted "APM Best Practices" Author Michael Sydor

26 © 2014 CA. ALL RIGHTS RESERVED.

SORT onthis column

Page 27: Redefine Triage by Learning the Golden Nuggets of APM From Noted "APM Best Practices" Author Michael Sydor

27 © 2014 CA. ALL RIGHTS RESERVED.

<CTRL><A><CTRL><C>

<CTRL><V>

Page 28: Redefine Triage by Learning the Golden Nuggets of APM From Noted "APM Best Practices" Author Michael Sydor

28 © 2014 CA. ALL RIGHTS RESERVED.

Example Execution

Page 29: Redefine Triage by Learning the Golden Nuggets of APM From Noted "APM Best Practices" Author Michael Sydor

29 © 2014 CA. ALL RIGHTS RESERVED.

Resources

Community site– Cookbook: APM HealthCheck

– Understanding Which Metrics Matter (KPI discussion)

– Cookbook: Application Audit

More details on the baseline techniques and process

– Blog entries

Redefine Triage by Learning the Golden Nuggets of APM

What are KPIs and how can I get some quick?!

Big Data - What does it mean for APM

Why Does ABA Find Anomalies When There Is Nothing Wrong In Production?

APM best practices – Realizing Application Performance Management– Available on Amazon.com and Apress.com

Baselines, Test Plans, App Audits, Triage, Firefighting

Organizational Models, Service Catalogs

Page 30: Redefine Triage by Learning the Golden Nuggets of APM From Noted "APM Best Practices" Author Michael Sydor

30 © 2014 CA. ALL RIGHTS RESERVED.

For More Information

To learn more about DevOps, please visit:

http://bit.ly/1wbjjqX

Insert appropriate screenshot and text overlayfrom following “More Info Graphics” slide here;

ensure it links to correct pageDevOps

Page 31: Redefine Triage by Learning the Golden Nuggets of APM From Noted "APM Best Practices" Author Michael Sydor

31 © 2014 CA. ALL RIGHTS RESERVED.

For Informational Purposes Only

© 2014 CA. All rights reserved. All trademarks referenced herein belong to their respective companies.

This presentation provided at CA World 2014 is intended for information purposes only and does not form any type of warranty. Some of the specific slides with customer references relate to customer's specific use and experience of CA products and solutions so actual results may vary.

Terms of this Presentation