48
Anomaly Detection and Root Cause Analysis in Distributed Application Transactions Yuchen Zhao @

Anomaly detection and root cause analysis in distributed application transactions

Embed Size (px)

Citation preview

Page 1: Anomaly detection and root cause analysis in distributed application transactions

Anomaly Detection and Root Cause Analysis in Distributed Application Transactions

Yuchen Zhao @

Page 2: Anomaly detection and root cause analysis in distributed application transactions

Software is Eating the World

Page 3: Anomaly detection and root cause analysis in distributed application transactions
Page 4: Anomaly detection and root cause analysis in distributed application transactions
Page 5: Anomaly detection and root cause analysis in distributed application transactions

it’s critical to make surethe software

is running properly

Page 6: Anomaly detection and root cause analysis in distributed application transactions

How? Through monitoring!

Page 7: Anomaly detection and root cause analysis in distributed application transactions

Monitoring shouldn’t be very hard… right?

Page 8: Anomaly detection and root cause analysis in distributed application transactions

Well, it can become a bit more complex...

Page 9: Anomaly detection and root cause analysis in distributed application transactions

Or… really complex...

Page 10: Anomaly detection and root cause analysis in distributed application transactions
Page 11: Anomaly detection and root cause analysis in distributed application transactions
Page 12: Anomaly detection and root cause analysis in distributed application transactions

Keep applications runningis hard.

Page 13: Anomaly detection and root cause analysis in distributed application transactions
Page 14: Anomaly detection and root cause analysis in distributed application transactions

Challenge 1:Enterprise applications are complex

Page 15: Anomaly detection and root cause analysis in distributed application transactions

Challenge 2:Data is heterogeneous.

Its volume is massive and growing

Page 16: Anomaly detection and root cause analysis in distributed application transactions

Challenge 3:Too many signals.

Finding anomalies & root causesare non-trival.

Page 17: Anomaly detection and root cause analysis in distributed application transactions

Our solution: Relevant Fields

Machine Learning + Engineering

Page 18: Anomaly detection and root cause analysis in distributed application transactions

Q1: How to get & organize data?

Page 19: Anomaly detection and root cause analysis in distributed application transactions

Collect data in the form ofBusiness Transactions

Page 20: Anomaly detection and root cause analysis in distributed application transactions
Page 21: Anomaly detection and root cause analysis in distributed application transactions

Q2: Can you give a real use case?

Page 22: Anomaly detection and root cause analysis in distributed application transactions

A hypothetical travel booking site with data in BT

Page 23: Anomaly detection and root cause analysis in distributed application transactions

An unexpected incident:

Page 24: Anomaly detection and root cause analysis in distributed application transactions
Page 25: Anomaly detection and root cause analysis in distributed application transactions

Step 1: filtering

Page 26: Anomaly detection and root cause analysis in distributed application transactions

Step 2: find relevant fields

Page 27: Anomaly detection and root cause analysis in distributed application transactions

the relevancy score

“airline:AA” related transactions:● 2% occurrence normally among all

travel bookings● 82% of the current slow transactions

are from “AA”.● 41 times more significant than normal.

Page 28: Anomaly detection and root cause analysis in distributed application transactions

What’s the root cause?

Page 29: Anomaly detection and root cause analysis in distributed application transactions

Step 3: take actions!

Page 30: Anomaly detection and root cause analysis in distributed application transactions

Q3: What’s the design of the system?

Page 31: Anomaly detection and root cause analysis in distributed application transactions

Architecture Overview

Page 32: Anomaly detection and root cause analysis in distributed application transactions

Data Collection

Page 33: Anomaly detection and root cause analysis in distributed application transactions

Smart Code Instrumentation

watch every line of code, self-learning, automatic

Page 34: Anomaly detection and root cause analysis in distributed application transactions

Stream Processing & Storage

Page 35: Anomaly detection and root cause analysis in distributed application transactions

Relevant Fields Processing

Page 36: Anomaly detection and root cause analysis in distributed application transactions

all transactions(baseline)

error/slow transactions(query)

Baseline & Query Sets

Page 37: Anomaly detection and root cause analysis in distributed application transactions

Q4: How to score the field?

Page 38: Anomaly detection and root cause analysis in distributed application transactions

all transactions(baseline)

error/slow transactions(query)

Page 39: Anomaly detection and root cause analysis in distributed application transactions

Optimization: Dynamics Baseline

Page 40: Anomaly detection and root cause analysis in distributed application transactions

Infer baseline context from query automatically

querytransactions

transactions of Entity 1

querytransactions

transactions of Entity 2

transactions of Entity n

Page 41: Anomaly detection and root cause analysis in distributed application transactions

Baseline entity is auto learned from two dimensions:

● physical (applications, tiers, nodes, etc)

● temporal

Page 42: Anomaly detection and root cause analysis in distributed application transactions

Score NormalizationNormalize the score using a function derived from

sigmoid:

Page 43: Anomaly detection and root cause analysis in distributed application transactions

Score Example

Page 44: Anomaly detection and root cause analysis in distributed application transactions

Fore more details, please check out our demo paper in ICDM 2015:

Discovering Anomalies and Root Causes in Applications via Relevant Fields Analysis,

in Proceedings of the 15th IEEE International Conference on Data Mining

Page 45: Anomaly detection and root cause analysis in distributed application transactions

Ongoing work...Support rich data types

● time series

● text

● graphs

● ...

Page 46: Anomaly detection and root cause analysis in distributed application transactions

We’re selling!

Page 47: Anomaly detection and root cause analysis in distributed application transactions

We’re hiring too!

Contact Mara or

me [email protected]

Page 48: Anomaly detection and root cause analysis in distributed application transactions

Thank you!