45
Copyright © 2012, Oracle and/or its affiliates. All rights reserved. 1

Continuous Performance Monitoring of a Distributed Application [CON4730]

Embed Size (px)

DESCRIPTION

JavaONE 2013 Copyright © 2013, Oracle and/or its affiliates. All rights reserved.1

Citation preview

Page 1: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.1

Page 2: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.2

Insert Picture Here

Continuous Performance Monitoring of a Distributed Application

Ashish Srivastava Principal Member of Technical StaffDiana YuryevaSenior Member of Technical Staff

Page 3: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.3

The following is intended to outline our general product direction. It is intended

for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

Page 4: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.4

Session Goal

§ Components:– Design patterns

– Tools

§ Qualities:– Continuous

– Light-weight

– Recordable

Arrive at solution for extreme performance monitoring

Page 5: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.5

Session Agenda

§ Use Case

§ Software Patterns

§ Tools

§ Pitfalls and Advice

§ Q&A

Page 6: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.6

About Us

§ Oracle Billing and Revenue Management Elastic Charging Engine– 100% real-time charging application

– Java

– Distributed grid– Oracle Coherence

– Oracle NoSQL

– Focus on extreme performance

Page 7: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.7

Operating Conditions

§ Low latency expectations

§ Heavy system load

§ Distributed environment

§ Multi-level software stack

Page 8: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.8

Monitoring Requirements

§ Detailed insight about performance– Latency

– Throughput

§ View over time§ Reporting§ Bottleneck detection§ View of system as cohesive unit

Functional

Page 9: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.9

Monitoring Requirements

§ Minimal impact on processing

§ Ease of use

§ Separation of concerns

Non-Functional

Page 10: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.10

Session Agenda

§ Use Case

§ Software Patterns

§ Tools

§ Pitfalls and Advice

§ Q&A

Page 11: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.11

Approach

§ Off-the-shelf software not sufficient

§ Custom development needed– Incorporate monitoring into system

– Collect, analyze and present metrics

How do I address these requirements?

Page 12: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.12

Collecting Metrics

§ Goal– Incorporate metrics collection into general processing

§ Approach– Enhance domain model with monitoring-related data structures

Problem overview

Page 13: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.13

Collecting Metrics

Model of sample system

ECE Client

Network Mediation

A

B

B'C

C'A'

A

Node1

Node3

Node2

request

―Debatch the requests―Data lookups―Apply Tariff―Save Session―Prepare Response―..

response

Page 14: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.14

Collecting Metrics

Solution

ECE Client

Network Mediation

A

B

B'C

C'A'

A

Node1

Node3

Node2

request

―Debatch the requests―Data lookups―Apply Tariff―Save Session―Prepare Response―..

response

Page 15: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.15

Client Node

Processing Node

Envelope

Routing ContextPayloadTracking Context Chronicler

– TimePointsStat Reporterharvest

Envelope

Page 16: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.16

Collecting Metrics

##### Elapsed time = 3600 seconds

##### Avg throughput = 20000 ops/sec

##### Avg latency = 50 ms

Result

Page 17: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.17

Granular Reporting

§ Goal– I need more granular reporting of performance over time

§ Approach– Enhance reporting of collected metrics

Problem overview

Page 18: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.18

Granular Reporting

Solution – data structure

Chronicler removed

―A moving reporting window―100% reporting―Sampled reporting―Stats exposed over JMX―Fixed data set for a window

Chronicler added

Page 19: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.19

Granular Reporting

Solution – class diagram

Page 20: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.20

Granular Reporting

§ I can see min/max/avg latency and throughput over time

§ My throughput reporting is quite good: I can see whether I had stable or erratic throughput

Result

Page 21: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.21

Latency Percentile Report

§ Goal– Latencies are still not detailed enough. I need to know more than the

average/min/max latencies

– Need to guarantee that 99.999% of the requests take less than 55ms

§ Approach– Introduce range bucketing to count latencies

Problem overview

Page 22: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.22

Latency Percentile Report

§ Pre-defined buckets of latency percentiles§ Data set does not grow. Each bucket is updated§ Multiple percentile breakdown

– End-to-end

– Server side processing

– Per batch reporting

Solution

Page 23: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.23

Latency Percentile Report

2013-03-21 08:44:29.112 PDT INFO ##### Latency statistics based on percentiles:

Percentile: 0.1, Latency: 1ms, Total Count: 1173148

Percentile: 1.0, Latency: 2ms, Total Count: 100909763

Percentile: 10.0, Latency: 2ms, Total Count: 100909763

Percentile: 95.0, Latency: 26ms, Total Count: 685176664

Percentile: 99.0, Latency: 50ms, Total Count: 713029967

Percentile: 99.5, Latency: 58ms, Total Count: 716355711

Percentile: 99.9, Latency: 78ms, Total Count: 719217619

Percentile: 99.99, Latency: 104ms, Total Count: 719836971

Percentile: 99.999, Latency: 128ms, Total Count: 719897850

Percentile: 100.0, Latency: 169ms, Total Count: 719904814

Result – printed report

Page 24: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.24

Latency Percentile Report

Result – heat map

Page 25: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.25

Method Breakdown

§ Goal– I want to measure the impact of a new method under varying load

– End-to-end latency always ON

– Minimum performance impact

§ Approach– Method annotations

– Aspect

Problem overview

Page 26: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.26

Method Breakdown

Solution

public enum LabelEnum { APPLY_TARIFF,

... DEBATCH } public class ClassToBeTracked { @Track(pointLabel = LabelEnum.APPLY_TARIFF) private <ReturnObject> method(<Parameters>) { ... } }

<pointcut name="scope" expression="within(ClassName) "/>

Page 27: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.27

Method Breakdown

Result – detailed breakdown report

2013-07-15 16:29:24.953 PDT Chronicler Breakdown: DEBATCH -> 64149 nanoseconds LOOKUP_DATA -> 1056748 nanoseconds APPLY_TARIFF -> 99994 nanoseconds SAVE_SESSION -> 12989 nanoseconds PREPARE_RESPONSE -> 15998 nanoseconds

Page 28: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.28

Session Agenda

§ Use Case

§ Software Patterns

§ Tools

§ Pitfalls and Advice

§ Q&A

Page 29: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.29

Storing and Presenting Metrics

§ Goal– I collect detailed performance metrics, but I need to report them too

– I need a tool which stores these metrics and presents them in a unified view

§ Approach– Create monitoring dashboard

– Technologies: JRDS,RRD and in-house development

Problem overview

Page 30: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.30

Storing and Presenting Metrics

Result – monitoring dashboard

Configuration: Topology 24 servers Throughput 20000 ops/sec Duration 10 hrs

Page 31: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.31

Storing and Presenting Metrics

§ Graphical§ Supports various metrics

– Application-specific

– Machine-specific

– JVM-specific

§ Consolidated view– All graphs on one page

Solution qualities

Page 32: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.32

Storing and Presenting Metrics

§ Easy to use– Collects and saves data automatically

§ Easy to share– Includes configuration for future references

– Send links to web pages

– Print page as PDF

Solution qualities

Page 33: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.33

Storing and Presenting Metrics

§ Stores data without losing precision§ Supports drilling down§ Light-weight§ Customizable

Solution qualities

Page 34: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.34

Session Agenda

§ Use Case

§ Software Patterns

§ Tools

§ Pitfalls and Advice

§ Q&A

Page 35: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.35

Pitfalls and Advice

§ Distributed system monitoring != Single JVM monitoring– Consolidated view is critical

§ Consistency of tools across team is important– Same language across development, QE and Performance teams saves

hours

§ Solution should enable you to be agile– Run monitoring on laptop AND realistic setup

Take into consideration

Page 36: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.36

Pitfalls and Advice

§ These hide problems– Averaging

– Sampling

§ GC has big impact, so include it in your metrics§ Watch our for processes sharing the same host§ Always run long-duration tests

Some things to pay attention to

Page 37: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.37

Session Summary

Detailed insight about performance– Latency

– Throughput

View over time Reporting Bottleneck detection View of system as cohesive unit

Let's see how we addressed original requirements

Page 38: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.38

Session Summary

Detailed insight about performance– Latency

– Throughput

View over time Reporting Bottleneck detection View of system as cohesive unit

Let's see how we addressed original requirements

Page 39: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.39

Session Summary

Detailed insight about performance– Latency

– Throughput

View over time Reporting Bottleneck detection View of system as cohesive unit

Let's see how we addressed original requirements

Page 40: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.40

Session Summary

Detailed insight about performance– Latency

– Throughput

View over time Reporting Bottleneck detection View of system as cohesive unit

Let's see how we addressed original requirements

Page 41: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.41

Session Summary

Detailed insight about performance– Latency

– Throughput

View over time Reporting Bottleneck detection View of system as cohesive unit

Let's see how we addressed original requirements

Page 42: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.42

Session Summary

Detailed insight about performance– Latency

– Throughput

View over time Reporting Bottleneck detection View of system as cohesive unit

Let's see how we addressed original requirements

Page 43: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.43

Session Agenda

§ Use Case

§ Software Patterns

§ Tools

§ Pitfalls and Advice

§ Q&A

Page 44: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.44

Page 45: Continuous Performance Monitoring of a Distributed Application [CON4730]

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.45

Links

§ ECE– http://www.oracle.com/us/products/applications/communications/elastic-

charging-engine

§ JRDS

– http://www.jrds.fr