Top Java Performance Problems and Metrics To Check in Your Pipeline

  • View
    1.313

  • Download
    0

  • Category

    Software

Preview:

Citation preview

And other Tips & Tricks to make you a “Performance Expert”More @ http://blog.dynatrace.com – Tools @ http://bit.ly/dtpersonal

Andreas Grabner - @grabnerandi

Deep Dive Into Top Performance Mistakes

Why Performanc

e?Confidential, Dynatrace, LLC

700 deployments / YEAR

10 + deployments / DAY

50 – 60 deployments / DAY

Every 11.6 SECONDS

Not only fast delivered but also delivering fast!

-1000ms +2%

Response Time Conversions

-1000ms +10%

+100ms -1%

#1: Which Geo has which “User Experience”?

#2: Who are these users?

Daily Deployments + Mkt Push

Increase # of unhappy users!

Drop in Conversion Rate

Overall increase of Users!

Satisfied Users Click more Content

Tolerating Users click less content

Frustrated Users mainly click on Support

Update of Dependency Injection Library impacts Memory & CPU

App with Regular Load supported by

10 ContainersTwice the Load but 48 (=4.8x!) Containers! App doesn’t scale!!

Does it really scale?

How to analyze

perf?Confidential, Dynatrace, LLC

Time: Wall Clock, CPU, I/O, Wait/Sync, Susp, Page Load

Throughput: # of Requests per Timeinterval

Resources: CPU Cycles, Memory, I/O, Log Messages, ...

Pools and Queues: Sizes, Utilization, Acquisition Time, # Publishers vs # Subscribers, Process Time

Interactions: # SQLs, # Messages, # Services, # Images, # CSS

Errors: Exceptions, HTTPs, TCP Packet Loss

AND MANY MORE

0.02ms

0.01ms

https://dynatrace.github.io/ufo/

“In Your Face” Data!

Where do your Stories come

from?

Share Your PurePath - http://bit.ly/sharepurepath

3rd parties

Akamai

Cloudfront

Synthetic

Apache

IIS

Node.js

nginx

Java

.NET

PHP

IBM

WMQ

ESBsMongoDB

Hbase

Cassandra

CICs

IMS

ORACLE

MSSQL

MySQL

DB2

Mobile

Collector

Plugins

Dynatrace Server

Hosts

Session Storage

Splunk

Elasticsearch

Solr

Rich Client

Web Interface

Web

Dev/Arch

Arch Validation

Method Level Hotspots

Every SQL + Bind

+ Exceptions, Logs, Memory Allocation, Threads, Actual Code ...

Export & Share

Share Your PurePath - http://bit.ly/sharepurepath

20%80%

Frontend PerformanceWe are getting FATer!

Mobile landing page of Super Bowl ad

434 Resources in total on that page:230 JPEGs, 75 PNGs, 50 GIFs, …

Total size of ~ 20MB

Fifa.com during Worldcup

Source: http://apmblog.compuware.com/2014/05/21/is-the-fifa-world-cup-website-ready-for-the-tournament/

8MB of background image for STPCon (Word Press)

Time of D

eployment

Availability dropped to 0%

Availability And Response Time

Tip for handling Spike Load: GO LEAN!!

Response time improved 4x

1h before SuperBowl KickOff

1h after Game ended

Make F12 or Browser Agent your friend!

Key Metrics# of ResourcesSize of ResourcesTotal Size of ContentHTTP 3xx, 4xx, 5xx# of Domains

Backend PerformanceThe Usual Suspects

• Symptoms• HTML takes between 60 and 120s to render• High GC Time

• Developer Assumptions• Bad GC Tuning• Probably bad Database Performance as rendering was simple

• Result: 2 Years of Finger pointing between Dev and DBA

Project: Online Room Reservation System

Developers built own monitoringvoid roomreservationReport(int officeId){ long startTime = System.currentTimeMillis(); Object data = loadDataForOffice(officeId); long dataLoadTime = System.currentTimeMillis() - startTime; generateReport(data, officeId);}

Result:Avg. Data Load Time: 45s!

DB Tool says:Avg. SQL Query: <1ms!

#1: Loading too much data

24889! Calls to the Database API!

High Memory Usage results in GC resulting to high GC to keep all

data in Memory

#2: On individual connections 12444! individual

connections

Classical N+1 Query Problem

Individual SQL really <1ms

#3: Putting all data in temp Hashtable

Lots of time spent in Hashtable.get

Called from their Entity Objects

• … you know what code is doing you inherited!!• … you are not making mistakes like this

• Explore the Right Tools• Built-In Database Analysis Tools• “Logging” options of Frameworks such as Hibernate, …• JMX, Perf Counters, … of your Application Servers• Performance Tracing Tools: Dynatrace, Ruxit, NewRelic,

AppDynamics, Your Profiler of Choice …

Lessons Learned – Don’t Assume …

Key Metrics# of SQL Calls# of same SQL Execs (1+N)# of ConnectionsRows/Data Transferred

LoggingWE CAN LOG THIS!!

Or we just throw a lot ofExceptions

LOG

Log Hotspots in Frameworks!callAppenders clear CPU and I/O Hotspot

Excessive logging through Spring Framework

Debug Log and outdated log4j library#1: Top Problem: log4j.callAppenders

-> 71% Sync Time

#2: Most of logging done from fillDetail method

#3: Doing “DEBUG” log output: Is this necessary?

Overhead caused by ExceptionsfillInStackTrace is Top 2 in CPU Hotspots

All these Exceptions that never show up in a log file are consuming all CPU

Too Many Exceptions vs Log Messages

2-5 Log Messages per 5 MinLooking at the important

(SEVERE, FATAL, …) log messages written

Up to 20000 Custom ExceptionsThat’s about 4000x the number of Exceptions per Log Message

Key Metrics

# of Log EntriesSize of Logs per Use Case

Pools & QueuesProper Sizing!!

Wrong Pool Sizes Configured

Do we have enough DB CONNECTIONS per pool?

Threading Issues

Threading Issues (Analysis) Tip: I like the Thread Column as it tells me where we spawn off async threads and

where the “main threads” might be waiting

Sync / Wait1.63s in Object.wait

Means that this thread is put to hold

Waiting on the next Connection to become

available!

Key Metrics

Pool and Queue SizesTime in Sync & Wait

(Micro)ServicesArchitectural Mistakes with „Migrating“ to (Micro)Services

Example #2: Online Sports Club Search Service

2015201420xx

Response Time

2016+

1) Started as a small project

2) Slowly growing user base

3) Expanding to new markets –

1st performance degradation!

4) Adding more markets – performance becomes

a business impact Users

4) Potentially start loosing users

Early 2015: Monolithic App

Can‘t scale vertically endlessly!

2.68s Load Time

94.09% CPU Bound

Proposal: Service approach!

Front Endto Cloud

Scale Backendin Containers!

7:00 a.m.Low Load and Service runningon minimum redundancy

12:00 p.m.Scaled up service during peak loadwith failover of problematic node

7:00 p.m.Scaled down again to lower loadand move to different geo location

Testing the Backend Service alone scales well …

Go live – 7:00 a.m.

Go live – 12:00 p.m.

What Went Wrong?

26.7s Load Time5kB Payload

33! Service Calls

99kB - 3kB for each call!

171! Total SQL Count

Architecture ViolationDirect access to DB from frontend service

Single search query end-to-end

The fixed end-to-end use case“Re-architect” vs. “Migrate” to Service-Orientation

2.5s (vs 26.7) 5kB Payload

1! (vs 33!) Service Call

5kB (vs 99) Payload!

3! (vs 177) Total SQL Count

You measure it! from Dev (to) Ops

Build 17 testNewsAlert OK

testSearch OK

Build # Use Case Stat # API Calls # SQL Payload CPU

1 5 2kb 70ms

1 3 5kb 120ms

Use Case Tests and Monitors Service & App Metrics

Build 26 testNewsAlert OK

testSearch OK

Build 25 testNewsAlert OK

testSearch OK

1 4 1kb 60ms

34 171 104kb 550ms

Ops#ServInst Usage RT

1 0.5% 7.2s

1 63% 5.2s

1 4 1kb 60ms

2 3 10kb 150ms

1 0.6% 4.2s

5 75% 2.5s

Build 35 testNewsAlert -

testSearch OK

- - - -

2 3 10kb 150ms

- - -

8 80% 2.0s

Metrics from and for Dev(to)Ops

Re-architecture into „Services“ + Performance Fixes

Scenario: Monolithic App with 2 Key Features

Key Metrics# of Service CallsPayload of Service Calls# of Involved Threads1+N Service Call Pattern!

Tips & TricksAnd more Metrics of course

Tip: Layer Breakdown over Time

With increasing load: Which LAYER doesn’t SCALE?

Tip: Exceptions and Log Messages

How are # of EXCEPTIONS evolving over time?

How many SEVERE LOG messages to we write in relation to Exceptions?

Tip: Failed Transactions

Are more TRANSACTIONS FAILING (HTTP 5xx, 4xx, …)

under heavier load?

Tip: Database Activity

Do we see increased in AVG # of SQL Executions over Time?

Do TOTAL # of SQL Executions increase with load? Shouldn’t

it flatten due to CACHES?

Tip: Database History Dashboard

How many SQL Statements are PREPARED?

What’s the overall Execution Time of different SQL Types (SELECT, INSERT, DELETE, …)

For more Key Metricshttp://blog.dynatrace.com

http://blog.ruxit.com

Questions and/or DemoSlides: slideshare.net/grabnerandiGet Tools: bit.ly/dtpersonalYouTube Tutorials: bit.ly/dttutorialsContact Me: agrabner@dynatrace.comFollow Me: @grabnerandiRead More: blog.dynatrace.com

Andreas GrabnerDynatrace Developer Advocate@grabnerandihttp://blog.dynatrace.com

Recommended