Upload
andreas-grabner
View
1.312
Download
0
Embed Size (px)
Citation preview
And other Tips & Tricks to make you a “Performance Expert”More @ http://blog.dynatrace.com – Tools @ http://bit.ly/dtpersonal
Andreas Grabner - @grabnerandi
Deep Dive Into Top Performance Mistakes
Why Performanc
e?Confidential, Dynatrace, LLC
700 deployments / YEAR
10 + deployments / DAY
50 – 60 deployments / DAY
Every 11.6 SECONDS
Not only fast delivered but also delivering fast!
-1000ms +2%
Response Time Conversions
-1000ms +10%
+100ms -1%
#1: Which Geo has which “User Experience”?
#2: Who are these users?
Daily Deployments + Mkt Push
Increase # of unhappy users!
Drop in Conversion Rate
Overall increase of Users!
Satisfied Users Click more Content
Tolerating Users click less content
Frustrated Users mainly click on Support
Update of Dependency Injection Library impacts Memory & CPU
App with Regular Load supported by
10 ContainersTwice the Load but 48 (=4.8x!) Containers! App doesn’t scale!!
Does it really scale?
How to analyze
perf?Confidential, Dynatrace, LLC
Time: Wall Clock, CPU, I/O, Wait/Sync, Susp, Page Load
Throughput: # of Requests per Timeinterval
Resources: CPU Cycles, Memory, I/O, Log Messages, ...
Pools and Queues: Sizes, Utilization, Acquisition Time, # Publishers vs # Subscribers, Process Time
Interactions: # SQLs, # Messages, # Services, # Images, # CSS
Errors: Exceptions, HTTPs, TCP Packet Loss
AND MANY MORE
0.02ms
0.01ms
https://dynatrace.github.io/ufo/
“In Your Face” Data!
Where do your Stories come
from?
Share Your PurePath - http://bit.ly/sharepurepath
3rd parties
Akamai
Cloudfront
Synthetic
Apache
IIS
Node.js
nginx
Java
.NET
PHP
IBM
WMQ
ESBsMongoDB
Hbase
Cassandra
CICs
IMS
ORACLE
MSSQL
MySQL
DB2
Mobile
Collector
Plugins
Dynatrace Server
Hosts
Session Storage
Splunk
Elasticsearch
Solr
Rich Client
Web Interface
Web
Dev/Arch
Arch Validation
Method Level Hotspots
Every SQL + Bind
+ Exceptions, Logs, Memory Allocation, Threads, Actual Code ...
Export & Share
Share Your PurePath - http://bit.ly/sharepurepath
20%80%
Frontend PerformanceWe are getting FATer!
Mobile landing page of Super Bowl ad
434 Resources in total on that page:230 JPEGs, 75 PNGs, 50 GIFs, …
Total size of ~ 20MB
Fifa.com during Worldcup
Source: http://apmblog.compuware.com/2014/05/21/is-the-fifa-world-cup-website-ready-for-the-tournament/
8MB of background image for STPCon (Word Press)
Time of D
eployment
Availability dropped to 0%
Availability And Response Time
Tip for handling Spike Load: GO LEAN!!
Response time improved 4x
1h before SuperBowl KickOff
1h after Game ended
Make F12 or Browser Agent your friend!
Key Metrics# of ResourcesSize of ResourcesTotal Size of ContentHTTP 3xx, 4xx, 5xx# of Domains
Backend PerformanceThe Usual Suspects
• Symptoms• HTML takes between 60 and 120s to render• High GC Time
• Developer Assumptions• Bad GC Tuning• Probably bad Database Performance as rendering was simple
• Result: 2 Years of Finger pointing between Dev and DBA
Project: Online Room Reservation System
Developers built own monitoringvoid roomreservationReport(int officeId){ long startTime = System.currentTimeMillis(); Object data = loadDataForOffice(officeId); long dataLoadTime = System.currentTimeMillis() - startTime; generateReport(data, officeId);}
Result:Avg. Data Load Time: 45s!
DB Tool says:Avg. SQL Query: <1ms!
#1: Loading too much data
24889! Calls to the Database API!
High Memory Usage results in GC resulting to high GC to keep all
data in Memory
#2: On individual connections 12444! individual
connections
Classical N+1 Query Problem
Individual SQL really <1ms
#3: Putting all data in temp Hashtable
Lots of time spent in Hashtable.get
Called from their Entity Objects
• … you know what code is doing you inherited!!• … you are not making mistakes like this
• Explore the Right Tools• Built-In Database Analysis Tools• “Logging” options of Frameworks such as Hibernate, …• JMX, Perf Counters, … of your Application Servers• Performance Tracing Tools: Dynatrace, Ruxit, NewRelic,
AppDynamics, Your Profiler of Choice …
Lessons Learned – Don’t Assume …
Key Metrics# of SQL Calls# of same SQL Execs (1+N)# of ConnectionsRows/Data Transferred
LoggingWE CAN LOG THIS!!
Or we just throw a lot ofExceptions
LOG
Log Hotspots in Frameworks!callAppenders clear CPU and I/O Hotspot
Excessive logging through Spring Framework
Debug Log and outdated log4j library#1: Top Problem: log4j.callAppenders
-> 71% Sync Time
#2: Most of logging done from fillDetail method
#3: Doing “DEBUG” log output: Is this necessary?
Overhead caused by ExceptionsfillInStackTrace is Top 2 in CPU Hotspots
All these Exceptions that never show up in a log file are consuming all CPU
Too Many Exceptions vs Log Messages
2-5 Log Messages per 5 MinLooking at the important
(SEVERE, FATAL, …) log messages written
Up to 20000 Custom ExceptionsThat’s about 4000x the number of Exceptions per Log Message
Key Metrics
# of Log EntriesSize of Logs per Use Case
Pools & QueuesProper Sizing!!
Wrong Pool Sizes Configured
Do we have enough DB CONNECTIONS per pool?
Threading Issues
Threading Issues (Analysis) Tip: I like the Thread Column as it tells me where we spawn off async threads and
where the “main threads” might be waiting
Sync / Wait1.63s in Object.wait
Means that this thread is put to hold
Waiting on the next Connection to become
available!
Key Metrics
Pool and Queue SizesTime in Sync & Wait
(Micro)ServicesArchitectural Mistakes with „Migrating“ to (Micro)Services
Example #2: Online Sports Club Search Service
2015201420xx
Response Time
2016+
1) Started as a small project
2) Slowly growing user base
3) Expanding to new markets –
1st performance degradation!
4) Adding more markets – performance becomes
a business impact Users
4) Potentially start loosing users
Early 2015: Monolithic App
Can‘t scale vertically endlessly!
2.68s Load Time
94.09% CPU Bound
Proposal: Service approach!
Front Endto Cloud
Scale Backendin Containers!
7:00 a.m.Low Load and Service runningon minimum redundancy
12:00 p.m.Scaled up service during peak loadwith failover of problematic node
7:00 p.m.Scaled down again to lower loadand move to different geo location
Testing the Backend Service alone scales well …
Go live – 7:00 a.m.
Go live – 12:00 p.m.
What Went Wrong?
26.7s Load Time5kB Payload
33! Service Calls
99kB - 3kB for each call!
171! Total SQL Count
Architecture ViolationDirect access to DB from frontend service
Single search query end-to-end
The fixed end-to-end use case“Re-architect” vs. “Migrate” to Service-Orientation
2.5s (vs 26.7) 5kB Payload
1! (vs 33!) Service Call
5kB (vs 99) Payload!
3! (vs 177) Total SQL Count
You measure it! from Dev (to) Ops
Build 17 testNewsAlert OK
testSearch OK
Build # Use Case Stat # API Calls # SQL Payload CPU
1 5 2kb 70ms
1 3 5kb 120ms
Use Case Tests and Monitors Service & App Metrics
Build 26 testNewsAlert OK
testSearch OK
Build 25 testNewsAlert OK
testSearch OK
1 4 1kb 60ms
34 171 104kb 550ms
Ops#ServInst Usage RT
1 0.5% 7.2s
1 63% 5.2s
1 4 1kb 60ms
2 3 10kb 150ms
1 0.6% 4.2s
5 75% 2.5s
Build 35 testNewsAlert -
testSearch OK
- - - -
2 3 10kb 150ms
- - -
8 80% 2.0s
Metrics from and for Dev(to)Ops
Re-architecture into „Services“ + Performance Fixes
Scenario: Monolithic App with 2 Key Features
Key Metrics# of Service CallsPayload of Service Calls# of Involved Threads1+N Service Call Pattern!
Tips & TricksAnd more Metrics of course
Tip: Layer Breakdown over Time
With increasing load: Which LAYER doesn’t SCALE?
Tip: Exceptions and Log Messages
How are # of EXCEPTIONS evolving over time?
How many SEVERE LOG messages to we write in relation to Exceptions?
Tip: Failed Transactions
Are more TRANSACTIONS FAILING (HTTP 5xx, 4xx, …)
under heavier load?
Tip: Database Activity
Do we see increased in AVG # of SQL Executions over Time?
Do TOTAL # of SQL Executions increase with load? Shouldn’t
it flatten due to CACHES?
Tip: Database History Dashboard
How many SQL Statements are PREPARED?
What’s the overall Execution Time of different SQL Types (SELECT, INSERT, DELETE, …)
For more Key Metricshttp://blog.dynatrace.com
http://blog.ruxit.com
Questions and/or DemoSlides: slideshare.net/grabnerandiGet Tools: bit.ly/dtpersonalYouTube Tutorials: bit.ly/dttutorialsContact Me: [email protected] Me: @grabnerandiRead More: blog.dynatrace.com
Andreas GrabnerDynatrace Developer Advocate@grabnerandihttp://blog.dynatrace.com