I know why your Java is slow

@aragozin #Devoxx #WhyJavaSlow

I Know Why Your Java is Slow

Alexey Ragozin

`


After months of hard work

with tons of new features

well tested and polished

your application is live!

Everything runs smoothly until …


Users start complaining about performance

application has become unusable

stakeholders are getting nervous

You have to fix it!

ASAP


ServerBrowser

Application

Database


ServerBrowser

Application

Database

Caching

JavaScript

HTTP

Network Network

SQL performance

CPU Memory / Swapping

Disk IO Virtualization


What exactly slow means – clarify your KPIs Business transaction ≠ Page

Business transaction ≠ SQL transaction

Page ≠ HTTP request

HTTP request ≠ SQL transaction

Your system is a black box Do you think you know how it works?

You do not!

Profiling produces three types of data Lies – incorrectly interpreted data

Darn lies – incorrectly measured data

Statistics – data which will help to fix your system


What types of bottlenecks with can find in Java?

CPU bound Single core bottleneck

CPU starvation

Thread contention

Memory and GC related Frequent young GC

Abnormally long GC pauses

Frequent full GC


Single core bottleneck Certain singleton thread consumes 100% CPU

CPU starvation Number of threads compete for physical cores

Thread nor sleeps neither waits but CPU usage is far below 100%

Thread contention Thread spends considerable time in synchronization (accruing, waiting)

Frequent young GC Frequent young GC – consume handful of CPU budget

Caused by intensive memory allocation in application code


Thread CPU usage Number CPU cycles spend on this thread (translated into time or percentage)

Calculated by OS (User + Kernel)

Single thread can consume 100% at max

Java thread in RUNNABLE state Not BLOCKED, WAITING, SLEEPING or PARKED

Thread in blocking socket read is RUNNABLE

Java RUNNABLE

may not be runnable from OS prospective

may be runnable but not on CPU

may actually be running on CPU (counted to thread CPU usage)


Bird eye view on Java process

using JVisualVM



using JProfiler



using SJK ttop > java –jar sjk.jar ttop -p 12345 -n 20 -o CPU 2016-11-07T00:09:50.091+0400 Process summary process cpu=268.35% application cpu=244.42% (user=222.50% sys=21.93%) other: cpu=23.93% GC cpu=6.28% (young=1.62%, old=4.66%) heap allocation rate 1218mb/s safe point rate: 1.5 (events/s) avg. safe point pause: 43.24ms safe point sync time: 0.03% processing time: 6.39% (wallclock time) [000056] user=17.45% sys= 0.62% alloc= 106mb/s - hz._hzInstance_2_dev.generic-operation.thread-0 [000094] user=17.76% sys= 0.31% alloc= 113mb/s - hz._hzInstance_3_dev.generic-operation.thread-1 [000093] user=16.83% sys= 0.00% alloc= 111mb/s - hz._hzInstance_3_dev.generic-operation.thread-0 [000020] user=16.06% sys= 0.15% alloc= 108mb/s - hz._hzInstance_1_dev.generic-operation.thread-0 [000021] user=15.44% sys= 0.15% alloc= 110mb/s - hz._hzInstance_1_dev.generic-operation.thread-1 [000057] user=14.36% sys= 0.00% alloc= 110mb/s - hz._hzInstance_2_dev.generic-operation.thread-1 [000105] user=13.59% sys= 0.00% alloc= 72mb/s - hz._hzInstance_3_dev.cached.thread-1 [000079] user=13.43% sys= 0.15% alloc= 67mb/s - hz._hzInstance_2_dev.cached.thread-3 [000042] user=10.96% sys= 0.62% alloc= 65mb/s - hz._hzInstance_1_dev.cached.thread-2 [000174] user=10.65% sys= 0.31% alloc= 66mb/s - hz._hzInstance_3_dev.cached.thread-7 [000123] user= 8.96% sys= 0.00% alloc= 55mb/s - hz._hzInstance_4_dev.response [000129] user= 7.72% sys= 0.31% alloc= 21mb/s - hz._hzInstance_4_dev.generic-operation.thread-0 [000168] user= 6.95% sys= 0.62% alloc= 33mb/s - hz._hzInstance_1_dev.cached.thread-6 [000178] user= 7.57% sys= 0.00% alloc= 32mb/s - hz._hzInstance_2_dev.cached.thread-9 [000166] user= 6.48% sys= 0.46% alloc= 33mb/s - hz._hzInstance_1_dev.cached.thread-5 [000130] user= 6.02% sys= 0.62% alloc= 20mb/s - hz._hzInstance_4_dev.generic-operation.thread-1 [000181] user= 5.56% sys= 0.00% alloc= 34mb/s - hz._hzInstance_2_dev.cached.thread-12 [000014] user= 2.32% sys= 0.15% alloc= 7345kb/s - hz._hzInstance_1_dev.response

https://github.com/aragozin/jvm-tools






Tracking CPU hogs using sampling

using JProfiler


Tracking CPU hogs using sampling

flame graph with SJK


JEE world example – JBoss + Seam + Hibernate


JEE world example – JBoss + Seam + Hibernate

Command sjk ssa -f tracedump.std --categorize -tf **.CoyoteAdapter.service -nc

JDBC=**.jdbc

Hibernate=org.hibernate

"Facelets compile=com.sun.faces.facelets.compiler.Compiler.compile"

"Seam bijection=org.jboss.seam.**.aroundInvoke/!**.proceed"

JSF.execute=com.sun.faces.lifecycle.LifecycleImpl.execute

JSF.render=com.sun.faces.lifecycle.LifecycleImpl.render

Other=**

Report Total samples 2732050 100.00%

JDBC 405439 14.84%

Hibernate 802932 29.39%

Facelets compile 395784 14.49%

Seam bijection 385491 14.11%

JSF.execute 290355 10.63%

JSF.render 297868 10.90%

Other 154181 5.64% 0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Time

Other

JSF.render

JSF.execute

Seam bijection

Facelets compile

Hibernate

JDBC

Excel


Tracking CPU hot spots using thread sampling

PRO Low overhead

No upfront configuration

Can identify, CPU hot spots, IO hot spots, contention

Data are comparable across runs

CON Limited precision

No real method execution time measured

Safe point bias

Destabilization is limited to static program structure – methods


Instrumentation profiling Modifies byte code at runtime

Modifies behavior of program

Affects JIT compilation

Probe can access method arguments

When instrumentation should/can be used? You narrowed down problem area

You need more context to find root cause


BTrace

Open Source

Byte code instrumentation

Probes are coded Java

CLI and API

BTrace script @Property

Profiler prof = Profiling.newProfiler();

@OnMethod(clazz = "org.jboss.seam.Component",

method = "/(inject|disinject|outject)/")

void entryByMethod2(@ProbeClassName String className,

@ProbeMethodName String methodName, @Self Object component) {

if (component != null) {

Field nameField = field(classOf(component), "name", true);

if (nameField != null) {

String name = (String)get(nameField, component);

Profiling.recordEntry(prof, concat("org.jboss.seam.Component.",

concat(methodName, concat(":", name))));

}

}

}

@OnMethod(clazz = "org.jboss.seam.Component",

method = "/(inject|disinject|outject)/",

location = @Location(value = Kind.RETURN))

void exitByMthd2(@ProbeClassName String className,

@ProbeMethodName String methodName, @Self Object component,

@Duration long duration) {

if (component != null) {

Field nameField = field(classOf(component), "name", true);

if (nameField != null) {

String name = (String)get(nameField, component);

Profiling.recordExit(prof, concat("org.jboss.seam.Component.",

concat(methodName, concat(":", name))), duration);

}

}

}

https://github.com/jbachorik/btrace2

https://github.com/jbachorik/btrace2


Garbage Collection is one to blame?

Enable GC logs to see whole picture -XX:+PrintGCDetails -XX:+PrintReferenceGC

Common problems

JVM has not enough memory

Intensive allocation of short live object by application

Reference problems Large object resurrection by finalizes

Too many references


-XX:InitialTenuringThreshold=8Initial value for tenuring threshold (number of collections

before object will be promoted to old space)

-XX:+UseTLAB Use thread local allocation blocks in eden

-XX:MaxTenuringThreshold=15Max value for tenuring threshold

-XX:PretenureSizeThreshold=2m Max object sizeallowed to be allocated in young space (large objects will be

allocated directly in old space). Thread local allocation bypasses this check, so if TLAB is large enough object

exciding size threshold still may be allocated in young space.

-XX:+AlwaysTenure Promote all objects survivingyoung collection immediately to tenured space

(equivalent of -XX:MaxTenuringThreshold=0)

-XX:+NeverTenure Objects from young spacewill never get promoted to tenured space unless survivor

space is not enough to keep them

-XX:+ResizeTLAB Let JVM resize TLABs per thread

-XX:TLABSize=1m Initial size of thread’s TLAB

-XX:MinTLABSize=64k Min size of TLAB

-XX:+UseCMSInitiatingOccupancyOnlyOnly use predefined occupancy as only criterion for starting

a CMS collection (disable adaptive behaviour)

-XX:CMSInitiatingOccupancyFraction=70Percentage CMS generation occupancy to start a CMS cycle.

A negative value means that CMSTriggerRatio is used.

-XX:CMSBootstrapOccupancy=50Percentage CMS generation occupancy at which to initiate

CMS collection for bootstrapping collection stats.

-XX:CMSTriggerRatio=70Percentage of MinHeapFreeRatio in CMS generation that is

allocated before a CMS collection cycle commences.

-XX:CMSWaitDuration=30000Once CMS collection is triggered, it will wait for next young

collection to perform initial mark right after. This parameter specifies how long CMS can wait for young collection

-XX:+CMSScavengeBeforeRemarkForce young collection before remark phase

-XX:+CMSScheduleRemarkEdenSizeThresholdIf Eden used is below this value, don't try to schedule remark

-XX:CMSScheduleRemarkEdenPenetration=20Eden occupancy % at which to try and schedule remark pause

-XX:CMSScheduleRemarkSamplingRatio=4Start sampling Eden top at least before young generation occupancy

reaches 1/ of the size at which we plan to schedule remark

-XX:+CMSIncrementalModeEnable incremental CMS mode. Incremental mode was meant for

severs with small number of CPU, but may be used on multicore servers to benefit from more conservative initiation strategy.

-XX:+CMSClassUnloadingEnabledIf not enabled, CMS will not clean permanent space. You may need to enable it for containers such as JEE or OSGi.

-XX:ConcGCThreads=2Number of parallel threads used for concurrent phase.

-XX:ParallelGCThreads=16Number of parallel threads used for stop-the-world phases.

-XX:+DisableExplicitGCJVM will ignore application calls to System.gc()

-XX:+ExplicitGCInvokesConcurrentLet System.gc() trigger concurrent collection instead of full GC-XX:+ExplicitGCInvokesConcurrentAndUnloadsClassesSame as above but also triggers permanent space collection.

-XX:PrintCMSStatistics=1Print additional CMS statistics. Very verbose if n=2.

-XX:+PrintCMSInitiationStatisticsPrint CMS initiation details

-XX:+CMSDumpAtPromotionFailureDump useful information about the state of the CMS old

generation upon a promotion failure

-XX:+CMSPrintChunksInDump (with optin above) Add more detailed information about the free chunks

-XX:+CMSPrintObjectsInDump (with optin above) Add more detailed information about the allocated objects

by Alexey Ragozin – http://blog.ragozin.info HotSpot JVM options cheatsheet

Young space tenuring

Thread local allocation

Parallel processing

-XX:+CMSOldPLABMin=16 -XX:+CMSOldPLABMax=1024Min and max size of CMS gen PLAB caches per worker per block size

CMS initiating options

CMS Stop-the-World pauses tuning

Misc CMS options

CMS Diagnostic options

- Options for “deterministic” CMS, they disable some heuristics and require careful validation

Concurrent Mark Sweep (CMS)

All concrete numbers in JVM options in this card are for illustrational purposes only!

-XX:+ParallelRefProcEnabled Enable parallel processing of references during GC pause

-XX:SoftRefLRUPolicyMSPerMB=1000 Factor for calculating soft reference TTL based on free heap size

-XX:OnOutOfMemoryError=… Command to be executed in case of out of memory.

E.g. “kill -9 %p” on Unix or “taskkill /F /PID %p” on Windows.

-XX:G1HeapRegionSize=32m Size of heap region

-XX:MaxGCPauseMillis=500 Target GC pause duration.G1 is not deterministic, so no guaranties for GC pause to satisfy this limit.

-XX:G1ReservePercent=10 Percentage of heap to keep free. Reserved memory is used as last resort to avoid promotion failure.

-XX:G1ConfidencePercent=50 Confidence levelfor MMU/pause prediction

-XX:G1HeapWastePercent=10 If garbage level is belowthreshold, G1 will not attempt to reclaim memory further

-XX:G1MixedGCCountTarget=8 Target number of mixed collections after a marking cycle

-XX:InitiatingHeapOccupancyPercent=45 Percentage of (entire) heap occupancy to trigger concurrent GC

Garbage First (G1)

-XX:+CMSParallelRemarkEnabledWhether parallel remark is enabled (enabled by default)

-XX:+CMSParallelSurvivorRemarkEnabledWhether parallel remark of survivor space enabled,

effective only with option above (enabled by default)

-XX:+CMSConcurrentMTEnabledUse multiple threads for concurrent phases.

CMS Concurrency options

-XX:+CMSParallelInitialMarkEnabledWhether parallel initial mark is enabled (enabled by default)

-XX:CMSTriggerInterval=60000 Periodically triggers_ CMS collection. Useful for deterministic object finalization.

GC options cheat sheet download at

http://blog.ragozin.info/2016/10/hotspot-jvm-garbage-collection-options.html

Young collector Old collectior JVM Flags

Serial (DefNew)

Parallel scavenge (PSYoungGen)

Parallel scavenge (PSYoungGen)

Parallel (ParNew)

Serial (DefNew)

Parallel (ParNew)

Serial Mark Sweep Compact

Serial Mark Sweep Compact (PSOldGen)

Parallel Mark Sweep Compact (ParOldGen)

Concurrent Mark Sweep

Concurrent Mark Sweep

Serial Mark Sweep Compact

-XX:+UseSerialGC

-XX:+UseParallelGC

-XX:+UseParallelOldGC

-XX:+UseParNewGC

-XX:-UseParNewGC1 -XX:+UseConcMarkSweepGC

-XX:+UseParNewGC -XX:+UseConcMarkSweepGC

-XX:+UseG1GCGarbage First (G1)

1 - Notice minus before UseParNewGC, which is explicitly disables parallel mode

-verbose:gc or -XX:+PrintGC Print basic GC info

-XX:+PrintGCDetails Print more details GC info

-XX:+PrintGCTimeStamps Print timestamps for each GC event (seconds count from start of JVM)

-XX:+PrintGCDateStamps Print date stamps at garbage collection events: 2011-09-08T14:20:29.557+0400: [GC...

-Xloggc:<file>Redirects GC output to a file instead of console

-XX:+PrintTLAB Print TLAB allocation statistics

-XX:+PrintReferenceGC Print times for special(weak, JNI, etc) reference processing during STW pause

-XX:+PrintJNIGCStalls Reports if GC is waiting for native code to unpin object in memory

-XX:+PrintClassHistogramAfterFullGCPrints class histogram after full GC

-XX:+PrintClassHistogramBeforeFullGCPrints class histogram before full GC

-XX:+UseGCLogFileRotation Enable GC log rotation

-XX:GCLogFileSize=512m Size threshold for GC log file

-XX:NumberOfGCLogFiles=5 Number GC log files

-XX:+PrintGCCause Add cause of GC in log

-XX:+PrintHeapAtGC Print heap details on GC

-XX:+PrintAdaptiveSizePolicyPrint young space sizing decisions

-XX:+PrintHeapAtSIGBREAK Print heap details on signal-XX:+PrintPromotionFailure

Print additional information for promotion failure

-XX:+PrintPLAB Print survivor PLAB details

-XX:+PrintOldPLAB Print old space PLAB details

-XX:+PrintGCTaskTimeStamps Print timestamps for individual GC worker thread tasks (very verbose)

-XX:+PrintGCApplicationStoppedTimePrint summary after each JVM safepoint (including non-GC)

-XX:+PrintGCApplicationConcurrentTimePrint time for each concurrent phase of GC

GC Log rotation

More logging options

-XX:+PrintTenuringDistribution Print detailed demography of young space after each collection

GC log detail options

-Xms256m or -XX:InitialHeapSize=256mInitial size of JVM heap (young + old)

-Xmx2g or -XX:MaxHeapSize=2gMax size of JVM heap (young + old)

-XX:NewSize=64m-XX:MaxNewSize=64m

Absolute (initial and max) size ofyoung space (Eden + 2 Survivours)

-XX:NewRatio=3 Alternative way to specify size of young space. Sets ratio of young vs old space

(e.g. -XX:NewRatio=2 means that young space will be 2 time smaller than old space, i.e. 1/3 of heap size).

-XX:SurvivorRatio=15 Sets size of single survivor spacerelative to Eden space size

(e.g. -XX:NewSize=64m -XX:SurvivorRatio=6 means that each Survivor space will be 8m and Eden will be 48m).

-XX:MetaspaceSize=512m-XX:MaxMetaspaceSize=1g

Initial and max size ofJVM’s metaspace space

-Xss256k (size in bytes) or -XX:ThreadStackSize=256 (size in Kbytes)

Thread stack size

-XX:MaxDirectMemorySize=2g Maximum amount of memory available for NIO off-heap byte buffers

- Highly recommended option

Memory sizing options

by Alexey Ragozin – http://blog.ragozin.info

Available combinations of garbage collection algorithms in HotSpot JVM

- Highly recommended option

HotSpot JVM options cheatsheetAll concrete numbers in JVM options in this card are for illustrational purposes only!

Java Process Memory

JVM Memory

Java Heap

No

n-J

VM

Mem

ory

(nat

ive

libra

ries)

Non-Heap

Young Gen

Old

Ge

n

Ed

en

Su

rviv

or

0

Su

rviv

or

1

Th

rea

d S

tac

ks

Me

tas

pa

ce

Com

pres

sed

Clas

s Sp

ace

Co

de

Ca

ch

e

NIO

Dir

ect

Bu

ffer

s

Oth

er J

VM M

emm

ory

-Xms/-Xmx

-XX:CompressedClassSpaceSize=1g Memory reserved for compressed class space (64bit only)

-XX:InitialCodeCacheSize=256m-XX:ReservedCodeCacheSize=512m

Initial size and max size of code cache area













Visualize GC Dynamics

https://github.com/chewiebug/GCViewer

with GC Viewer




Visualize GC Dynamics

with Mission Control


Tracing object allocation hot spots

Traditional approach – instrumentation

Instrumenting each allocation site

Intrusive and expensive

Flight recorder approach – out of TLAB allocation tracing

Low overhead

No byte code transformation

Biased sampling


Tracking allocation hotspots

with Mission Control


Conclusion

Bottleneck could be everywhere in system

Move from top to bottom

Use right tools on each level

Focus on application KPI

There is no silver bullet tool


KEEP CALM AND MAKE YOUR JAVA FAST

Alexey Ragozin

[email protected]

http://blog.ragozin.info

Software

I know why your Java is slow