Java at Scale: Performance & GC - Azul Systems, Inc.© 2013 Azul Systems 5 Big Memory Servers are the Standard • Retail prices, major web server store ( US $, Oct 2012) • Cheap

Hank Shiffman

Product Marketing Manager

Java at Scale:

Performance & GC

© 2013 Azul Systems 2

Where is Java Working?

• On the server ─ Enterprise applications: business rules

─ Monolithic & distributed computing

• On the client

─ Fat client computing

─ Thin client, browser-based

• Embedded

─ Android apps


What is Java’s Appeal?

• Portable ─ Write once, run anywhere (after testing everywhere)

• Productive ─ No bad features: no multiple inheritance, operator overloading

─ Do the Right Thing philosophy (vs. C++ Do the Efficient Thing)

─ Memory management reduces opportunities for error

• Efficient ─ Interpreter → JIT compilation → Dynamic recompilation

• Generic ─ Scala, Clojure, JRuby & more use Java runtime

─ Byte code is the new target architecture (ANDF)

• Scalable ─ Small to large platforms


Parkinson’s Law Applied to Software

• Hardware grows with Moore’s Law ─ Transistor counts double roughly every 18 months

─ Memory size grows around 100x every 10 years

• Application sizes grow with hardware ─ 1980: 100 KB data on ¼ – ½ MB server

─ 1990: 10 MB data on 16 – 32 MB server

─ 2000: 1 GB data on 2 – 4 GB server

─ 2010: 100 GB data on 256 GB server

─ (In-memory data size. Bigger data is cached or distributed.)


Big Memory Servers are the Standard

• Retail prices, major web server store (US $, Oct 2012)

• Cheap (< $1/GB/Month), and roughly linear to ~1TB

• 10s to 100s of GB/sec of memory bandwidth ─ 24 vCore, 128 GB server $5K

─ 24 vCore, 256 GB server $8K



─ 64 vCore, 1 TB server $36K


Has Java Kept Up? How Scalable is it?

• How big is your Java heap? ˃ .5 GB

˃ 1 GB

˃ 2 GB

˃ 4 GB

˃ 10 GB

˃ 20 GB

˃ 50 GB

˃ 100 GB

• Hardly anyone runs over 4 GB


• Survey of heap sizes for Plumbr memory leak detector

─ Source: http://plumbr.eu/blog/most-popular-memory-configurations

Large Heaps are a Rarity


• Java performance gets worse with heap size

ehCache: 10 GB cache, 29 GB heap, 48 GB 16 core Ubuntu server

─ Pause frequency varies with application activity

─ Pause duration varies with amount to scan/copy

Why So Few Big JVMs on Big Servers?


• What are requirements (percentiles & worst case)?

─ Need to think beyond averages & standard deviations

─ GC pauses don’t fit a bell curve

Think in Terms of Service Levels


• Key assumption: response time is a function of load

─ source: IBM CICS server documentation, “understanding response times”

A Classic Look at Application Response


Java Response Has a Different Look

• Pauses may track with load, but not in as obvious a way

─ source: ZOHO QEngine White Paper: performance testing report analysis

─ ”


A Few Realities About GC

• First the good: ─ GC is very efficient, much better than

─ Dead objects cost nothing to collect

─ GC will find all the dead objects without help, even cyclic graphs

• Now the bad: ─ GC really does stop for ~1 second per GB of live objects

─ You can change when it happens, not if*

─ You can still have memory leaks

─ Hold on to objects so GC can’t release them

─ No pauses in a 20 minute test doesn’t mean they’re gone

─ “You can pay me now, or you can pay me later.”

* We’ll talk about that later…


How Does a Garbage Collector Work?

• Three phases to GC: ─ Identify the live objects

─ Start with stack & statics, flag everything we reach

─ Reclaim resources held by dead objects

─ Anything we didn’t flag in the 1st phase

─ Periodically relocate live objects (defrag)

─ Move objects together, correct references (remap)

Free


How Does a Garbage Collector Work?

• Three phases to GC: ─ Identify the live objects

─ Start with stack & statics, flag everything we reach

─ Reclaim resources held by dead objects

─ Anything we didn’t flag in the 1st phase

─ Periodically relocate live objects (defrag)

─ Move objects together, correct references (remap)

• Sample implementations: ─ Mark/sweep/compact for old generation

─ Three separate passes, minimal extra heap

─ Copying collector for new generation

─ Move as we flag, do it all in one pass

─ Requires 2x heap


Generational GC

Basic assumption: most objects die young

• Use copying collector on new objects ─ Scan small % of heap, need small space for copy area

─ Reclaim the most space for the least effort

─ Move objects that live long enough to old generation(s)

• Collect old gen as it fills up ─ Much less frequent, likely higher cost, lower benefit

• Requires a Remembered Set (e.g. via Card Marking) ─ Track references from outside into new gen

─ Use as roots for new gen collector scan

• Don’t absolutely need 2x memory for new gen GC ─ Can overflow into old gen space


GC Terminology

• Concurrent vs. Parallel ─ A concurrent collector does GC while the application runs

─ A parallel collector uses multiple CPU cores to perform GC

─ A collector may be neither, one, or both

• Concurrent vs. Stop-The-World ─ A STW collector pauses the application during part of GC

─ A STW collector is not concurrent; it may be parallel

• Incremental ─ An incremental collector does its work in discrete chunks

─ Probably STW, with big gaps between increments


GC Terminology 2

• Precise vs. Conservative ─ A conservative collector doesn’t know every object reference or

doesn’t know if some values are references or not

─ Can’t relocate objects if it can’t tell a ref from a value

─ A precise collector knows & can process every reference

─ Required to move objects

─ Compiler provides semantic information for the collector

─ Java relies on precise collection

• Safepoints ─ Places in execution (point or range) where collector can identify

every reference in a thread’s execution stack

─ We bring a thread to a safepoint and keep it there during GC

─ Might mean pausing the thread, might not (e.g. JNI)

─ Safepoints need to be reached frequently

─ Global safepoints apply to all threads (STW)


Typical GC Combinations

• New generation ─ Usually a copying collector

─ Usually monolithic, stop-the-world

• Old generation ─ Usually Mark/Sweep/Compact

─ May be stop-the-world, or concurrent, or mostly concurrent, or incremental stop-the-world, or mostly incremental stop-the-world

• Mostly means not always ─ Fall back to monolithic stop-the-world (i.e. big pauses)


The Good Little Architect – A Moral Tale

A good architect must be able to impose her architectural choices on her projects

• Once upon a time, Azul met an app with 18 sec pauses ─ App had 10s of millions of object finalizations every GC cycle

─ Back then, reference processing was a stop-the-world event

• Every class in the project had a finalizer ─ All the finalizers did was null every reference field

─ In theory, saves the GC from following pointers

─ Right for C++ reference counting, oh so wrong for Java

• Two morals: ─ Know the cost of your actions (learn the underlying system)

─ Just because it doesn’t cost now doesn’t mean it won’t later


Oracle HotSpot GC Options

• Parallel GC ─ New Gen: monolithic STW copying

─ Old Gen: monolithic STW mark/sweep/compact

• Concurrent Mark Sweep (CMS) ─ New Gen: monolithic STW copying

─ Old Gen: mostly concurrent non-compacting

─ Mostly concurrent marking (multipass)

─ Concurrent sweeping

─ No compaction: free list, no object movement

─ Fallback is monolithic STW mark/sweep/compact


Oracle HotSpot GC Options 2

• Garbage First (G1GC) ─ New Gen: monolithic STW copying

─ Old Gen:

─ Mostly concurrent marker

─ STW to catch up on mutations, reference processing

─ Track inter-region relationships in remembered sets

─ STW mostly incremental compactor

─ Compact regions that can be done in limited time

─ Delay compaction of popular objects & regions

─ Goal: “avoid, as much as possible, having a full GC”

─ Fallback is monolithic STW mark/sweep/compact

─ Required for compacting popular objects & regions


Where Do Pauses Matter?

• Interactive apps like ecommerce ─ Add many seconds to a transaction & maybe lose a customer

─ Batch apps care about start-to-finish time, not transactions

• Big data apps ─ Travel site wants to keep hotel inventory in memory

─ Search app wants to keep entire index in memory

• Efficiency & management ─ More work from fewer JVM instances

• Low latency apps ─ Financial apps process data as it arrives

─ Small number of msecs down to < 1 msec

─ Requires low latency OS & significant tuning


Characterizing GC Pauses

• Frequency relates to activity ─ Object creation rate

─ Object mutation rate

• Severity relates to memory size ─ The more we examine & copy, the longer it takes

─ New gen is usually not the problem (yet)

• Not how much GC overhead, but where it happens


Limits to GC Overhead

• Worst case: no empty memory = 100% GC ─ GC runs hard all the time, reclaiming nothing

• Best case: infinite empty memory = 0% GC ─ Just keep creating objects, never collecting

• In between, GC follows 1/x curve as memory grows

CPU

Live set Heap size

100%

0%


How to Measure Pauses

• Identify the magnitude of the problem ─ jHiccup: free software from Azul’s CTO (jhiccup.com)

─ Does minimal work & records time to complete

─ Long delays indicate JVM wasn’t letting apps run

─ Run against your application

─ Results should map well to GC logs

─ Results will not include app inefficiencies

─ Run against idle JVM

─ Identify pauses from OS, VM, power management

• Don’t fix problems until you know where they lie


What To Do About Pauses

• Apply creative language (the Marketing solution)

─ “Guarantee a worst case of X msec, 99% of the time”

─ “Mostly concurrent, mostly incremental”

─ i.e. “Will at times exhibit long monolithic STW pauses”

─ “Fairly consistent”

─ i.e. “Will sometimes show results well outside this range”

─ “Typical pauses in the tens of milliseconds”

─ i.e. “Some pauses are a lot longer than that”


What To Do About Pauses

• Tune like crazy ─ Adjust GC parameters until behavior’s acceptable

─ A stopgap, not a solution

• Keep the heap small ─ Multiple small instances instead of fewer bigger ones

─ Move data out of heap (e.g. external cache)

─ Pool your objects (e.g. threads, DB connections)

• Commit ritual murder ─ Big heap, kill & restart instance before old gen GC

─ Yes, people really do this

• Change your GC ─ Move from one that rarely stalls to one that never stalls


Making JVM Pauseless: The Hard Parts

• Robust concurrent marking ─ References keep changing

─ Multipass marking is sensitive to mutation rate

─ Weak, Soft, Final references hard to deal with

• Concurrent compaction ─ Moving the objects isn’t the problem

─ It’s fixing all the references to the moved objects

─ How do you handle an app looking at a stale reference?

─ If you can’t, remapping is a monolithic STW operation

• New gen collection at scale ─ New gen is generally monolithic STW

─ Pauses are small because heaps are tiny

─ A 100 GB heap means new gen GC has a lot of work


Azul’s Zing JVM

• High performance production JVM ─ 64-bit Linux on X86

─ Red Hat, SuSE, Ubuntu, CentOS

─ Maximum heap size: 512 GB

─ Elastic memory to prevent out-of-memory failures

─ Overdraft protection for your JVM

• Always-on performance & execution monitoring ─ System level

─ JVM level

─ Application level


Azul’s C4 Collector

• Concurrent guaranteed-single-pass marker ─ Unaffected by mutation rate

─ Concurrent reference processing (weak, soft, final)

• Concurrent compactor ─ Moves objects without pausing your application

─ Remaps references without pausing your application

─ Can relocate entire generation (new/old) in every GC cycle

• Concurrent, compacting old generation

• Concurrent, compacting new generation

• No stop-the-world fallback. Ever.


• Java performance gets worse with heap size

ehCache: 10 GB cache, 29 GB heap, 48 GB 16 core Ubuntu server

─ Pause frequency varies with application activity

─ Pause duration varies with amount to scan/copy

Remember This Slide?


• What are requirements (percentiles & worst case)?

─ Need to think beyond averages & standard deviations

─ GC pauses don’t fit a bell curve

Think in Terms of Service Levels


• Wikipedia English language index in memory ─ 132 GB data in 240 GB heap

─ Ref: blog.MikeMcCandless.com

In-Memory Computing with Lucene


In-Memory Computing with Lucene

• Wikipedia English language index in memory ─ 132 GB data in 240 GB heap

─ Ref: blog.MikeMcCandless.com


Always-on Performance Monitoring

• System level activity: CPU, memory, network


Always-on Performance Monitoring

• JVM activity: CPU & memory


Real Time Execution Analysis

Technical papers

Free trials of Zing VM

Free licenses to OSS committers

www.azulsystems.com


Parallel GC


Concurrent Mark/Sweep


G1GC


Zing C4

Documents

Java at Scale: Performance & GC - Azul Systems, Inc.© 2013 Azul Systems 5 Big Memory Servers are the Standard • Retail prices, major web server store ( US $, Oct 2012) • Cheap