Upload
ddkeenan
View
2.326
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Advanced JVM Tuning, JavaOne 2013.
Citation preview
@TwitterAds | Confidential
Advanced JVM Tuning
David Keenan Language Runtimes and PerformanceJavaOne 2013 CON4540
Thursday, September 26, 13
@TwitterAds | Confidential
Performance Tuning Overview
2
Thursday, September 26, 13
@TwitterEng 3
Performance Tuning Overview
Top-Down Analysis
- Commonly used when you have the ability to change code at the highest level of the software stack.1. Monitor target application under load- `System level diagnostics- JVM level diagnostics
2. Profile Application Under load3. Identify bottlenecks, Analyze, and Optimize.-Make code more efficient-Reduce allocation rates
4. Repeat
Thursday, September 26, 13
@TwitterEng 4
Performance Tuning Overview
Bottom-Up Analysis- Commonly used when you do not have the ability to change code at the highest level of the software stack.- JVM and OS performance optimization is a common use case.
1. Monitor CPU-level statistics against target application under load- Use hardware counters (cache misses, path level, etc)- HW Profile and map to instructions, OS/JVM, and Scala/Java code- Use tools when available, otherwise visual inspect assembly code
2. Manipulate static and runtime compilers to address code issues- Missed optimizations-Example: autobox elision
3. Manipulate javac / scala compiler4. Manipulate core platform libraries5. Identify issues at higher level of the application stack 6. Repeat
Thursday, September 26, 13
@TwitterEng 5
Latency
ThroughputMemory Footprint
Performance Triangle
Thursday, September 26, 13
@TwitterEng 6
Latency
Throughput Memory Footprint
Reduce Latency
Thursday, September 26, 13
@TwitterEng 7
Latency
Throughput Memory Footprint
Increase Throughput
Thursday, September 26, 13
@TwitterEng 8
Latency
ThroughputMemory Footprint
Smaller Memory Footprint
Thursday, September 26, 13
@TwitterAds | Confidential
Performance Metrics
9
Thursday, September 26, 13
@TwitterEng 10
Choosing the Right Metrics
Identify Metrics- What’s important to your users- What influences your bottom line?- What are you willing to trade off?
Define Success- If its not broken .... Don’t fix it.- Perfect is the enemy of done.
Thursday, September 26, 13
@TwitterEng 11
Choosing the Right Metrics
We want it all!-High Throughput-Fast response times-Small footprint
But …-There’s no free lunch.
Choose your metrics wisely-Target metrics that impact your customers first
Use Statistics!- High variability can render some metrics useless
Thursday, September 26, 13
@TwitterEng 12
Throughput Metrics
Transactions per Second (TPS)- # of Transactions / Time- Aka pages/sec, queries/sec, hits/sec-Good measure of top end performance
Average Response time-Inverse of TPS-Time / #Transactions-Sometimes a rolling average.
CPU utilization-Measure of computation efficiency-Good for capacity planning, not for development regression testing (new features can increase work).
Thursday, September 26, 13
@TwitterEng 13
Latency Metrics
Maximum response time- Worst case
99% response time- Drops a few outliers
90% response time- May drop too many outliers and give a false sense of security
Critical Injection Rate- Critical jOPs in SPECjbb2013- Achievable throughput under response time SLA
Not Average Response Time
Thursday, September 26, 13
@TwitterEng 14
Memory Footprint Metrics
Heap size after Full GC (Live Data Size) Upcoming slide
Native process size- # ps aux PID
Static footprint- Size of application binary- Size of .jar- Why does it mater? - download/deployment speed-update/refresh speed
Thursday, September 26, 13
@TwitterAds | Confidential
JVM Tuning Basics
15
Thursday, September 26, 13
@TwitterEng 16
JVM Tuning Basics
Track size of Old Generation after Full GCs[GC 435426K->392697K(657920K), 0.1411660 secs][Full GC 392697K->390333K(927232K), 0.5547680 secs][GC 625853K->592369K(1000960K), 0.1852460 secs][GC 831473K->800585K(1068032K), 0.1707610 secs][Full GC 800585K->798499K(1456640K), 1.9056030 secs]
Calculating Live Data Size
Thursday, September 26, 13
@TwitterEng 17
JVM Tuning Basics
Track size of Old Generation after Young GCs if no Full GC events occur2013-09-10T05:39:03.489+0000: [GC[ParNew: 11766264K-
>18476K(13212096K), 0.0326070 secs] 12330878K->583306K(16357824K), 0.0327090 secs] [Times: user=0.48 sys=0.01, real=0.03 secs]
2013-09-10T05:42:54.666+0000: [GC[ParNew: 11762604K->20088K(13212096K), 0.0270110 secs] 12327434K->585068K(16357824K), 0.0271140 secs] [Times: user=0.39 sys=0.00, real=0.02 secs]
2013-09-10T05:46:41.623+0000: [GC[ParNew: 11764216K->21013K(13212096K), 0.0267490 secs] 12329196K->586133K(16357824K), 0.0268490 secs] [Times: user=0.40 sys=0.00, real=0.03 secs]
Calculating Live Data Size
Thursday, September 26, 13
@TwitterEng 18
JVM Tuning Basics
Size of Old Generation-Good starting point: 2X size of live data at steady state. -If object promotion rate causes frequent CMS cycles, increase size of the old generation-If live data size is 5GB, starting point should be ~10GB.- Old Generation size alone.- Set –Xms and –Xmx to same value- Nobody really needs extra Full GC pauses
Young and Old Generation Sizing
Thursday, September 26, 13
@TwitterEng 19
JVM Tuning Basics
Size of Young Generation- Young gen = Old gen is a good starting point.- Young generation size should increase with allocation rate - Sometimes 2-3x larger than Old Gen- Young GC times dominated by copying of live objects to Survivor spaces, not
size of overall Young Generation- Size so that most objects die in Young Generation - Higher Allocation rates -> Larger Young Generation
Young and Old Generation Sizing
Thursday, September 26, 13
@TwitterEng 20
JVM Tuning Basics
Example Enterprise Application- Significant application state- In memory cache cache size: 3.5GB- Overall Live data size: 4GB- High allocation rate of transient data
-Most objects die in large young generation- Suggested Initial Heap Size Suggestion
--Xms16g -Xmx16g -Xmn8g
Young and Old Generation Sizing
Thursday, September 26, 13
@TwitterEng 21
JVM Tuning Basics
Throughput--XX:+UseParallelOldGCLow server response times?- CMS
- Older technology- Can be highly tuned, but tuning can be brittle- -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
- G1- Current development focus- Young GC times slower than CMS- -XX:+UseG1GC
Choosing a Garbage Collector
Thursday, September 26, 13
@TwitterEng 22
JVM Tuning Basics
Recommended GC Logging Flags- -XX:+PrintGCDateStamps- -XX:+PrintGCDetails- -XX:+PrintGCTimeStamps- -Xloggc:/tmp/fileOther Helpful Flags- -XX:+PrintHeapAtGC - -XX:+PrintTenuringDistribution- -XX:+PrintGCApplicationStoppedTime- -XX:+PrintReferenceGC
GC logging flags
Thursday, September 26, 13
@TwitterEng 23
JVM Tuning Basics
2013-09-10T05:39:03.489+0000: [GC[ParNew: 11766264K->18476K(13212096K), 0.0326070 secs] 12330878K->583306K(16357824K), 0.0327090 secs] [Times: user=0.48 sys=0.01, real=0.03 secs]
2013-09-10T05:42:54.666+0000: [GC[ParNew: 11762604K->20088K(13212096K), 0.0270110 secs] 12327434K->585068K(16357824K), 0.0271140 secs] [Times: user=0.39 sys=0.00, real=0.02 secs]
2013-09-10T05:46:41.623+0000: [GC[ParNew: 11764216K->21013K(13212096K), 0.0267490 secs] 12329196K->586133K(16357824K), 0.0268490 secs] [Times: user=0.40 sys=0.00, real=0.03 secs]
(YGen before GC) - (YGen after gc) / ΔTime
(11764216K - 21013K) / (5:46:41.623+0000 - 5:42:54.666+0000)
11.2GB / 186 sec = ~62 MB/sec
Calculating Allocation Rate
Thursday, September 26, 13
@TwitterEng 24
JVM Tuning Basics
2013-09-10T05:39:03.489+0000: [GC[ParNew: 11766264K->18476K(13212096K), 0.0326070 secs] 12330878K->583306K(16357824K), 0.0327090 secs] [Times: user=0.48 sys=0.01, real=0.03 secs]
2013-09-10T05:42:54.666+0000: [GC[ParNew: 11762604K->20088K(13212096K), 0.0270110 secs] 12327434K->585068K(16357824K), 0.0271140 secs] [Times: user=0.39 sys=0.00, real=0.02 secs]
2013-09-10T05:46:41.623+0000: [GC[ParNew: 11764216K->21013K(13212096K), 0.0267490 secs] 12329196K->586133K(16357824K), 0.0268490 secs] [Times: user=0.40 sys=0.00, real=0.03 secs]
ΔOld Generation Size / ΔTime
(586133K - 583306K) / (5:46:41.623+0000 - 5:39:03.489+0000)
2827Kb / 458 sec = ~6.172Kb/sec
Calculating Promotion Rate
Thursday, September 26, 13
@TwitterAds | Confidential
Tuning for Latency
25
Thursday, September 26, 13
@TwitterEng 26
Tuning for Latency
Enable CMS- -XX:+UseConcMarkSweepGC
Good to have--XX:+CMSScavengeBeforeRemark- -XX:+ParallelRefProcEnabled--XX:CMSInitiatingOccupancyFraction=70
Start with Basic Tuning Guidelines- -XX:+PermSize256m -XX:MaxPermSize=256m- Old Gen Size is 2X Live Data Size- Young Gen Size = Old Gen Size
Using CMS
Thursday, September 26, 13
@TwitterEng 27
Tuning for Latency
General rules of thumb-Increase young gen. size to handle higher allocation rates.- Increase young gen size if promotion rate high- May suffer from premature promotion, i.e. promotions
from too frequent young GC.-Larger young gen decreases GC frequency, and gives
more time for objects to die.-Increase Old Gen size if promotion rate is still high, avoid
allocation and concurrent mode failures
Using CMS
Thursday, September 26, 13
@TwitterEng 28
Tuning for Latency
CMS Tuned for Latency-Xmx18g -Xms18g –XX:PermSize=256m \-XX:MaxPermSize=256M -XX:+CMSScavengeBeforeRemark \-XX:-OmitStackTraceInFastThrow -XX:+UseParNewGC \-XX:+UseConcMarkSweepGC \-XX:CMSInitiatingOccupancyFraction=70 \-XX:+UseCMSInitiatingOccupancyOnly-XX:SurvivorRatio=6 -XX:NewSize=8g \-XX:MaxNewSize=8g –verbosegc \-XX:+PrintGCApplicationStoppedTime \-XX:+PrintGCDateStamps -XX:+PrintGCDetails \ -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC \ -XX:+PrintTenuringDistributionNote: Increased Young Gen Size, Survivor Ratio Tuning
Using CMS
Thursday, September 26, 13
@TwitterEng 29
Tuning for Latency
Enable G1-XX:+UseG1GC –XX:MaxGCPauseMillis=100
- Start with just overall heap size and target pause time.
- Increase Young Generation Size for High Allocation
- Tune to keep remembered set processing low
Using G1GC
Thursday, September 26, 13
@TwitterEng 30
Tuning for Latency
G1 Tuning to Consider-XX:InitiatingHeapOccupancyPercent=90–XX:G1MixedGCLiveThresholdPercent: The occupancy threshold of live objects in the old region to be included in the mixed collection.–XX:G1HeapWastePercent: The threshold of garbage that you can tolerate in the heap.–XX:G1MixedGCCountTarget: The target number of mixed garbage collections within which the regions with at most G1MixedGCLiveThresholdPercent live data should be collected.–XX:G1OldCSetRegionThresholdPercent: A limit on the max number of old regions that can be collected during a mixed collection.
Reference: Monica Beckwith’s InfoQ article: “G1: One Garbage Collector To Rule Them All“http://www.infoq.com/articles/G1-One-Garbage-Collector-To-Rule-Them-All
Using G1GC
Thursday, September 26, 13
@TwitterEng 31
Tuning for Latency
G1GC Tuned for Latency- -XX:+TieredCompilation –XX:InitialCodeCacheSize=256m \ –XX:ReservedCodeCacheSize=256m -Xmx18g -Xms18g \–XX:PermSize=256m -XX:MaxPermSize=256M --XX:+UseG1GC \–XX:MaxGCPauseMillis=200 \-XX:InitiatingHeapOccupancyPercent=90 \ -XX+PrintGCApplicationStoppedTime \-XX:+PrintGCDateStamps -XX:+PrintGCDetails \-XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC \-XX:+PrintTenuringDistribution
Note: MaxGCPauseMillis biggest tuning knob. Don’t start with CMS Tuning!
Using G1GC
Thursday, September 26, 13
@TwitterAds | Confidential
Tuning for Throughput
32
Thursday, September 26, 13
@TwitterEng 33
Enable ParallelOldGC--XX:+UseParallelOldGC
Old Gen needs to be 2-4X live data size (LDS)Young generation should be ¾ the heap
Often used when tuning for throughput--XX:+AggressiveOpts--XX:+TieredCompilation
Disabling adaptive sizing and tuning survivor spaces directly.- -XX:-AdaptiveSizePolicy -XX:SurvivorRatio=7 \-XX:TargetSurvivorRatio=90
Using ParallelOldGCTuning for Throughput
Thursday, September 26, 13
@TwitterEng 34
Tuning for Throughput
ParallelOldGC tuned for Throughput:-showversion -server -XX:-UseBiasedLocking \-XX:LargePageSizeInBytes=2m -XX:+AlwaysPreTouch \-XX:+UseLargePages -XX:+PrintGCDetails \-XX:+PrintGCTimeStamps -XX:+UseLargePages \-Xms29g -Xmx29g -Xmn27g -XX:+UseParallelOldGC \-XX:ParallelGCThreads=24 -XX:SurvivorRatio=16 \-XX:TargetSurvivorRatio=90 -XX:-UseAdaptiveSizePolicy \-XX:+AggressiveOpts -XX:InitialCodeCacheSize=160m -XX:ReservedCodeCache=160m -XX:+TieredCompilation
Using ParallelOldGC
Thursday, September 26, 13
@TwitterEng 35
Enable G1--XX:+UseG1GC
Old Gen needs to be 2X live data size (LDS)Young generation should be ¾ the heap
Often used when tuning for throughput--XX:+AggressiveOpts--XX:+TieredCompilation
Using G1GCTuning for Throughput
Thursday, September 26, 13
@TwitterEng 36
Tuning for Throughput
G1GC tuned for throughput:-showversion -server -XX:-UseBiasedLocking \-XX:LargePageSizeInBytes=2m -XX:+AlwaysPreTouch \-XX:+UseLargePages -XX:+PrintGCDetails \-XX:+PrintGCTimeStamps -XX:+UseLargePages \-Xms28g -Xmx28g -Xmn21g -XX:+UseG1GC \-XX:+AggressiveOpts \-XX:InitialCodeCacheSize=160m -XX:ReservedCodeCache=160m \ -XX:+TieredCompilation
Using G1GC
Thursday, September 26, 13
@TwitterEng 37
Enable CMS, and tune for throughput--XX:+UseParNewGC -XX:+UseConcMarkSweepGC- Configure heap to avoid promotion- Application design should separate stateful and stateless components
to allow targeted tuning.Young generation should be ¾ the heap- Young generation should be size to ensure nearly all objects
die young.- Very large heaps, very large old generation- Use memory to avoid the need for Full GC.
Tuning survivor spaces manually, etc.- -XX:SurvivorRatio=7 -XX:+CMSScavengeBeforeRemark \-XX:+ParallelRefProcEnabled \
Using CMSTuning for Throughput
Thursday, September 26, 13
@TwitterEng 38
Tuning for Throughput
CMS Tuned for Throughput-Xmx18g -Xms18g –XX:PermSize=256m \-XX:MaxPermSize=256M -XX:+CMSScavengeBeforeRemark \-XX:-OmitStackTraceInFastThrow -XX:+UseAggressiveOpts \-XX:+UseParNewGC -XX:+UseConcMarkSweepGC \-XX:CMSInitiatingOccupancyFraction=90 \-XX:+UseCMSInitiatingOccupancyOnly \-XX:SurvivorRatio=6 -XX:NewSize=16g \-XX:MaxNewSize=16g –verbosegc \-XX:+PrintGCApplicationStoppedTime \-XX:+PrintGCDateStamps -XX:+PrintGCDetails \ -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC \ -XX:+PrintTenuringDistribution \ -XX:InitialCodeCacheSize=160m -XX:ReservedCodeCache=160m \ -XX:+TieredCompilation
Using CMS
Thursday, September 26, 13
@TwitterAds | Confidential
Tuning for Footprint
39
Thursday, September 26, 13
@TwitterEng 40
Enable ParallelOldGC--XX:+UseParallelOldGC
Old Gen needs to be 2X live data size (LDS)Young generation should start at 1/2 the Old Generation size.
Strategy is to reduce young and old GC sizes independently until a maximum acceptable end user response time is met.
Definitely not low-pause. Trading higher response times, for lower footprint and lower throughput.
Using ParallelOldGCTuning for Footprint
Thursday, September 26, 13
@TwitterEng 41
Tuning for Footprint
ParallelOldGC tuned for Footprint-showversion -server -XX:LargePageSizeInBytes=2m \-XX:+UseLargePages -XX:+PrintGCDetails \-XX:+PrintGCTimeStamps -XX:+UseLargePages \-Xms8g -Xmx8g -Xmn4g -XX:+UseParallelOldGC \-XX:-UseAdaptiveSizePolicy -XX:+AggressiveOpts \–XX:PermSize=256m -XX:MaxPermSize=256M
Using ParallelOldGC
Thursday, September 26, 13
@TwitterEng 42
Enable G1--XX:+UseG1GC
Heap should be 3x live data size (LDS)-Do not tune the size of the young generation-Allow G1 to adapt the size- Tune only after observer minimum size according to G1
Increase the Pause Target to decrease GC overhead--XX:MaxGCPauseMillis=400
Strategy is to reduce young and old GC sizes independently until a maximum acceptable end user response time is met.
Using G1GCTuning for Footprint
Thursday, September 26, 13
@TwitterEng 43
Tuning for Footprint
G1 Tuned for Footprint-showversion-XX:+PrintGCDetails -XX:+PrintGCTimeStamps \-Xms12g -Xmx12g -XX:+UseG1GC -XX:InitialCodeCacheSize=160m \-XX:ReservedCodeCache=160m
Using G1GC
Thursday, September 26, 13
@TwitterEng 44
Enable CMS, and tune for throughput--XX:+UseParNewGC -XX:+UseConcMarkSweepGC
Old Gen needs to be 2X live data size (LDS)
Young generation should start at 1/2 the Old Generation size.- Young generation should be sized so “enough” objects die in
the old generation to reduce the pressure on CMS- Promotion rate needs to be low enough so CMS concurrent
threads don’t loose the race (ConcurrentMode Failures)Strategy is to reduce young and old GC sizes independently
until a maximum acceptable end user response time is met.-Young Generation first, then OldGen.
Using CMSTuning for Footprint
Thursday, September 26, 13
@TwitterEng 45
Tuning for Footprint
Example of a highly tuned CMS deploy for throughput:-Xmx12g -Xms12g -Xmn4g –XX:PermSize=256m \-XX:MaxPermSize=256M -XX:+CMSScavengeBeforeRemark \-XX:+UseParNewGC -XX:+UseConcMarkSweepGC \-XX:CMSInitiatingOccupancyFraction=60 \-XX:SurvivorRatio=6 –verbosegc \-XX:+PrintGCApplicationStoppedTime \-XX:+PrintGCDateStamps -XX:+PrintGCDetails \ -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC \ -XX:+PrintTenuringDistribution \
Note: Increased Young Gen Size, Survivor Ratio Tuning
Using CMS
Thursday, September 26, 13
@TwitterAds | Confidential
Common Performance Issues
46
Thursday, September 26, 13
@TwitterEng 47
Common Performance Issues
Size of Permanent Generation- Perm. Gen. only collects and resizes at Full GC.
Heap before GC invocations=40019 (full 36522): par new generation total 15354176K, used 14K [0x00000003b9c00000, 0x0000000779c00000, 0x0000000779c00000) eden space 14979712K, 0% used [0x00000003b9c00000, 0x00000003b9c039a8, 0x000000074c0a0000) from space 374464K, 0% used [0x000000074c0a0000, 0x000000074c0a0000, 0x0000000762e50000) to space 374464K, 0% used [0x0000000762e50000, 0x0000000762e50000, 0x0000000779c00000) concurrent mark-sweep generation total 2097152K, used 588343K [0x0000000779c00000, 0x00000007f9c00000, 0x00000007f9c00000) concurrent-mark-sweep perm gen total 102400K, used 102399K [0x00000007f9c00000, 0x0000000800000000, 0x0000000800000000)2013-09-05T17:21:39.530+0000: [Full GC[CMS: 588343K->588343K(2097152K), 1.6166150 secs] 588357K->588343K(17451328K), [CMS Perm : 102399K->102399K(102400K)], 1.6167040 secs] [Times: user=1.57 sys=0.00, real=1.61 secs]Heap after GC invocations=40020 (full 36523): par new generation total 15354176K, used 0K [0x00000003b9c00000, 0x0000000779c00000, 0x0000000779c00000) eden space 14979712K, 0% used [0x00000003b9c00000, 0x00000003b9c00000, 0x000000074c0a0000) from space 374464K, 0% used [0x000000074c0a0000, 0x000000074c0a0000, 0x0000000762e50000) to space 374464K, 0% used [0x0000000762e50000, 0x0000000762e50000, 0x0000000779c00000) concurrent mark-sweep generation total 2097152K, used 588343K [0x0000000779c00000, 0x00000007f9c00000, 0x00000007f9c00000) concurrent-mark-sweep perm gen total 102400K, used 102399K [0x00000007f9c00000, 0x0000000800000000, 0x0000000800000000)}
Recommendation: -XX:PermSize=256m –XX:MaxPermSize=256m
In Enterprise Software
Thursday, September 26, 13
@TwitterEng 48
Common Performance Issues
Size of Code Cache- Default size is 64mb, 96mb if running TieredCompilation- Enterprise Applications have lots of code
Aggressively Tune to Avoid Issue-Tuning Without Using TieredCompilation- -XX:InitialCodeCacheSize=128m \-XX:ReservedCodeCacheSize=128m
- Tuning With Using TieredCompilation- -XX:InitialCodeCacheSize=256m \-XX:ReservedCodeCacheSize=256m
In Enterprise Software
Thursday, September 26, 13
@TwitterAds | Confidential
OpenJDK Development at Twitter
49
Thursday, September 26, 13
@TwitterEng 50
What’s up with Twitter and JDK Development?
Twitter runs Java + Scala on the HotSpot JVM- Most Highly Optimized Managed Runtime-Open source :-)- Massive performance gains moving technologies
Own and Optimize our Platform- Build out diagnostic tools - Build, test, and deploy OpenJDK- Optimize HotSpot Runtime Compilers for Scala, etc.- Tailored GC for Twitter’s needs-extremely low latency requirements ( < 10ms)
@TwitterJDK
Thursday, September 26, 13
@TwitterEng 51
What’s up with Twitter and JDK Development?
Contribute Back to the Community- Working closely with Oracle Java Development- Collaborating with Other OpenJDK contributors- Posting tools to Github and OpenJDK repositories
Interesting isn’t it?- We’re just ramping up now. - Follow us soon: @TwitterJDK (new idea)- Follow me at: @dagskeenan- #jointheflock
@TwitterJDK
Thursday, September 26, 13
@TwitterAds | Confidential
#ThankYou
52
Thursday, September 26, 13