VMworld 2013: Virtualizing and Tuning Large Scale Java Platforms

Virtualizing and Tuning Large Scale Java Platforms

Emad Benjamin, VMware

VAPP4536

#VAPP4536

2

About the Speaker

I have been with VMware for the last 8 years, working on Java and vSphere

20 years experience as a Software Engineer/Architect, with last 15 years focused on Java development

Open source contributions

Prior work with Cisco, Oracle, and Banking/Trading Systems

Authored the following books:

• Virtualizing and Tuning Large Scale Java Platforms

• Enterprise Java Applications Architecture on VMware

3

Disclaimer

This session may contain product features that are

currently under development.

This session/overview of the new technology represents

no commitment from VMware to deliver these features in

any generally available product.

Features are subject to change, and must not be included in

contracts, purchase orders, or sales agreements of any kind.

Technical feasibility and market demand will affect final delivery.

Pricing and packaging for any new technologies or features

discussed or presented have not been determined.

4

Agenda

Overview

Design and Sizing Java Platforms

Performance

Best Practices and Tuning

Customer Success Stories

Questions

5

Java Platforms Overview

6

Conventional Java Platforms

Java Platforms are multitier and multi org

DB Servers Java Applications

Load Balancer Tier

Load Balancers Web Servers

IT Operations

Network Team

IT Operations

Server Team

IT Apps – Java

Dev Team

IT Ops & Apps

Dev Team

Organizational Key Stakeholder Departments

Web Server Tier Java App Tier DB Server Tier

7

Middleware Platform Architecture on vSphere

SHARED, ALWAYS-ONINFRASTRUCTURE

SHARED INFRASTRUCTURE SERVICES

Capacity On Demand High AvailabilityDynamic

APPLICATION SERVICES

DB ServersJava ApplicationsLoad balancers Web Servers

VMware vSphere

High Uptime, Scalable, and Dynamic Enterprise Java ApplicationsLoad Balancers as VMs

Web Servers

Java Application Servers

8

Java Platforms Design and Sizing

9

Design and Sizing of Java Platforms on vSphere

Step 1 –

Establish Load profile

From production logs/monitoring reports measure:

Concurrent UsersRequests Per SecondPeak Response TimeAverage Response Time

Establish your response time SLA

Step 2

Establish Benchmark

Iterate through Benchmark test until you are satisfied with the Load profile metrics and your intended SLAafter each benchmark iteration you may have to adjust the Application Configuration Adjust the vSphereenvironment to scale out/up in order to achieve your desired number of VMs, number of vCPU and RAM configurations

Step 3 –

Size Production Env.

The size of the production environment would have been established in Step2, hence either you roll out the environment from Step-2 or build a new one based on the numbers established

10

Step 2 – Establish Benchmark

DETERMINE HOW MANY VMs Establish Horizontal Scalability

Scale Out Test How many VMs do you need to

meet your Response Time SLAs without reaching 70%-80% saturation of CPU?

Establish your Horizontal scalability Factor before bottleneck appear in your application

Scale Out Test

Building Block VM Building Block VM

SLA OK?

Test complete

Investigate bottlnecked layer Network, Storage,

Application Configuration, & vSphere

If scale out bottlenecked

layer is removed, iterate

scale out test

If building block app/VM config problem, adjust

& iterate No

Building Block VM

Building Block VM

ESTABLISH BUILDING BLOCK VM Establish Vertical scalability

Scale Up Test Establish how many JVMs on a VM? Establish how large a VM would be

in terms of vCPU and memory

Scal

e U

p T

est

Building Block VM

11

Design and Sizing HotSpot JVMs on vSphere

JVM Max Heap -Xmx

JVM Memory

Perm Gen

Initial Heap

Guest OS Memory

VM Memory

-Xms

Java Stack

-Xss per thread

-XX:MaxPermSize

Other mem

Direct native

Memory

“off-the-heap”

Non

Direct

Memory

“Heap”

12

Design and Sizing of HotSpot JVMs on vSphere

Guest OS Memory approx 1G (depends on OS/other processes)

Perm Size is an area additional to the –Xmx (Max Heap) value and

is not GC-ed because it contains class-level information.

“other mem” is additional mem required for NIO buffers, JIT code

cache, classloaders, Socket Buffers (receive/send), JNI, GC

internal info

If you have multiple JVMs (N JVMs) on a VM then:

• VM Memory = Guest OS memory + N * JVM Memory

VM Memory = Guest OS Memory + JVM Memory

JVM Memory = JVM Max Heap (-Xmx value) + JVM Perm Size (-XX:MaxPermSize) +

NumberOfConcurrentThreads * (-Xss) + “other Mem”

13

Sizing Example

JVM Max Heap -Xmx

(4096m)

JVM Memory (4588m) Perm Gen

Initial Heap

Guest OS Memory

VM Memory (5088m)

-Xms (4096m)

Java Stack -Xss per thread (256k*100)

-XX:MaxPermSize (256m)

Other mem (=217m)

500m used by OS

set mem Reservation to 5088m

14

Perm Gen

Initial Heap

Java Stack

Larger JVMs for In-Memory Data Management Systems

JVM Max Heap -Xmx (30g)

Guest OS Memory

-Xms (30g)

-Xss per thread (1M*500)

-XX:MaxPermSize (0.5g)

Other mem (=1g)

0.5-1g used by OS

Set memory reservation to 34g

JVM Memory for

SQLFire (32g)

VM Memory for

SQLFire (34g)

15

NUMA Local Memory with Overhead Adjustment

Physical RAM

On vSphere host

Physical RAM

On vSphere host

Number of VMs

On vSphere host

1% RAM

overhead

vSphere RAM

overhead

Number of Sockets

On vSphere host

16

Middleware ESXi Cluster

96GB RAM

2 sockets

8 pCPU per

socket

Middleware

components

47GB RAM VMs

with

8vCPU

Locator/heart beat

for middleware

DO NOT VMotion

Memory Available for all VMs =

96*0.98 -1GB => 94GB

Per NUMA memory => 94/2

47GB

17

96 GB RAM

on Server

Each NUMA

Node has 94/2

47GB

8 vCPU VMs

less than

47GB RAM

on each VM ESX

Scheduler

If VM is sized greater

than 47GB or 8 CPUs,

Then NUMA interleaving

Occurs and can cause

30% drop in memory

throughput performance

18

1

128 GB RAM

on server

2vCPU VMs

less than

20GB RAM

on each VM

4vCPU VM

40GB RAM

split by ESXi into

2 NUMA Clients

available in ESX4.1

ESXi

Scheduler 2

3

4

5

19

Java Platform Categories – Category 1

Smaller JVMs < 4GB heap,

4.5GB Java process, and 5GB

for VM

vSphere hosts with <96GB

RAM is more suitable, as by

the time you stack the many

JVM instances, you are likely

to reach CPU boundary before

you can consume all of the

RAM. For example if instead

you chose a vSphere host with

256GB RAM, then 256/4.5GB =>

57JVMs, this would clearly

reach CPU boundary

Multiple JVMs per VM

Use Resource pools to

manage different LOBs Category 1: 100s to 1000s of JVMs

Resource Pool 1

Gold LOB 1

Resource Pool 2

SilverLOB 2

Use 4 sockets servers

to get more cores

20

Most Common Sizing and Configuration Question

JVM-1

JVM-2

JVM-1A

JVM-1

JVM-2

JVM-1

JVM-2

JVM-2A

JVM-3

JVM-4 Option-1 Scale out VM and JVM ( best)

Option-2 Scale Up JVM heap size (2nd best)

JVM-2

JVM-1

Option-3 Scale up VM and JVM (3rd best)

2GB 2GB 2GB 2GB

2vCPU 2vCPU 2vCPU 2vCPU

2vCPU 2vCPU

4GB 4GB

21

What Else to Consider When Sizing?

Job

Web

JVM-1

Job

Web

JVM-2

Job

Web

Job

Web

JVM-3

Job

Web

JVM-4

Vert

ica

l

Horizontal

Mixed workloads Job Scheduler vs Web app require

different GC Tuning

Job Schedulers care about Throughput

Web apps care about minimize latency and response time

You can’t have both reduced response time and increased

throughput, without compromise

Separate the concerns for optimal tuning

22


Fewer JVMs < 20

Very large JVMs, 32GB to 128GB

Always deploy 1 VM per NUMA node

and size to fit perfectly

1 JVM per VM

Choose 2 socket vSphere hosts, and

install ample memory128GB to 512GB

Example is in memory databases, like

SQLFire and GemFire

Apply latency sensitive BP disable

interrupt coalescing pNIC and vNIC

Dedicated vSphere cluster

Category 2: a dozen of very large JVMs

Use 2 sockets servers

to get larger NUMA

nodes

23


Category 3: Category-1 accessing data from Category-2

Resource Pool 1

Gold LOB 1

Resource Pool 2

SilverLOB 2

24

Java Platforms Performance

25

Performance Perspective

See the Performance of Enterprise Java Applications on VMware vSphere 4.1 and SpringSource tc Server at

http://www.vmware.com/resources/techresources/10158 .

http://www.vmware.com/resources/techresources/10158

26

Performance Perspective

See the Performance of Enterprise Java Applications on VMware vSphere 4.1 and SpringSource tc Server at

http://www.vmware.com/resources/techresources/10158 .

80% Threshold

% CPU

R/T


27

SQLFire vs. Traditional RDBMS

SQLFire scaled 4x compared to RDBMS

Response times of SQLFire are 5x to 30x

faster than RDBMS

Response times on SQLFire are more

stable and constant with increased load

RDBMS response times increase with

increased load

28

Load Testing SpringTrader Using Client-Server Topology

SpringTrader

Integration Services Application Tier SpringTrader

Application Service

SQLFire

Member 2

Redundant

Locators

SpringTrader Data Tier

SQLFire

Member1

Integration

Patterns

4 Application Services

29

vFabric Reference Architecture Scalability Test

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

0

2000

4000

6000

8000

10000

12000

1 2 3 4

Sca

lin

g f

rom

1 A

pp

Se

rvic

es

VM

Nu

mb

er

of

Use

rs

Number of Application Services VMs

Maximum Passing Users and Scaling

With this topology

10400 users session

30

10k Users Load Test Response Time

0

1

2

3

4

5

6

7

0 2000 4000 6000 8000 10000 12000

Se

con

ds

Number of Users

Operation 90th-Percentile Response-Time Four Application Services VMs

HomePage Register Login DashboardTab PortfolioTab

TradeTab GetHoldingsPage GetOrdersPage SellOrder GetQuote

BuyOrder Logout MarketSummary

10400 users session

Approx. 0.25 seconds

response time

31

Java Platforms Best Practices and Tuning

32

Most Common VM Size for Java Workloads

2 vCPU VM with 1 JVM, for tier-1 production workloads

Maintain this ratio as you scale out or scale-up, i.e. 1 JVM : 2vCPU

Scale out preferred over Scale-up, but both can work

You can diverge from this ratio for less critical workloads

2 vCPU VM

1 JVM (-Xmx 4096m)

Approx 5GB RAM Reservation

33

However for Large JVMs + CMS

For large JVMs

4+ vCPU VM

1 JVM (8-128GB)

Start with 4+ vCPU VM with 1 JVM, for

tier-1 in memory data management

systems type of production workloads

Likely increase JVM size, instead of

launching a second JVM instance

Multiple 4vCPU+ will allow for

ParallelGCThreads to be allocated 50% of

the available vCPUs to the JVM, i.e. 2 GC

Threads +

Ability to increase ParallelGCThreads is

critical to YoungGen scalability for large

JVMs

ParallelGCThreads should be allocated

50% of available vCPU to the JVM and not

more. You want to ascertain there other

vCPUs available for other txns

34

Which GC?

ESX doesn’t care which GC you select, because of the degree of

independence of Java to OS and OS to Hypervisor

35

GC Policy Types

GC Policy Type Description

Serial GC • Mark, sweep and compact algorithm

• Both minor and full GC are stop the world threads

• Stop the world GC means application is stopped while GC is

executing

• Not very scalable algorithm

• Suited for smaller <200MB JVMs like Client machines

Throughput

GC

• Parallel GC

• Similar to Serial GC, but uses multiple worker Threads in

parallel to increase throughput

• Both Young and Old Generation collection are multi thread, but

still stop-the-world

• number of threads allocated by -

XX:ParallelGCThreads=<nThreads>

• NOT Concurrent, meaning when the GC worker threads run,

they will pause your application threads. If this is a problem

move to CMS where GC threads are concurrent.

36

GC Policy Types

GC Policy Type Description

Concurrent GC • Concurrent Mark and Sweep, no compaction

• Concurrent implies when GC is running it doesn't pause your

application threads – this is the key difference to

throughput/parallel GC

• Suited for application that care more about response time than

throughput

• CMS does use more heap when compared to

throughput/ParallelGC

• CMS works on OLD gen concurrently, but young generation is

collected using ParNewGC, a version of the throughput collector

• Has multiple phases:

• Initial mark (short pause)

• concurrent mark (no pause)

• Pre-cleaning (no pause)

• re-mark (short pause)

• Concurrent sweeping (no pause)

G1 • Only in J7 and mostly experimental, equivalent to CMS + compacting

37

Tuning GC – Art Meets Science!

Either you tune for Throughput or Latency, one at the cost of the other

Increase

Throughput

Reduce

Latency Tuning

Decisions

• improved R/T

• reduce latency impact

• slightly reduced throughput

• improved throughput

• longer R/T

• increased latency impact

Job

Web

38

Parallel Young Gen and CMS Old Gen

application threads minor GC threads concurrent mark and sweep GC

Young Generation Minor GC

Parallel GC in YoungGen using

XX:ParNewGC & XX:ParallelGCThreads

-Xmn

Old Generation Major GC

Concurrent using in OldGen using

XX:+UseConcMarkSweepGC

Xmx minus Xmn

S

0

S

1

39

High Level GC Tuning Recipe

Measure

Minor GC

Duration

and

Frequency

Adjust –Xmn

Young Gen size

and /or

ParallelGCThreads

Measure

Major GC

Duration

And

Frequency

Adjust

Heap space

–Xmx

Adjust –Xmn

And/or

SurvivorSpaces

Step A-Young Gen Tuning

Step B-Old Gen Tuning

Step C-

Survivor Spaces

Tuning

40

CMS Collector Example

java –Xms30g –Xmx30g –Xmn10g -XX:+UseConcMarkSweepGC -XX:+UseParNewGC –

XX:CMSInitiatingOccupancyFraction=75

–XX:+UseCMSInitiatingOccupancyOnly -XX:+ScavengeBeforeFullGC

-XX:TargetSurvivorRatio=80 -XX:SurvivorRatio=8 -XX:+UseBiasedLocking

-XX:MaxTenuringThreshold=15 -XX:ParallelGCThreads=4

-XX:+UseCompressedOops -XX:+OptimizeStringConcat -XX:+UseCompressedStrings

-XX:+UseStringCache

This JVM configuration scales up and down effectively

-Xmx=-Xms, and –Xmn 33% of –Xmx

-XX:ParallelGCThreads=< minimum 2 but less than 50% of available

vCPU to the JVM. NOTE: Ideally use it for 4vCPU VMs plus, but if

used on 2vCPU VMs drop the -XX:ParallelGCThreads option and let

Java select it

41

IBM JVM – GC Choice

-Xgc:mode Usage Example

-Xgcpolicy:Optthruput

(Default)

Performs the mark and sweep operations

during garbage collection when the

application is paused to maximize

application throughput. Mostly not

suitable for multi CPU machines.

Apps that demand a

high throughput but

are not very sensitive

to the occasional long

garbage collection

pause

-

Xgcpolicy:Optavgpause

Performs the mark and sweep

concurrently while the application is

running to minimize pause times; this

provides best application response times.

There is still a stop-the-world GC, but the

pause is significantly shorter. After GC,

the app threads help out and sweep

objects (concurrent sweep).

Apps sensitive to long

latencies transaction-

based systems where

Response Time are

expected to be stable

-Xgcpolicy:Gencon Treats short-lived and long-lived objects

differently to provide a combination of

lower pause times and high application

throughput.

Before the heap is filled up, each app

helps out and mark objects

(concurrent mark).

Latency sensitive

apps, objects in the

transaction don't

survive beyond the

transaction commit

Job

Web

Web

42

Middleware on VMware – Best Practices

Enterprise Java

Applications on

VMware Best

Practices Guide


Best Practices for

Performance Tuning

of Latency-Sensitive

Workloads in vSphere

VMs


vFabric SQLFire Best

Practices Guide


vFabric Reference

Architecture

http://tinyurl.com/cjkvftt




http://tinyurl.com/cjkvftt

43

Middleware on VMware – Best Practices Summary

Follow the design and sizing examples we discussed thus far

Set appropriate memory reservation

Leave HT enabled, size bases on vCPU=1.25pCPU if needed

RHEL6 and SLES 11 SP1 have tickless kernel that does not rely on

a high frequency interrupt-based timer, and is therefore much

friendlier to virtualized latency-sensitive workloads

Do not overcommit memory

Locators/heartbeat process should not be vMotion® migrated, it

otherwise would lead to network split brain problems

vMotion over 10Gbps when doing scheduled maintenance

Use Affinity and Anti-Affinity rules to avoid redundant copies on

the same VMware ESX®/ESXi host

44

Middleware on VMware – Best Practices

Disable NIC interrupt coalescing on physical and virtual NIC

Extremely helpful in reducing latency for latency-sensitive

virtual machines

Disable virtual interrupt coalescing for VMXNET3

• It can lead to some performance penalties for other virtual machines on the

ESXi host, as well as higher CPU utilization to deal with the higher rate of

interrupts from the physical NIC

This implies it is best to use dedicated ESX cluster for

Middleware Platforms

• All host are configured the same way for latency sensitivity and this insures

non middleware workloads, such as other enterprise applications are not

negatively impacted

• This is applicable in category 2 workloads

45

Middleware on VMware – Benefits

Flexibility to change compute resources, VM sizes, add more hosts

Ability to apply hardware and OS patches while

minimizing downtime

Create more manageable system through reduced

middleware sprawl

Ability to tune the entire stack within one platform

Ability to monitor the entire stack within one platform

Ability to handle seasonal workloads, commit resources when

they are needed and then remove them when not needed

46

Customer Success Stories

47

NewEdge

Virtualized GemFire workload

Multiple geographic active-

active datacenters

Multiple Terabytes of data

kept in memory

1000s of transactions per

second

Multiple vSphere clusters

Each cluster 4 vSphere hosts

and 8 large 98GB+ JVMs

http://www.vmware.com/files/pdf/customers/VMware-Newedge-12Q4-EN-Case-Study.pdf












48

Cardinal Health Virtualization Journey

4

8

Consolidation

< 40% Virtual

<2,000 VMs

<2,355 physical

Data Center Optimization

30 DCs to 2 DCs

Transition to Blades

<10% Utilization

<10:1 VM/Physical

Low Criticality Systems

8X5 Applications

Internal cloud

>58% Virtual

>3,852 VMs

<3,049 physical

Power Remediation

P2Vs on refresh

HW Commoditization

15% Utilization

30:1 VM/Physical

Business Critical Systems

SAP ~ 382

WebSphere ~ 290

Unix to Linux ~ 655

Cloud Resources

• >90% Virtual

>8,000 VMs

<800 physical

Optimizing DCs

Internal disaster recovery

Metered service offerings (SAAS, PAAS, IAAS)

Shrinking HW Footprint

> 50% Utilization

> 60:1 VM/Physical

Heavy Lifting Systems

Database Servers

Virtual

HW

SW

Timeline 2005 – 2008 2009 – 2011 2012 – 2015

Theme Centralized IT

Shared Service

Capital Intensive - High

Response

Variable Cost

SubscriptionServices

DC

49

Virtualization Why Virtualize WebSphere on VMWare

DC strategy alignment

• Pooled resources capacity ~15% utilization

• Elasticity – for changing workloads

• Unix to Linux

• Disaster Recovery

Simplification and manageability

• High availability for thousands instead of thousands of high

availability solutions

• Network & system management in DMZ

Five year cost savings ~ $6 million

• Hardware Savings ~ $660K

• WAS Licensing ~ $862K

• Unix to Linux ~ $3.7M

• DMZ – ports~ >$1M

50

Thank you and are there any Questions?

Emad Benjamin,

[email protected]

You can get the book here:

https://www.createspace.com/3632131

mailto:[email protected]

https://www.createspace.com/3632131

51

Second Book

Emad Benjamin,

[email protected]

Preview chapter available at

VMworld bookstore

You can get the book here:

Safari: http://tinyurl.com/lj8dtjr

Later on Amazon

http://tinyurl.com/kez9trj

mailto:[email protected]

http://tinyurl.com/lj8dtjr

http://tinyurl.com/lj8dtjr

http://www.amazon.com/Virtualizing-Tuning-Platforms-VMware-Technology/dp/013349120X






52

Other VMware Activities Related to This Session

HOL:

HOL-SDC-1304

vSphere Performance Optimization

Group Discussions:

VAPP1010-GD

Java with Emad Benjamin

VAPP4536

THANK YOU

Virtualizing and Tuning Large Scale Java Platforms

Emad Benjamin, VMware

VAPP4536

#VAPP4536