Upload
vmworld
View
131
Download
2
Tags:
Embed Size (px)
DESCRIPTION
VMworld 2013 Emad Benjamin, VMware Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
Citation preview
Virtualizing and Tuning Large Scale Java Platforms
Emad Benjamin, VMware
VAPP4536
#VAPP4536
2
About the Speaker
I have been with VMware for the last 8 years, working on Java and vSphere
20 years experience as a Software Engineer/Architect, with last 15 years focused on Java development
Open source contributions
Prior work with Cisco, Oracle, and Banking/Trading Systems
Authored the following books:
• Virtualizing and Tuning Large Scale Java Platforms
• Enterprise Java Applications Architecture on VMware
3
Disclaimer
This session may contain product features that are
currently under development.
This session/overview of the new technology represents
no commitment from VMware to deliver these features in
any generally available product.
Features are subject to change, and must not be included in
contracts, purchase orders, or sales agreements of any kind.
Technical feasibility and market demand will affect final delivery.
Pricing and packaging for any new technologies or features
discussed or presented have not been determined.
4
Agenda
Overview
Design and Sizing Java Platforms
Performance
Best Practices and Tuning
Customer Success Stories
Questions
5
Java Platforms Overview
6
Conventional Java Platforms
Java Platforms are multitier and multi org
DB Servers Java Applications
Load Balancer Tier
Load Balancers Web Servers
IT Operations
Network Team
IT Operations
Server Team
IT Apps – Java
Dev Team
IT Ops & Apps
Dev Team
Organizational Key Stakeholder Departments
Web Server Tier Java App Tier DB Server Tier
7
Middleware Platform Architecture on vSphere
SHARED, ALWAYS-ONINFRASTRUCTURE
SHARED INFRASTRUCTURE SERVICES
Capacity On Demand High AvailabilityDynamic
APPLICATION SERVICES
DB ServersJava ApplicationsLoad balancers Web Servers
VMware vSphere
High Uptime, Scalable, and Dynamic Enterprise Java ApplicationsLoad Balancers as VMs
Web Servers
Java Application Servers
8
Java Platforms Design and Sizing
9
Design and Sizing of Java Platforms on vSphere
Step 1 –
Establish Load profile
From production logs/monitoring reports measure:
Concurrent UsersRequests Per SecondPeak Response TimeAverage Response Time
Establish your response time SLA
Step 2
Establish Benchmark
Iterate through Benchmark test until you are satisfied with the Load profile metrics and your intended SLAafter each benchmark iteration you may have to adjust the Application Configuration Adjust the vSphereenvironment to scale out/up in order to achieve your desired number of VMs, number of vCPU and RAM configurations
Step 3 –
Size Production Env.
The size of the production environment would have been established in Step2, hence either you roll out the environment from Step-2 or build a new one based on the numbers established
10
Step 2 – Establish Benchmark
DETERMINE HOW MANY VMs Establish Horizontal Scalability
Scale Out Test How many VMs do you need to
meet your Response Time SLAs without reaching 70%-80% saturation of CPU?
Establish your Horizontal scalability Factor before bottleneck appear in your application
Scale Out Test
Building Block VM Building Block VM
SLA OK?
Test complete
Investigate bottlnecked layer Network, Storage,
Application Configuration, & vSphere
If scale out bottlenecked
layer is removed, iterate
scale out test
If building block app/VM config problem, adjust
& iterate No
Building Block VM
Building Block VM
ESTABLISH BUILDING BLOCK VM Establish Vertical scalability
Scale Up Test Establish how many JVMs on a VM? Establish how large a VM would be
in terms of vCPU and memory
Scal
e U
p T
est
Building Block VM
11
Design and Sizing HotSpot JVMs on vSphere
JVM Max Heap -Xmx
JVM Memory
Perm Gen
Initial Heap
Guest OS Memory
VM Memory
-Xms
Java Stack
-Xss per thread
-XX:MaxPermSize
Other mem
Direct native
Memory
“off-the-heap”
Non
Direct
Memory
“Heap”
12
Design and Sizing of HotSpot JVMs on vSphere
Guest OS Memory approx 1G (depends on OS/other processes)
Perm Size is an area additional to the –Xmx (Max Heap) value and
is not GC-ed because it contains class-level information.
“other mem” is additional mem required for NIO buffers, JIT code
cache, classloaders, Socket Buffers (receive/send), JNI, GC
internal info
If you have multiple JVMs (N JVMs) on a VM then:
• VM Memory = Guest OS memory + N * JVM Memory
VM Memory = Guest OS Memory + JVM Memory
JVM Memory = JVM Max Heap (-Xmx value) + JVM Perm Size (-XX:MaxPermSize) +
NumberOfConcurrentThreads * (-Xss) + “other Mem”
13
Sizing Example
JVM Max Heap -Xmx
(4096m)
JVM Memory (4588m) Perm Gen
Initial Heap
Guest OS Memory
VM Memory (5088m)
-Xms (4096m)
Java Stack -Xss per thread (256k*100)
-XX:MaxPermSize (256m)
Other mem (=217m)
500m used by OS
set mem Reservation to 5088m
14
Perm Gen
Initial Heap
Java Stack
Larger JVMs for In-Memory Data Management Systems
JVM Max Heap -Xmx (30g)
Guest OS Memory
-Xms (30g)
-Xss per thread (1M*500)
-XX:MaxPermSize (0.5g)
Other mem (=1g)
0.5-1g used by OS
Set memory reservation to 34g
JVM Memory for
SQLFire (32g)
VM Memory for
SQLFire (34g)
15
NUMA Local Memory with Overhead Adjustment
Physical RAM
On vSphere host
Physical RAM
On vSphere host
Number of VMs
On vSphere host
1% RAM
overhead
vSphere RAM
overhead
Number of Sockets
On vSphere host
16
Middleware ESXi Cluster
96GB RAM
2 sockets
8 pCPU per
socket
Middleware
components
47GB RAM VMs
with
8vCPU
Locator/heart beat
for middleware
DO NOT VMotion
Memory Available for all VMs =
96*0.98 -1GB => 94GB
Per NUMA memory => 94/2
47GB
17
96 GB RAM
on Server
Each NUMA
Node has 94/2
47GB
8 vCPU VMs
less than
47GB RAM
on each VM ESX
Scheduler
If VM is sized greater
than 47GB or 8 CPUs,
Then NUMA interleaving
Occurs and can cause
30% drop in memory
throughput performance
18
1
128 GB RAM
on server
2vCPU VMs
less than
20GB RAM
on each VM
4vCPU VM
40GB RAM
split by ESXi into
2 NUMA Clients
available in ESX4.1
ESXi
Scheduler 2
3
4
5
19
Java Platform Categories – Category 1
Smaller JVMs < 4GB heap,
4.5GB Java process, and 5GB
for VM
vSphere hosts with <96GB
RAM is more suitable, as by
the time you stack the many
JVM instances, you are likely
to reach CPU boundary before
you can consume all of the
RAM. For example if instead
you chose a vSphere host with
256GB RAM, then 256/4.5GB =>
57JVMs, this would clearly
reach CPU boundary
Multiple JVMs per VM
Use Resource pools to
manage different LOBs Category 1: 100s to 1000s of JVMs
Resource Pool 1
Gold LOB 1
Resource Pool 2
SilverLOB 2
Use 4 sockets servers
to get more cores
20
Most Common Sizing and Configuration Question
JVM-1
JVM-2
JVM-1A
JVM-1
JVM-2
JVM-1
JVM-2
JVM-2A
JVM-3
JVM-4 Option-1 Scale out VM and JVM ( best)
Option-2 Scale Up JVM heap size (2nd best)
JVM-2
JVM-1
Option-3 Scale up VM and JVM (3rd best)
2GB 2GB 2GB 2GB
2vCPU 2vCPU 2vCPU 2vCPU
2vCPU 2vCPU
4GB 4GB
21
What Else to Consider When Sizing?
Job
Web
JVM-1
Job
Web
JVM-2
Job
Web
Job
Web
JVM-3
Job
Web
JVM-4
Vert
ica
l
Horizontal
Mixed workloads Job Scheduler vs Web app require
different GC Tuning
Job Schedulers care about Throughput
Web apps care about minimize latency and response time
You can’t have both reduced response time and increased
throughput, without compromise
Separate the concerns for optimal tuning
22
Java Platform Categories – Category 2
Fewer JVMs < 20
Very large JVMs, 32GB to 128GB
Always deploy 1 VM per NUMA node
and size to fit perfectly
1 JVM per VM
Choose 2 socket vSphere hosts, and
install ample memory128GB to 512GB
Example is in memory databases, like
SQLFire and GemFire
Apply latency sensitive BP disable
interrupt coalescing pNIC and vNIC
Dedicated vSphere cluster
Category 2: a dozen of very large JVMs
Use 2 sockets servers
to get larger NUMA
nodes
23
Java Platform Categories – Category 3
Category 3: Category-1 accessing data from Category-2
Resource Pool 1
Gold LOB 1
Resource Pool 2
SilverLOB 2
24
Java Platforms Performance
25
Performance Perspective
See the Performance of Enterprise Java Applications on VMware vSphere 4.1 and SpringSource tc Server at
http://www.vmware.com/resources/techresources/10158 .
26
Performance Perspective
See the Performance of Enterprise Java Applications on VMware vSphere 4.1 and SpringSource tc Server at
http://www.vmware.com/resources/techresources/10158 .
80% Threshold
% CPU
R/T
27
SQLFire vs. Traditional RDBMS
SQLFire scaled 4x compared to RDBMS
Response times of SQLFire are 5x to 30x
faster than RDBMS
Response times on SQLFire are more
stable and constant with increased load
RDBMS response times increase with
increased load
28
Load Testing SpringTrader Using Client-Server Topology
SpringTrader
Integration Services Application Tier SpringTrader
Application Service
SQLFire
Member 2
Redundant
Locators
SpringTrader Data Tier
SQLFire
Member1
Integration
Patterns
4 Application Services
29
vFabric Reference Architecture Scalability Test
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
0
2000
4000
6000
8000
10000
12000
1 2 3 4
Sca
lin
g f
rom
1 A
pp
Se
rvic
es
VM
Nu
mb
er
of
Use
rs
Number of Application Services VMs
Maximum Passing Users and Scaling
With this topology
10400 users session
30
10k Users Load Test Response Time
0
1
2
3
4
5
6
7
0 2000 4000 6000 8000 10000 12000
Se
con
ds
Number of Users
Operation 90th-Percentile Response-Time Four Application Services VMs
HomePage Register Login DashboardTab PortfolioTab
TradeTab GetHoldingsPage GetOrdersPage SellOrder GetQuote
BuyOrder Logout MarketSummary
10400 users session
Approx. 0.25 seconds
response time
31
Java Platforms Best Practices and Tuning
32
Most Common VM Size for Java Workloads
2 vCPU VM with 1 JVM, for tier-1 production workloads
Maintain this ratio as you scale out or scale-up, i.e. 1 JVM : 2vCPU
Scale out preferred over Scale-up, but both can work
You can diverge from this ratio for less critical workloads
2 vCPU VM
1 JVM (-Xmx 4096m)
Approx 5GB RAM Reservation
33
However for Large JVMs + CMS
For large JVMs
4+ vCPU VM
1 JVM (8-128GB)
Start with 4+ vCPU VM with 1 JVM, for
tier-1 in memory data management
systems type of production workloads
Likely increase JVM size, instead of
launching a second JVM instance
Multiple 4vCPU+ will allow for
ParallelGCThreads to be allocated 50% of
the available vCPUs to the JVM, i.e. 2 GC
Threads +
Ability to increase ParallelGCThreads is
critical to YoungGen scalability for large
JVMs
ParallelGCThreads should be allocated
50% of available vCPU to the JVM and not
more. You want to ascertain there other
vCPUs available for other txns
34
Which GC?
ESX doesn’t care which GC you select, because of the degree of
independence of Java to OS and OS to Hypervisor
35
GC Policy Types
GC Policy Type Description
Serial GC • Mark, sweep and compact algorithm
• Both minor and full GC are stop the world threads
• Stop the world GC means application is stopped while GC is
executing
• Not very scalable algorithm
• Suited for smaller <200MB JVMs like Client machines
Throughput
GC
• Parallel GC
• Similar to Serial GC, but uses multiple worker Threads in
parallel to increase throughput
• Both Young and Old Generation collection are multi thread, but
still stop-the-world
• number of threads allocated by -
XX:ParallelGCThreads=<nThreads>
• NOT Concurrent, meaning when the GC worker threads run,
they will pause your application threads. If this is a problem
move to CMS where GC threads are concurrent.
36
GC Policy Types
GC Policy Type Description
Concurrent GC • Concurrent Mark and Sweep, no compaction
• Concurrent implies when GC is running it doesn't pause your
application threads – this is the key difference to
throughput/parallel GC
• Suited for application that care more about response time than
throughput
• CMS does use more heap when compared to
throughput/ParallelGC
• CMS works on OLD gen concurrently, but young generation is
collected using ParNewGC, a version of the throughput collector
• Has multiple phases:
• Initial mark (short pause)
• concurrent mark (no pause)
• Pre-cleaning (no pause)
• re-mark (short pause)
• Concurrent sweeping (no pause)
G1 • Only in J7 and mostly experimental, equivalent to CMS + compacting
37
Tuning GC – Art Meets Science!
Either you tune for Throughput or Latency, one at the cost of the other
Increase
Throughput
Reduce
Latency Tuning
Decisions
• improved R/T
• reduce latency impact
• slightly reduced throughput
• improved throughput
• longer R/T
• increased latency impact
Job
Web
38
Parallel Young Gen and CMS Old Gen
application threads minor GC threads concurrent mark and sweep GC
Young Generation Minor GC
Parallel GC in YoungGen using
XX:ParNewGC & XX:ParallelGCThreads
-Xmn
Old Generation Major GC
Concurrent using in OldGen using
XX:+UseConcMarkSweepGC
Xmx minus Xmn
S
0
S
1
39
High Level GC Tuning Recipe
Measure
Minor GC
Duration
and
Frequency
Adjust –Xmn
Young Gen size
and /or
ParallelGCThreads
Measure
Major GC
Duration
And
Frequency
Adjust
Heap space
–Xmx
Adjust –Xmn
And/or
SurvivorSpaces
Step A-Young Gen Tuning
Step B-Old Gen Tuning
Step C-
Survivor Spaces
Tuning
40
CMS Collector Example
java –Xms30g –Xmx30g –Xmn10g -XX:+UseConcMarkSweepGC -XX:+UseParNewGC –
XX:CMSInitiatingOccupancyFraction=75
–XX:+UseCMSInitiatingOccupancyOnly -XX:+ScavengeBeforeFullGC
-XX:TargetSurvivorRatio=80 -XX:SurvivorRatio=8 -XX:+UseBiasedLocking
-XX:MaxTenuringThreshold=15 -XX:ParallelGCThreads=4
-XX:+UseCompressedOops -XX:+OptimizeStringConcat -XX:+UseCompressedStrings
-XX:+UseStringCache
This JVM configuration scales up and down effectively
-Xmx=-Xms, and –Xmn 33% of –Xmx
-XX:ParallelGCThreads=< minimum 2 but less than 50% of available
vCPU to the JVM. NOTE: Ideally use it for 4vCPU VMs plus, but if
used on 2vCPU VMs drop the -XX:ParallelGCThreads option and let
Java select it
41
IBM JVM – GC Choice
-Xgc:mode Usage Example
-Xgcpolicy:Optthruput
(Default)
Performs the mark and sweep operations
during garbage collection when the
application is paused to maximize
application throughput. Mostly not
suitable for multi CPU machines.
Apps that demand a
high throughput but
are not very sensitive
to the occasional long
garbage collection
pause
-
Xgcpolicy:Optavgpause
Performs the mark and sweep
concurrently while the application is
running to minimize pause times; this
provides best application response times.
There is still a stop-the-world GC, but the
pause is significantly shorter. After GC,
the app threads help out and sweep
objects (concurrent sweep).
Apps sensitive to long
latencies transaction-
based systems where
Response Time are
expected to be stable
-Xgcpolicy:Gencon Treats short-lived and long-lived objects
differently to provide a combination of
lower pause times and high application
throughput.
Before the heap is filled up, each app
helps out and mark objects
(concurrent mark).
Latency sensitive
apps, objects in the
transaction don't
survive beyond the
transaction commit
Job
Web
Web
42
Middleware on VMware – Best Practices
Enterprise Java
Applications on
VMware Best
Practices Guide
http://www.vmware.com/resources/techresources/1087
Best Practices for
Performance Tuning
of Latency-Sensitive
Workloads in vSphere
VMs
http://www.vmware.com/resources/techresources/10220
vFabric SQLFire Best
Practices Guide
http://www.vmware.com/resources/techresources/10327
vFabric Reference
Architecture
http://tinyurl.com/cjkvftt
43
Middleware on VMware – Best Practices Summary
Follow the design and sizing examples we discussed thus far
Set appropriate memory reservation
Leave HT enabled, size bases on vCPU=1.25pCPU if needed
RHEL6 and SLES 11 SP1 have tickless kernel that does not rely on
a high frequency interrupt-based timer, and is therefore much
friendlier to virtualized latency-sensitive workloads
Do not overcommit memory
Locators/heartbeat process should not be vMotion® migrated, it
otherwise would lead to network split brain problems
vMotion over 10Gbps when doing scheduled maintenance
Use Affinity and Anti-Affinity rules to avoid redundant copies on
the same VMware ESX®/ESXi host
44
Middleware on VMware – Best Practices
Disable NIC interrupt coalescing on physical and virtual NIC
Extremely helpful in reducing latency for latency-sensitive
virtual machines
Disable virtual interrupt coalescing for VMXNET3
• It can lead to some performance penalties for other virtual machines on the
ESXi host, as well as higher CPU utilization to deal with the higher rate of
interrupts from the physical NIC
This implies it is best to use dedicated ESX cluster for
Middleware Platforms
• All host are configured the same way for latency sensitivity and this insures
non middleware workloads, such as other enterprise applications are not
negatively impacted
• This is applicable in category 2 workloads
45
Middleware on VMware – Benefits
Flexibility to change compute resources, VM sizes, add more hosts
Ability to apply hardware and OS patches while
minimizing downtime
Create more manageable system through reduced
middleware sprawl
Ability to tune the entire stack within one platform
Ability to monitor the entire stack within one platform
Ability to handle seasonal workloads, commit resources when
they are needed and then remove them when not needed
46
Customer Success Stories
47
NewEdge
Virtualized GemFire workload
Multiple geographic active-
active datacenters
Multiple Terabytes of data
kept in memory
1000s of transactions per
second
Multiple vSphere clusters
Each cluster 4 vSphere hosts
and 8 large 98GB+ JVMs
http://www.vmware.com/files/pdf/customers/VMware-Newedge-12Q4-EN-Case-Study.pdf
48
Cardinal Health Virtualization Journey
4
8
Consolidation
< 40% Virtual
<2,000 VMs
<2,355 physical
Data Center Optimization
30 DCs to 2 DCs
Transition to Blades
<10% Utilization
<10:1 VM/Physical
Low Criticality Systems
8X5 Applications
Internal cloud
>58% Virtual
>3,852 VMs
<3,049 physical
Power Remediation
P2Vs on refresh
HW Commoditization
15% Utilization
30:1 VM/Physical
Business Critical Systems
SAP ~ 382
WebSphere ~ 290
Unix to Linux ~ 655
Cloud Resources
• >90% Virtual
>8,000 VMs
<800 physical
Optimizing DCs
Internal disaster recovery
Metered service offerings (SAAS, PAAS, IAAS)
Shrinking HW Footprint
> 50% Utilization
> 60:1 VM/Physical
Heavy Lifting Systems
Database Servers
Virtual
HW
SW
Timeline 2005 – 2008 2009 – 2011 2012 – 2015
Theme Centralized IT
Shared Service
Capital Intensive - High
Response
Variable Cost
SubscriptionServices
DC
49
Virtualization Why Virtualize WebSphere on VMWare
DC strategy alignment
• Pooled resources capacity ~15% utilization
• Elasticity – for changing workloads
• Unix to Linux
• Disaster Recovery
Simplification and manageability
• High availability for thousands instead of thousands of high
availability solutions
• Network & system management in DMZ
Five year cost savings ~ $6 million
• Hardware Savings ~ $660K
• WAS Licensing ~ $862K
• Unix to Linux ~ $3.7M
• DMZ – ports~ >$1M
50
Thank you and are there any Questions?
Emad Benjamin,
You can get the book here:
https://www.createspace.com/3632131
51
Second Book
Emad Benjamin,
Preview chapter available at
VMworld bookstore
You can get the book here:
Safari: http://tinyurl.com/lj8dtjr
Later on Amazon
http://tinyurl.com/kez9trj
52
Other VMware Activities Related to This Session
HOL:
HOL-SDC-1304
vSphere Performance Optimization
Group Discussions:
VAPP1010-GD
Java with Emad Benjamin
VAPP4536
THANK YOU
Virtualizing and Tuning Large Scale Java Platforms
Emad Benjamin, VMware
VAPP4536
#VAPP4536