Upload
vmworld
View
132
Download
1
Tags:
Embed Size (px)
Citation preview
Extreme Performance Series - Understanding Applications that Require Extra TLC for Better Performance on vSphere - Deep Dive
VAPP2305
Vishnu Mohan, VMware, Inc Reza Taheri, VMware, Inc
CONFIDENTIAL 2
Disclaimer • This presentation may contain product features that are currently under development. • This overview of new technology represents no commitment from VMware to deliver these
features in any generally available product. • Features are subject to change, and must not be included in contracts, purchase orders, or
sales agreements of any kind.
• Technical feasibility and market demand will affect final delivery. • Pricing and packaging for any new technologies or features discussed or presented have not
been determined.
CONFIDENTIAL 3
VMware Customer Survey – BCA Workloads Virtualized • Microsoft Exchange, Microsoft SQL, and Microsoft SharePoint are the top 3 virtualized
applications; gaining ground with Oracle and SAP workloads being virtualized as well.
38%
53%
43%
25% 25% 18%
41%
56%
47%
34% 28% 28%
47%
57% 52%
41% 35%
40%
60% 58% 59%
51% 49% 53%
Microsoft Exchange
Microsoft Sharepoint
Microsoft SQL Oracle Middleware
Oracle DB SAP
Jan 2010 Jun 2011 Mar 2012 Jun 2013
Source: VMware customer survey, Jan 2010, Jun 2011, Mar 2012, June 2013 Question: Total number of instances of that workload deployed in your organization and the percentage of those instances that are virtualized .
CONFIDENTIAL 4
Our Performance History Application Performance Requirements
% o
f App
licat
ions
95% of Apps
Require
IOPS
Network
Memory
CPU
<10,000
<2.4 Mb/s
<4 GB at peak
1 to 2 CPUs
VMware vSphere 4
300,000
30 Gb/s
256 GB per VM
8 VCPUs
ESX
100,000
9 Gb/s
64 GB per VM
4 VCPUs
VMware vSphere 5.5
1,000,000 per VM
>40Gb/s
1,000 GB per VM
64 VCPUs
ESX 2
7,000
.9 Gb/s
3.6 GB per VM
2 VCPUs
ESX 1
<5,000
<.5Gb/s
2 GB per VM
1 VCPUs
3.0/3.5
CONFIDENTIAL 5
TPC-VMS (Database Consolidation) - Virtual
0
200
400
600
800
1000
1200
1400
Virtual VMStpsE
457.55
468.11
470.31
Thro
ughp
ut S
core
HP Proliant DL385 G8
OLTP Instance3
OLTP Instance2
OLTP Instance1
CONFIDENTIAL 6
TPC-E - Physical
0
200
400
600
800
1000
1200
1400
1600
Native tpsE
1416.37
Thro
ughp
ut S
core
HP Proliant DL385 G8
OLTP Instance1
CONFIDENTIAL 7
Outline of This Talk • We will present 5 performance-sensitive application classes
– I/O Intensive Applications • Bandwidth • Latency
– Applications That Incur High Memory Address Translation Costs – Applications That Demonstrate a High Rate of Sleep-Wakeup Cycles – Applications That Bang on the Timer – Latency-Sensitive Applications
• We will show how vSphere handles the demands of these applications – We summarize what is required to achieve good performance for these classes
• Best practices • Tuning recommendations • Nothing replaces good engineering judgment!
– We also list remaining challenges for each class
CONFIDENTIAL 8
Scope of This Presentation • Data collected from multiple sources
– Customers – In-house performance stress tests
• Work in progress
• Focus is on the compute side – vSAN, vNAS, NSX pose their own challenges
• Focusing on application classes that require extra TLC for better performance when virtualized – Not looking for vSphere bugs! – The same issues are present on any hypervisor
• A qualitative analysis
CONFIDENTIAL 9
What This Talk Is NOT About • Let’s not talk about
– Creating many VMs, and putting VMDKs on the same physical disk – Not upgrading to vSphere 5.5, virtualHW.version=10, VMXNET3, PVSCI, etc. – Not following best practices – Not following good engineering sense!! – vSphere bugs that have already been reported
• Having said that, there are gray areas in our definitions – If an application achieves good performance by setting a tunable, is it still a fundamentally challenging
application for virtualization?
CONFIDENTIAL 11
Herculean IO • More than 1 Million IOPS from 1 VM
Hypervisor: vSphere 5.1 Server: HP DL380 Gen8 CPU: 2 x Intel Xeon E5-2690, HT disabled Memory: 256GB HBAs: 5 x QLE2562 Storage: 2 x Violin Memory 6616 Flash Arrays VM: Windows Server 2008 R2, 8 vCPUs and 48GB. Iometer Config: 4K IO size w/ 16 workers
• http://blogs.vmware.com/performance/2012/08/1millioniops-on-1vm.html
• http://www.vcritical.com/2012/08/who-needs-a-million-iops-for-a-single-vm
CONFIDENTIAL 12
Network Throughput • Demonstrated 80 Gbps
• Reference: http://blogs.vmware.com/performance/2013/09/line-rate-performance-with-80gbe-and-vsphere-5-5-2.html
CONFIDENTIAL 13
1M Disk IOPS • A storage (or networking) I/O has to go through both the guest and the host stacks • Cost of this scales nearly linearly with the throughput
– We measure 44 microseconds of CPU processing costs for a storage I/O • At 1M IOPS, that’s 44 cores worth of CPU usage!
– In practice, coalescing and economies of scale cut this cost down drastically
• 1M IOPS blog at http://blogs.vmware.com/performance/2012/08/1millioniops-on-1vm.html – 8 vCPU Windows guest maxed out – VMKernel also used around 8 cores worth of processing power – Careful benchmark, scheduler, and system-level tuning in both the guest and the ESX host
• Specific tuning is very benchmark and configuration dependent
CONFIDENTIAL 14
1M Network Packets/Second • Can handle over 1M Packets/Sec in vSphere 5.5 • Working on pushing this higher in the next release. Stretch goals of:
– 4M Packets/Sec – Running a 40Gb NIC at line rate
• On the Rx side, VMKernel uses ~4 cores worth of processing – Tx Side is the same or better
• It can be achieved, but there are CPU costs
CONFIDENTIAL 15
Dealing with the CPU Costs of High I/O Rates • Achieving high storage IOPS or high networking packet rates is challenging
– Application and guest tuning tricks are known to customers or benchmarkers who aim for drag racing results
– When virtualized, the hypervisor needs the same level of TLC as the application and the guest OS
• There are additional CPU costs in the VMKernel – vSphere reduces this by:
• Aggressive guest interrupt coalescing • Aggressive coalescing of requests at issue time
– Further tuning • There is always pass-through • Use RDMs if you need to lower CPU utilization by a few percent and are willing to sacrifice the benefits of VMFS
CONFIDENTIAL 17
Problem Statement • SSD access times are heading towards the 10’s of microseconds
– SSD (SATA, SAS, PCIe): 10’s usec latency, 10K ~ 100K IOPS – Memory Channel Storage (FlashDIMM): 3-5 usec latency, 100K IOPS – NVMe (Flash): Low 10’s usec latency, 100’s ~ 1000’s of thousands IOPS – NVMe (PCM): 80 nanosecond read, 10’s~100’s usec write
• vSphere processing costs per I/O are also in the 10’s of microseconds
CONFIDENTIAL 18
Example of a Customer’s Observation • Virtualizing an app had a 40% penalty
– NAS on NetApp – Turns out NetApp device was caching data in SSD – So the vSphere overhead was of the same magnitude as that of the disk access
CONFIDENTIAL 20
Low Latency Storage IO • 1M IOPS, <2ms latency, 8KB block size, 32 OIO’s
Reference: www.vmware.com/files/pdf/1M-iops-perf-vsphere5.pdf
CONFIDENTIAL 21
Problem Statement • Storage I/O has higher latency in virtual
– True for all hypervisors – Usually not a noticeable problem for Data access
• Long (5+ms) latency because: – Random I/O – Many threads banging on the same spindle
– Not OK for Redo Log • Short (<1ms latency)
– Sequential I/O – Single-threaded – Write-only
• Top 5 wait events in Oracle AWR • Cannot cover with multiprogramming
– More threads means more waiting threads
• So look to trade more CPU cycles to achieve minimum latency for the database logs
CONFIDENTIAL 22
Trade Off CPU Cycles for Latency • vSphere aims for lower CPU usage by batching I/O operations
– Asynchronous request passing from vHBA to VMKernel
• On the issuing path: – Initiate I/O immediately
• For LSI Logic vHBA – scsiNNN.async = "FALSE“ – Or scsiNNN.reqCallThreshold = 1
• default reqCallThreshold value is 8 – Or change the value for all LSILogic vHBAs on the ESX host:
• esxcfg-advcfg -s x /Disk/ReqCallThreshold • For PVSCSI, starting with vSphere 2015
– scsiNNN.reqCallThreshold = 1 – As well as dynamic, self-tuning!
CONFIDENTIAL 23
Trade Off CPU Cycles for Latency, continued… • vSphere aims for lower CPU usage by batching I/O operations
– Virtual Interrupt Coalescing – vSphere automatically suspends interrupt coalescing for low IOPS workloads
• On the completion path: – Avoid Virtual Interrupt Coalescing
• Put the log device on a vHBA by itself – Low IOPS means no waiting to coalesce interrupts
• Or explicitly disable Virtual Interrupt Coalescing – For PVSCSI: scsiNNN.intrCoalescing=“False” – For other vHBAs, scsiNNN.ic=“False”
CONFIDENTIAL 24
Bottom Line • An issue on all hypervisors • More noticeable as storage access times shrink faster than processor cycle times
• But only an issue if application performance is directly proportional to disk access time – Multiprogramming usually covers this
• Even with all the workarounds, log latency can be longer in virtual – Running Oracle TPC-C on 8-way, virtual matches native log device latency of 0.4ms – On 32-way, redo log latency on native is 0.66ms, virtual is 0.85ms – Log latency has a bigger impact than data latency
• Because it is typically much faster (even with spinning drives)
– DBMS are sensitive to log latency – Trade off CPU cycles for minimum latency for log
• There is always pass-through • Has been a challenge for networking all along; so more progress can be found there
CONFIDENTIAL 26
Virtual Memory in a Native OS
• Applications see contiguous virtual address space, not physical memory
• OS defines VA -> PA mapping – Usually at 4 KB granularity – Mappings are stored in page tables
• HW memory management unit (MMU) – Page table walker – TLB (Translation Look-aside Buffer)
Process 1 Process 2
Virtual Memory
VA
Physical Memory
PA
0 4GB 0 4GB
TLB fill hardware
VA PA TLB
%cr3
VA→PA mapping
. . .
26
CONFIDENTIAL 27
Virtualizing Virtual Memory
• To run multiple VMs on a single system, another level of memory virtualization is required – Guest OS still controls virtual to physical mapping: VA -> PA – Guest OS has no direct access to machine memory (to enforce isolation)
• VMM maps guest physical memory to actual machine memory: PA -> MA
Virtual Memory
Physical Memory
VA
PA
VM 1 VM 2
Process 1 Process 2 Process 1 Process 2
Machine Memory
MA
27
CONFIDENTIAL 28
Virtualizing Virtual Memory
• VMM builds “shadow page tables” to accelerate the mappings – Shadow directly maps VA -> MA – Can avoid doing two levels of translation on every access – TLB caches VA->MA mapping – Leverage hardware walker for TLB fills (walking shadows) – When guest changes VA -> PA, the VMM updates shadow page tables
Shadow Page Tables
Virtual Memory
Physical Memory
VA
PA
VM 1 VM 2
Process 1 Process 2 Process 1 Process 2
Machine Memory
MA
28
CONFIDENTIAL 29
2nd Generation Hardware Assist Nested/Extended Page Tables
VA MA TLB
TLB fill hardware
guest VMM
Guest PT ptr
Nested PT ptr
VA→PA mapping
PA→MA mapping
. . .
29
CONFIDENTIAL 30
Analysis of EPT/NPT • MMU composes VA->PA and PA->MA mappings on the fly at TLB fill time • Benefits
– Significant reduction in “exit frequency” • Page faults require no exits • Context switches require no exits
– No shadow page table memory overhead – Better scalability for vSMP
• Aligns with multi-core: performance through parallelism
• Costs – More expensive TLB misses: O(n2) cost for page table walk,
where n is the depth of the page table tree
CONFIDENTIAL 31
Virtualized Address Translation • H/W Assist is much cheaper than Software MMU for:
– Administration of Page Tables – Applications with a lot of process creation/exits
• EPT doubles performance for compiling Apache web server
• All in all, H/W assist for TLB miss processing has been a great performance boost across the board
• Even for an OLTP workload with static processes and high TLB miss rate – EPT is 3% better than S/W MMU – We spend ~7-10% of the time in EPT
• That’s the time the CPU spends in two-dimensional page table walks
CONFIDENTIAL 32
But There Are Rare Corner Cases: • Customer Java Application
– 62GB heap and very poor locality of memory access – Native DTLB miss rate: 18M/sec/core at peak, 13M/sec/core on average
• Cycles spent in DTLB miss processing peaks at 19%, average 14% – Exceptionally high cycle counts!
• Customer observed a 30-40% drop in performance for this app when virtualized – Collected TLB stats (miss rates; cycles in EPT; etc.) to investigate:
• vmkperf on vSphere • perf on Linux
– Average of 16% of overall time in EPT processing • Plus indirect costs!
– Peaking at 27% for several minutes
• Switch to S/W MMU – Performance gap dropped from 30-40% to 17%
CONFIDENTIAL 33
Bottom Line • A scant few applications have this high a TLB miss rate • But for applications that do, there is a noticeable performance drop when virtualized
– One of the rare use cases for vSphere S/W MMU • Better processors and larger pages will be the ultimate solution
• If you suspect TLB miss issues, check for it with “perf stat” or vmkperf
• For all but the rare application, the Hardware MMU is faster
CONFIDENTIAL 35
Sleep-Wakeup Problem Statement • Ping-Pong architecture • Typically, a Producer-Consumer relationship
– Synchronization mechanism relies on sleeping when a thread finds the mutex locked, and waking up when the mutex is unlocked
– When the waiting thread sleeps, the native/guest OS puts the core into an idle state
• On native, the OS uses the MONITOR and MWAIT instructions – Put the core into a low-power sleep state and wait on the resched flag – On lock release, the native OS writes to the resched flag, and the core sleeping with MWAIT wakes up
• On virtual, we do not (always) emulate the MWAIT instruction – Guest OS has to use a Rescheduling Interrupt (RES) to wake up the sleeping core – RES interrupts are delivered using Inter-Processor Interrupts (IPIs), which are expensive
• Same phenomenon can be observed on KVM
CONFIDENTIAL 36
Sleep-Wakeup Problem at a Customer • Running vSphere 5.5 on IvyBridge processors with RHEL 6.5 • Started off at 1/6 of the performance on native
• Initial triage pointed to two known issues: – Supervisor Mode Execution Protection (SMEP) on IvyBridge can cause a performance problem
• KB 2077307 • Fixed in 5.5 Update 1 or 5.1 Update 1 • 3-4X improvement in a pathologically bad case
– Flexpriority • KB 2040519 • Can change from default if H/W fix (microcode update) is applied • 4X improvement in a pathologically bad case
CONFIDENTIAL 37
Sleep-Wakeup Solutions and Workarounds • Apply knows patches and fixes
– SMEP • Fixed in 5.5 Update 1 or 5.1 Update 1
– Flexpriority • Enable explicitly after applying hardware microcode patch • But still ~70% of native
• Avoid the Resched Interrupts/IPIs altogether! – boot the guest with idle=poll
• Performance almost on par with native • Expensive in terms of power
– Use taskset to run all threads on the same vCPU • Bound by processing power of 1 vCPU
CONFIDENTIAL 38
Bottom Line • Need to apply known bug fixes/updates/etc. • But still noticeable performance issue:
– If there is a Very High rate of Sleep/Wakeup – If the producer/consumer threads do nothing but communicate, and there is no other activity in the VM – If there are no changes made to guest OS/Application to avoid using RES Interrupts
CONFIDENTIAL 40
Problem Statement • Timers are a known virtualization issue
– Correctness – Cost
• Older ESX releases were poll-driven; vSphere is now interrupt-driven
• Accessing the timer has CPU overhead – vSphere 5.5 has made great improvements
• Most applications should not have an issue anymore
CONFIDENTIAL 41
Timer Access Overhead • Guest OS
– No issues in newer Linux releases – Windows might still have a problem because Windows is tick driven.
• If the guest timer has fallen behind, vSphere tries to gradually adjust the clock to catch up. During this time, we exit on TSC reads, which is expensive.
• Not an issue if in steady state and the VM is not getting descheduled
• vMotion – If the VM migrates to a processor of sufficiently different frequency, we will have to exit on every TSC
read to scale the frequency – When frequencies are close, we adjust frequencies only occasionally, and only during this period do we
exit – AMD avoids this by using frequency scaling
• But vSphere does not take advantage of that yet
CONFIDENTIAL 43
Performance of Latency Sensitivity Feature Reference: http://www.vmware.com/files/pdf/techpaper/latency-sensitive-perf-vsphere55.pdf
CONFIDENTIAL 44
Latency-sensitive Applications • Same issues on all hypervisors • Bands
• Band 1: <10us, not a good candidate for virtualization • Band 2: 10s of microseconds
– Good support since 5.1 • Latency • Jitter
• Band 3: 100s of us to milliseconds – Surprisingly, might not be a good candidate because customers may not be willing to set Latency Sensitivity to High
• Requires following best practices and the instructions in the white paper
CONFIDENTIAL 46
Summary • Although inherent performance problems are rare
– Some apps require TLC – Apply best practices – Apply workarounds – Consider changes to Application design
• We continually work on vSphere to improve performance when dealing with corner cases
CONFIDENTIAL 47
Acknowledgements • Seongbeom Kim • Lenin Singaravelu
• Jin Heo • Jinpyo Kim
• Tian Luo
• Adrian Marinescu • Cedric Krumbein
• Jianzhe Tai
• Rishi Mehta • Jeff Buell
• Lance Berc • Joshua Schnee
• Nikhil Bathia
• Bhavesh Davda • Josh Simon