Upload
suresh-kumar
View
4.379
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Advancedperformancetroubleshootingusingesxtop 101110131727-phpapp02
Citation preview
© 2010 VMware Inc. All rights reserved
Advanced performance troubleshooting usingesxtop/resxtop
Krishna Raj Raja
Staff Engineer, Performance Group
2
Disclaimer
�This session may contain product features that are currently under development.
�This session/overview of the new technology represent s no commitment from VMware to deliver these features in any generally available product.
�Features are subject to change, and must not be inclu ded in contracts, purchase orders, or sales agreements of any k ind.
�Technical feasibility and market demand will affect final delivery.
�Pricing and packaging for any new technologies or feat ures discussed or presented have not been determined.
“THESE FEATURES ARE REPRESENTATIVE OF FEATURE AREAS UNDER DEVELOPMENT. FEATURE COMMITMENTS ARE SUBJECT TO CHA NGE, AND MUST NOT BE INCLUDED IN CONTRACTS, PURCHASE ORDERS, OR SALES
AGREEMENTS OF ANY KIND. TECHNICAL FEASIBILITY AND M ARKET DEMAND WILL AFFECT FINAL.”
3
esxtop resources
esxtop manual:
�http://www.vmware.com/pdf/vsphere4/r41/vsp_41_resou rce_mgmt.pdf
VMware Community documents:
�http://communities.vmware.com/docs/DOC-9279 - ESX 4.0
�http://communities.vmware.com/docs/DOC-11812 - ESX 4.1
esxtop for advanced users:
�VMworld 2008 - http://vmworld.com/docs/DOC-2356
�VMworld 2009 - http://vmworld.com/docs/DOC-3838
4
Ten things that you need to know about esxtop
5
esxtop counters
1. esxtop does not create performance metrics
• esxtop derives performance metrics from raw counters exported in the VMkernel System Info nodes (VSI nodes)
• esxtop can show new counters on older ESX system if the raw counters are present in VMKernel
6
esxtop counters
2. Counter values
• Many raw counters have static values that do no change with time – esxtopdisplays them as it is
• Many counters increment monotonically, esxtop reports the delta for these for the given refresh interval – for instance CMDS/sec, packets transmitted/sec etc
• %USED and %RUN - CPU occupancy delta between successive snapshots
7
Refresh interval
3. Graphs will look different depending on the refresh int erval
• Many counters values are dependent on refresh interval
• Larger refresh interval smoothens spikes and troughs
2 second refresh interval 10 second refresh interval
8
esxtop counters
4. Counter normalization
• By default counters are shown for the group
• In group view counters values are cumulative
• In expanded view, counters are normalized per entity
Cumulative stats
vcpu world
consumes CPU
Pressing ‘e’ key expands a group
9
esxtop counters
5. %USED can exceed 100
• Turbo boost can increase the processor clock speed
• Asynchronous work can be happening on a different core on behalf of the VM
VM on a NFS datastore running I/O intensive workload
10
esxtop batch mode
6. Batch mode (-b)
• Produces windows perfmon compatible CSV file
• CSV file compatibility requires fixed number of columns on every row -statistics of VMs/worlds instances that appear after starting the batch mode are not collected because of this reason
• Only counters that are specified in the configuration file are collected, (-a) option collects all counters
• Counters are named slightly differently
11
esxtop batch mode – importing data into perfmon
12
esxtop batch mode – viewing data in perfmon
13
esxtop batch mode – trimming data
Trimming dataSaving data after trim
14
esxplot
� http://labs.vmware.com/flings/esxplot
15
I/O Latencies
7. IO latencies
• IO latencies are measured per SCSI command so it is not affected by refresh interval
• Reported latencies are average values for all the SCSI commands issued within the refresh interval window
• Reported average latencies can be different on different screens (adapter, LUN, VM), since each screen accounts for different group of I/Os
16
resxtop – remote esxtop
8. You can use resxtop to connect to different ESX hosts
• Newer version of resxtop will connect to older ESX hosts
9. You don’t need root access to view esxtop counters
• resxtop can authenticate using vCenter credentials
17
esxtop CPU usage
10. esxtop can consume non-trivial amount of CPU
• When you have very large inventory (VMs, LUNs, virtual disks, virtual NICs etc)
• You can limit the amount of data collected by limiting the fields (columns) and entities (rows), you can also reduce CPU consumption by locking entities, (-l) option
CPU consumption on a host with 512 VMs
CPU consumption with esxtop -l
CPU usage when using resxtop
18
Performance Troubleshooting Using esxtop
19
esxtop screens
Screens• c: cpu (default)
• m: memory
• n: network
• d: disk adapter
• u: disk device (added in ESX 3.5)
• v: disk VM (added in ESX 3.5)
• i: Interrupts (new in ESX 4.0)
• p: power management (new in ESX 4.1)
VMkernel
CPU
Scheduler
Memory
Scheduler
Virtual
SwitchvSCSI
c, i, p m d, u, vn
VM VM VMVM
20
Troubleshooting CPU Problems
21
CPU Constrained
SMP VM
High CPU utilization
Both the virtual CPUs
CPU constrained
22
CPU Contention
4 CPUs, all at 100%
3 SMP VMs
VMs don’t get to run
all the time%ready
accumulates
23
CPU Limit
Max Limited
CPU Limit AMAX = -1 : Unlimited
24
Mis-configured SMP VM
vCPU 1 not used by the
VM
Incorrect (UP) Kernel/HAL inside the guest or the application inside the
guest is single threaded
25
Power management – CPU frequency scaling
C states: C0 – busy, C1 – halted, C2 – deep halt
P states: P0 – Highest clock frequency, P11 – Lowest clock frequency
26
VM Power Usage
Experimental feature, not enabled by default.
VMkernel advanced setting: Power.ChargeVMs
27
CPU clock frequency scaling
%USED: CPU usage with reference to base clock frequency
%UTIL: CPU utilization with reference to current clock frequency
%RUN: CPU scheduled time
VM is running all the time but uses only 75% of the clock frequency
28
Hyperthreading
Two VMs running on different cores
Two VMs sharing the same core
%LAT_C counter shows the time de-scheduled due to
core sharing
29
Timer interrupt rate
� Linux Guests
30
Timer interrupt rate
� Windows Guests – Multimedia timer
31
New metrics in CPU screen
%LAT_C : %time the VM was not scheduled due to CPU resource issue
%LAT_M : %time the VM was not scheduled due to memory resource issue
%DMD : Moving CPU utilization average in the last one minute
EMIN : Minimum CPU resources in MHZ that the VM is guaranteed to get
when there is CPU contention
32
Troubleshooting Memory Problems
33
esxtop memory screen (m)
Possible states: high, soft, hard
and low
PMEM – Total Physical memory
VMKMEM - Memory managed by VMKernel
COSMEM - Memory used by Service Console
34
Not able to power-on a new VM
� Memory reservation
820 MB reservation requested
Overhead memory
needs to be reserved
4G memory reservation
35
Granted Memory
Granted Memory = Memory touched by the guest
Windows and FreeBSD Guests touches (zeroes) all its memory during boot
Linux Guests touches memory when it first uses it
36
Ballooning versus Swapping
MCTL: N - Balloon driver not active, tools probably not installed
Memory Hog VMs
Swapped in the past but not actively swapping
now
Swap target is more for the VM
without the balloon driver
VM with Balloon
driver swaps less
37
Memory Compression Stats
COWH : Copy on Write Pages hints – amount of memory in MB that are potentially shareable
CACHESZ: Compression Cache size
CACHEUSD: Compression Cache currently used
ZIP/s, UNZIP/s: Memory compression/decompression rate
38
Wide NUMA - CPU
2 NUMA nodes with ~6G each
NUMA home node not assigned
6-vcpu VM –cannot fit into a NUMA node
size of 4 CPUs
4G, can fit into a single node
39
NUMA affinity not set
NUMA machine with 2 nodes
CPU affinity set to wrong NUMA node
All the memory in remote node
NHN: NUMA Home Node
NLMEM: Memory in local node
NRMEM: Memory in remote node
40
Wide NUMA - Memory
2 NUMA nodes with ~6G each
NUMA home node not assigned
VM cannot be fit into a single NUMA node
41
Troubleshooting Network Problems
42
vSwitch active uplink
TEAM-PNIC : The uplink that the virtual switch port is currently using
43
Dropped packets at vSwitch
Packet drops usually happens when the traffic has
no flow control (UDP/Multicast/Broadcast packets)
44
Multicast/Broadcast stats
PKTTXMUL/s – Multicast packets transmitted per second
PKTRXMUL/s – Multicast packets received per second
PKTTXBRD/s – Broadcast packets transmitted per second
PKTRXBRD/s – Broadcast packets received per second
45
NFS stats
DAVG and KAVG is not available for network backed storage
GAVG – gives the end to end latency
46
Troubleshooting Disk Problems
47
Disk I/O latency
Host bus adapters (HBAs) -includes SCSI, iSCSI, RAID,
and FC-HBA adapters
Latency stats from the Device, Kernel and the
Guest
DAVG/cmd - Average latency (ms) from the Device (LUN)
KAVG/cmd - Average latency (ms) in the VMKernel
GAVG/cmd - Average latency (ms) in the Guest
48
Problem with the disk subsystem
Bad throughput
Good throughput
Device Latency is high - cache disabled
Low device Latency
49
Insufficient Queue depth
Non-zero KAVG
Queuing at the HBA
50
FC bottleneck
‘v’ – VM view
‘u’ – device view
‘d’ – adapter view
51
vStorage API for Array Integration (VAAI) stats
CLONE_RD, CLONE_WR: Number of Clone read/write requests
CLONE_F: Number of Failed clone operations
MBC_RD/s, MBC_WR/s – Clone read/write MBs/sec
ATS – Number of ATS commands
ATSF – Number of failed ATS commands
ZERO – Number of Zero requests
ZEROF – Number of failed zero requests
MBZERO/s – Megabytes Zeroed per second
52
VAAI - virtual disk creation example
� vStorage API for Array Integration (VAAI)
53
SCSI reservation conflicts
54
Other diagnostic tools
55
Other diagnostic tools (1 of 2)
� sched-stats and schedtrace
• vm-support -s/-S flag captures sched-stats
• vm-support -c flag captures scheduler trace – takes lot of disk space
� memstats
• Provides detailed memory usage stats with resource pool hierarchy
� ft-stats
• FT Virtual Machine stats
• Collected with vm-support –s/S flag
56
Other diagnostic tools (2 of 2)
� swatchStats
• Stopwatch stats for VMFS, SCSI events
� vscsiStats
• Virtual machine SCSI disk I/O stats
• Provides histogram information for latency, IO size, inter-arrival time and outstanding I/Os
57
vscsiStats
Virtual scsi disk handle ids -
unique across virtual machines
World group leader id
Virtual Machine Name
# vscsiStats -l
58
vscsiStats – latency histogram
# vscsiStats -p latency -w 118739 -i 8205
Latency in microsecondsI/O
distribution count
59
vscsiStats – iolength histogram
# vscsiStats -p iolength -w 118739 -i 8205
I/O block size
Distribution Count