59
© 2010 VMware Inc. All rights reserved Advanced performance troubleshooting using esxtop/resxtop Krishna Raj Raja Staff Engineer, Performance Group

Advanced performance troubleshooting using esxtop

Embed Size (px)

DESCRIPTION

Advanced performance troubleshooting using esxtop presented by Krishna Raj Raja

Citation preview

Page 1: Advanced performance troubleshooting using esxtop

© 2010 VMware Inc. All rights reserved

Advanced performance troubleshooting usingesxtop/resxtop

Krishna Raj Raja

Staff Engineer, Performance Group

Page 2: Advanced performance troubleshooting using esxtop

2

Disclaimer

�This session may contain product features that are currently under development.

�This session/overview of the new technology represent s no commitment from VMware to deliver these features in any generally available product.

�Features are subject to change, and must not be inclu ded in contracts, purchase orders, or sales agreements of any k ind.

�Technical feasibility and market demand will affect final delivery.

�Pricing and packaging for any new technologies or feat ures discussed or presented have not been determined.

“THESE FEATURES ARE REPRESENTATIVE OF FEATURE AREAS UNDER DEVELOPMENT. FEATURE COMMITMENTS ARE SUBJECT TO CHA NGE, AND MUST NOT BE INCLUDED IN CONTRACTS, PURCHASE ORDERS, OR SALES

AGREEMENTS OF ANY KIND. TECHNICAL FEASIBILITY AND M ARKET DEMAND WILL AFFECT FINAL.”

Page 3: Advanced performance troubleshooting using esxtop

3

esxtop resources

esxtop manual:

�http://www.vmware.com/pdf/vsphere4/r41/vsp_41_resou rce_mgmt.pdf

VMware Community documents:

�http://communities.vmware.com/docs/DOC-9279 - ESX 4.0

�http://communities.vmware.com/docs/DOC-11812 - ESX 4.1

esxtop for advanced users:

�VMworld 2008 - http://vmworld.com/docs/DOC-2356

�VMworld 2009 - http://vmworld.com/docs/DOC-3838

Page 4: Advanced performance troubleshooting using esxtop

4

Ten things that you need to know about esxtop

Page 5: Advanced performance troubleshooting using esxtop

5

esxtop counters

1. esxtop does not create performance metrics

• esxtop derives performance metrics from raw counters exported in the VMkernel System Info nodes (VSI nodes)

• esxtop can show new counters on older ESX system if the raw counters are present in VMKernel

Page 6: Advanced performance troubleshooting using esxtop

6

esxtop counters

2. Counter values

• Many raw counters have static values that do no change with time – esxtopdisplays them as it is

• Many counters increment monotonically, esxtop reports the delta for these for the given refresh interval – for instance CMDS/sec, packets transmitted/sec etc

• %USED and %RUN - CPU occupancy delta between successive snapshots

Page 7: Advanced performance troubleshooting using esxtop

7

Refresh interval

3. Graphs will look different depending on the refresh int erval

• Many counters values are dependent on refresh interval

• Larger refresh interval smoothens spikes and troughs

2 second refresh interval 10 second refresh interval

Page 8: Advanced performance troubleshooting using esxtop

8

esxtop counters

4. Counter normalization

• By default counters are shown for the group

• In group view counters values are cumulative

• In expanded view, counters are normalized per entity

Cumulative stats

vcpu world

consumes CPU

Pressing ‘e’ key expands a group

Page 9: Advanced performance troubleshooting using esxtop

9

esxtop counters

5. %USED can exceed 100

• Turbo boost can increase the processor clock speed

• Asynchronous work can be happening on a different core on behalf of the VM

VM on a NFS datastore running I/O intensive workload

Page 10: Advanced performance troubleshooting using esxtop

10

esxtop batch mode

6. Batch mode (-b)

• Produces windows perfmon compatible CSV file

• CSV file compatibility requires fixed number of columns on every row -statistics of VMs/worlds instances that appear after starting the batch mode are not collected because of this reason

• Only counters that are specified in the configuration file are collected, (-a) option collects all counters

• Counters are named slightly differently

Page 11: Advanced performance troubleshooting using esxtop

11

esxtop batch mode – importing data into perfmon

Page 12: Advanced performance troubleshooting using esxtop

12

esxtop batch mode – viewing data in perfmon

Page 13: Advanced performance troubleshooting using esxtop

13

esxtop batch mode – trimming data

Trimming dataSaving data after trim

Page 14: Advanced performance troubleshooting using esxtop

14

esxplot

� http://labs.vmware.com/flings/esxplot

Page 15: Advanced performance troubleshooting using esxtop

15

I/O Latencies

7. IO latencies

• IO latencies are measured per SCSI command so it is not affected by refresh interval

• Reported latencies are average values for all the SCSI commands issued within the refresh interval window

• Reported average latencies can be different on different screens (adapter, LUN, VM), since each screen accounts for different group of I/Os

Page 16: Advanced performance troubleshooting using esxtop

16

resxtop – remote esxtop

8. You can use resxtop to connect to different ESX hosts

• Newer version of resxtop will connect to older ESX hosts

9. You don’t need root access to view esxtop counters

• resxtop can authenticate using vCenter credentials

Page 17: Advanced performance troubleshooting using esxtop

17

esxtop CPU usage

10. esxtop can consume non-trivial amount of CPU

• When you have very large inventory (VMs, LUNs, virtual disks, virtual NICs etc)

• You can limit the amount of data collected by limiting the fields (columns) and entities (rows), you can also reduce CPU consumption by locking entities, (-l) option

CPU consumption on a host with 512 VMs

CPU consumption with esxtop -l

CPU usage when using resxtop

Page 18: Advanced performance troubleshooting using esxtop

18

Performance Troubleshooting Using esxtop

Page 19: Advanced performance troubleshooting using esxtop

19

esxtop screens

Screens• c: cpu (default)

• m: memory

• n: network

• d: disk adapter

• u: disk device (added in ESX 3.5)

• v: disk VM (added in ESX 3.5)

• i: Interrupts (new in ESX 4.0)

• p: power management (new in ESX 4.1)

VMkernel

CPU

Scheduler

Memory

Scheduler

Virtual

SwitchvSCSI

c, i, p m d, u, vn

VM VM VMVM

Page 20: Advanced performance troubleshooting using esxtop

20

Troubleshooting CPU Problems

Page 21: Advanced performance troubleshooting using esxtop

21

CPU Constrained

SMP VM

High CPU utilization

Both the virtual CPUs

CPU constrained

Page 22: Advanced performance troubleshooting using esxtop

22

CPU Contention

4 CPUs, all at 100%

3 SMP VMs

VMs don’t get to run

all the time%ready

accumulates

Page 23: Advanced performance troubleshooting using esxtop

23

CPU Limit

Max Limited

CPU Limit AMAX = -1 : Unlimited

Page 24: Advanced performance troubleshooting using esxtop

24

Mis-configured SMP VM

vCPU 1 not used by the

VM

Incorrect (UP) Kernel/HAL inside the guest or the application inside the

guest is single threaded

Page 25: Advanced performance troubleshooting using esxtop

25

Power management – CPU frequency scaling

C states: C0 – busy, C1 – halted, C2 – deep halt

P states: P0 – Highest clock frequency, P11 – Lowest clock frequency

Page 26: Advanced performance troubleshooting using esxtop

26

VM Power Usage

Experimental feature, not enabled by default.

VMkernel advanced setting: Power.ChargeVMs

Page 27: Advanced performance troubleshooting using esxtop

27

CPU clock frequency scaling

%USED: CPU usage with reference to base clock frequency

%UTIL: CPU utilization with reference to current clock frequency

%RUN: CPU scheduled time

VM is running all the time but uses only 75% of the clock frequency

Page 28: Advanced performance troubleshooting using esxtop

28

Hyperthreading

Two VMs running on different cores

Two VMs sharing the same core

%LAT_C counter shows the time de-scheduled due to

core sharing

Page 29: Advanced performance troubleshooting using esxtop

29

Timer interrupt rate

� Linux Guests

Page 30: Advanced performance troubleshooting using esxtop

30

Timer interrupt rate

� Windows Guests – Multimedia timer

Page 31: Advanced performance troubleshooting using esxtop

31

New metrics in CPU screen

%LAT_C : %time the VM was not scheduled due to CPU resource issue

%LAT_M : %time the VM was not scheduled due to memory resource issue

%DMD : Moving CPU utilization average in the last one minute

EMIN : Minimum CPU resources in MHZ that the VM is guaranteed to get

when there is CPU contention

Page 32: Advanced performance troubleshooting using esxtop

32

Troubleshooting Memory Problems

Page 33: Advanced performance troubleshooting using esxtop

33

esxtop memory screen (m)

Possible states: high, soft, hard

and low

PMEM – Total Physical memory

VMKMEM - Memory managed by VMKernel

COSMEM - Memory used by Service Console

Page 34: Advanced performance troubleshooting using esxtop

34

Not able to power-on a new VM

� Memory reservation

820 MB reservation requested

Overhead memory

needs to be reserved

4G memory reservation

Page 35: Advanced performance troubleshooting using esxtop

35

Granted Memory

Granted Memory = Memory touched by the guest

Windows and FreeBSD Guests touches (zeroes) all its memory during boot

Linux Guests touches memory when it first uses it

Page 36: Advanced performance troubleshooting using esxtop

36

Ballooning versus Swapping

MCTL: N - Balloon driver not active, tools probably not installed

Memory Hog VMs

Swapped in the past but not actively swapping

now

Swap target is more for the VM

without the balloon driver

VM with Balloon

driver swaps less

Page 37: Advanced performance troubleshooting using esxtop

37

Memory Compression Stats

COWH : Copy on Write Pages hints – amount of memory in MB that are potentially shareable

CACHESZ: Compression Cache size

CACHEUSD: Compression Cache currently used

ZIP/s, UNZIP/s: Memory compression/decompression rate

Page 38: Advanced performance troubleshooting using esxtop

38

Wide NUMA - CPU

2 NUMA nodes with ~6G each

NUMA home node not assigned

6-vcpu VM –cannot fit into a NUMA node

size of 4 CPUs

4G, can fit into a single node

Page 39: Advanced performance troubleshooting using esxtop

39

NUMA affinity not set

NUMA machine with 2 nodes

CPU affinity set to wrong NUMA node

All the memory in remote node

NHN: NUMA Home Node

NLMEM: Memory in local node

NRMEM: Memory in remote node

Page 40: Advanced performance troubleshooting using esxtop

40

Wide NUMA - Memory

2 NUMA nodes with ~6G each

NUMA home node not assigned

VM cannot be fit into a single NUMA node

Page 41: Advanced performance troubleshooting using esxtop

41

Troubleshooting Network Problems

Page 42: Advanced performance troubleshooting using esxtop

42

vSwitch active uplink

TEAM-PNIC : The uplink that the virtual switch port is currently using

Page 43: Advanced performance troubleshooting using esxtop

43

Dropped packets at vSwitch

Packet drops usually happens when the traffic has

no flow control (UDP/Multicast/Broadcast packets)

Page 44: Advanced performance troubleshooting using esxtop

44

Multicast/Broadcast stats

PKTTXMUL/s – Multicast packets transmitted per second

PKTRXMUL/s – Multicast packets received per second

PKTTXBRD/s – Broadcast packets transmitted per second

PKTRXBRD/s – Broadcast packets received per second

Page 45: Advanced performance troubleshooting using esxtop

45

NFS stats

DAVG and KAVG is not available for network backed storage

GAVG – gives the end to end latency

Page 46: Advanced performance troubleshooting using esxtop

46

Troubleshooting Disk Problems

Page 47: Advanced performance troubleshooting using esxtop

47

Disk I/O latency

Host bus adapters (HBAs) -includes SCSI, iSCSI, RAID,

and FC-HBA adapters

Latency stats from the Device, Kernel and the

Guest

DAVG/cmd - Average latency (ms) from the Device (LUN)

KAVG/cmd - Average latency (ms) in the VMKernel

GAVG/cmd - Average latency (ms) in the Guest

Page 48: Advanced performance troubleshooting using esxtop

48

Problem with the disk subsystem

Bad throughput

Good throughput

Device Latency is high - cache disabled

Low device Latency

Page 49: Advanced performance troubleshooting using esxtop

49

Insufficient Queue depth

Non-zero KAVG

Queuing at the HBA

Page 50: Advanced performance troubleshooting using esxtop

50

FC bottleneck

‘v’ – VM view

‘u’ – device view

‘d’ – adapter view

Page 51: Advanced performance troubleshooting using esxtop

51

vStorage API for Array Integration (VAAI) stats

CLONE_RD, CLONE_WR: Number of Clone read/write requests

CLONE_F: Number of Failed clone operations

MBC_RD/s, MBC_WR/s – Clone read/write MBs/sec

ATS – Number of ATS commands

ATSF – Number of failed ATS commands

ZERO – Number of Zero requests

ZEROF – Number of failed zero requests

MBZERO/s – Megabytes Zeroed per second

Page 52: Advanced performance troubleshooting using esxtop

52

VAAI - virtual disk creation example

� vStorage API for Array Integration (VAAI)

Page 53: Advanced performance troubleshooting using esxtop

53

SCSI reservation conflicts

Page 54: Advanced performance troubleshooting using esxtop

54

Other diagnostic tools

Page 55: Advanced performance troubleshooting using esxtop

55

Other diagnostic tools (1 of 2)

� sched-stats and schedtrace

• vm-support -s/-S flag captures sched-stats

• vm-support -c flag captures scheduler trace – takes lot of disk space

� memstats

• Provides detailed memory usage stats with resource pool hierarchy

� ft-stats

• FT Virtual Machine stats

• Collected with vm-support –s/S flag

Page 56: Advanced performance troubleshooting using esxtop

56

Other diagnostic tools (2 of 2)

� swatchStats

• Stopwatch stats for VMFS, SCSI events

� vscsiStats

• Virtual machine SCSI disk I/O stats

• Provides histogram information for latency, IO size, inter-arrival time and outstanding I/Os

Page 57: Advanced performance troubleshooting using esxtop

57

vscsiStats

Virtual scsi disk handle ids -

unique across virtual machines

World group leader id

Virtual Machine Name

# vscsiStats -l

Page 58: Advanced performance troubleshooting using esxtop

58

vscsiStats – latency histogram

# vscsiStats -p latency -w 118739 -i 8205

Latency in microsecondsI/O

distribution count

Page 59: Advanced performance troubleshooting using esxtop

59

vscsiStats – iolength histogram

# vscsiStats -p iolength -w 118739 -i 8205

I/O block size

Distribution Count