Upload
dan-brinkmann
View
10.659
Download
6
Tags:
Embed Size (px)
DESCRIPTION
From the Lewan
Citation preview
vSphere Performance Monitoring and Troubleshooting
Overview
What?
CPU, Memory, Disk, Network
How?
Use available tools and a systematic methodology
Why?
Need to build confidence in virtualizing critical and high demand applications
vSphere Performance Monitoring and Troubleshooting
Top Issues
Top Issues:
Storage "performance capacity" oversubscription
Memory oversubscription
SMP overuse
Firmware & driver issues
vSphere Performance Monitoring and Troubleshooting
What tools do we have at our disposal?
Top tools for information collection:
vCenter - Performance charts and alarms
Guest OS* - Task Manager/Resource Monitor and PerfMon
ESX Host - esxtop and vscsiStats
vSphere PowerCLI
*Guest based monitoring is subject to inaccuracy
vSphere Performance Monitoring and Troubleshooting
Prepare vCenter Settings
vSphere Performance Monitoring and Troubleshooting
Prepare vCenter Settings
vSphere Performance Monitoring and Troubleshooting
Prepare vCenter Settings
Prepare custom vCenter alerts:
Host Console Swap In Rate 512KBps Warning, 1024 KBps AlertHost Console Swap Out Rate 512KBps Warning, 1024 KBps Alert
VM CPU Ready 1000ms Warning, 2000ms Alert
VM Disk Latency 20ms Warning, 50ms Alert
vSphere Performance Monitoring and Troubleshooting
Prepare vCenter Settings
vSphere Performance Monitoring and Troubleshooting
Prepare vCenter Settings
vSphere Performance Monitoring and Troubleshooting
Prepare esxtop
ESXTOP realtime monitoring:
esxtop (run command from SSH or tech-support mode)
s 2 (refresh view every 2 seconds)
V (View VMs only)
h (for quick in-tool command reference)
Batch Mode for a 5 minute capture of all stats:
esxtop -b -a -d 2 -n 150 > esxtop_capture.csv
vSphere Performance Monitoring and Troubleshooting
Prepare PowerCLI
Run PowerCLI:
Tip: Run as Administrator
Set-ExecutionPolicy remotesigned
Connect-VIServer -Server <host> -Protocol https -User <user> -Password <pass>
<host> can be IP address or name of ESX server or vCenter
Get-VMGet-Stat -common -realtime
vSphere Performance Monitoring and Troubleshooting
Where do we get started?
vSphere Performance Monitoring and Troubleshooting
Network Overview
Core Switch pNIC
Virtual Machine
vNIC
vSwitchor
dvSwitch
Network
vSphere Performance Monitoring and Troubleshooting
Network
Troubleshooting Guidance:
1. Physical Issues - A bad cable, a failing switch port or NIC, or an incompatible/flawed firmware or device driver (use VMXNET3 whenever possible)
2. Configuration Issues - Inconsistent configuration of vSwitches, Port Groups, or upstream VLAN trunks
3. Capacity Issues - Too many VMs on a single NIC; inadequate switch backplane or uplink capacity; sharing “unmanaged” network infrastructure for storage and data
4. Thresholds – Bandwidth saturation, dropped packets
vSphere Performance Monitoring and Troubleshooting
Network – What can we see?
Systems Management Tools
VMware Monitoring Tools
Core Switch pNIC
Virtual Machine
vNIC
dvSwitch (only)
vSphere Performance Monitoring and Troubleshooting
Network
vCenter Metrics:
Receive packets dropped
Transmit packets dropped
vSphere Performance Monitoring and Troubleshooting
Network
ESXTOP Metrics:
Display Metric Threshold Explanation
NETWORK %DRPTX 1Dropped packets transmitted, hardware overworked.
Possible cause: very high network utilization
NETWORK %DRPRX 1Dropped packets received, hardware overworked.
Possible cause: very high network utilization
vSphere Performance Monitoring and Troubleshooting
Network
ESXTOP Commands:
esxtop
s 2
n
f
vSphere Performance Monitoring and Troubleshooting
Network
ESXTOP Example:
vSphere Performance Monitoring and Troubleshooting
Network
PowerCLI Commands:
Get-Stat -net -realtime
Get-Stat -Entity <Host> -stat net.droppedRx.summation
Get-Stat -Entity <Host> -stat net.droppedTx.summation
vSphere Performance Monitoring and Troubleshooting
Network – What can’t we see?
Net Mon ToolsNetwork Monitoring Tools
Core Switch pNIC
Virtual Machine
vNIC
Cisco 1000v only
vSphere Performance Monitoring and Troubleshooting
Network
Possible resources for external monitoring:
Native Telnet/SSH/HTTP-based interface counters and stats
Third-party SNMP, NetFlow and ICMP tools
vSphere Performance Monitoring and Troubleshooting
CPU Overview
Physical CPU
Virtual Machine
CPU
vCPU
vSphere Performance Monitoring and Troubleshooting
CPU
Troubleshooting Guidance:
1. Physical Issues - Rare and always catastrophic (e.g. obvious)
2. Configuration Issues - Too many / too few vCPUs per VM; SMP/HAL mismatch; incorrect CPU affinity settings
3. Capacity Issues - CPU saturation at the guest or host level; CPU starvation due to high IO or other system level ops
4. Thresholds – Waiting for CPU cycles (due to co-scheduling, swapping, high IO)
vSphere Performance Monitoring and Troubleshooting
CPU – What can we see?
Systems Management Tools
VMware Monitoring Tools
Physical CPU
Virtual Machine
vCPU
vSphere Performance Monitoring and Troubleshooting
CPU
vCenter Metrics:
Host/Guest Saturation
Stacked Graph (per VM)
Usage
vSphere Performance Monitoring and Troubleshooting
CPU
vCenter Metrics:
Guest
Ready (value/20=n%)
Swap Wait
vSphere Performance Monitoring and Troubleshooting
CPU
ESXTOP Metrics:
Display Metric Threshold Explanation
CPU %RDY 10Overprovisioning of vCPUs, excessive usage of vSMP or a limit(check %MLMTD) has been set.
CPU %CSTP 3
Excessive usage of vSMP. Decrease amount of vCPUs for this particular VM. This should lead to increased scheduling opportunities.
CPU %SYS 20
The percentage of time spent by system services on behalf of the world. Most likely caused by high IO VM. Check other metrics and VM for possible root cause
CPU %MLMTD 0
The percentage of time the vCPU was ready to run but deliberately wasn’t scheduled because that would violate the “CPU limit” settings. If larger than 0 the world is being throttled due to the limit on CPU.
CPU %SWPWT 5VM waiting on swapped pages to be read from disk. Possible cause: Memory overcommitment.
vSphere Performance Monitoring and Troubleshooting
CPU
ESXTOP Commands:
esxtop
s 2
V
c
e GID (expand/contract a VM world)
vSphere Performance Monitoring and Troubleshooting
CPU
ESXTOP Example:
Excessive vCPUs
vSphere Performance Monitoring and Troubleshooting
CPU
ESXTOP Example: Now with fewer vCPUs
vSphere Performance Monitoring and Troubleshooting
CPU
ESXTOP Example:
SMP impacting multiple VMs
vSphere Performance Monitoring and Troubleshooting
CPU
PowerCLI Example
Get-Stat -cpu
Get-Stat -Entity <VM> -stat cpu.ready.summation -realtime
Very cool script code at:
http://www.peetersonline.nl/index.php/vmware/examine-vmware-cpu-ready-times-with-powershell/
vSphere Performance Monitoring and Troubleshooting
CPU – Not much else to see…
Offline Diagnostics & Systems Management Tools
Physical CPU
Virtual Machine
vCPU
vSphere Performance Monitoring and Troubleshooting
CPU
Possible resources for external monitoring:
Vendor specific systems management tools,
MS System Center, etc.
http://www.peetersonline.nl/index.php/vmware/examine-vmware-cpu-ready-times-with-powershell
/
vSphere Performance Monitoring and Troubleshooting
Memory Overview
Physical RAM
Virtual Machine
Memory
vRAM
vSphere Performance Monitoring and Troubleshooting
Memory
Troubleshooting Guidance:
1. Physical Issues - Rare and usually catastrophic
2. Configuration Issues - Memory overcommit; incorrect configuration of shares, reservations or limits
3. Capacity Issues - Physical memory exhaustion
4. Thresholds – Active memory swapping
vSphere Performance Monitoring and Troubleshooting
Memory – What can we see?
Systems Management Tools
VMware Monitoring Tools
Physical RAM
Virtual Machine
vRAM
vSphere Performance Monitoring and Troubleshooting
Memory
vCenter Metrics
Swap in rate
Swap out rate
Swap used
vSphere Performance Monitoring and Troubleshooting
Memory
ESXTOP Metrics:
Display Metric Threshold Explanation
MEM MCTLSZ 1If larger than 0 host is forcing VMs to inflate balloon driver
to reclaim memory as host is overcommited.
MEM SWCUR 1If larger than 0 host has swapped memory pages in the
past. Possible cause: Overcommitment.
MEM SWR/s 1If larger than 0 host is actively reading from swap(vswp).
Possible cause: Excessive memory overcommitment.
MEM SWW/s 1If larger than 0 host is actively writing to swap(vswp). Possible cause: Excessive memory overcommitment.
MEM CACHEUSD 0If larger than 0 host has compressed memory. Possible
cause: Memory overcommitment.
MEM ZIP/s 0If larger than 0 host is actively compressing memory.
Possible cause: Memory overcommitment.
MEM UNZIP/s 0
If larger than 0 host has accessing compressed memory. Possible cause: Previously host was overcommited on
memory.
vSphere Performance Monitoring and Troubleshooting
Memory
ESXTOP Commands:
esxtop
s 2
V
m
f
vSphere Performance Monitoring and Troubleshooting
Memory
ESXTOP Example:m – Heavy swapping and ballooning
vSphere Performance Monitoring and Troubleshooting
Memory
PowerCLI Commands:
Get-Stat -mem
Get-Stat -Entity <VM> -stat mem.swapoutRate.average -realtime
Get-Stat -Entity <VM> -stat mem.swapinRate.average -realtime
Get-Stat -Entity <VM> -stat mem.vmmemctl.average -realtime
Get-Stat -Entity <Host> -stat mem.swapused.average -realtime
vSphere Performance Monitoring and Troubleshooting
Memory – The occasional DIMM failure…
Offline Diagnostics & Systems Management Tools
Physical RAM
Virtual Machine
vRAM
vSphere Performance Monitoring and Troubleshooting
Memory
Possible external monitoring options:
Vendor specific systems management tools, MS System Center, etc.
Don’t forget vCenter ‘Hardware Status’ reporting
vSphere Performance Monitoring and Troubleshooting
Storage Overview
vmdk
Disk LUN Controller Switch HBA
vmdk vmdk
Datastore
Virtual Machine
SCSI Controller
Storage
RDM
vSphere Performance Monitoring and Troubleshooting
Storage
Troubleshooting Guidance:
1. Physical Issues - A bad cable, a failing switch port or HBA/NIC, or an incompatible/flawed firmware or device driver (use LSI Logic Parallel/SAS as appropriate)
2. Configuration Issues - Inconsistent or incorrect configuration of LUN masking, zoning, or multi-pathing; inappropriate resource provisioning; aligning queue depth with storage type
3. Capacity Issues - Too many VMs or VMDKs on a LUN; too much IO load for an array or RAID group
4. Thresholds – Latency and queuing
vSphere Performance Monitoring and Troubleshooting
Storage – What can we see?
Systems Management Tools
VMware Monitoring Tools
vmdk
Disk LUN Controller Switch HBA
vmdk vmdk
Datastore
Virtual Machine
SCSI Controller
RDM
vSphere Performance Monitoring and Troubleshooting
Storage
vCenter Metrics:
Datastore
Read latency
Write latency
vSphere Performance Monitoring and Troubleshooting
Storage
ESXTOP Metrics:
Display Metric Threshold Explanation
DISK GAVG 20 Look at “DAVG” and “KAVG” as the sum of both is GAVG.
DISK DAVG 20 Disk latency most likely to be caused by array.
DISK KAVG 2Disk latency caused by the VMkernel, high KAVG usually
means queuing. Check “QUED”.
DISK QUED 1Queue maxed out. Possibly queue depth set to low. Check
with array vendor for optimal queue depth value.
DISK ABRTS/s 1Aborts issued by guest(VM) because storage is not
responding. Can be caused when paths failed.
DISK RESETS/s 1 The number of commands reset per second.
DISK CONS/s 20SCSI Reservation Conflicts per second. Can be caused by
too many VMDKs on a datastore.
vSphere Performance Monitoring and Troubleshooting
Storage
Guest
VMkernel
Application
Filesystem
I/O Drivers
Virtual SCSI
Filesystem
A
G
D
K
S
R
Device Queue
Application Latency
R = Physical Disk “Disk Secs/Transfer”
G = Guest Latency
K = ESX Kernel
D = Device Latency
vSphere Performance Monitoring and Troubleshooting
Storage
ESXTOP Commands (HBA/LUN):
esxtop
s 2
V
d
f
e vmhba#
vSphere Performance Monitoring and Troubleshooting
Storage
ESXTOP Commands(LUN/Datastore):
esxtop
s 2
V
u
L 38
f
e <devname>
vSphere Performance Monitoring and Troubleshooting
Storage
ESXTOP Commands (VM/VMDK):
esxtop
s 2
V
v
f
e GID
vSphere Performance Monitoring and Troubleshooting
Storage
ESXTOP Examples: d - Multipathing / Expand adapter to view targets
vSphere Performance Monitoring and Troubleshooting
Storage
ESXTOP Examples: u - Queuing, Disk or Kernel?
vSphere Performance Monitoring and Troubleshooting
Storage
ESXTOP Examples:v - Identify the IO consumer
vSphere Performance Monitoring and Troubleshooting
Storage
vscsiStats Command:
[root@host ~]# cd /usr/lib/vmware/bin
./vscsiStats -l
./vscsiStats -s -w <worldid>
./vscsiStats -w <worldid> -p all -c > /path/vscsistats.csv
./vscsiStats -x
vSphere Performance Monitoring and Troubleshooting
Storage
vscsiStats Example:
vSphere Performance Monitoring and Troubleshooting
Storage
vscsiStats Example:
vSphere Performance Monitoring and Troubleshooting
Storage
vscsiStats Example:http://dunnsept.wordpress.com/2010/03/11/new-vscsistats-excel-macro/
vSphere Performance Monitoring and Troubleshooting
Storage
vscsiStats histograms: IO lengths of commands IO lengths of Read commands IO lengths of Write commands distance (in LBNs) between successive commands distance (in LBNs) between successive Read commands distance (in LBNs) between successive Write commands distance (in LBNs) between each command from the closest of previous 16 latency of IOs in Microseconds (us) latency of Read IOs in Microseconds (us) latency of Write IOs in Microseconds (us) number of outstanding IOs when a new IO is issued number of outstanding Read IOs when a new Read IO is issued number of outstanding Write IOs when a new Write IO is issued latency of IO interarrival time in Microseconds (us) latency of IO interarrival time for Reads in Microseconds (us) latency of IO interarrival time for Writes in Microseconds (us)
vSphere Performance Monitoring and Troubleshooting
Storage
PowerCLI Commands:
Get-Stat -disk
Get-Stat -stat disk.totalLatency.average -realtime
Get-Stat -stat disk.deviceLatency.average -realtime
Get-Stat -stat disk.kernelLatency.average -realtime
vSphere Performance Monitoring and Troubleshooting
Storage – What can’t we see?
Storage Monitoring Tools
vmdk
Disk LUN Controller Switch HBA
vmdk vmdk
Datastore
Virtual Machine
SCSI Controller
RDM
vSphere Performance Monitoring and Troubleshooting
Storage – More of what we can’t see
Network/Fabric Monitoring Tools
vmdk
Disk LUN Controller Switch HBA
vmdk vmdk
Datastore
Virtual Machine
SCSI Controller
RDM
vSphere Performance Monitoring and Troubleshooting
Storage
Possible external monitoring solutions:
Vendor specific SAN and fabric/network tools, native Telnet/SSH/HTTP-based tools for most networks, third-party SNMP-based tools
vSphere Performance Monitoring and Troubleshooting
Working with PowerCLI
PowerCLI Tips:
For a complete list of stat objects:
Get-StatType -Entity <Host/VM>
Pipe the outputs to a file:
Get-Stat -stat <stat> -realtime | ft -autosize > c:\temp\<filename>.csv
Import the CSV file data to a spreadsheet with fixed width parameters
Build pretty graphs
vSphere Performance Monitoring and Troubleshooting
Working with PowerCLI
vSphere Performance Monitoring and Troubleshooting
Way More Information
ESXTOP / vscsiStats / PowerCLI:
http://www.yellow-bricks.com/esxtop/ Special thanks to Duncan Epping!
http://communities.vmware.com/docs/DOC-3930
http://communities.vmware.com/docs/DOC-9279
http://communities.vmware.com/docs/DOC-10095
http://www.vmware.com/support/developer/PowerCLI/PowerCLI41/html/Get-Stat.html
http://www.lucd.info/2009/12/30/powercli-vsphere-statistics-part-1-the-basics/
http://simongreaves.co.uk/blog/esxtop-guide
http://dunnsept.wordpress.com/2010/03/11/new-vscsistats-excel-macro/
vSphere Performance Monitoring and Troubleshooting
Easy button?
What is the problem with these tools?
Limited alerting mechanisms, no collection automation or historical data for comparison, and no correlation of events!
vCenter Operations Standard / Enterprise