Download pdf - Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional

© 2017 VMware Inc. All rights reserved.

Toronto VMUG Q1 Meeting

Steve Sykes, Staff EngineerGlobal Support, Premier ServicesJanuary 31, 2017

Troubleshooting Storage Performance

Agenda

• The ESXi Storage Stack

• Troubleshooting Performance

• Recommended Practices, Tools and Tips

• Steve’s 4-dimensional framework for latency

• Sample case # 1: Latency

• Sample case # 2: Unresponsive guests

• Community and other useful resources

2

VMware ESXi Architecture

PhysicalHardware

ESXi

Virtual Machine

Guest OS

Monitor (BT, HW, PV)

Memory

Allocator

NIC Drivers

Virtual Switch

I/O Drivers

File SystemScheduler

Virtual NIC Virtual SCSI

TCP/IP

File

System

I/O Drivers

Disk I/O Latencies

ApplicationGuest OS

ESX Storage

Stack

VMM

Driver

KAVG

DAVG

GAVG

QAVG

Fabric

vSCSI

HBA

Time spent in ESX storage stack

is minimal, for all practical

purposes

KAVG ~= QAVG

In a well configured system QAVG

should be zero

* KAVG = GAVG – DAVG

Array SP

Disk I/O Queues

GQLEN – Guest Queue

AQLEN – Adapter Queue

WQLEN – World Queue

DQLEN – Device / LUN

Queue

SQLEN – Array SP Queue

DQLEN

WQLEN

SQLEN

GQLEN

DQLEN can change dynamically

when SIOC is enabled

Reported in esxtopAQLEN

ApplicationGuest OS

ESX Storage

Stack

VMM

Driver

Fabric

vSCSI

HBA

Array SP

Use the Right Tool

• esxtop

2 sec data points, VERY granular, not scalable across hosts

• vRealize Operations

5 min data points, very scalable, best starting view

• vCenter Performance Charts

20 sec data points, okay real-time data, poor history, recommend vROPs

• VSAN Observer

Most detailed tool to troubleshoot VSAN related performance

• 3rd Party

Ensure you know what the counters mean and their sample rate

vRealize Operations

• vRealize Operations

– Manage storage performance

on scale

– Integrate with your storage

OEM

• VMware Virtual SAN with a

management pack for storage

monitoring:

– Virtual SAN Object and

component limits

– Disk/Disk Groups.

– Virtual SAN datastore

VSAN Observer

• VSAN Observer is the

engineering performance tool.

– Latency

– IOPS

– Congestion

– OutstandingIO

– Bandwidth

• Do not use esxtop

Storage: Key Indicators

• Kernel Latency Average (KAVG)

This counter tracks the latencies of IO passing thru the Kernel

Investigation Threshold: 1ms

• Device Latency Average (DAVG)

This is the latency seen at the device driver level. It includes the round-trip time between the HBA and the storage.

Investigation Threshold: 10-15ms, lower is better, some spikes okay

• Guest Latency Average (GAVG)

This is the latency seen at the guest level. It is effectively DAVG + KAVG. Needed for network attached storage.

Investigation Threshold: 10-15ms, lower is better, some spikes okay

esxtop– For live troubleshooting and root cause analysis, Finer Granularity (2 Second)

– Lots of Metrics reported

CPU

Scheduler

Memory

Scheduler

Virtual

SwitchvSCSI

c, i, p m d, u, vn

• c: cpu (default)

• m: memory

• n: network

• p: power management

• i: Interrupts

• d: disk adapter

• u: disk device

• v: disk VM

E S X T O P S C R E E N S

esxtop disk adapter screen (d)

Host bus adapters (HBAs) - includes SCSI,

iSCSI, RAID, and FC-HBA adapters

Latency stats from the Device,

Kernel and the Guest

DAVG/cmd - Average latency (ms) from the Device (LUN)

KAVG/cmd - Average latency (ms) in the VMKernel

GAVG/cmd - Average latency (ms) in the Guest

Guest Level Issues

Questions to ask

Is In Guest/App Latency > GAVG vCenter

Latency

Is Latency and IOPS Low but Performance is

“BAD”

Guest Level Queue

Guest App Tuning / Threads / Outstanding I/Os

For Very High IOP levels - Use Multiple vSCSIControllers / Disks

Guest Level Drivers PVSCSI Investigate Interrupt Coalescing

Alignment

Filesystem optimizations: fragmentation, sync/async, …

ApplicationGuest OS

ESXi Level Issues

Device Queue Overflow

World Queue Limiting

High %SYS/Chargeback or VMWAIT

– Blocked Waiting on I/O

– Blocked Waiting on Swapping

High Failed Disk IOPs

SIOC Kicked in – Latency Threshold

VM IOP Limit Set

Questions to ask

Is KAVG > 1ms

Is Device Queue full

Is ESX Host CPU > 85%

IS VM SYS% > 35%

Is VMWAIT > 5%

ESX Storage

Stack

VMM

Driver

vSCSI

Array Level Issues

Engage your storage partner to assist in diagnosis

Questions to ask

Is DAVG > 20ms

What is array health & utilization?

What is the array reporting for service times?

Fabric

Array SP

Device Queue Full

KAVG is non-

zeroQueuing issue

LUN Queue

depth is 32

32 IOs in flight

and 32 Queued

Disk I/O Queuing – World Queue

World ID

World Queue Length –

modifiable

Disk.SchedNumRequestOut

standing

Background: CPU State Times

IDLE

WAIT

SWPWT blocked

VMWAIT

RUNRDY

MLMTD

Elapsed Time

CSTP

Guest I/O

Chargeback : %SYS time

CPU frequency Scaling: Turbo boost USED > (RUN – SYS)

Power management USED < (RUN – SYS)

Identifying storage connectivity issues

I/O activity to NFS

datastore

System time charged

for NFS activity

NFS Connectivity Issue (1 of 2)

Identifying storage connectivity issues

VM blocked,

connectivity lost to

NFS datastore

No I/O activity on the

NFS datastore

VM is not using

CPU

NFS Connectivity Issue (2 of 2)

Performance Impact of Swapping

Some swapping activity

Time spent in blocked

state due to swapping

Storage – Recommendations

• Use Multiple vSCSI Adapters

Allows for more queues and I/O’s in flight

• Use pvscsi vSCSI Adapter

More efficient I/O’s per cycle

• Don’t Use RDM’s

Unless needed for shared disk clustering, no longer a performance advantage

• Leverage Your Storage OEM’s Integration Guide

They provide necessary guidance around items like multi-pathing, 80% of issues solved here

“VMDK on VMFS” or “RDM”

• There really is no difference in performance between

vmdk on VMFS and RDM

• https://blogs.vmware.com/vsphere/2013/01/vsphere

-5-1-vmdk-versus-rdm.html

• Use RDMs ONLY when you require shared disk

clustering (or native SAN tools)

https://blogs.vmware.com/vsphere/2013/01/vsphere-5-1-vmdk-versus-rdm.html

“Thick” vs “Thin”MBs I/O Throughput

• Thin (Fully Inflated and Zeroed) Disk Performance =

Thick Eager Zero Disk

• Performance impact due to zeroing, not result of

allocation of new blocks

• To get maximum performance from the start, must use

Thick Eager Zero Disks (think Business Critical Apps)

• Maximum Performance happens eventually, but when

using lazy zeroing, zeroing needs to occur before you

can get maximum performance

http://www.vmware.com/pdf/vsp_4_thinprov_perf.pdf

Iometer

is an I/O subsystem measurement and characterization tool for single and clustered systems. Windows and Linux

Windows and Linux

Free (Open Source)

Single or Multi-server capable

Multi-threaded

Metrics Collected

• Total I/Os per Sec.

• Throughput (MB)

• CPU Utilization

• Latency (avg. & max)

I/O Analyzer

is a virtual appliance solution, which provides a simple and standardized way of measuring storage performance.

http://labs.vmware.com/flings/io-analyzer

Readily deployable virtual appliance

Easy configuration and launch of I/O

tests on one or more hosts

I/O trace replay as an additional

workload generator

Ability to upload I/O traces for

automatic extraction of vital metrics

Graphical visualization

http://labs.vmware.com/flings/io-analyzer

Storage Profiling Tips and Tricks Common IO Profiles (database, web, etc): http://blogs.msdn.com/b/tvoellm/archive/2009/05/07/useful-io-profiles-for-simulating-

various-workloads.aspx

Make Sure to Check / Try:

– Load balancing / multi-pathing

– Queue depth & outstanding I/Os

– pvSCSI Device Driver

Look out for:

– I/O contention

– Disk Shares

– SIOC & SDRS

– IOP Limits

http://blogs.msdn.com/b/tvoellm/archive/2009/05/07/useful-io-profiles-for-simulating-various-workloads.aspx

vscsiStats – DEEP Storage Diagnostics

vscsiStats characterizes IO for each virtual disk

• Allows us to separate out each different type of workload into

its own container and observe trends

Histograms only collected if enabled; no overhead

otherwise

Metrics

I/O Size

Seek Distance

Outstanding I/Os

I/O Interarrival Times

Latency

Steve’s 4-dimensional framework for Latency discussions

CONFIDENTIAL 28

• Magnitude

How high are the spikes, when they happen?

• Frequency

How often / what times / days do the spikes occur?

• Duration

How log do they last, when they occur?

• Spread

How many hosts / datastores are involved

Magnitude – It Matters

CONFIDENTIAL 29

Image Credit: http://housepetscomic.wikia.com/wiki/File:Order_Of_Magnitude.png

Magnitude – How High?

CONFIDENTIAL 30

• Magnitude - minor

– Spikes of 30-40-50 (or even 100) milliseconds, could be IOPS exceeding the underlying capacity of the hardware

– This level of magnitude will possibly cause small queues to develop

– Depending on the duration, this might cause applications to feel pain, but not intolerable – a “dull ache” periodically

• Magnitude - intermediate

– When the spikes get up towards 500 milliseconds or greater, not likely an IOPS issue

– This is approximately 50x as long as we would normally expect for each command to complete (i.e. 50 x 10 ms = 500 ms)

– Queues will most certainly develop, and the queues may get sufficiently long that the workload perceives it as an outage

• Magnitude - major

– If single SCSI commands take more than 1000 milliseconds (1 second), to execute, there are serious issues indeed

– Queues will almost certainly get sufficiently long that workloads will perceive storage is unavailable

– In both the intermediate and major cases here, duration and spread must be considered

Frequency – Variations in patterns

CONFIDENTIAL 31

Image Credit: https://www.sfu.ca/~truax/Frequency_Modulation.html

Frequency – How often / What days / times?

CONFIDENTIAL 32

• Frequency - occasional

– Once in a while, no set pattern in terms of day of week, time of day

– Also seemingly “random” with regard to datastores, hosts

– Not consistent over time, appears to come and go

• Frequency – some patterns

– Sometimes we see certain time frames of the day; i.e. middle of the night

– These time slots are usually reserved for maintenance type activity

– A “flood” of activity that is much more than the environment was engineered for, can be the cause

• Frequency – all over the place

– In this scenario, we see events logged all through the day / night, and multiple days of week, weeks of month

– Workloads may perceive storage is unavailable (because of excessive queuing)

– In both the intermediate and major cases here, duration and spread must be considered.

Duration – How Long … ?

CONFIDENTIAL 33

Image Credit: http://www.jqueryscript.net/images/Easy-Time-Duration-Picker-Plugin-with-jQuery-jQuery-UI.jpg

Duration – How Long Do the Spikes last?

CONFIDENTIAL 34

• Duration – a “blip”

– Durations of a few milliseconds, or even up to a second, are not necessarily material (unless the Magnitude is high)

– Generally don’t last long enough to cause queuing

– Consider, however, the frequency – if they are all consecutive, even short duration “blips” can add up to longer periods

• Duration – moderate in length

– Generally these are greater than 1 second, but likely in the order of < 1 minute or two

– Cause may be some sort of queue development and clearing

– Here, other factors such as frequency are relevant – if they happen too frequently, the effect can be much worse

• Duration - elongated

– Depending on the Magnitude, if spikes go on and on for many seconds, the effect can be cumulative

– If the spike lasts for minutes, and the Magnitude is sufficiently high, workloads may perceive “outages”

– Again, most important to consider this factor together with Magnitude and Frequency

Spread – Hosts / Datastores

CONFIDENTIAL 35

Image Credit: https://www.wired.com/wp-content/uploads/2015/04/epi-rail-web1-1024x1024.jpg

Spread – Confined vs Widespread (?)

CONFIDENTIAL 36

• Spread – confined

– If the issue is on a single host (or a small subset of the total hosts), that suggests an inquiry line

– Often it can be limited to a single cluster

– Same inference for single datastore, or small subset of datastores

• Spread – intermediate

– Multiple hosts, multiple clusters, multiple datacenter objects

– More than one array type, and/or significant representation of datastores

– This suggests more of a fabric issue, especially if multiple arrays involved

• Spread – widespread / universal

– Almost all hosts involved, many clusters and/or datacenters

– Most datastores involved also

– This may be a combination of fabric and array issues

Latency - Symptoms

CONFIDENTIAL 37

Host logs: /var/run/log/vobd.log

Magnitude over half a secondDuration 39 seconds

Latency - Evidence

CONFIDENTIAL 38Parsed from logs / imported to Excel for analysis

But where do we look?

CONFIDENTIAL 39

DAVG: Host HBA Driver Firmware Wire Switch Wire Array Front End LUN Media And return!

Strategy / Approach

CONFIDENTIAL 40

• Collect data from the hosts

– Based on vm-support log extracts

– Objective: Understand the Magnitude, Frequency, Duration and Spread

– Get everyone on the same page regarding the symptoms

• Share the data collaboratively

– Both storage and fabric support teams

– Get the correlating data from the array stats

– Compare with the hosts’ experience

• Does the array data agree with the host experience?

– If so, then array support / vendor can investigate / make changes

– If not, then issue must be in the fabric, so different direction for the investigation

– After any changes are made, collect fresh logs and perform comparative analysis

Configuration issues – many possibilities

CONFIDENTIAL 41

• Round Robin PSP

– IOPS=1000 vs IOPS=1

• Fabric

– Switches (hardware / field upgradeable code such as firmware)

– Cable plant (defective or inferior quality cables / connectors)

– Zoning issues

• HBA issues

– Drivers / firmware / hardware issues

– Queue depth settings

• Array issues

– Front end processor issues

– Defective media

– De-duplication and other overhead activity

– High % of cache misses

Sample Case # 2 – Unresponsive guests

CONFIDENTIAL 42

… and yet, can ping the hosts, no apparent network issues

Can this be a storage issue?

CONFIDENTIAL 43

Let’s look in the logs

CONFIDENTIAL 43

Host logs: /var/run/log/vobd.log

Between 11:19:05.105Z and 11:21:46.217Z, no I/O scheduled for datastore UUID 4a80

https://kb.vmware.com/kb/2136081 - "Understanding lost access to volume messages in ESXi 5.5/6.x"

And speaking of logs …

CONFIDENTIAL 44

• Many people find this painful

– But it is not meant to be

– https://kb.vmware.com/kb/1008524 - "Collecting diagnostic information for VMware products“

– The above KB has links to every product (or should do – please report if not)

– If you have trouble collecting logs, then that’s a reason for an SR all by itself

• Which logs?

– For storage, almost always get vm-supports from ALL hosts in any cluster of interest

– If LUNs are presented to multiple clusters, then ALL hosts in EACH cluster

– Generally vCenter and vSphere client logs can be omitted

– But … it is the responsibility of the investigator (TSE) to prescribe which logs are needed

• Uploading

– https://kb.vmware.com/kb/2069559 - "Uploading diagnostic information for VMware through the Secure FTP portal"

– Make sure to use Binary transmission mode

– Make sure to change into the SR directory (after creating as necessary – directory name is SR #) before transferring files

Bottom Line Principles

CONFIDENTIAL 45

• Optimally configured / engineered environments

– Should exhibit few, if any, latency alerts and/or VMFS heartbeat “timedout” events

– Log analysis can be done anytime, if problems are suspected

– Also, can be done when problems are NOT suspected – provides useful baseline info

– Better to cite evidence, than to throw darts

• Collaboration is key

– The root cause(s) of these issues are usually external to vSphere, BUT …

– ESXi log analysis can help direct the investigation, AND …

– Storage and fabric support teams are needed in addition to vSphere admins, AND …

– Vendors need to get engaged also

• It’s in everyone’s interest that things are smooth and stable

– Often, if these issues are chronic, word starts spreading that virtualized apps “can’t keep up” with physicals

– In most cases, that is no longer true

– And even if it is in some cases – we want to fix that!

Community Resources

VMware’s Performance Technology Pages (Whitepapers Here)

– http://www.vmware.com/technical-resources/performance/resources.html

VMware’s Tech-Marketing Performance Blog

– http://blogs.vmware.com/vsphere/performance/

VMware’s Perf-Eng Blog (VROOM!)

– http://blogs.vmware.com/performance

Performance Community Forum

– http://communities.vmware.com/community/vmtn/general/performance

VMware Performance Links – Master List

– https://communities.vmware.com/docs/DOC-25253

Virtualizing Business Critical Applications

– http://www.vmware.com/solutions/business-critical-apps/

http://www.vmware.com/technical-resources/performance/resources.html

http://blogs.vmware.com/vsphere/performance/

http://blogs.vmware.com/performance

http://communities.vmware.com/community/vmtn/general/performance

https://communities.vmware.com/docs/DOC-25253

http://www.vmware.com/solutions/business-critical-apps/

Resources

VMware’s Performance – Technical Whitepapers

http://www.vmware.com/resources/techresources/cat/91,96

VMware’s Tech-Marketing Performance Blog


VMware’s Perf-Eng Blog (VROOM!)


Performance Community Forum


VMware Performance Links – Master List


Virtualizing Business Critical Applications


CONFIDENTIAL 47

http://www.vmware.com/resources/techresources/cat/91,96






Resources

Performance Best Practices

http://www.vmware.com/pdf/Perf_Best_Practices_vSphere5.5.pdf

http://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/vmware-perfbest-practices-vsphere6-0-white-paper.pdf

Troubleshooting Performance Related Problems in vSphere Environments

http://communities.vmware.com/docs/DOC-23094 (vSphere 5.x with vCOps)

CONFIDENTIAL 48

http://www.vmware.com/pdf/Perf_Best_Practices_vSphere5.5.pdf

http://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/vmware-perfbest-practices-vsphere6-0-white-paper.pdf

http://communities.vmware.com/docs/DOC-23094

Resources

Virtualizing Microsoft Business Critical Applications on VMware vSphere

by: Matt Liebowitz, Alexander Fontana

vSphere High Performance Cookbook

by: Prasenjit Sarkar

Troubleshooting Storage Performance

By: Mike Preston

VMware vSphere Performance: Designing CPU, Memory, Storage, and Networking for Performance-Intensive Workloads

By: Matt Liebowitz, Christopher Kusek, Rynardt Spies

Virtualizing SQL Server with VMware: Doing IT Right

By: Jeff Szastak, Michael Corey, Michael Webster

Virtualizing Oracle Databases on vSphere

By: Don Sullivan, Kannan Mani

VMware vRealize Operations Performance and Capacity Management

By: Ewan ‘e1’ Rahabok

CONFIDENTIAL 49

Resources

VMware Hands-On-Labs

http://labs.hol.vmware.com/

HOL-SDC-1404:

vSphere Performance Optimization – This has always been one of the most popular labs and has content for both the beginner and the advanced vSphere Administrator. You can learn more about the basics of vSphere Performance or delve into esxtop, or vNUMA.

http://labs.hol.vmware.com/HOL/#lab/1474

CONFIDENTIAL 50

http://labs.hol.vmware.com/

http://labs.hol.vmware.com/HOL/#lab/1474

Thank You

Steve SykesStaff Engineer, Global Support, Premier [email protected]

mailto:[email protected]