20
© 2009, LeftHand Networks. All rights reserved. Introduction This document is intended to be used by storage administrators who are concerned about the performance of their LeftHand storage system. The goal of this document is to provide the reader with the knowledge and tools necessary to determine whether the SAN is performing as expected, and if so, if there is room for more performance. The performance of the SAN can dramatically affect the performance of the application that depends on that storage. A poorly performing SAN can result in user complaints, application jobs not completing in time, and could impact disaster recovery plans and regulatory compliance. By either identifying or eliminating the SAN as the cause of the performance issues, resolution will occur that much faster. USING PERFORMANCE MONITOR TO TROUBLESHOOT SAN PERFORMANCE

Using Performance Monitor to Troubleshoot SAN Performance-Final

  • Upload
    flatexy

  • View
    1.536

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Using Performance Monitor to Troubleshoot SAN Performance-Final

© 2009, LeftHand Networks. All rights reserved.

Introduction This document is intended to be used by storage administrators who are concerned about the

performance of their LeftHand storage system. The goal of this document is to provide the

reader with the knowledge and tools necessary to determine whether the SAN is performing as

expected, and if so, if there is room for more performance.

The performance of the SAN can dramatically affect the performance of the application that

depends on that storage. A poorly performing SAN can result in user complaints, application

jobs not completing in time, and could impact disaster recovery plans and regulatory compliance.

By either identifying or eliminating the SAN as the cause of the performance issues, resolution

will occur that much faster.

USING PERFORMANCE MONITOR TO TROUBLESHOOT SAN

PERFORMANCE

Page 2: Using Performance Monitor to Troubleshoot SAN Performance-Final

Page 2 of 20

Contents

Performance Troubleshooting Flowchart ..................................................................................... 10

Best Practices for Gathering Statistics .......................................................................................... 11

Understanding Expected SAN Performance ................................................................................. 12

FAQ............................................................................................................................................... 15

Examples ....................................................................................................................................... 16

Glossary ........................................................................................................................................ 18

Contacting Support ....................................................................................................................... 20

Page 3: Using Performance Monitor to Troubleshoot SAN Performance-Final

Page 3 of 20

Performance Trends and Relationships I/O Size

When analyzing SAN1performance, it is important to keep in mind several trends and

relationships between the statistics that are being monitored, as changing one aspect of a

workload can have a measurable impact on another. As a result, performance expectations can

change, and knowing how these variables impact one another can help to set realistic

expectations of SAN performance.

The size of the I/O to and from the SAN impacts the measurable performance statistics of the

SAN. Specifically, the smaller the I/O size, the more I/Os per second (IOPS) the SAN can

process. However, the corollary to this is a decrease in throughput (as measured in MB/s).

Conversely, as I/O size increases, IOPS decreases but throughput increases. When an I/O gets

above a certain size, latency also increases as the time required to transport each I/O increases

such that the disk itself is no longer the major influence on latency. The following chart shows

the relationship between I/O size, IOPS, and throughput.

Chart 1: Sample relationship between I/O Size, Throughput, and I/Os per Second

Queue Depth

Queue depth is the amount of outstanding I/O waiting for processing by the SAN. In other

words, it’s the count of how many pieces of data are stacked up waiting to get written to or read

from the SAN. If the queue depth is low (as defined below), it means that there are few (or no)

I/Os waiting on the SAN. Latency (or response time) is minimal as each I/O gets processed

immediately. IOPS are reduced because the SAN is waiting on I/O from the application. If

1 The terms “SAN” and “cluster” are used throughout this document. “SAN” is used when discussing topics that

apply to all storage, “cluster” is used when discussing topics specific to an HP LeftHand product.

Page 4: Using Performance Monitor to Troubleshoot SAN Performance-Final

Page 4 of 20

queue depth is high, there are outstanding I/Os waiting to be serviced by the SAN. This

increases IOPS, but adds latency because each I/O is waiting to be serviced instead of being

serviced immediately. The SAN performs optimally when there are enough I/Os outstanding to

keep the SAN busy, but not so many that each I/O has to wait longer than desired to get serviced.

The optimal queue depth for a cluster would be 1-2 times the number of physical disks in the

SAN for SAS drives, and one times the number of physical drives for SATA. This difference is

because of the way SAS and SATA differ in the way they handle the queuing of I/O2. With the

Virtualization SAN, the optimal queue depth for any volume would be in the 40 to 48 range (24

physical disks * 2 = 48).

Disk Type

Disk Count

12 24 48 96

SAS (2x disk

count) 24 48 96 192

SATA (1x disk

count) 12 24 48 96

Table 1: Optimal queue depth by drive type and quantity

Latency

Latency is the amount of time the system is waiting for the I/O to complete by either getting the

data back from the SAN during reads or getting acknowledgements back for the writes.

Latencies and queue depth are correlated because as the SAN gets busier, it takes longer to

service an individual I/O, as shown in the chart below. As mentioned above, queue depth can

improve overall SAN performance to a certain point, until the SAN becomes saturated. This

saturation point is dependent on many variables, such as platform and disk type, workload to the

SAN, etc. Saturating the SAN leads to high latency, as illustrated in Chart 2.

Latencies matter because if an application is waiting on storage for data, then the users are

waiting on the application. Applications where response times are critical, such as email and

OLTP (Online Transactional Processing) systems, require lower latencies than applications such

as backup. As an example, imagine waiting 10 seconds for an email to open versus 10 seconds

for a backup job to get data from the SAN. The nature of the application determines the

tolerance for latency on the SAN.

2For more information, see http://www.serialstoragewire.net/Articles/2006_03/sysinsights16.html

Page 5: Using Performance Monitor to Troubleshoot SAN Performance-Final

Page 5 of 20

Chart 2: Sample Relationship of IOPS and Latency to Queue Depth

SAS vs. SATA

Drives based on SAS technology offer higher rotational speeds, lower latency, and the ability to

queue commands more effectively than SATA, which makes SAS optimal for workloads such as

virtualization, email servers, OLTP, and other random database type applications. Higher

density drives, such as SATA, offer a better Gigabyte per dollar and are especially good for

workloads such as archive repositories, file shares and as a staging area to backup data to tape.

Performance sensitive applications such as email and OLTP are best suited for higher

performance/lower latency SAS drives, and performance may not be satisfactory if those

applications are run on lower performing/higher capacity SATA drives.

Disk type

Attributes

Rotational

Speed

Average

disk

latency

MAXS

IOPS Cost/GB Reliability Ideal for:

SAS 15,000

RPM 3.4 ms 290 Low High

Exchange

SQL

Databases

Random workloads

Virtualization

SATA 7,200 RPM 11ms 100 Low Low

Low-performing

applications

Fileshare

DR Sites Table 2: Approximate performance attributes of SAS and SATA drives

Page 6: Using Performance Monitor to Troubleshoot SAN Performance-Final

Page 6 of 20

Analyzing SAN Performance There are many variables that affect the performance of a SAN and the metrics that performance

is measured by. The preceding section discusses some of these variables, and moving forward it

is important to recognize that these variables can change the expected performance of the SAN.

Keep the preceding trends in mind while reading the following sections.

Make sure there are no hardware faults

One of the most common causes of performance problems is hardware issues. Some common

hardware issues that can be detected from the hardware diagnostics panel are situations in which

the cache on the RAID controller has been disabled, or problems with the drives such as

complete or pending failures. To run hardware diagnostics, expand the node(s) in question,

select Hardware, right click and “Run Diagnostic Tests”. The results will appear on the right

ride of the CMC – if any tests return a result of “Fail”, contact HP LeftHand support.

Image 1: Example of a hardware fault that can impact cluster performance

Look for discrepancies between nodes

Discrepancies between performance counters on different nodes can be an indication of a

problem with a specific storage node or its connectivity to the network rather than a limitation of

the performance of the SAN as a whole. Key performance metrics to monitor and compare are:

Collect performance data for each of the storage nodes, including IOPS total, CPU

Utilization, and Network Bytes Total for each of the network cards. If one or more of the

storage nodes look like they are underperforming relative to other storage nodes in the

cluster, it’s an indicator that something is not optimized on that storage node. Some

simple things to check are networks setup and connectivity, and the health of the

hardware (run hardware diagnostics via the CMC).

Page 7: Using Performance Monitor to Troubleshoot SAN Performance-Final

Page 7 of 20

Disk latencies higher on one or more node compared to the other storage nodes in the

cluster can indicate that the storage node(s) is not in an optimal state. Things that could

cause this are a disk rebuilds or disabled cache on the controller.

All nodes can host connections from the server to the volume. If too much traffic is

coming through one node, this could indicate that the network connection to the node is

the bottleneck. Look at the Network Bytes Total for that node. If it is equal to or near

that maximum limit for its network connections, performance may benefit by

reconnecting to the volumes through another node. Maximum throughput for two GigE

ports is ~224MB/s; a 10Gb network card has a maximum throughput of ~1120MB/s.

Are any volumes or storage nodes restriping or resynching?

There are several conditions that can lead to data being moved around on the cluster, transparent

to users. This could include volume(s) restriping as a result of changing volume properties, such

as replication level. Storage nodes can also have to resynchronize if a storage node has been

offline due to maintenance, power loss, etc. When the storage node comes back online, it

resynchronizes with the other nodes in the cluster, and this resynchronization can impact

performance. To determine if volumes are restriping, select “Volumes and Snapshots from the

cluster and look at Status

The restripe rate can be increased by selecting the management group, right clicking, and

selecting Edit Management Group. Adjust the Bandwidth Priority to maximize the rebuild rate if

desired, or minimize the rebuild rate if you would like the SAN I/O to preference applications.

Page 8: Using Performance Monitor to Troubleshoot SAN Performance-Final

Page 8 of 20

Image 2: Example of a volume during resynchronization

Image 3: Example of a volume during a restripe

Is the host initiator saturated?

For the volume or host initiator, check total throughput. For Windows servers, the maximum

throughput to the cluster is ~112MB/s (1000 Gigabits / 1024) for a gigabit initiator, or

~1120MB/s (10000 Gigabits / 1024) for a 10Gb initiator. For other operating systems that

support bonded (teamed) initiators for iSCSI traffic, that number will increase, depending on the

Page 9: Using Performance Monitor to Troubleshoot SAN Performance-Final

Page 9 of 20

bonding method used. To determine if the initiator or network is saturated, check the throughput

for the initiator in question via the performance monitor in the CMC or by using a tool such as

Windows Perfmon to monitor the initiator in question.

Has the workload changed?

If more applications or workloads have been added to the cluster, the cluster might have

naturally reached its maximum operating capacity. Table 1 gives estimates of the maximum

performance capacities of common cluster configurations for applications. If your workload

exceeds what is listed in Table 1, you may have outgrown your current cluster performance

capabilities and the cluster will need to grow by adding more nodes.

Is latency high?

As discussed above, latencies are the amount of time the application is waiting for the I/O to

complete. High storage latencies are often the cause of user complaints of slow email, poor

application response time, etc. Table 2 below lists some common applications and their optimal

and acceptable latencies. High latencies can often be reduced by either adding disks to the

cluster or by moving high I/O volumes to a dedicated cluster. To check latencies, add those

counters from the “Performance Monitor” screen in the CMC. More information on this process

can be found in the SAN User Manual for SAN/iQ 8.0

Is queue depth high?

Queue depth is the amount of outstanding I/O waiting for processing by the SAN. If the queue

depth is low, that means the SAN has performance available to be used by applications. For

example, if the cluster queue depth is one, only one disk is getting I/O requests at a time and the

SAN will perform similar to a single hard drive. Ideally, the queue depth should be equal to the

number of physical disks in the cluster for SATA, and about one to two times the number of

physical disks for SAS. Queue depth higher than the recommended maximums can lead to

higher latencies, which can impact application response time. To check queue depth, add those

counters from the “Performance Monitor” screen in the CMC. More information on this process

can be found in the SAN User Manual for SAN/iQ 8.0.

Page 10: Using Performance Monitor to Troubleshoot SAN Performance-Final

Page 10 of 20

Performance Troubleshooting Flowchart SAN is not performaning

as expected

Is performance

suspect across the entire cluster?

Is queue depth high?

Has any application workload changed?

Is latency high?

Did all hardware

diagnostics pass?

Are any volumes

restriping?

Is more data being

processed?

SAN might have reached maximum performance

Is queue depth high?

Yes

Yes Contact LeftHand Support No

Yes

SAN might have reached maximum performance

SAN is performing as expected. 10Gb on the host(s) might improve

performance.

Increased completion time is expected as more data is being

processed

Yes

Yes

SAN background operations may be taking place, contact support

No

Wait for restripe to complete Yes

SAN is under-utilized. Application tuning may increase performance

No

Is the host / initiator /network

saturated?

Yes

No

Run hardware diagnostics on

each NSM

Collect Performance

Data for affected hosts/volumes

Are the storage nodes resynching?

No

No

Wait for resynch to completeYesNo

No

Incr

ease

the p

erf

orm

ance

of

the S

AN

(cl

ust

er)

by a

ddin

g m

ore

nodes

Yes

No

Yes

No

No

Chart 3: Performance Troubleshooting Flowchart

Page 11: Using Performance Monitor to Troubleshoot SAN Performance-Final

Page 11 of 20

Best Practices for Gathering Statistics In most SANs, performance is not constant over any specific period of time, such as a day or

week. There may be spikes in the morning, such as when as workers arrive at work and hit email

servers harder than during the rest of the day. Heavy reporting or batch processing may take

place on the weekend, increasing load on servers and storage at peaks higher than the rest of the

week. It is important that statistics are captured for the times with peak workloads, whether

caused by user activity or system maintenance (such as backups).

It’s important that the correct statistics be captured for the application being monitored. If the

application is a database, I/Os and volume latency will be more important than throughput. If the

application is streaming to a server (such as doing a backup to or from the SAN), throughput

would be a better statistic to measure.

Additionally, if the concern is around a specific server or volume, statistics for that item should

be of primary concern. The following guidelines will help to ensure that the proper statistics are

collected3.

Item being monitored Performance Statistics to Monitor

SAN as a whole Cluster statistics:

IOPS total, Throughput Total, Average

I/O Size, Queue Depth Total, I/O

Latency Total

Specific Server Server initiator IQN:

IOPS total, Throughput Total, Average

I/O Size, Queue Depth Total, I/O

Latency Total

Specific Volume For the volume(s) in question:

IOPS total, Throughput Total, Average

I/O Size, Queue Depth Total, I/O

Latency Total.

3 See the SAN User Manual for SAN/iQ 8.0 for more information on how to collect performance data

(http://www.lefthandnetworks.com/document.aspx?oid=a0e00000000013yAAA)

Page 12: Using Performance Monitor to Troubleshoot SAN Performance-Final

Page 12 of 20

Specific NSM Storage Node Statistics:

CPU Utilization, Network Bytes Total:

Motherboard: Port1, Network Bytes

Total: Motherboard: Port2, IOPS Total,

Throughput Total, Queue Depth Total,

and I/O Latency Total

Table 3: Relevant Performance Monitor statistics for a specific object

Understanding Expected SAN Performance Workload

Typically workloads can be defined by four categories – I/O size, reads vs. writes, sequential vs.

random, and queue depth. A typical application usually consists of a mix of reads and writes,

and sequential and random. For example, a typical Microsoft Exchange 2007 server has a

workload that is 8k in I/O size, 70% read, and 20% sequential. The type of workload will affect

the results of the performance measurement. For example, a cluster of three storage node2120s

with hardware RAID 5 and Network RAID level 2 will see less than half the throughput on 64k

sequential writes vs. 64k sequential reads (assuming proper queue depth). This is because when

a write is sent to the SAN, it gets written in two places as a result of Network RAID level 2.

Writes to a hardware RAID 5 volume are also slower than reads to a RAID 5 volume, because of

the parity overhead of the RAID 5 writes. When analyzing the performance of your LeftHand

SAN, take the workload into consideration.

Queue Depth and Latency as an Indicator of Cluster Performance

As mentioned above, to maximize the efficiency of the cluster a balance between queue depth

and latency must be found.

As a general rule of thumb, for operations such as databases and email servers where

disks become the bottleneck, the queue depth of the workload for the cluster is

recommended to be no larger than the number of physical disk drives in the SAN for

SATA drives. For example, a three node cluster of 2120 storage nodes with 36 SAS

drives would have an optimal queue depth of around 60.

If your queue depth is higher than recommended and your latency is high, that is a key indicator

that your cluster is saturated. The solution to this is to add more nodes to the cluster. If your

latency is high but your queue depth is low, it indicates some workload on the cluster that is not

related to application workload, such as a volume being restriped across a storage node that was

added. Applications that are sensitive to response time, such as OLTP and email require lower

latencies. Operations such as backups and batch processing or report generation can tolerate

higher latencies from the cluster because there are no users actively waiting for a reply from the

system.

Page 13: Using Performance Monitor to Troubleshoot SAN Performance-Final

Page 13 of 20

Application Virtualization SAN Multi-Site SAN 6 Total Nodes 8 Total Nodes

Exchange

(IOPS/USERS) 3,400 IOPS/

4,800 users

6,800 IOPS/

9,600 users

10,200 IOPS/

14,400 users

13,600 IOPS/

19,200 users

SQL (8k Random

Reads) 5,200 IOPS 10,400 IOPS 15,600 IOPS 20,800 IOPS

Virtual Servers

(Maximum Guests) 136 272 408 544

File Share

(Maximum

Throughput)

220 MB/s 440 MB/s 660 MB/s 880 MB/s

Table 4-- Expected Performance for NSM 2120-G2 SAS with Network RAID 2 in Hardware RAID 5

The following chart offers preferred sustained response times for specific application types based

on customer experience. Keep in mind that while these numbers are based on known best

practices, your individual requirements and tolerances for latencies can vary greatly from these

numbers. If your cluster is performing to your expectations, no action or concern is required by

you. Also keep in mind that latency on the cluster can spike from time to time. Brief spikes are

not cause for concern; sustained periods of high latency for latency-sensitive applications

indicate that more nodes should be added to the cluster.

Application Typical optimal

response time

Typical maximum

acceptable response time

Email (Exchange, Lotus Notes, etc) 20ms <100ms

Database/OTLP

(SQL Server, Oracle, etc.)

20ms <100ms

File Share Up to 100ms <200ms

Batch Processing/Reporting Up to 100ms Varies widely

Backup Varies widely Varies widely

Table 5: Optimal disk latencies for applications

Page 14: Using Performance Monitor to Troubleshoot SAN Performance-Final

Page 14 of 20

Investigating background tasks

The cluster runs background operations that can impact the performance that is available to hosts

and applications. Some examples of background operations include volume restripes as storage

nodes are added to the cluster, or the synchronization of storage nodes that have been offline and

been brought back online. These operations will be evidenced by the IOPS of the cluster being

higher than the total IOPS of the volumes on the cluster, or as cluster IOPS when there are no

volumes connected to the cluster. These are normal background tasks and should not be cause

for alarm; however, as they do consume system resources, they should factor in to your SAN

performance sizing.

Hardware RAID rebuild tasks can also impact performance, such as when a physical hard drive

is replaced in the cluster. This operation is not visible via the Performance Monitor, but can be

seen in the Centralized Management Console in the “Storage” section for that individual storage

node.

Background task How to identify Action

SAN/Node Initialization High Storage Node IOPS Wait for completion

Volume Restripe Volume Status on Volume

Details Tab in CMC Wait for completion

Volume Resynchronization Volume Status on Volume

Details Tab in CMC Wait for completion

Node Restripe

Volume Status on Volume

Details Tab in CMC for all

volumes

Wait for completion

Node Resynchronization

Volume Status on Volume

Details Tab in CMC for all

volumes

Wait for completion

RAID Rebuild RAID Status on Disk Setup

tab of Storage in CMC Wait for completion

Table 6: Identification and actions for SAN/iQ background tasks

Page 15: Using Performance Monitor to Troubleshoot SAN Performance-Final

Page 15 of 20

FAQ I just built a new cluster, and have not created or attached any volumes. Why do I see activity on the storage nodes?

When storage nodes are added to a management group or a new management group is created, a

background initialization process occurs on the storage nodes in the management group. This is

a onetime occurrence, but explains the activity on the storage nodes. This activity occurs as

quickly as an individual storage node can write to its disks, and will appear on the Performance

Monitor as a throughput of up to ~200 MB/s of local throughput on each storage node,

depending on the type of drive and the hardware RAID level. This will show up as local I/O and

will not be measured over the network. If these tasks have completed, there will still be some

background activity going on, as SAN/iQ does preventive maintenance on the cluster, checking

for and repairing bad blocks on the physical disks. This is expected and should not be a cause

for concern.

I just deleted a volume/snapshot. Why do I see activity on the storage nodes?

This initialization also occurs when a volume or snapshot is deleted, as SAN/iQ writes zeros over

the deleted volume or snapshot as a security measure and a validation of the integrity of the disk.

If these tasks have completed, there will still be some background activity going on, as SAN/iQ

does preventive maintenance on the cluster, checking for and repairing bad blocks on the

physical disks. This is expected and should not be a cause for concern.

I notice that CPU usage of one storage node is higher than the rest. Is something wrong with my storage node?

The storage node hosting the Virtual IP address or the coordinating manager might also have

higher CPU usage during certain operations. This is to be expected and not cause for concern. If

you’re not noticing any unexpected performance changes with your applications, there is likely

nothing wrong with the hardware. The cause of the higher CPU usage is probably related to day

to day SAN operations, such as Remote Snapshots or managing the iSCSI sessions to the cluster.

Unless this continues for a prolonged amount of time, there is nothing to worry about in this

example.

CPU/IOPS/Throughput spike occasionally. Why is this?

There are several possible causes for this. Some possible causes include bursty I/O from the

application or I/O directly to or from cache on the storage node. Regardless of the cause, the

important thing to understand is that temporary drops or spikes in I/O to and from the cluster are

considered a normal part of the cluster operation and should not be cause for concern.

I notice that CPU usage of storage nodes occasionally is at 100%. Is something wrong with my cluster?

If you’re not noticing any unexpected performance changes with your applications, there is likely

nothing wrong with the cluster. The cause of the higher CPU usage is probably related to day to

day cluster operations, such as Snapshot deletes. Unless this continues for a prolonged amount

of time, there is nothing to worry about in this example.

Page 16: Using Performance Monitor to Troubleshoot SAN Performance-Final

Page 16 of 20

CPU/IOPS/Throughput drop to zero occasionally. Why is this?

There are several reasons that this can happen. Some possible causes include the disk controller

flushing I/O to disk, or the application may have temporarily stopped requesting I/O from the

cluster. Regardless of the cause, the important thing to understand is that temporary drops or

spikes in I/O to and from the cluster are considered a normal part of the cluster operation and

should not be cause for concern.

Examples

Example 1: Backups take unusually long to complete

Step 1: Define the problem.

Application administrator complains that backups are taking twice as long to complete as they

did a week ago.

**Note: The problem might also appear as a slower throughput from the cluster to the backup

server.

Step 2: Identify other possible performance issues

None of the other system administrators have performance concerns

Step 3: Identify any changes in the application workload.

The application administrator says that the workload is relatively unchanged. No new data is

being backed up and the tape devices appear to be functioning properly.

**Note: If the application data being backed up had doubled in size, that also would have led to

the backup taking twice as long (twice as much data at the same speed means twice as much time

to back it up). This is why it is important to have a clear understanding of the problem.

Step 4: Check to see if the initiator is saturated.

Using the Performance Monitor, the administrator sees that throughput from the SAN to the

backup server averages 9 MB/s, with a maximum of 11 MB/s. The administrator does further

investigation on the initiator on the backup server and finds that the switch port the initiator is

plugged in to has auto-negotiated to 100 MB/s. Changing the port to 1000 MB/s speeds up the

backup job, which completes in its usual time frame.

Example 2: Users complain about slow email response time

Step 1: Define the problem.

Page 17: Using Performance Monitor to Troubleshoot SAN Performance-Final

Page 17 of 20

With the SAS Starter SAN, users complain that opening Outlook can take up to a minute –

opening any single email can take up to 10 seconds.

Step 2: Identify other possible performance issues.

The web administrator mentions that he’s also had complaints of long load times for some pages,

and the DBA mentions that reports run during the day are starting to take longer to complete.

The backup administrator has no performance issues with the backups he runs at midnight.

Step 3: Run hardware diagnostics on every node in the cluster

As the performance issues seem to be occurring against the SAN as a whole, run hardware

diagnostics on each node to identify any node that might have hardware faults, a possible cause

of performance degradation. In this example, all tests on all nodes pass.

**Note: If any of the nodes show a performance fault (the diagnostics tests returns a “Fail”

result), run performance counters as outlined in Table 3 against all of the nodes. Look for

discrepancies between nodes that would indicate a performance impact from the hardware fault.

Step 4: Identify any changes in the application workload.

None of the application administrators can identify any major changes to their applications or

size of their data.

Step 5: Check to see if the initiator is saturated.

None of the initiators for the servers experiencing performance problems are near their

maximum throughput. To be safe, the network configuration is checked and all initiators are set

to 1000MB/s full-duplex. Flow control is also enabled.

Step 6: Check if queue depth on the cluster is high

The Performance Monitor is set to collect data for 48 hours with the following cluster counters:

IOPS total, Throughput Total, Average I/O Size, Queue Depth Total, I/O Latency Total. Queue

depth averages 10, but during the day, queue depth is much higher, often in the 40-50 range.

This correlates to the times of user complaints.

Step 7: Check if latency on the SAN is high

Using the performance counters collected in Step 6, latency averages 12ms, but again, during the

day, latency is much higher, with many spikes over 100ms. At night, latency is very low, except

during the time that the backup operations run.

Based on the information gathered through the Performance Monitor, all signs point to the SAN

being overloaded during the day, when latencies and queue are high. At night, latencies and

queue are high as a result of running the backup jobs. Because the backups are not latency

Page 18: Using Performance Monitor to Troubleshoot SAN Performance-Final

Page 18 of 20

sensitive, this is not a problem. Because email and the database operations during the day are

latency sensitive, latency must be decreased. The proper solution to this is to add more nodes

(and thus more performance) to the cluster.

Glossary Bond – The combining of two or more physical network interfaces into a single logical interface

for performance and fault tolerance. Also referred to as a “team”

CMC – The Centralized Management Console. This is the application which serve as a single

pane of glass for monitoring and managing all HP LeftHand storage nodes.

Disk – A single physical hard drive.

Cluster - A cluster is a grouping of storage nodes that create the storage pool from which you

create volumes. This can also be referred to as the SAN.

Initiator - An initiator functions as an iSCSI client. An initiator typically serves the same

purpose to a computer as a SCSI bus adapter would, except that instead of physically cabling

SCSI devices (like hard drives and tape changers), an iSCSI initiator sends SCSI commands over

an IP network.

I/O – An I/O is a read or write to a drive. I/O is usually measured in IOPS – I/Os Per Second.

Latency - latency is a measurement of the read/write performance of a disk drive. Latency is the

time it takes for a particular sector to pass under the read/write head of a disk after the head is

positioned over the appropriate disk track.

NIC - A computer hardware component designed to allow computers to communicate over a

computer network. Also called a network card, network adapter, network interface controller,

network interface card, or LAN adapter

Perfmon – The Windows Performance Monitor console, which can be used to monitor and

collect performance statistics for a number of Microsoft objects.

Queue Depth - Amount of outstanding I/O waiting for processing by the SAN

RAID - RAID (originally redundant array of inexpensive disks, now redundant array of

independent disks) refers to a data storage scheme using multiple hard drives to share or replicate

data among the drives. The benefit of RAID is higher performance and/or availability than

stand-alone drives. The different RAID types supported by HP LeftHand are:

RAID 0 - data striped across disk set (select platforms only)

RAID 10 - mirrored sets of RAID 1 disks

Page 19: Using Performance Monitor to Troubleshoot SAN Performance-Final

Page 19 of 20

RAID 5 - data blocks are distributed across all disks in a RAID set. Redundant

information is stored as parity distributed across the disks.

RAID 50 - mirrored sets of RAID 5 disks.

Restripe - Striped data is stored across all disks in the cluster. You might change the

configuration of a volume, for example, change replication level, add a storage node, or remove

a storage node. Because of your change, the pages in the volume must be reorganized across the

new configuration. This reorganization is known as a restripe.

Resynchronization - When a storage node goes down, and writes continue to a second storage

node, and the original store comes back up, the original storage node needs to recoup the exact

data captured by the second storage node.

Rotational Speed – The speed at which the platters in the physical hard drive spin. Higher

rotational speeds lead to higher performance because of reduced seek time and the ability to pull

data off the drive faster.

SAS - Short for Serial Attached SCSI, an evolution of parallel SCSI into a point-to-point serial

peripheral interface in which controllers are linked directly to disk drives. SAS is a performance

improvement over traditional SCSI; its full-duplex signal transmission supports 3.0Gb/s. In

addition, SAS drives can be hot-plugged.

SATA - Serial ATA (Serial Advanced Technology Attachment or SATA) is a new standard for

connecting hard drives into computer systems. As its name implies, SATA is based on serial

signaling technology, unlike IDE (Integrated Drive Electronics) hard drives that use parallel

signaling

Seek Time - Refers to the time a program or device takes to locate a particular piece of data

Snapshot - A point in time image of a volume for use with backup and other applications. The

data on a snapshot does not change, and a live volume can revert back to a snapshot to bring the

data back to a previous state.

Spindle – Another name for disk, a single physical hard drive.

Team – Another name for Bond, the combining of two or more physical network interfaces into

a single logical interface for performance and fault tolerance.

Volume - A logical entity that is made up of storage on one or more storage nodes. It can be

used as raw data storage or it can be formatted with a file system and used by a host or file

server.

Page 20: Using Performance Monitor to Troubleshoot SAN Performance-Final

Page 20 of 20

Contacting Support Contact LeftHand Networks Support if you have any questions on the above information:

North America:

Basic Contract Customers

1.866.LEFT-NET (1.866.533.8638)

303.217.9010

http://support.lefthandnetworks.com

Premium Contract Customers

1.888.GO-SANIQ (1.888.467.2647)

303.625.2647

http://support.lefthandnetworks.com

EMEA:

All Customers

00.800.5338.4263 (Int'l Toll-Free number)

+1.303.625.2647 (US number)

http://support.lefthandnetworks.com