41
Troubleshooting Storage Area Network (SAN) issues SUMMARY Troubleshooting Storage Area Network (SAN) issues requires a clear understanding of all the variables that can be involved with the current issue. A key aspect of troubleshooting SAN issues is determining whether the issue appears to be caused by a specific SAN issue or not. In many cases the issue, while involving a SAN, is not approached any differently than troubleshooting a standard system with direct-attached storage. Sometimes, a problem that occurred one time on the SAN, and was resolved, can result in something wrong with the operating system. For example, an error in zoning may have occurred in the past but has since been corrected. This may cause a SAN volume that is working fine, to "disappear" from the operating system and then later return, but show up as "unallocated". Whatever the problem was on the SAN caused a problem with the Master Boot Record, the Boot Sector, or the Dynamic Disk Database on the affected volume for Windows 2000 and later. The solution to this specific type of issue no longer involves SAN troubleshooting but standard troubleshooting and repair methods. If the issue turns out to be SAN related, or it is not clear where the issue resides, a methodical approach that includes all interested parties, including the hardware vendor(s), will result in the shortest path of resolution. Terminology used in this article: Arbitrated Loop (AL): An Arbitrated Loop is an interconnect topology where each node's transmitter is logically connected to the next node's receiver to form a logical loop. Devices that attach to an Arbitrated

Troubleshooting Storage Area Network

Embed Size (px)

Citation preview

Page 1: Troubleshooting Storage Area Network

Troubleshooting Storage Area Network (SAN) issues

SUMMARY

Troubleshooting Storage Area Network (SAN) issues requires a clear understanding of all the variables that can be involved with the current issue. A key aspect of troubleshooting SAN issues is determining whether the issue appears to be caused by a specific SAN issue or not.  In many cases the issue, while involving a SAN, is not approached any differently than troubleshooting a standard system with direct-attached storage.  Sometimes, a problem that occurred one time on the SAN, and was resolved, can result in something wrong with the operating system. For example, an error in zoning may have occurred in the past but has since been corrected. This may cause a SAN volume that is working fine, to "disappear" from the operating system and then later return, but show up as "unallocated". Whatever the problem was on the SAN caused a problem with the Master Boot Record, the Boot Sector, or the Dynamic Disk Database on the affected volume for Windows 2000 and later.  The solution to this specific type of issue no longer involves SAN troubleshooting but standard troubleshooting and repair methods.

If the issue turns out to be SAN related, or it is not clear where the issue resides, a methodical approach that includes all interested parties, including the hardware vendor(s), will result in the shortest path of resolution.

Terminology used in this article:

Arbitrated Loop (AL):  An Arbitrated Loop is an interconnect topology where each node's transmitter is logically connected to the next node's receiver to form a logical loop.  Devices that attach to an Arbitrated Loop receive a loop address known as an Arbitrated Loop Physical Address (AL_PA). When devices come online in a loop, they receive an AL_PA through a process known as Loop Initialization Primitive (LIP).  Whenever there is a LIP, all I/O on the loop is stopped during the LIP process. The arbitrated loop topology is referred to as a "blocking" topology, because arbitrated loop is a shared medium.  Devices on an arbitrated loop must arbitrate for access to the loop to exchange I/O.  Only one exchange of I/O is permitted at any one time on an arbitrated loop, resulting in reduced overall throughput as more devices are added to a loop.

Application Specific Integrated Circuit (ASIC):  An integrated circuit designed for a specific purpose such as logic in a Fibre Channel switch.

Device Class:  To make device installation easier, devices that are set up and configured in the same way are grouped into a device setup class. For example, SCSI media changer devices are grouped into the Medium Changer device setup class. The device setup class defines the class installer and class co-installers that

Page 2: Troubleshooting Storage Area Network

are involved in installing the device.(Source: Device Installation: Windows DDK: Device Setup Classes)

Fabric:  In the simplest terms, fabric can refer to a Fibre Channel switch.  Fabric is a term used to describe the infrastructure that enables SAN functionality.

Fibre Channel Switched Fabric (FC-SW):  FC-SW is a term used to describe a topology where a node is connected to a port in a Fibre Channel switch for access to the SAN.  When two nodes form a connection in a FC-SW topology, the connection is full line speed, full-duplex, and none-shared.  As more devices are added to a FC-SW topology, the aggregated bandwidth potential is increased.  The FC-SW topology is termed "non-blocking".

Full Port driver:  There are two kinds of storage components. A storage driver may be either a Full port driver or a SCSIPort/Miniport combination (in Windows Server 2003, Microsoft introduces a third concept of a Storport/Miniport combination). Different storage vendors provide different drivers. However, the type of driver affects the level of root cause analysis that Microsoft can provide for storage-related issues. A SCSIPort/miniport or a StorPort/miniport combination includes two components, the port driver that Microsoft provides and the miniport driver that the vendor provides. Because the port driver controls much of the storage operation, Microsoft can provide complete troubleshooting and debugging for this. With a full port model, the vendor provides the complete stack. This means that Microsoft has no means to troubleshoot I/Os after they are passed to the driver.  (Source: Microsoft Support for Server Clusters with 3rd Party System Components)

Gigabit Interface Converter (GBIC):  A removable transceiver that converts light energy into electrical energy.  A GBIC can be removed and replaced while the SAN is running and causes no disruption assuming that the GBIC is functioning within design specifications.  Different types of GBICs allow for different type of cable to be used with a single switch (for example, copper and fiber optic).

HCL:  The Hardware Compatibility List is a list of devices and applications that have been tested on various Windows operating systems by the Windows Hardware Quality Labs for compatibility with the respective operating systems.  As of June 2002, the original HCL has been discontinued and replaced with the Windows Catalog.

Host Bus Adapter (HBA):  An HBA is an input/output controller that provides an interface between a computer system's input/output bus.  The term HBA has been used for both SCSI and Fibre Channel controllers.  In a Windows system an HBA will appear to the operating system as a SCSI controller, and occasionally as a Network Adapter.

Initiator:  A SCSI term meaning the device initiating an input/output exchange with a target.  Initiators generally refer to host systems such as servers in a SAN context.

Logical Unit Number (LUN):  A LUN is an address that is a subdivision of a SCSI target identifier (ID).  SCSI-2 specifications provide 8 logical units per SCSI target ID numbered 0-7.

Page 3: Troubleshooting Storage Area Network

LUN Masking:  A method of access control where a hardware device presenting itself as a SCSI LUN does not reveal its address or WWN to another device, except for those devices explicitly permitted access.  This is a method of access control that works fine except for the fact that the access control list is held by the device configured to use LUN masking. If the device ever goes down so does the access control mechanism.  LUN masking per device is not preferred because it introduces single points of failure.

Miniport driver:  Relatively small, simple drivers or files that contain additional instructions that a specific hardware device requires to interface with the universal driver for a class of devices. (Source: Windows 2000 Professional Resource Kit: glossary).

Multipath:  Multipathing is a high availability function that provides multiple paths from the host to the external storage device. Although multipath I/O (MPIO) is not a feature of the operating system, the MPIO Driver Development Kit (DDK) enables storage vendors to create interoperable multipathing solutions. Up to 32 paths are supported. Load balancing is an additional benefit that improves performance.  (source: http://www.microsoft.com/windowsserver2003/evaluation/overview/technologies/storage.mspx)

Small Computer System Interface (SCSI):  A standard high-speed parallel interface defined by the American National Standards Institute (ANSI).  A SCSI interface is used for connecting microcomputers to peripheral devices, such as hard disks and printers, and to other computers and local area networks (LANs). (source: Windows Server 2003 Help and Support Center)

Storage Area Network (SAN):  A SAN is a network that is dedicated to the movement of data between computer systems such as file, Web, or database servers and storage devices such as disk systems and tape libraries. A SAN is optimized for high-speed, highly reliable transportation of data.  Similar to a Local Area Network (LAN) there are standards and services in place that provide for address assignment, name services, security services, error checking, and more.  A SAN is highly scalable and able to address roughly 16,000,000 nodes.

Target:  A SCSI term meaning the device receiving an input/output exchange with an initiating device.  Targets generally refer to storage devices in a SAN context.

Windows Cluster:  A Server cluster (Windows Cluster) is a collection of independent servers that together provide a single, highly available operating system for hosting applications.(Source: Server Cluster Frequently Asked Questions)

Zoning:  Zoning is a method of allowing or preventing access between devices in a SAN.  Zones are a collection of fabric or loop nodes, and nodes in a zone cannot access nodes that are outside the zone, when hardware enforced.  There can be more than one zone in a particular SAN.  Zone enforcement can be performed either by hardware or by software.  If the zone is hardware enforced, the zoning term is "hard-zoned".  This means that access to nodes is enforced at the ASIC in the switch.  If the zone is software enforced, the zoning term is "soft-zoned".  This means that access to nodes is controlled by the name server service running in the

Page 4: Troubleshooting Storage Area Network

fabric operating system.  Soft zoning is a masking mechanism and does not prevent devices from explicitly accessing nodes that are not listed to the device trying the connection if the target device's node address is explicitly known.

Troubleshooting SAN issues involves several phases:

Information Gathering 

Information AnalysisInformation Analysis Conclusions 

Change recommendation 

Change implementation 

Monitoring 

Follow-up

Information Gathering

SAN Information Template

Because of the possible complexity of SAN issues, information must be collected about the hosts, SAN interconnects, storage devices, and environment. The following can be used as a template to collect and record the information that is required to troubleshoot SAN issues:

============ Start SAN information template ============

====================================SAN ISSUE OVERVIEW====================================

What type of topology: Switched Fabric (FC-SW) or Arbitrated Loop (FC-

AL)?

___________________________________________________

Is this a new or pre-existing SAN?

___________________________________________________

If the SAN is pre-existing, were there any devices added to the SAN

Page 5: Troubleshooting Storage Area Network

recently?

___________________________________________________

Any changes made to the SAN at or near the time the issue was first noticed?

___________________________________________________

Does the problem affect all hosts or targets, or only some?

___________________________________________________

======================================SAN COMPONENT INFORMATION======================================

SAN Hardware Vendor or Vendors:

___________________________________________________

HBAs: – Make, Model, and Firmware version(s):

___________________________________________________

HBA driver version and is the HBA driver a fullport or miniport driver:

___________________________________________________

Switch(s) or Hub(s) – Make, Model, Firmware revision:

___________________________________________________

Storage Cabinet – Make, Model, and controller firmware:(Note: EMC may refer to firmware as microcode, HP refers to firmware as array controller software)

___________________________________________________

Is there any multiple-path (multipath) software involved?  If so, name and version (ex. SecurePath, PowerPath, etc):

___________________________________________________

If using multiple-path software how is it configured? (fault-tolerant or load-balanced):

___________________________________________________

Are all the SAN components on the Microsoft *HCL?

___________________________________________________

Page 6: Troubleshooting Storage Area Network

======================================HOW THE SAN IS CONFIGURED======================================

Is the SAN vendor engaged and if so, have they evaluated the SAN yet?

___________________________________________________

If zoning is in use, what is the method of device presentation or isolation?

Hard Zones:

(ASIC) Soft Zones:

(Name Server) Controller level LUN Masking:

(HBA) Multipath Software: Other:

How many other devices are on this SAN?

___________________________________________________

Are there non-Windows hosts on this SAN?  If so, what types of hosts and what operating system are they running?

___________________________________________________

Is there a Windows Server Cluster attached to this SAN?

___________________________________________________

If there are one or more Server Clusters on this SAN, is it or are they listed on the “Cluster/Multi-Cluster Device” listing on the Microsoft *HCL?

___________________________________________________

If the components are listed on the *HCL, record the details of the components here

___________________________________________________

Is there a diagram available that outlines the SAN environment?

___________________________________________________============================================SAN TEMPLATE NOTES* HCL=Microsoft Hardware Compatibility List

============= End SAN information template =============

Page 7: Troubleshooting Storage Area Network

Operating System Information

The next set of information that is required is about the Windows operating system.  A data collection utility named MPS_REPORTS will easily collect a wide range of information from the operating system in the form of text files, log files, event viewer logs, etc. To download MPS_REPORTS, visit the following Microsoft Web site:

Microsoft Product Support's Customer Configuration Capture Tools

(Note: The version named "Setup/Perf" gathers the most relevant information for SAN issues)

These are the most relevant reports gathered automatically by MPS_REPORTS:

Event logs in both .EVT and .TXT format Setupapi.log PnPEnum.exe report (DDK utility for extracting PnP information from Windows

2000 and later) MountedDevices registry key (used along with other reports to determine if

volumes are online or offline) Disk Manager Diagnostics report (DmDiag) Master Boot Record and Boot Sector report (FTDMPNT) List of installed hotfixes WinMSD report (MSINFO32) (which list services, system information, and so

on) PSTAT report that shows currently loaded modules Drivers reports which show detailed driver information

The following are additional possible sources of troubleshooting information: Host Bus Adapter (HBA) logs if available Multipath logging if available System management logs if available Switch (fabric) error logs Storage error logs Performance Monitor

Information Analysis

When the information has been gathered, the first logical places to start looking from an operating system perspective are the system event logs.  The following are samples and analysis of operating system information gathered during the troubleshooting process.

Event logs in both .EVT and .TXT format

Page 8: Troubleshooting Storage Area Network

Event logs can be gathered from a system either manually by exporting them from the Event Viewer utility, or automatically with the MPS_REPORTS utility.  If MPS_REPORTS are used to gather the event logs, the logs are saved in both the native .EVT format and raw format dumped to a text file.  Event logs saved in .EVT format will have the data section that is very important, but may not contain the text of the message as it appears on the destination computer.  Therefore the text versions of the event logs are useful because they will have the text of the event message as it appears on the destination computer.

The following are system events that are symptoms of I/O stalling or blocking between a host and storage device or devices:

Event ID: 3Source: LDMDescription: [computername] A Dynamic Volume (\Device\HarddiskDmVolumes\

(diskgroup)\Volume(n)) has failed.

Event ID: 9Source: [scsi miniport driver]Description: The device, \Device\ScsiPort1, did not respond within the timeout period.

Event ID: 15Source: [scsi miniport driver]Description: The device, \Device\ScsiPort1, is not ready for access yet.

Event ID: 26Source: Application PopupDescription: Application popup: Windows - Delayed Write Failed : Windows was unable to save all the data for the file \Device\HarddiskDmVolumes\PhysicalDmVolumes\BlockVolume{n}\... The data has been lost.

Event ID: 50Event Source: DiskDescription: {Lost Delayed-Write Data} The system was attempting to transfer file data from buffers to \Device\Harddisk\Volume{n}. The write operation failed, and only some of the data may have been written to the file.

Event ID: 51Event Source: DiskDescription: An error was detected on device \Device\Harddisk{n}\DR{n}

during a paging operation.

Event ID: 29Event Source: dmioDescription: dmio: Harddisk2 read error at block {n}: status 0xc000009d.

Event ID: 41Event Source: FTDISKDescription: The file system structure on the disk is corrupt and unusable. Please run the chkdsk utility on the device \Device\Harddisk{n}\Ft{n} with label "{label}".

Page 9: Troubleshooting Storage Area Network

With most of these timeout errors, the data section of the System Event Log will have much more information about the error.  See the following Microsoft Knowledge Base articles for more information about decoding these types of operating system event log entries:

182335 INFO: Format of Event Log Data Created by ScsiPortLogError244780 Information About Event ID 51816004 Description of the Event ID 50 Error Message

The following are system events that are symptoms of general hardware problems either with the host or the storage device:

Event ID: 2Event Source: dmbootDescription: [computername] dmboot: Failed to start volume Volume7 (no

mountpoint)

Event ID: 7Event Source: DiskDescription: The device, \Device\Harddisk{n}\DR{n}, has a bad block

Event ID: 11Event Source: [scsi miniport driver]Description: The driver detected a controller error on Device\ScsiPort1

Event ID: 29Event Source: dmioEvent Type: Information Description:dmio: Harddisk9 read error at block {########}: status 0xC000009A

Event ID: 30Event Source: dmioDescription: dmio: Harddisk1 write error at block {########}: status 0xc000009d

Event ID: 37Event Source: dmioDescription: dmio: Disk Harddisk1 block {########} (mountpoint D:): Uncorrectable write error

Event ID: 55Source: NTFSDescription: The file system structure on disk is corrupt and unusable. Please run the chkdsk utility on the volume "Drive_letter:"

Note: status code 0xC000009A = STATUS_INSUFFICIENT_RESOURCESstatus code 0xC000009C = STATUS_DEVICE_DATA_ERRORstatus code 0xC000009D = STATUS_DEVICE_NOT_CONNECTED(To obtain a list of Windows NT status codes, see NTSTATUS.H in the Windows Software Developers Kit (SDK)).

Page 10: Troubleshooting Storage Area Network

The following event log entries indicate issues involving zoning or path configuration issues:

Event ID: 1Source: SDDMANDescription: [computername] Device \Device\Harddisk1\DR0 path [n] offline.

Event ID: 2Source: SDDMANDescription: [computername] Device \Device\Harddisk1\DR0 path [n]

online.

Event ID: 31Source: dmioDescription: [computername] dmio: Harddisk9 write error at block 6 due to disk removal

The following event log entries indicate a change in the underlying hardware, such as a volume expansion at the hardware level:

Event Type: InformationEvent Source: dmioEvent Category: NoneEvent ID: 31Description:dmio: Harddisk0 write error at block {########} due to disk removal

Event Type: WarningEvent Source: LDMEvent Category: NoneEvent ID: 1000Description:Cannot grow LUN Harddisk{n}: Config copy write failed

Event Type: WarningEvent Source: LDMEvent Category: NoneEvent ID: 1000Description:Disk group CompernameDg{n}: Errors in some configuration copies:Disk Harddisk0, copy 1: Block 0: Disk read failure

Event Type: InformationEvent Source: dmioEvent Category: NoneEvent ID: 34Description:dmio: Harddisk0 is re-online by PnP

SETUPAPI.LOG (Windows 2000 and later only)

Page 11: Troubleshooting Storage Area Network

On Windows 2000 systems additional information may be found in the SETUPAPI.LOG file in the %SystemRoot% folder (generally \WINNT).  The following are some examples of logging generated when a disk device is presented to a Windows 2000 host and the host has not seen this device before, or if it has, something about the disk device has changed, and the change is large enough that the Windows 2000 host cannot map the device to a device that has been presented to the system in the past.

[2002/06/20 22:53:27 284.42 Driver Install]Searching for hardware ID(s): scsi\diskibm_____2105f20_________2.63,scsi\diskibm_____2105f20_________,scsi\diskibm_____,scsi\ibm_____2105f20_________2,ibm_____2105f20_________2,gendiskSearching for compatible ID(s): scsi\disk,scsi\rawEnumerating files C:\WINNT\inf\*.infFound GenDisk in C:\WINNT\inf\disk.inf; Device: Disk drive; Driver: Disk drive; Provider: Microsoft; Mfg: (Standard disk drives); Section: disk_installDecorated section name: disk_install.NTDevice install function: DIF_SELECTBESTCOMPATDRV.Selected driver installs from section disk_install in c:\winnt\inf\disk.inf.Changed class GUID of device to {4D36E967-E325-11CE-BFC1-08002BE10318}.Set selected driver.Selected best compatible driver.Device install function: DIF_INSTALLDEVICEFILES.Doing copy-only install of SCSI\DISK&VEN_IBM&PROD_2105F20&REV_2.63\5&1DDEDBBB&0&063.Device install function: DIF_REGISTER_COINSTALLERS.Co-Installers Registered.Device install function: DIF_INSTALLINTERFACES.Installing section disk_install.NT.Interfaces from c:\winnt\inf\disk.inf.Interfaces installed.Device install function: DIF_INSTALLDEVICE.Doing full install of SCSI\DISK&VEN_IBM&PROD_2105F20&REV_2.63\5&1DDEDBBB&0&063.Device required reboot: Device has problem: 12.Device install finished successfully (SCSI\DISK&VEN_IBM&PROD_2105F20&REV_2.63\5&1DDEDBBB&0&063).:

In the previous example, an IBM storage device exposed a storage target to a Windows 2000 host and the host had not been exposed to this device previously.  Device discovery events at or near the time of the SAN issue being investigates must be noted.  Another good use of SETUPAPI.LOG in Windows 2000 is to note the history of driver changes to the operating system.  The most interesting events to look for are events that occurred near the time of the SAN issue being investigated.

For more information about the Setupapi.log file, see the following white paper available on Microsoft.com: Troubleshooting by Using the Setupapi.log File

PnPEnum.exe Report:

Page 12: Troubleshooting Storage Area Network

PnPEnum.exe is a utility from the Windows 2000 Device Driver Kit and available in the MPS_REPORTS reporting utility.  PnPEnum.exe retrieves and displays the PnP device properties of devices.  Of particular interest would be storage device entries in the PnPEnum report.  When a device is presented to the OS and is registered through PnP, an entry for this device is recorded in the "ENUM" registry key.  The key item to look for here is disk or volume devices that do not have a corresponding Physical Device Object.  This means that this device has at some time been presented to this operating system but in the current session is not available.

From Q277222: "Windows 2000 stores information about LUNs and volumes that have been installed and configured in a computer in the SYSTEM hive of the registry. When a device (including a drive) is removed from a system, Windows 2000 retains the registry entries in case the device returns to the system; this is part of Plug and Play. This issue may occur after an array or set of drives is reconfigured, if they are detected as new devices and therefore create duplicate entries".

Windows 2000 stores DeviceClass entries in the following registry location:

HKLM\ControlSet00n\Control\DeviceClasses\{53f5630d-b6bf-11d0-94f2-00a0c91efb8b}

This registry key is the VolumeClassGuid key

PnPEnum.exe reports not only on currently attached devices, but devices known to the system that may not be online at the current time.  This can be helpful when diagnosing issues where the same device is being presented to the operating system multiple times.  Here is an example of output from the PnPEnum report for a Hitachi disk device that is not currently present as far as the operating system is concerned:

 Device Description:Disk driveHardware Id:SCSI\DiskHITACHI_OPEN-L*9________0116SCSI\DiskHITACHI_OPEN-L*9________SCSI\DiskHITACHI_SCSI\HITACHI_OPEN-L*9________0HITACHI_OPEN-L*9________0GenDiskCompatible Ids:SCSI\DiskSCSI\RAWService:diskClass Guid:{4D36E967-E325-11CE-BFC1-08002BE10318}Driver:{4D36E967-E325-11CE-BFC1-08002BE10318}\0008Manufacturer:(Standard disk drives)Friendly Name:

Page 13: Troubleshooting Storage Area Network

HITACHI OPEN-L*9 SCSI Disk DeviceLocation Info:Bus Number 4, Target ID 0, LUN 0Device Instance Id:SCSI\DISK&VEN_HITACHI&PROD_OPEN-L*9&REV_0116\4&2620F1AC&0&400

The following is an entry for a very similar disk device with the difference being that this device has a corresponding Physical Device Object meaning that this device is currently present and accessible by the operating system:

 Device Description:Disk driveHardware Id:SCSI\DiskHITACHI_OPEN-L*9________0116SCSI\DiskHITACHI_OPEN-L*9________SCSI\DiskHITACHI_SCSI\HITACHI_OPEN-L*9________0HITACHI_OPEN-L*9________0GenDiskCompatible Ids:SCSI\DiskSCSI\RAWService:diskClass Guid:{4D36E967-E325-11CE-BFC1-08002BE10318}Driver:{4D36E967-E325-11CE-BFC1-08002BE10318}\0025Manufacturer:(Standard disk drives)Friendly Name:HITACHI OPEN-L*9 SCSI Disk DeviceLocation Info:Bus Number 4, Target ID 0, LUN 0Physical Device Object Name:\Device\Scsi\AFC9XXX1Port3Path4Target0Lun0Device Instance Id:SCSI\DISK&VEN_HITACHI&PROD_OPEN-L*9&REV_0116\4&1F2B3425&2&400

What is helpful about this report is that in cases where seemingly multiple devices are presented to the OS, this report can help determine what is different about the devices.  Here is an example of a device being presented to Windows through multiple paths as seen through the PnPEnum.exe report:

HITACHI OPEN-L*9 SCSI Disk DeviceBus Number 0, Target ID 0, LUN 0\Device\Scsi\AFC9XXX1Port3Path0Target0Lun0SCSI\DISK&VEN_HITACHI&PROD_OPEN-L*9&REV_0116\4&1F2B3425&2&000

HITACHI OPEN-L*9 SCSI Disk DeviceBus Number 4, Target ID 0, LUN 0\Device\Scsi\AFC9XXX1Port3Path4Target0Lun0SCSI\DISK&VEN_HITACHI&PROD_OPEN-L*9&REV_0116\4&1F2B3425&2&400

Page 14: Troubleshooting Storage Area Network

HITACHI OPEN-L*9 SCSI Disk DeviceBus Number 6, Target ID 0, LUN 0\Device\Scsi\AFC9XXX1Port3Path6Target0Lun0SCSI\DISK&VEN_HITACHI&PROD_OPEN-L*9&REV_0116\4&1F2B3425&2&600

Notice there are only very small differences in these entries:

Bus Number 0, Target ID 0, LUN 0Bus Number 4, Target ID 0, LUN 0Bus Number 6, Target ID 0, LUN 0

\Device\Scsi\AFC9XXX1Port3Path0Target0Lun0\Device\Scsi\AFC9XXX1Port3Path4Target0Lun0\Device\Scsi\AFC9XXX1Port3Path6Target0Lun0

SCSI\DISK&VEN_HITACHI&PROD_OPEN-L*9&REV_0116\4&1F2B3425&2&000SCSI\DISK&VEN_HITACHI&PROD_OPEN-L*9&REV_0116\4&1F2B3425&2&400SCSI\DISK&VEN_HITACHI&PROD_OPEN-L*9&REV_0116\4&1F2B3425&2&600

Based on this report, the same device is being presented to the operating system through multiple paths. 

There are some other reports that can be used to cross-check this information depending on the type of disk device you are checking; for example DmDiag, the Mount Manager database, and others.

MountedDevices registry key:

The MountedDevices registry key is the database for the Windows Mount Manager.  The following is the location and description of the MountedDevices registry key:

HKLM\SYSTEM\MountedDevices

For more information about DeviceClasses and MountedDevices, see the following Knowledge Base article:

Q234048 How Windows 2000 Assigns, Reserves, and Stores Drive Letters

Name Type Data

\DosDevices\E: REG_BINARY

00000000 44 4d 49 4f 3a 49 44 3a - 56 51 9b 61 07 5f b7 49 DMIO:ID:VQ.a._·Iad 80 3f b5 b9 90 ae 3b - .?µ¹.®; 

\??\Volume{9d5dec1a-daeb-11d6-b028-000802808d9a}

REG_BINARY

00000000 44 4d 49 4f 3a 49 44 3a - 14 ec 38 a5 69 bb 67 4a DMIO:ID:.ì8¥i»gJbd 11 82 e6 01 1b 1e de - ½..æ...Þ 

\??\Volume{a373fb9f-ed0b-11d6-a37d-000802808d9a}

REG_BINARY

00000000 44 4d 49 4f 3a 49 44 3a - 56 51 9b 61 07 5f b7 49 DMIO:ID:VQ.a._·Iad 80 3f b5 b9 90 ae 3b - .?µ¹.®; 

Page 15: Troubleshooting Storage Area Network

The information in this example shows that there is one current DOS Device with the drive letter E:.  The data section includes the string "DMIO" that indicates this device is managed by DMIO.SYS, that would indicate that this is a Dynamic Disk.  In this example, the DOS Device E: matches one entry only based on the volume GUID.  The other entry is not currently accessible to Windows.

Troubleshooting using the MountedDevices key is to match DOS Devices with entries that start with "\??\".  Entries that do not have an associated DOS Device must be noted for additional inspection.

Disk Management Diagnostics (DmDiag) Report:Source: http://www.microsoft.com/windows2000/techinfo/reskit/tools/existing/dmdiag-o.asp

DmDiag is a Disk Manager Diagnostics utility that was first introduced with Windows 2000 and Dynamic Disks.  This utility is used to query the Dynamic Disk database and the output can be redirected to a text file.

DmDiag displays the following information for a computer:

Computer name and operating system version Physical disk to disk type Mount points LDM file versions Drive letter usage, GetLogicalDrives(), GetDriveType() \Device Symbolic links ldmsize Kernel list Disk partition information

Referring to the previous section on MountedDevices key, because the disk is Dynamic, the DMDIAG report will contain information about the disk that is currently online and the Dynamic volume that is currently offline.

In the previous example the system being examined shows an E: drive and two Dynamic disks listed as "Foreign".  The output of DMDIAG matches what is shown in Disk Management:

---------- \Device\Harddisk0 ----------\Device\Harddisk0\DP(1)0x4000-0x23d8000+2 (Device)\Device\Harddisk0\DP(2)0x23dc000-0x8783f4000+3 (Device)\Device\Harddisk0\DR0 (Device)\Device\Harddisk0\Partition0 (SymbolicLink) -> \Device\Harddisk0\DR0\Device\Harddisk0\Partition1 (SymbolicLink) -> \Device\HarddiskVolume1\Device\Harddisk0\Partition2 (SymbolicLink) -> \Device\HarddiskVolume2

---------- \Device\Harddisk1 ----------

Page 16: Troubleshooting Storage Area Network

\Device\Harddisk1\DP(1)0x7e00-0x4c613b9800+4 (Device)\Device\Harddisk1\DR1 (Device)\Device\Harddisk1\Partition0 (SymbolicLink) -> \Device\Harddisk1\DR1\Device\Harddisk1\Partition1 (SymbolicLink) -> \Device\HarddiskDmVolumes\SandollarDg0\Volume1

---------- \Device\Harddisk2 ----------\Device\Harddisk2\DP(1)0x7e00-0x4c613b9800+6 (Device)\Device\Harddisk2\DR5 (Device)\Device\Harddisk2\Partition0 (SymbolicLink) -> \Device\Harddisk2\DR5

---------- \Device\Harddisk3 ----------\Device\Harddisk3\DP(1)0x7e00-0x4c613b9800+8 (Device)\Device\Harddisk3\DR7 (Device)\Device\Harddisk3\Partition0 (SymbolicLink) -> \Device\Harddisk3\DR7

The first thing to key on is the Disk Signature for Harddisks 1, 2, & 3 in the DMDIAG report.  Scan the report for the corresponding section named "Partition Table Info Disk n" for each of the disks.  Here is the output from the corresponding sections, about the disk signature:

---------- Partition Table Info Disk 1 ---------- 6ef83e34 Signature

---------- Partition Table Info Disk 2 ---------- 6ef83e34 Signature

---------- Partition Table Info Disk 3 ---------- 6ef83e34 Signature

Based on the information retrieved from PnPEnum, the MountedDevices registry key, and the DMDIAG report, the conclusion in this case is that the Windows host has more than one path of the same device.  At this point the focus of troubleshooting can narrow in on multiple-path troubleshooting.  Does the host have more than one HBA?  If so, does each of the HBAs have the path of the storage?  If so, what is the method of managing the multiple paths?  Possible issues may be multiple-path software issues, zoning issues at the switch, redundant path configuration at the storage unit, and possible other causes.  If the host is using multiple-path software, one troubleshooting step may be to disable the multiple path software, turn off the host, remove the physical path from the redundant HBA, then restart the host, in that order.

Make sure that Microsoft and the customer understand the potential issues involved when there is more than one possible path of a common storage device.  Take great care when recommending any change in the host configuration when there are potentially two or more paths to a common storage device.  If you are not sure if a problem may occur from a configuration change, do not recommend a change to the Windows host, but rather present the information discovered to this point to the concerned parties.  Armed with this information, the customer, the vendor, and Microsoft may come to a conclusion as to the possible cause of the problem and a recommended course of action to address the problem.

Page 17: Troubleshooting Storage Area Network

Master Boot Record and Boot Sector report (FTDMPNT)

A command line utility named FTDMPNT (a.k.a. Sector Inspector) will query the physical drives that the system being examined has access to.  Critical sections of the disk are output to the console window.  This report is important the exact state of the Master Boot Record and the Boot Sectors for each volume can be examined.  Disks that come up as unknown or unallocated frequently have corruption in one of these critical startup disk sectors.  The following is example output from the Sector Inspector report:

===============================================================================PhysicalDrive 2 (Size 8620.8 MB)===============================================================================Cylinders - 1099Heads - 255Sectors/Track - 63Bytes/Sector - 512===============================================================================Logical Block 0 (0x00000000) Master Boot Record0000 33 c0 8e d0 bc 00 7c fb-50 07 50 1f fc be 1b 7c ......|.P.P....|0010 bf 1b 06 50 57 b9 e5 01-f3 a4 cb bd be 07 b1 04 ...PW...........0020 38 6e 00 7c 09 75 13 83-c5 10 e2 f4 cd 18 8b f5 .n.|.u..........0030 83 c6 10 49 74 19 38 2c-74 f6 a0 b5 07 b4 07 8b ...It...t.......0040 f0 ac 3c 00 74 fc bb 07-00 b4 0e cd 10 eb f2 88 ....t...........0050 4e 10 e8 46 00 73 2a fe-46 10 80 7e 04 0b 74 0b N..F.s..F..~..t.0060 80 7e 04 0c 74 05 a0 b6-07 75 d2 80 46 02 06 83 .~..t....u..F...0070 46 08 06 83 56 0a 00 e8-21 00 73 05 a0 b6 07 eb F...V.....s.....0080 bc 81 3e fe 7d 55 aa 74-0b 80 7e 10 00 74 c8 a0 ....}U.t..~..t..0090 b7 07 eb a9 8b fc 1e 57-8b f5 cb bf 05 00 8a 56 .......W.......V00a0 00 b4 08 cd 13 72 23 8a-c1 24 3f 98 8a de 8a fc .....r..........00b0 43 f7 e3 8b d1 86 d6 b1-06 d2 ee 42 f7 e2 39 56 C..........B...V00c0 0a 77 23 72 05 39 46 08-73 1c b8 01 02 bb 00 7c .w.r..F.s......|00d0 8b 4e 02 8b 56 00 cd 13-73 51 4f 74 4e 32 e4 8a .N..V...sQOtN...00e0 56 00 cd 13 eb e4 8a 56-00 60 bb aa 55 b4 41 cd V......V.`..U.A.00f0 13 72 36 81 fb 55 aa 75-30 f6 c1 01 74 2b 61 60 .r...U.u....t.a`0100 6a 00 6a 00 ff 76 0a ff-76 08 6a 00 68 00 7c 6a j.j..v..v.j.h.|j0110 01 6a 10 b4 42 8b f4 cd-13 61 61 73 0e 4f 74 0b .j..B....aas.Ot.0120 32 e4 8a 56 00 cd 13 eb-d6 61 f9 c3 49 6e 76 61 ...V.....a..Inva0130 6c 69 64 20 70 61 72 74-69 74 69 6f 6e 20 74 61 lid.partition.ta0140 62 6c 65 00 45 72 72 6f-72 20 6c 6f 61 64 69 6e ble.Error.loadin0150 67 20 6f 70 65 72 61 74-69 6e 67 20 73 79 73 74 g.operating.syst0160 65 6d 00 4d 69 73 73 69-6e 67 20 6f 70 65 72 61 em.Missing.opera0170 74 69 6e 67 20 73 79 73-74 65 6d 00 00 00 00 00 ting.system.....0180 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 ................0190 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 ................01a0 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 ................01b0 00 00 00 00 00 2c 44 63-fc cd 6b a2 00 00 80 01 ......Dc..k.....01c0 01 00 07 fe ff ff 3f 00-00 00 8b 27 0d 01 00 00 ................01d0 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 ................01e0 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 ................01f0 00 00 00 00 00 00 00 00-00 00 00 00 00 00 55 aa ..............U.

Page 18: Troubleshooting Storage Area Network

PARTITION TABLE-----------------------------------------------------------------------|B|FS TYPE| START | END | | ||F| (hex) | C H S| C H S| RELATIVE | TOTAL |-----------------------------------------------------------------------|1| 0x07 | 0 1 1|1023 254 63| 63| 17639307|| | 0x00 | 0 0 0| 0 0 0| 0| 0|| | 0x00 | 0 0 0| 0 0 0| 0| 0|| | 0x00 | 0 0 0| 0 0 0| 0| 0|-----------------------------------------------------------------------Disk Signature 0xa26bcdfc

Logical Block 63 (0x0000003f) Primary Partition Start-----------------------------------------------------------------------NTFS BIOS Parameter Block Information

-----------------------------------------------------------------------BytesPerSector : 512 Sectors Per Cluster : 8 ReservedSectors : 0 Fats : 0 RootEntries : 0 Small Sectors : 0 Media Type : 248 (0xf8) SectorsPerFat : 0 SectorsPerTrack : 63 Heads : 255 Hidden Sectors : 63 LargeSectors : 0 ClustersPerFRS : 246 Clust/IndxAllocBuf : 1 NumberSectors : 17639306 ( 8612 MB ) MftStartLcn : 4 Mft2StartLcn : 1102456 SerialNumber : 10110857296842789924 (0x8C50FB2450FB1424)Checksum : 10286278697749053440 Backup Boot Sector : 17639369

0000 eb 52 90 4e 54 46 53 20-20 20 20 00 02 08 00 00 .R.NTFS.........0010 00 00 00 00 00 f8 00 00-3f 00 ff 00 3f 00 00 00 ................0020 00 00 00 00 80 00 80 00-8a 27 0d 01 00 00 00 00 ................0030 04 00 00 00 00 00 00 00-78 d2 10 00 00 00 00 00 ........x.......0040 f6 00 00 00 01 00 00 00-24 14 fb 50 24 fb 50 8c ...........P..P.0050 00 00 00 00 fa 33 c0 8e-d0 bc 00 7c fb b8 c0 07 ...........|....0060 8e d8 e8 16 00 b8 00 0d-8e c0 33 db c6 06 0e 00 ................0070 10 e8 53 00 68 00 0d 68-6a 02 cb 8a 16 24 00 b4 ..S.h..hj.......0080 08 cd 13 73 05 b9 ff ff-8a f1 66 0f b6 c6 40 66 ...s......f....f0090 0f b6 d1 80 e2 3f f7 e2-86 cd c0 ed 06 41 66 0f .............Af.00a0 b7 c9 66 f7 e1 66 a3 20-00 c3 b4 41 bb aa 55 8a ..f..f.....A..U.00b0 16 24 00 cd 13 72 0f 81-fb 55 aa 75 09 f6 c1 01 .....r...U.u....00c0 74 04 fe 06 14 00 c3 66-60 1e 06 66 a1 10 00 66 t......f`..f...f00d0 03 06 1c 00 66 3b 06 20-00 0f 82 3a 00 1e 66 6a ....f.........fj00e0 00 66 50 06 53 66 68 10-00 01 00 80 3e 14 00 00 .fP.Sfh.........00f0 0f 85 0c 00 e8 b3 ff 80-3e 14 00 00 0f 84 61 00 ..............a.0100 b4 42 8a 16 24 00 16 1f-8b f4 cd 13 66 58 5b 07 .B..........fX[.0110 66 58 66 58 1f eb 2d 66-33 d2 66 0f b7 0e 18 00 fXfX...f..f.....

Page 19: Troubleshooting Storage Area Network

0120 66 f7 f1 fe c2 8a ca 66-8b d0 66 c1 ea 10 f7 36 f......f..f.....0130 1a 00 86 d6 8a 16 24 00-8a e8 c0 e4 06 0a cc b8 ................0140 01 02 cd 13 0f 82 19 00-8c c0 05 20 00 8e c0 66 ...............f0150 ff 06 10 00 ff 0e 0e 00-0f 85 6f ff 07 1f 66 61 ..........o...fa0160 c3 a0 f8 01 e8 09 00 a0-fb 01 e8 03 00 fb eb fe ................0170 b4 01 8b f0 ac 3c 00 74-09 b4 0e bb 07 00 cd 10 .......t........0180 eb f2 c3 0d 0a 41 20 64-69 73 6b 20 72 65 61 64 .....A.disk.read0190 20 65 72 72 6f 72 20 6f-63 63 75 72 72 65 64 00 .error.occurred.01a0 0d 0a 4e 54 4c 44 52 20-69 73 20 6d 69 73 73 69 ..NTLDR.is.missi01b0 6e 67 00 0d 0a 4e 54 4c-44 52 20 69 73 20 63 6f ng...NTLDR.is.co01c0 6d 70 72 65 73 73 65 64-00 0d 0a 50 72 65 73 73 mpressed...Press01d0 20 43 74 72 6c 2b 41 6c-74 2b 44 65 6c 20 74 6f .Ctrl.Alt.Del.to01e0 20 72 65 73 74 61 72 74-0d 0a 00 00 00 00 00 00 .restart........01f0 00 00 00 00 00 00 00 00-83 a0 b3 c9 00 00 55 aa ..............U.

For more information about the Master Boot Record and Boot Sector, see the following articles on Microsoft.com:

Chapter 1 - Disk Concepts and TroubleshootingMaster Boot Record on Basic DisksMaster Boot Record on Dynamic DisksBoot Sectors on MBR Disks

List of installed hotfixes:

Make sure that you know what hotfixes are installed and for what reason.  While you troubleshoot, it is worthwhile to check the installed hotfixes versus recent fixes that are relevant to the current issue being investigated.  For example, an HBA that uses a miniport driver will forward SCSI operations to the SCSIPORT.SYS inbox port driver.  If there are errors in the event log relating to other drivers such as DISK.SYS, check the version currently installed and then check the Microsoft Knowledge Base to see if there are any updates to this driver that may be relevant to the current issue.

There are several ways to obtain a list of installed hotfixes.  One is to check the Add/Remove Programs list in Control Panel.  The other is the information gathered by running MPS_REPORTS.  If you run MPS_REPORTS the output will include a file named %ComputerName%_hotfixes.txt.

System Information Report (MSINFO32.exe and WinMSD.exe):

For Windows NT 4.0, the system information report produced by the Windows NT Diagnostic utility is named WinMSD.  For Windows 2000 and later, the system information report is named the System Information utility.  In either case this utility will show a lot of information about the system being investigated.  Both utilities can save reports to text files that can be sent to Microsoft support professionals for analysis.  Basic information is common to both utilities including the following:

Hardware configuration Operating system version and service pack

Page 20: Troubleshooting Storage Area Network

List of services and their respective states at the time the report was run

The System Information report for Windows 2000 and later is more flexible and reports more information including the following:

Hardware conflicts Forced hardware Detailed hardware driver information Running tasks Loaded modules Startup programs

With Server 2003 the System Information report also gives version information for drivers.

Process and Thread Status (PSTAT) Report:

The PSTAT utility is available from a variety of locations including Resource Kit, SDK, and Support Tools included with Windows XP and later.  The following is sample output from the PSTAT utility:

Pstat version 0.3: memory: 1048096 kb uptime: 38 14:18:02.860

PageFile: \??\F:\pagefile.sysCurrent Size: 1572864 kb Total Used: 15416 kb Peak Used 16308 kb

Memory:1048096K Avail: 477528K TotalWs: 549904K InRam Kernel: 1608K P:66952KCommit: 213884K/ 123180K Limit:2522136K Peak: 226688K Pool N:26912K P:67092K

User Time  Kernel Time Ws Faults Commit Pri  Hnd Thd Pid Name

     352056 110740827           File Cache0:00:00.000 12:08:01.406 16 1 0 0 0 2 0 Idle Process0:00:00.000 6:00:31.359 260 88263 24 8 3299 59 8 System0:00:00.031 0:00:01.046 364 656 1100 11 43 6 212 smss.exe0:02:18.515 0:08:04.296 1888 1098657 1336 13 589 13 240 csrss.exe0:00:12.125 0:00:25.125 8448 39968 6652 13 369 16 236  WINLOGON.EXE2:17:57.140 5:47:02.953 17544 19418425 9600 9 823 37 292 services.exe0:04:51.765 0:09:59.140 6644 130650 3204 13 417 19 304 lsass.exe

 pid:140 pri:11 Hnd: 281 Pf: 807 Ws: 356K smss.exe tid  pri  Ctx Swtch  StrtAddr  User Time Kernel Time  State13c 13 816 4858983E  0:00:00.000  0:00:00.250 Wait:UserRequest144 13 797 4858818D 0:00:00.000 0:00:00.093 Wait:LpcReceive148 12 828 4858818D 0:00:00.031 0:00:00.125 Wait:LpcReceive120 13 2 48582F1F 0:00:00.000 0:00:00.000 Wait:LpcReceive14c 13 78 48582CB4 0:00:00.000 0:00:00.000 Wait:LpcReceive

Page 21: Troubleshooting Storage Area Network

150 13 2 77F9ADA1 0:00:00.000 0:00:00.000 Wait:LpcReceive

 ModuleName Load Addr Code Data Paged Linkdate

-------------------------------------------------------------------------ntoskrnl.exe 80400000 449216 99264 735168 Wed Jul 17 21:39:51 2002

hal.dll 80062000 31808 8128 22048 Wed Mar 20 08:35:10 2002BOOTVID.dll EB410000 5664 2464 0 Wed Nov 03 19:24:33 1999

ACPI.sys BFFD8000 92192 9024 43520 Fri Nov 16 11:45:08 2001

end of PSTAT.exe sample

PSTAT gives a lot of information in one report.  For troubleshooting operating system issues, the drivers section at the end of the PSTAT report is one of the first places to look for driver and module information.  Of special interest are the storage related Microsoft drivers such as SCSIPORT.SYS, DISK.SYS, HBA miniport drivers, HBA full-port drivers, multipath software drivers, and other filter drivers.  The initial check is a consistency check to see if the relevant drivers are fairly current.  If drivers have a linkdate that is more than a year or two past the current date, this is worth noting.  The drivers of interest can be cross-referenced in other diagnostic reports for version and timestamp information.  From the version and timestamp, check either the Microsoft Knowledge Base for known issues, or check the driver vendor's Web site to see if the running drivers are up to recommended revisions.  The decision to update drivers must be part of a managed change process for consistency and control. 

Too many changes at the same time may cause more damage than good and even if this issue is resolved the exact change that resolved the issue may never be discovered.

For troubleshooting performance issues on a SAN, the PSTAT report is also a good place to start.  From this report a quick assessment can be made of operating system resource utilization.  Of particular interest are memory usage, paged and non-paged pool usage, working set size, page faults, handles, and thread.  The beginning of the PSTAT report will show totals for the aforementioned objects and the next section after the totals section will show the resource utilization statistics per process.  If one process seems to stand out with an abnormally high utilization of a particular resource, such as handles, the section following the per process summary of resource usage lists each process with a detailed report of resource usage, per thread.

Detailed Drivers report:

A utility named CHECKSYM.EXE (included with MPS_REPORTS) can be used to gather very detailed information about operating system drivers.  Drivers are generally installed to a folder named %SystemRoot%\System32\drivers.  The drivers report will show the numeric version of the driver, timestamp, and other information.  The numeric version of installed drivers can be used to verify whether installed drivers comply with

Page 22: Troubleshooting Storage Area Network

OEM or vendor specifications.  If MPS_REPORTS is used to gather this information, the output will contain a file named %ComputerName%_drivers.txt.

The following is a sample output from the Drivers report:

Module[132] [C:\WINDOWS\SYSTEM32\DRIVERS\SCSIPORT.SYS] Company Name: Microsoft CorporationFile Description: SCSI Port DriverProduct Version: (5.2:3790.0)File Version: (5.2:3790.0)File Size (bytes): 131584File Date: Tue Mar 25 05:00:00 2003Module TimeDateStamp = 0x3e800cd5 - Tue Mar 25 00:01:25 2003Module Checksum = 0x00024808Module SizeOfImage = 0x00026000Module Pointer to PDB = [scsiport.pdb]Module PDB Guid = {3B93BFB2-CE9F-40FA-B15F-7D0E45A0439E}Module PDB Age = 0x2

Host Bus Adapter (HBA) logging:

Most HBA events will be recorded in the system event log.  Contact the vendor to make sure there is no other logging location for HBA events.  HBA vendors may have a support section on their Web site that will help decode system events recorded by their drivers.  Also, some HBA drivers can be configured to log additional information in the System event log by changing settings through a configuration utility or by editing registry entries for the driver.

Multipath Software logging:

Most Multipath software events will be recorded in the system event log.  Contact the vendor to make sure there is no other logging location for multipath software events.  If there are, the vendor of the multipath software should be able to provide assistance analyzing these log files.

System Management logs:

Many systems are running some form of system management software.  This software is used to monitor the health of the server and record events such as errors, failures, changes, and others.  System management logs may contain additional information about the system that is not recorded in the Windows event logs.  Here is a sample list of management suites:

Dell OpenManageEMC ControlCenter Management Package for SymmetrixHDS Storage Area Management SuiteHewlett-Packard Insight Manager

Page 23: Troubleshooting Storage Area Network

IBM DirectorUnisys Server Sentinel

For more information, contact the respective vendor.

Switch (fabric) error logs:

The SAN fabric devices may have error logging that can be consulted for additional information.  The customer or the hardware vendor can read these logs and decode the information.  The error logs would be useful to Microsoft in that specific errors that can be correlated to an event on a Windows host would be helpful in root cause analysis.  However, finding and reviewing switch logs would be outside the scope of Microsoft support.

Storage error logs:

Storage devices on the SAN may have error logging.  Collection and analysis of storage error logging would be the responsibility of the customer or hardware vendor.  This information can potentially be useful to Microsoft support for root cause analysis.

Performance Monitor:

The Performance Monitor utility can be used to measure system performance both in real time and over time.  For Windows 2000 and later, the Performance Monitor utility has been renamed to System Monitor.  Ideally the Performance Monitor utility would be run on another system that can access the target system over a standard network.  Performance Monitor can be configured to collect performance counters at time intervals defined during run time.  For issues that occur infrequently the sampling intervals must be made farther apart so that the collection log does not grow too large and become unmanageable.  If the issues occur on a daily basis the collection intervals must be shortened sufficiently as to capture potentially transient information that may be relevant.

See the following Knowledge Base article for information about using Performance Monitor:

248345 How to Create a Log Using System Monitor in Windows 2000811237 HOW TO: Capture Performance Data from a Remote Windows 2000 Computer

Performance problems with SAN attached systems can be analyzed using the same tools and methods as servers that are not SAN attached.  The main difference is that the SAN administrator(s) or hardware vendors must be engaged.  After all standard troubleshooting methods and tools have been employed, you will have to examine the path between the initiator and target in addition to all the hardware in the path.

Some possible causes of SAN performance issues are:

Page 24: Troubleshooting Storage Area Network

Failing or intermittently failing hardware such as HBA, GBIC, cable, switch port, switch Application Specific Integrated Circuit (ASIC), storage controller, or other component in between the initiator and target.

Incorrectly configured storage configuration.  For example, Microsoft Exchange Server must have log files and databases on separate disks because of the different nature of input/output for log file and database reads and writes.  The following are several articles related to Microsoft Exchange performance issues:

XADM: Client Latencies Occur When Exchange 2000 Converts Mail from MAPI to MIME FormatXADM: White Paper - Troubleshooting Exchange 2000 PerformanceXADM: Hard-Disk Write Performance Is Slower with Exchange 2000 Than with Previous Versions of Exchange

Events occurring on the SAN that are transparent to the Windows host. For example, on an arbitrated loop, devices that come and go will cause the whole loop to reinitialize, and may cause performance degradation.  In a switched fabric environment devices that log on to and log off the fabric will trigger RSCNs that may sometimes are disruptive to the Windows host.  If fabric events are suspected, the SAN administrator(s) and hardware vendors must monitor the SAN looking for disruptive events such as nodes with high error counts, timeouts, and others.

Information Analysis Conclusions:

Based on the information gathered, some conclusion may be starting to emerge.  For example:

Symptom: An HBA driver was updated on a Server Cluster node, including firmware, and now the host occasionally has timeout issues communicating with the storage devices.

Conclusion: In this case the change recommendation would be to bring all members of this Cluster to the same revision of driver and firmware levels for devices that communicate on the SAN.  This issue is well documented, see Microsoft Knowledge Base article Q311081 for more information about this issue. 

Analysis: Event ID 9 or event ID 11 generally indicates a hardware timeout.

Conclusion: In an arbitrated loop environment, when a device comes online or goes offline, a process named Loop Initialization occurs.  The initialization process occurs very quickly, but during this process all I/O on the loop is suspended.  The result of this may be a timeout that will be reported by a Windows host as an event ID 9, 11, 26, 51, and others.  The loop initialization process is by design and the main reason that Windows Clustering is not supported on an Arbitrated Loop.

Page 25: Troubleshooting Storage Area Network

The loop initialization process is the reason that Windows hosts are not supported in a boot from SAN configuration on an arbitrated loop.Conclusion: Some timeout issues have been traced back to problems with multipath software.  If the hosts have redundant paths to their storage, the change recommendation will be to test with a single path to see if the same issues occur.

Conclusion: Windows status codes in various log files show that the storage device seems to be "coming and going".  Although many things can cause devices to "come and go", the best place to start troubleshooting this type of issue is halfway in between the initiator and the target.  This level of troubleshooting would be outside the realm of Microsoft Product Support Services and in the realm of the SAN administrator(s) or hardware vendor(s).  Potential causes of these types of issues can be bad cables, bad Gigabit Interface Converter (GBIC), bad or failing HBAs, or a bad port anywhere in the chain.  The hardware vendor will have additional resources available to help with these types of issues such as switch logs, port logs, and if necessary, Fibre Channel analyzers to perform a trace of traffic in between the initiator and target.

For related information about this topic, see the following Knowledge Base article:

Q317162 Supported Fibre Channel Configurations 

Analysis: A Windows 2000 server starts the "Write Disk Signature" wizard every time the Computer Management Console is opened.  When the wizard is canceled there are disk devices present that are unknown to this system.

Conclusion:  This issue is documented in Knowledge Base article Q293778 and can be a result of multi-path software either failing or incorrectly configured.  If there is multi-path software involved for redundancy purposes, the vendor must be involved to verify the correct configuration, correct version of software, and perform any diagnostics that may be available for that software. 

Analysis: A set of disk storage devices on the SAN were configured as a Dynamic spanned volume.  At some point the host lost contact with the devices.  The host was restarted to resolve the situation and now there is only a subset of the members available.  Troubleshooting by the SAN administrators has revealed a switch failure in the SAN, although there should have been a redundant path to the storage devices.

Conclusion: Disk devices that are presented to a host that are unknown or that "come and go" indicate potential zoning issue on the SAN.  Zoning is a method to allow or prevent access to hosts and storage devices.  With correct zoning only specified storage devices will be visible to specified hosts.  If there are changes in zoning or changes in the path or paths to storage devices, the storage devices may

Page 26: Troubleshooting Storage Area Network

appear at random to the hosts or the hosts may be able to scan storage devices were not meant to be scanned by this host.  If a path or zoning issue is suspected the SAN administrator(s) must be given the information for additional analysis.  Microsoft may be able to provide general zoning guidelines, but would not be able to recommend any specific zoning changes.

Note: Failure of the LUN masking mechanisms may appear to the user as loss of connectivity, inappropriate failover of devices and data corruption – either by interrupted transfers from a different host or by exposing devices to systems that should not see them and then that system will try to take ownership.  For related information about this topic, see the following Microsoft Knowledge Base article:

Q294173 Removing the HBA Cable on a Server Cluster 

Analysis: A Windows host configured to boot from SAN has various issues relating to the paging file, general I/O timeout errors, and occasional bugchecks.

Conclusion: See Q305547 Support for Booting from a Storage Area Network (SAN), for a complete rundown on this topic and troubleshooting guidelines for boot from SAN issues.

Change Recommendation

Based on the analysis of collected information, any changes recommended by one party must be first disclosed to and agreed upon by all parties involved.  Troubleshooting SAN issues should always be a collaborative effort between the customer, hardware vendor(s), and software vendor(s).

Recommendation based on HCL status:

Based on collected information, the first recommendation may be to bring all Windows hosts into HCL compliance if they are not already in compliance.  There are very specific rules for support, especially where Server Clustering is involved.  The following Microsoft Knowledge Base articles give detailed information about the topic of HCL compliance:

Q327831 Support for a Single Server Cluster Attached to Multiple SANs

Q304415 Support for Multiple Clusters Attached to the Same SAN Device

Q254321 INF: Clustered SQL Server Do's, Don'ts, and Basic Warnings

Page 27: Troubleshooting Storage Area Network

Based on the collected information and existing support policies, the following changes may be recommended:

1. All hosts in a common zone must have the same firmware and driver revisions for their HBAs.  These firmware and driver revisions must be in accordance with the HCL configuration that this solution or device was tested and approved for.

2. All Windows systems in a specific zone must generally be at the same service pack and hotfix revision.  Differences in service pack and hotfix versions can possible cause unknown behavior.

3. All SAN switches must be at a common firmware revision.  This task would be handled by the SAN administrator(s) or hardware vendors.

4. Each Server Cluster must be in its own private zone if they are not already. 5. Boot from SAN Windows hosts must not be on an arbitrated loop.. 6. Multipath software must be at the latest vendor's recommended revision.  If the

system is already at the latest revision, and the system is experiencing timeout issues or devices that "come and go", Microsoft's change recommendation would be to set the multipath software to disabled, shut down the server, remove all but one path to the storage, then restart the server.  The multipath software must not be changed in any way as long as multiple paths exist to the storage.  The configuration of the software and HBAs must not be changed as long as the host can perform I/O with the storage device.  That is why the recommendation is to disable the software, shut down, remove all but one path, and then restart the host, in that order.

Recommendation based on an amount of logging and event information that is not sufficient: Sometimes, the information available in the Server event logs and other logs are not adequate to determine root cause or make a change recommendation.  In this case Microsoft's change recommendation would be to enable a more verbose mode of logging either in the HBA, the multiple-path software, or both.  Verbose HBA logging will quickly fill the event logs, possibly flushing out other important events.  Increasing the size of the event logs during additional information gathering would be recommended.

The following Web pages have additional information about verbose logging methods:

Emulex: Windows Event LogQlogic: Extended Error loggingQlogic: QLA2xxx NT Miniport Driver Event Logging

For more information about adjusting Windows Event logging, see the following Knowledge Base article:

Q320121 HOW TO: Configure the Size and Behavior of Event Viewer Logs in Windows

Zoning change recommendations:

Page 28: Troubleshooting Storage Area Network

There are no times that anyone in PSS would recommend zoning changes.  Sometimes, Microsoft Consulting Services may recommend changes, but in most cases zoning changes must be performed by the customer or SAN hardware vendor.  If there is an issue on the SAN that appears to be zoning related, the information can be gathered and made available to concerned parties to help in their analysis of the issue.

SAN fabric logging recommendations:

SAN infrastructure may be as simple as a point-to-point connection between a host and storage, or as complex as a full-mesh fabric.  Depending on the complexity of the SAN infrastructure, the fabric may (and frequently is) managed by a SAN operating system, also known as the fabric OS.  In many cases there are various methods of managing and monitoring events that occur in the SAN infrastructure.  Many types of SAN devices such as switches maintain internal event facilities.  For example, by default Brocade switches will maintain the last 64 events that occurred on that switch.  The problem is that if the switch is rebooted, or there are more than 64 events, items of interest may be discarded because of lack of buffer space.

SAN switches or fabric management software may include the capability to log events to permanent storage.  If this capability exists, one potential change recommendation would be to configure SAN devices to log events to permanent storage for later analysis.  If you can configure SAN devices this way, it is important that the time be synchronized on the SAN devices with respect to the time on the hosts or nodes.  In this way an event reported by a host can more easily be matched to an event reported by a SAN device.  Having a time drift on the SAN or the host makes the troubleshooting process more difficult.

Event notification recommendations:

There are methods available to configure Windows Servers or SAN management software to notify appropriate personnel when specific events occur.  A potential change recommendation might be to configure the Windows hosts or the SAN management software to notify specified personnel when specified events occur, so that critical information can be gathered and analyzed as soon as possible.  The following Microsoft Knowledge Base articles give more information about event management:

Q260527 Generating Notifications for an MSCS Resource ProblemQ320121 HOW TO: Configure the Size and Behavior of Event Viewer Logs in WindowsQ318763 HOW TO: Use the Event Log Management Script Tool (Eventlog.pl) to ManageQ243625 How to Configure Administrative Alerts in Windows 2000

Third-party event management information:

Brocade: Fabric Watch InformationEMC: Navisphere InformationHP: SANworks storage management appliance Information McData: SANavigator™ Management SoftwareQlogic: SANPoint™ Control for QLogic

Page 29: Troubleshooting Storage Area Network

Change Implementation

Changes in a SAN environment must be made only with careful calculation as to the effects on the rest of the environment as the result of the intended changes.  Generally, changes are performed during a planned maintenance window and after backup of data that may be potentially affected.  Take into account any unintended effects and plan accordingly. 

For example, a change in firmware revision on a storage controller may trigger a Windows Plug and Play event on the next startup of the system.  This event may show up as nothing more than an entry in the Setupapi.log, or as drastic as drive letters changing, paths to targets changing, etc.  Before implementing changes to a device attached to a SAN, it would be a good idea to take a "snapshot" of the system(s) attached to the SAN.  The MPS_REPORTS utility can be run ahead of time and the output saved to some known location.  After the update, if there is some problem with one of the affected servers the MPS_REPORTS can be run again and the output can be compared with the snapshot previously taken to determine exactly what has changed.  A good utility to do the comparison of the output files is WINDIFF.EXE.  WINDIFF.EXE is a Resource Kit utility for Windows NT 4.0 and included with the Windows CD in the Support Tools package for Windows 2000 and later.  For more information about WINDIFF.EXE, see the following Knowledge Base article:

159214 How to Detect and Compare File Differences

In the SAN fabric, a change in firmware will also cause a device to log out of the fabric during the update, then log back in after the update process is complete.  The log off and log on process of the device will trigger a Registered State Change notification (RSCN) to be sent to devices configured to be notified of state changes on the fabric.  Depending on the configuration of the device, a RSCN may be completely transparent to systems attached to the SAN, or in a worst case will cause the systems to re-query the Name Server for changes. 

The RSCN process should not be disruptive, but it is something that should be anticipated and investigated should problems occur as the result of a firmware update.

Monitoring:

After implementing recommended changes the next troubleshooting step is monitoring the host(s), node(s), or SAN for changes in events from previous noted events.  After a period of time it may be a good idea to gather a new set of information that can be used to compare the previous information gathered.  This may be as simple as running MPS_REPORTS again on the Windows host.

Based on the results of monitoring and changes previously implemented this may be the last step in troubleshooting.  Otherwise the troubleshooting effort would return back to the information gathering and analysis phases.

Page 30: Troubleshooting Storage Area Network

Follow-up:

The follow-up phase of SAN troubleshooting involves documenting findings, changes, and conclusions.  Depending on the outcome of the troubleshooting you may be able to create an article of content documenting the issue and resolution.  Otherwise the information must be maintained for historic purposes.  The level of information presented in a follow-up must be as brief as possible, yet still maintain the technical aspects of the issue and resolution.  Detailed information might be included with the follow-up in the form of attachments, diagrams, actual logs, and others.

Good follow-up and closure to a SAN troubleshooting issue can be very helpful to interested parties and others involved in troubleshooting SAN issues.  Support professionals may not have to spend as much time on subsequent similar SAN issues when previous similar SAN issues and resolutions are available.  Additionally, some aspects of the issue and resolution may be made available to the vendors involved in the issue for their support staff.  The collaborative nature of SAN issues require a sharing of knowledge for continued success in supporting SAN issues.

 

Microsoft Confidential, for internal use onlyThe information about this Web site is provided to you per your

Confidentiality Agreement

This page last modified