97
1 SAN Troubleshooting Rene Burema Brocade Communications March, 2008

SAN Trouble Shooting

  • Upload
    deeprtk

  • View
    1.713

  • Download
    18

Embed Size (px)

Citation preview

Page 1: SAN Trouble Shooting

1

SAN Troubleshooting

Rene BuremaBrocade Communications

March, 2008

Page 2: SAN Trouble Shooting

2

SAN Troubleshooting Basics 2March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Product Knowledge is Valuable

• Problem determination requires you to be able to identify– Products, associated port numbers, and LED status– Switch and port status– License requirements– Related compatibility information

• Available resources include– Brocade FOS Documentation– Brocade Connect and/or Brocade Partner Sites– Training materials including Products, FRUs and LEDs

(Webbased training module associated with this course)– Brocade switch provider information including compatibility

matrices

Page 3: SAN Trouble Shooting

3

SAN Troubleshooting Basics 3March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Common SAN ProblemsMany common SAN problems are related to - in alphabetical order

• Configuration - Port, device, switch is not correctly configured

– Problems accessing a switch or connecting switches or end devices can be related to configuration problems

• Firmware Download - FTP configuration and release.plist confusion

• Licensing - Customers do not have the license to do what they areattempting

– Problems connecting switches can be related to licensing problems

• Marginal Links - Bad or marginal cables/GBICs/SFPs

– Problems related to performance or problems that occur whenconnecting switches or end-devices can be related to marginal links

• Zoning - Zoning is not configured correctly

– Problems that occur when end-devices are not able to access each other can be related to zoning

Page 4: SAN Trouble Shooting

4

SAN Troubleshooting Basics 4March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

What does the switch status tell you?

Page 5: SAN Trouble Shooting

5

SAN Troubleshooting Basics 5March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

What can port status LEDs tell you?

Page 6: SAN Trouble Shooting

6

Adding/Replacing a Switch in a Fabric and Resolving FabricSegmentations

Page 7: SAN Trouble Shooting

7

SAN Troubleshooting Basics 7March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

When to Add or Replace a Switch• Faulty hardware

– Components on a switch that are not FRUs

– Motherboard, including FC ports

– Damaged chassis

• Upgrading to new hardware– 2 Gbit/sec to 4 Gbit/sec

– Port density

– Increased availability

– New features: FCR, FCIP, iSCSI

– Replacing EOL hardware

• Growing your fabric– Increased port density per switch

– Increased number of switches

• Whenever your switch provider recommends

Page 8: SAN Trouble Shooting

8

SAN Troubleshooting Basics 8March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Adding or Replacing a Switch

• Any switch added to an existing fabric must be configured properly

• LAN configuration information

• Fabric configuration information

• Your configuration plan should include a checklist that answers the following questions:

– Special port configurations required?

– Are the correct license keys installed?

– What versions of firmware are running in the fabric?

– Will you be using any additional capabilities i.e. ACLs, ADs, FCIP?

Page 9: SAN Trouble Shooting

9

SAN Troubleshooting Basics 9March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Adding or Replacing a Switch (cont.)

• Clear previous configuration from the switch– Zoning: cfgdisable; cfgclear; cfgsave

– Switch configuration: configdefault

• Gather all required information for new or replacement switch using a switch connection checklist

• Configure new or replacement switch to join an existing fabric

Page 10: SAN Trouble Shooting

10

SAN Troubleshooting Basics 10March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Methods for Configuring

• Use appropriate Fabric OS commands or Web Tools to configure the new or replacement switch

• Use configdownload command to copy a previously saved back up file to a new or replacement switch and also restore a configuration to an existing switch

• Fabric Manager baseline utility can copy the configuration of another switch or a previously saved configuration file

Page 11: SAN Trouble Shooting

11

SAN Troubleshooting Basics 11March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Merging Two Fabrics

• Successful merge will create a single fabric with four switches

Page 12: SAN Trouble Shooting

12

SAN Troubleshooting Basics 12March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Fabric Segmentation

• Fabric segmentation is generally caused by one of the following conditions:

1. Licensing problems: Switches segment due to value line license limitations

2. Zoning conflicts: The zoning configuration in both fabrics cannot bemerged

3. Admin Domain (AD) conflict: The AD configuration and/or AD zoningconfigurations cannot be merged

4. Fabric parameters conflict: fabric.ops parameters do not match

5. Port parameters conflict: ISL port settings are not compatible. FCIP tunnel settings must match.

6. Domain ID overlap: Two or more switches have the same domain ID

7. Access Control List (ACL): If configuration is strict all switches mustcomply

• In addition, all switches in a fabric with user-defined ADs 1-254, ACLs, and/or a zoning database size greater than 256K must support the Reliable Commit Service (RCS) protocol

Page 13: SAN Trouble Shooting

13

SAN Troubleshooting Basics 13March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Identify Fabric Segmentations

Primary sources for identifying fabric segmentations• switchshow output

– E_Port state will identify the state of all E_Ports – possiblesegmentations errors are: Domain Overlap, Zone Conflictor Op Mode Incompatible

• Switch error logs– errshow and errdump will capture fabric segmentation events

• fabstatsshow output

– Lists all the criteria that is exchanged during the ELP process andflags any parameter that is mismatched between the two switches

• Fabric Manager– Fabric merge check will identify a fabric segmentation cause

Page 14: SAN Trouble Shooting

14

SAN Troubleshooting Basics 14March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

switchshow Output

RSL1_ST01_B20:admin> switchshow

switchName: RSL1_ST01_B20

switchState: Online

switchMode: Native

switchRole: Principal

switchDomain: 1

switchId: fffc01

switchWwn: 10:00:00:05:1e:02:12:2c

zoning: ON (lab1)

Area Port Media Speed State

==============================

0 0 id N4 Online F-Port

10:00:00:00:c9:53:c6:c5

1 1 id N2 Online E-Port segmented, (domain overlap) (Trunk master)

Page 15: SAN Trouble Shooting

15

SAN Troubleshooting Basics 15March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Error Logs Capture Segmentation Events

RSL1_ST01_B20:admin> errshow –r

Fabric OS: v5.1.0c

2006/08/15-11:52:12, [FABR-1001], 204,, WARNING,RSL1_ST01_B20, port 1, domain IDs overlap

2006/08/15-11:45:57, [FABR-1001], 203,, WARNING,RSL1_ST01_B20, port 1, incompatible VC translationlink init, ensure it is set to 1 (2)

2006/08/15-11:37:54, [FABR-1001], 202,, WARNING,RSL1_ST01_B20, port 1, Zone conflict

RSL1_ST10_B41:admin> errshow –r

Fabric OS: v5.2.0a

2007/01/31-12:50:27, [FABR-1001], 4,, WARNING,rsl1_st10_b41_1, port 8, ELP rejected by the otherswitch

Page 16: SAN Trouble Shooting

16

SAN Troubleshooting Basics 16March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

fabstatsshow Output

RSL1_ST01_B20:admin> fabstatsshow

Description Count

-----------------------------------------

domain ID forcibly changed: 0

E_Port offline transitions: 7 (Last on port 14)

Reconfigurations: 6

Segmentations due to:

Loopback: 0

Incompatibility: 8 < Identifies mismatch

Overlap: 0

Zoning: 0

E_Port Segment: 0

What parameters would you compare next? fabric.ops

Page 17: SAN Trouble Shooting

17

SAN Troubleshooting Basics 17March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Licensing Conflicts

• Switches can be purchased with value line licenses– A value line 2 license enables the switch to exist in a two domain fabric

– A value line 4 license enables the switch to exist in a four domain fabric

• Prior to Fabric OS v3.1.2/4.2 value line licensed switches in fabrics that exceeded the allowable number of domains segmented

• After Fabric OS v3.1.2/4.2 value line licensed switches in fabrics that exceeded the allowable number of domains have a grace period

– The switch is allowed to join the fabric but Web Tools access is disabled after 45 days

– The following messages continuously display at the CLI even withquietmode on:

0x102b9f00 (tFcph): Jan 31 18:44:15 CRITICAL FABRIC-SIZE_EXCEEDED, 1, Critical fabric size (3) exceeds supported configuration (2). Switch status marginal. Contact Technical Support.

0x102b9f00 (tFcph): Jan 31 18:44:15 CRITICAL FABRIC-WEBTOOL_LIFE, 1, Webtool will be disabled in 44 days 23 hours and 50 minutes

Page 18: SAN Trouble Shooting

18

SAN Troubleshooting Basics 18March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Identify Zoning Conflicts

• There are three general types of zoning conflicts:

Type 1. Configuration mismatch: the enabled zone configurations aredifferent

– Fabric A: cfgcreate "cfg4", "Red_Zone"

– Fabric B: cfgcreate "cfg4", "Red_Zone; Blue_Zone"

sw4100:admin> cfgshow

Defined configuration:

<truncated output>

Effective configuration:

cfg: cfg4

zone: Red_Zone; 1,4; 1,5

sw4900:admin> cfgshow

Defined configuration:

<truncated output>

Effective configuration:

cfg: cfg4

zone: Red_Zone; 1,4; 1,5

zone: Blue_Zone; 2,8; 2,11

Page 19: SAN Trouble Shooting

19

SAN Troubleshooting Basics 19March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Identify Zoning Conflicts (cont.)

Type 2. Type mismatch: The name of a zone object (alias, zone, cfg.) inone fabric is used for a different zone object in the other fabric

– Fabric A: alicreate “Device1”, ”1,1”

– Fabric B: zonecreate “Device1”, ”1,1; 2,3”

sw4100:admin> cfgshow

Defined configuration:

<truncated output>

alias: Device1 1,1

<truncated output>

Effective configuration:

No effective configuration

sw4900:admin> cfgshow

Defined configuration:

<truncated output>

zone: Device1 1,1; 2,3

<truncated output>

Effective configuration:

No effective configuration

Page 20: SAN Trouble Shooting

20

SAN Troubleshooting Basics 20March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Identify Zoning Conflicts (cont.)Type 3. Content mismatch: The definition of a zone object in one fabric is

different from a zone object with the same name in the other fabric(including the order of the zone members)

– Fabric A: zonecreate “Green_Zone”, ”1,1; 2,3”

– Fabric B: zonecreate “Green_Zone”, ”2,3; 1,1”

sw4100:admin> cfgshow

Defined configuration:

<truncated output>

zone: Green_Zone 1,1; 2,3

<truncated output>

Effective configuration:

No effective configuration

sw4900:admin> cfgshow

Defined configuration:

<truncated output>

zone: Green_Zone 2,3; 1,1

<truncated output>

Effective configuration:

No effective configuration

Page 21: SAN Trouble Shooting

21

SAN Troubleshooting Basics 21March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Identify Zoning Conflicts

• Begin by running the switchshow and errshow commands

– Segmentations caused by zoning conflicts are noted as such

sw4100:admin> errshow -r

Fabric OS: v5.1.0c

2006/08/15-11:37:54, [FABR-1001], 202,, WARNING, sw4100,port 1, Zone conflict

• To identify zoning conflict cause, perform the following actionson both fabrics:

– Display the current zone configuration in both fabrics (cfgshow)

– Review the zone configurations in both fabrics for configuration,type, and content mismatches

– Verify that the Advanced Zoning license is installed(licenseshow)

Page 22: SAN Trouble Shooting

22

SAN Troubleshooting Basics 22March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Identify Zoning Conflicts (cont.)

• Use Fabric Manager 5.2 Fabric Merge to check and analyze and Offline zoning management tool to correct

– Copy the existing zoning configuration from an installed switch, and push it to the new switch.

• defzone - check this setting before you connect

sw4100:admin> defzone --show

Default Zone Access Mode

committed - No Access

transaction - No Transaction

Page 23: SAN Trouble Shooting

23

SAN Troubleshooting Basics 23March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Resolve Zoning Conflicts

• Use Web Tools or zone editing commands to resolve themismatches (ali*, cfg*, zone*, defzone*)

• To prevent zone conflicts clear the zoning database on thenew/replacement switch, cfgdisable, cfgclear, cfgsave

– Set defzone parameters to match existing fabric

• Use Fabric Manager 5.2+ offline zoning capabilities

Page 24: SAN Trouble Shooting

24

SAN Troubleshooting Basics 24March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Incompatible Switch Parameters• Incompatible switch parameters are reported as incompatibility

• To verify the flow control settings without disrupting the fabric, run theconfigshow command in both fabrics and look at the fabric.opsparameters:

– R_A_TOV – fabric.ops.R_A_TOV

– E_D_TOV – fabric.ops.E_D_TOV

– Data field size – fabric.ops.dataFieldSize

– Disable device probing – fabric.ops.mode.fcpprobedisable

– Suppress class F traffic – fabric.ops.mode.noClassF

– Per-frame route priority – fabric.ops.UseCsCtl

– BB credit – fabric.ops.BBcredit

– Interop mode – switch.interopMode

– PID format – fabric.ops.mode.pidFormat

– Long distance – fabric.ops.mode.longDistance

• You can also review these values by uploading the switch configuration file with the configupload command or Fabric Manager baseline

Page 25: SAN Trouble Shooting

25

SAN Troubleshooting Basics 25March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Incompatible Switch Parameters (cont.)

• To change these values at the command line (disruptively):

• First, disable the switch (switchdisable)

• Next, use the Fabric parameters menu in the configure commandsw4100:admin> switchdisable; configure

Configure...

Fabric parameters (yes, y, no, n): [no] yes

Domain:(1..239) [1]

R_A_TOV: (4000..120000) [10000]

E_D_TOV: (1000..5000) [2000]

WAN_TOV: (0..30000) [0]

MAX_HOPS: (7..19) [7]

Data field size: (256..2112) [2112]

Sequence Level Switching: (0..1) [0]

Disable Device Probing: (0..1) [0]

Switch PID Format: (1..2) [2] 1

Per-frame Route Priority: (0..1) [0]

BB credit: (1..16) [16]

• Finally, re-enable the switch (switchenable)

Page 26: SAN Trouble Shooting

26

SAN Troubleshooting Basics 26March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Incompatible Port Parameters• Port-level parameters will cause a segmentation if not set to the same values:

– Basic connections: Port speed, type, licensed, and enabled

– Long-distance connections: Long distance mode, VC Link Init, ISL R_RDYmode, and FCIP tunnel configurations

• Verify the current settings by running the portcfgshow commandrsl1_st10_b41_1:admin> portcfgshow 8

Area Number: 8

Speed Level: AUTO

Trunk Port ON

Long Distance LS

VC Link Init ON

Desired Distance 40 Km

Locked L_Port OFF

Locked G_Port OFF

Disabled E_Port OFF

ISL R_RDY Mode OFF

RSCN Suppressed OFF

Persistent Disable OFF

NPIV capability ON

Mirror Port OFF

Page 27: SAN Trouble Shooting

27

SAN Troubleshooting Basics 27March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Incompatible Port Parameters (cont.)

• Fabric OS v5.2 Extended Fabrics long-distance modes were revised:

– Modes L0, LE, LD, and LS are supported and can be configured on any FC port

– Modes L0.5, L1 and L2 are supported, but can not be configured

• When upgrading from Fabric OS v5.1 to v5.2, what happens to ports set to mode L0.5, L1, or L2?

– The long-distance mode is still displayed in command line output(switchshow, etc.), but modes L0.5, L1, and L2 cannot beconfigured

– To change the distance on these ports, use mode LD or LS

• When connecting a Fabric OS v5.2 switch to a pre-Fabric OS v5.2 switch both ports on the link must have the same mode

– Result: Use mode LS or LD

Page 28: SAN Trouble Shooting

28

SAN Troubleshooting Basics 28March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Incompatible Switch Parameters (cont.)

• Change these settings with the following commands:– Port speed: portcfgspeed

– Reset to defaults: portcfgdefault

– Port type (L_Port only): portcfglport

– Port type (E_Port or F_Port only): portcfggport

– Port type (E_Port disabled): portcfgeport

– Port disable/enabled: portdisable, portenable

– Port persistently disabled/enabled: portcfgpersistentdisable, portcfgpersistentenable

– Long-distance mode, VC link initialization: portcfglongdistance

– ISL R_RDY mode: portcfgislmode

• Verify settings are the same by invoking portcfgshow on both switches and comparing output

Page 29: SAN Trouble Shooting

29

SAN Troubleshooting Basics 29March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Domain ID Conflicts

Page 30: SAN Trouble Shooting

30

SAN Troubleshooting Basics 30March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Domain ID Conflicts (cont.)

• Duplicate domain IDs are reported as Domain Overlapor Overlap. To resolve domain ID conflicts, follow these steps:

– In each fabric, display the assigned domain IDs with thefabricshow or switchshow command

– Review the command output, and determine those switches whose domain ID must be changed

– Disable the switch (switchdisable), run the configurecommand to change the domain ID manually, then enable the switch (switchenable)

– The switch will now join the fabric with the unique domain ID you assigned

• Option: set Insistent domain ID (required for FICON)

Page 31: SAN Trouble Shooting

31

End Device Troubleshooting

Page 32: SAN Trouble Shooting

32

SAN Troubleshooting Basics 32March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Run supportsave Before and After

• Run supportsave as soon as you experience a problem in your SAN– Critical data will be captured if supportsave is run right away– Run supportsave prior to all problem determination steps– If unable to resolve during problem then run supportsave again

• If you have to escalate problem send escalation team both supportsavefiles

Page 33: SAN Trouble Shooting

33

SAN Troubleshooting Basics 33March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

End Device Troubleshooting

End device troubleshooting requires the following:

• Is there light from the host or device? A powered off or failed device may not provide light. Without light there will never be a login.

• Does the switch port speed configuration match the attached devicespeed configuration? Devices and switch ports typically autonegotiate.Verify that the switch port is not locked to a speed the device cannot handle.

• Are the transmission characters synchronized with the switch port?

• How far has the login process progressed? Did the device log in properly as a loop and/or fabric device?

• Are the FOS v5.2 ACLs, specifically Device Connection Control (DCC)policies, preventing device from receiving a response to a login?

Page 34: SAN Trouble Shooting

34

SAN Troubleshooting Basics 34March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

End Device Troubleshooting (cont.)

• With the maturation of Fibre Channel, most devices login as point-to-point via a Fabric Login (FLOGI). Has this occurred?

– Even if the device logs in as loop, it should still proceed to the FLOGI stage to get a Public Loop Address (24-bit address)

• If the end device logs in as loop or Fabric, it will be assigned a 24-bitaddress

– Until then, it has no source ID (SID) with which to initiatecommunication in the fabric

Page 35: SAN Trouble Shooting

35

SAN Troubleshooting Basics 35March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

End-to-End Device ConnectivityUse LLFD to Divide and Conquer

Page 36: SAN Trouble Shooting

36

SAN Troubleshooting Basics 36March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

End-to-End Device Connectivity (cont.)Link, Login, Fabric, Devices

Link – Physical and logical connection of device to switch• Transmission of light/signal• Negotiation of speed• Synchronization of characters and words

– Loop/Fabric initialization primitives

Login – Device to switch connectivity• FLOGI to Fabric Port (FFFFFE)• Security Policy Check– Device Connection Control POLICY

(DCC_POLICY) Access Control List (ACL);– Switch responses:

• Accept: Assign fabric unique 24-bit address• No response: Do not assign fabric address

• Port Login (PLOGI) to Name Server (FFFFFC)

Page 37: SAN Trouble Shooting

37

SAN Troubleshooting Basics 37March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

End-to-End Device Connectivity (cont.)Link, Login, Fabric, Devices

Fabric• Name Server Registration (FFFFFC)

– Device registers to local Name Server– Name Server is distributed within the fabric– If user-defined Virtual Fabric Admin Domains (ADs) are enabled, the Name

Server will only show devices within the current AD

• AD255 is the Physical Fabric view• AD0-AD254 will have a filtered view of the Name Server• Device attribute data may be registered:

– Device Model and Vendor– Firmware and Driver revisions– Host name

• SCR and RSCN to Fabric Controller (FFFFFD)– Initiators register using State Change Registration (SCR)– Initiators receive notifications by Name Server of Registered State Change

Notifications (RSCNs)

Page 38: SAN Trouble Shooting

38

SAN Troubleshooting Basics 38March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

End-to-End Device Connectivity (cont.)Link, Login, Fabric, Devices

Devices• Initiator queries Name Server for available devices

– Response contains devices within the effective zone configuration– FC devices are Type 8 (FCP)– Devices must successfully be logged into the fabric to exist within the Name

Server– Initiators are zoned with targets

• Initiator PLOGI to each target device, based upon Name Server queryresults

• Process Login (PRLI) from initiator to target(s)– Provides the end-to-end connectivity for device communication

• Issue Report LUNs and Inquiry to each available device

Page 39: SAN Trouble Shooting

39

SAN Troubleshooting Basics 39March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Troubleshooting End-to-End DeviceConnectivity

Start at the switch• The switch contains a wealth of information concerning the condition

of the fabric:– Devices that are logged into the fabric– Devices registered within the Name Server– Which devices are within the same zone

Don’t forget about LUN Masking and Persistent Binding• Storage array may implement LUN Masking

– Initiator WWN (Port or Node) presented to array properly?– Correct LUNs made available to initiator by array?

• HBAs may use Persistent Binding to specify LUN WWN or 24-bit PIDto OS device mapping

– Target LUN WWN (Port or Node) or PID specified correctly in host file(s)– May require entry for new or replaced target LUNs

Page 40: SAN Trouble Shooting

40

SAN Troubleshooting Basics 40March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Troubleshooting End-to-End DeviceConnectivity (cont.)

• If previous steps have been verified, there should be end-to-end device connectivity and communication

• If there is no communication between end devices, use CLIcommands to determine where the problem exists. Verify connectivity through the SAN first.

• If everything looks correct from switch CLI commands, use storageand host specific message logs and commands to isolate problems to the end point (initiator or target)

Page 41: SAN Trouble Shooting

41

SAN Troubleshooting Basics 41March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Troubleshooting Starts with switchshow

• The first command to enter when you start troubleshooting isswitchshow. That shows whether:

– Switch is online– SFP is installed in each port– Port licensing – e.g. Ports-On-Demand (POD)– End devices are online

• For remote devices, there are several commands to choose from, butstart with nscamshow

– Tells if remote devices are seen within the fabric.• Name Server (ns*) commands are filtered by ADs in FOS v5.2+• If ADs are implemented, select AD255 (Physical Fabric View):

rsl1_st15_b20_1:admin> ad --select 255

• Next get a view of the fabric configuration with cfgshow• …or just get a supportsave

– Super command script file. It gets all these commands and more!

Page 42: SAN Trouble Shooting

42

SAN Troubleshooting Basics 42March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Light/Signal

• Fibre Channel Layer 0 connectivity– The actual light transmitted and received over FC cabling– Use switchshow command to verify light/signal is being transmitted from a

device. Use portflagsshow to see if LED is seen.– Additionally use sfpshow to verify SFP is not faulty

Page 43: SAN Trouble Shooting

43

SAN Troubleshooting Basics 43March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Light/Signal (cont.)Successful light (still no speed/synchronization) output examples

• Use output of switchshow, portshow, and portflagsshow to verify light is being received:

Page 44: SAN Trouble Shooting

44

SAN Troubleshooting Basics 44March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Link – Speed Negotiation

• Speed Negotiation– Device and switch use special transmission characters to agree upon a

transfer speed of 4 Gbit/sec, 2 Gbit/sec, or 1 Gbit/sec– Speed negotiation starts with the highest possible speed and negotiates

down until a speed is agreed upon or the lowest possible speed is attempted without success

• CLI output information associated with the port when speednegotiation is successful:– switchshow: port speed will display the speed1 and State will display

Online– portshow: port speed will display configured or negotiated speed– portflagsshow: Physical command column output field will display

No_Sync or In_Sync

Page 45: SAN Trouble Shooting

45

SAN Troubleshooting Basics 45March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Link – Speed Negotiation (cont.)

Unsuccessful Speed Negotiation• switchshow <truncated output>

1 1 id 2G No_Sync

• portshow 1 | grep portSpeedportSpeed: 2Gbps

• portflagsshow <truncated output>1 Offline No_Sync PRESENT

Ensure port is set to default values:• portcfgdefault 1

Or manually set port to auto negotiate speed:• Use portcfgspeed 1 0

Page 46: SAN Trouble Shooting

46

SAN Troubleshooting Basics 46March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Physical Connectivity

• Physical connectivity between a device and a switch port includes light/signal, speed, and link negotiation processes

• After speed negotiation the connecting points have to synchronize

• Devices can get into a condition defined as marginal when they go into and out of sync

• Commands that help identify this issue include– porterrshow

– The errshow output may also have relevant output

• Fabric Watch can greatly augment the event reporting found inthe error log (RASLog)

Page 47: SAN Trouble Shooting

47

SAN Troubleshooting Basics 47March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Physical Connectivity (cont.)

porterrshow

• The porterrshow command is very helpful for getting a picture of all ports and their associated error and link related counters

• Using this information, you can quickly isolate problems down to a specific port

• A Marginal link is defined as a degraded physical connection; it is notoptimally passing data

– The porterrshow, portstatsshow, and portshow output displaycounters that help monitor marginal ports

– Symptoms include poor performance and occasional loss of connectivity

• A delta of the counters can help you isolate a problem to a portand/or the connected HBA or Storage device

– Note that you can clear the port counters using portstatsclear on aper-port/port-group basis (granularity is dependent on FOS version)

– The link counters cannot be cleared without a reboot

Page 48: SAN Trouble Shooting

48

SAN Troubleshooting Basics 48March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Physical Connectivity (cont.)

Use the porterrshow command for initial investigation of marginal links

portstatsclear can be used to clear port errors on error statistics to left of the dotted line. The other counters get cleared on a reboot/fastboot.

Page 49: SAN Trouble Shooting

49

SAN Troubleshooting Basics 49March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Physical Connectivity (cont.)

Granularity on ports with high error counters:

• porterrshow– Less granularity

– Good for quickly identifying port(s) of interest

• portstatsshow– Good for monitoring exact values of counters

Page 50: SAN Trouble Shooting

50

SAN Troubleshooting Basics 50March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Error Counters

Certain port counters can point to physical link layer issues:

• enc_in: This counter increments when 8b/10b encoding errors aredetected within a frame. enc_in errors are always detected on theingress port.

• crc_err: Indicates corruption within the frame. Always seen oningress port but will be passed by the switch unaltered through thefabric (like a trail of bread crumbs).

• enc_in and/or crc_err = Possible bad media (SFP, cable, patchpanel)

Page 51: SAN Trouble Shooting

51

SAN Troubleshooting Basics 51March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Error Counters (cont.)

• enc_out: 8b/10b encoding errors NOT associated with frames (IDLE, R_RDY, and various other primitives). This counter increments during speed negotiation prior to login. Locking a port to a speed supported by the end device can be used to isolate issues.

– Possible bad media (SFP, cable, patch panel)

– Can cause a performance problem due to buffer recovery

• disc_c3: Class 3 frame has been discarded because it is notroutable to a destination address

– Corrupted or not-online Destination ID (DID)

– Timeout exceeded (Condor ASIC hold time exceeded)

– Counter may increment when FC nodes and/or switches rapidlytransition between online and offline; look at fabriclog –s output (described in the Logical Connectivity slide later)

Page 52: SAN Trouble Shooting

52

SAN Troubleshooting Basics 52March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Link CountersThese are point-to-point errors; they do not propagate through the fabric

• Link failures - error conditions that cause a port to drop out of an active state– Requires the reconnecting device to FLOGI back into fabric (No speed negotiation

required, since the device does not lose synchronization)

• Loss of sync - occur when bit and word synchronization on link is lost

• Loss of signal – occur when light or an electrical signal is lost on a link– Require connected device to renegotiate speed and FLOGI back into fabric

• If you experience device connectivity and/or performance issues and risinglink counters look for

– bad cables/SFPs/patch-panel connections

– repeating cycles of online/offline states in fabriclog -s output

Page 53: SAN Trouble Shooting

53

SAN Troubleshooting Basics 53March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Device Initialization into Fabric

Page 54: SAN Trouble Shooting

54

SAN Troubleshooting Basics 54March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Device Initialization - Port Configuration

Device initialization could be affected by port configuration• portcfgshow – display port status

Page 55: SAN Trouble Shooting

55

SAN Troubleshooting Basics 55March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Port Configuration (cont.)

• switchshow – display login status; F/L/E or G:1 1 id N1 Online G-Port

• portcfglport – Lock port to L-Port to force Loop Initialization prior to FLOGIportcfglport <port> <0|1>

• portcfggport – Lock to G-Port if HBA/storage has difficultiesnegotiating initial Loop Initializationportcfggport <port> <0|1>

• portcfg mirrorport – A port configured as a mirror port willprevent HBA/storage loginportcfg mirrorport <[slot/]port#> --enable

– Disable mirror port configured to connect a deviceportcfg mirrorport <[slot/]port#> --disable

Page 56: SAN Trouble Shooting

56

SAN Troubleshooting Basics 56March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Login Services

Three different levels of login:• Fabric Login (FLOGI) is used by an N_Port or NL_Port (Nx_Ports)

to establish service parameters with the switch– The following information is implicitly captured and put into the Name

Server during this process: type; COS; PID; PortName (port WWN) ; and NodeName (node WWN)

• N_Port Login (PLOGI) is used by one Nx_Port to establish service parameters with another N_Port or NL_Port

• Process Login (PRLI) is used by an upper-level process in oneport to establish image pairs and service parameters with thecorresponding upper-level process in the other port

– For example, it can be used to establish the environment between related SCSI processes on an origination Nx_Port and a responding Nx_Port

Page 57: SAN Trouble Shooting

57

SAN Troubleshooting Basics 57March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Fabric Login (FLOGI)

• When devices 1st connect, their address is 000000 (unless they areloop devices, then their address will be 0000pp)

• FLOGI is required before any frame can be sent thru the fabric

• FLOGI is sent to well-known address FFFFFE (Fabric F_Port)

Page 58: SAN Trouble Shooting

58

SAN Troubleshooting Basics 58March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Commands to Check FLOGI Status

• switchshow – A successful login displays an F_Port (including itsWWN) or L_Port

• portshow – A successful login displays fabric viewpoint of device– portFlags - a bit map and English translation of the ports login process

– portState - Online

– portPhys - In_Sync, receiving light and synchronized

– portId - 24-bit Fabric Address, port identifier (PID) of device

– portScn - F_Port, from the fabrics point of view all end devices thatsuccessfully logged in are F_Ports

– port WWN(s) of connected device(s) - an F_Port will have one WWN; an FL_Port can have multiple WWNs

– Distance and Speed Configuration of the port

• portflagsshow – Lists the translation of all port login state flags; same as portshow portFlags output

Page 59: SAN Trouble Shooting

59

SAN Troubleshooting Basics 59March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

portshow

Page 60: SAN Trouble Shooting

60

SAN Troubleshooting Basics 60March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

portstatsshow – BB Credit

Page 61: SAN Trouble Shooting

61

SAN Troubleshooting Basics 61March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

portcamshow

• Hardware enforced – SID/DID zone tables are kept in ASIC– portcamshow <port>

• Out of CAM Entries – Changes to Session-Based zoning– Resource issue - not an actual error condition

• portzoneshow – undocumented/unsupported command

– Displays type of zoning (Hard, Session based) for each port

Page 62: SAN Trouble Shooting

62

SAN Troubleshooting Basics 62March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Logical Connectivity fabriclog -s

• fabriclog –s supersedes the fabstateshow command– Use it to check for port Online/Offline transitions:

– Port 1 transitioned from Offline to Online multiple times

– Check physical connectivity for bad cable, SFPs, patch-panel, etc.

Page 63: SAN Trouble Shooting

63

SAN Troubleshooting Basics 63March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Fabric – Name Server

Successful port login and registration to Name Server

– A port login (PLOGI) to the Name Server can be confirmed by looking at the Name Server information

– Verify using the nsshow command

– Unsuccessful port login means no information within the Name Server

Page 64: SAN Trouble Shooting

64

SAN Troubleshooting Basics 64March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Fabric – Name Server (cont.)

• Check for successful port login with -t option: device is an Initiator or Target

Page 65: SAN Trouble Shooting

65

SAN Troubleshooting Basics 65March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

State Change Notification Services

• State Change Notification (SCN) - State Change Notifications (SCN) are used for internal state change notifications, not external

– This is the switch logging that the port is online or is an Fx_port

– This is not sent from the switch to the Nx_ports!

• State Change Register (SCR) – Nx_Port request to receivenotification when something in the fabric changes

– FC Devices that choose to receive RSCNs must register for this service• Devices send a State Change Registration (SCR) to FFFFFD

• Registration indicates that the device wants to be notified of changes

– Devices register after PLOGI to Name Server

• Registered State Change Notification (RSCN) - issued by the Fabric Controller Service or an Nx_Port to devices that registered (issued an SCR requesting this notification) – only sent to devices within an affected zone

Page 66: SAN Trouble Shooting

66

SAN Troubleshooting Basics 66March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Fabric Controller Services

• The Fabric Controller (FFFFFD) Service alerts device that changes have occurred in the fabric by sending a Registered State Change Notification (RSCN) if:

– Device registered to receive RSCN using an SCR

– A new device has been added (within the same zone)

– An existing device has been removed (within the same zone)

– A zone has been changed

– A switch name or IP address changed

– The fabric reconfigured

• Registration is optional– SCSI initiators normally register

– SCSI targets do not register

Page 67: SAN Trouble Shooting

67

SAN Troubleshooting Basics 67March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Changes Within the Fabric

• “Properly” written device drivers will do the following in response to an RSCN:

– Query the Name Server for changes related to devices they are (or were) currently logged into

– Initiate a port login for any new devices the Name Server hasnotified them of within their Virtual Fabric zoning configuration

• Sometimes it isn’t a device driver issue. Applications can fail if their I/O is not satisfied quickly. (“Quickly” is a relative term.)

– If necessary, FOS gives the ability to suppress RSCN’s per port:

Page 68: SAN Trouble Shooting

68

SAN Troubleshooting Basics 68March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Device Identification Commands

• Use switchshow, nsshow, nscamshow, nsallshow, andnodefind to identify devices in the fabric

• nsallshow lists all 24-bit PID addresses within the current fabric(Name Server view of current AD)

• nodefind lists Name Server information for:

– Specified Alias

– Specified WWN

– Specified PID address

Page 69: SAN Trouble Shooting

69

SAN Troubleshooting Basics 69March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Devices - End-to-End Connectivity

• End-to-end device connectivity communication could be blocked onthe switch by:

– Zoning

– AD configuration

– Commands to check include: fcping, cfgshow, and ad --show

• End-to-end device connectivity flow– Nx_Port to Nx_Port communication

– Initiator to target (similar to SCSI model)

– PLOGI/PRLI from Nx_Port to Nx_Port

• Name Server Query– Initiators learn about “devices of interest”, based upon FC4 layer type (5

or 8): where 8 = FCP/SCSI, 5 = IP over FC

Page 70: SAN Trouble Shooting

70

SAN Troubleshooting Basics 70March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

End-to-End Device Connectivity (cont.)Use the fcping command to check for end device connectivity and zoning

• Response when device is not online:rsl1_st15_b41_1:admin> fcping 0x1400e8 0x0a0100

fcping: Error destination port invalid

• Response when devices are online; but one does not respond to thefcping ELS ECHO frame:

rsl1_st15_b20_1:admin> fcping 0x0a0000 0x1400e2

Source: 0xa0000

Destination: 0x1400e2

Zone Check: Not Zoned

Pinging 0xa0000 with 12 bytes of data:

received reply from 0xa0000: 12 bytes time:650 usec

<truncated output>

5 frames sent, 5 frames received, 0 frames rejected, 0 frames timeout

Round-trip min/avg/max = 567/618/674 usec

Pinging 0x1400e2 with 12 bytes of data:

Request timed out

<truncated output>

5 frames sent, 0 frames received, 0 frames rejected, 5 frames timeout

Round-trip min/avg/max = 0/0/0 usec

Page 71: SAN Trouble Shooting

71

SAN Troubleshooting Basics 71March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Device to Device Login

• Don’t forget, devices do not only log into the fabric. Initiators willinitiate PLOGIs and PRLIs to other end devices after:

– Each device is Online in the switch database

– Each device has registered with the Name Server

– Devices are zoned together and within the same Virtual FabricAdministrative Domain (AD)

• The mechanism for devices to login to each other through PLOGI isthe same as used for device to switch login

• The switch acts as a “middle-man”– Passing PLOGI/PRLI requests and ACCEPT responses

or

– Discarding such requests if the devices are not zoned together or in the same AD

Page 72: SAN Trouble Shooting

72

SAN Troubleshooting Basics 72March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Port Configuration – End-to-End

Check port configuration for end-to-end device connectivity• Use nszonemember as a final step to verify that:

– End devices have logged into Name Server, are Online, and are zoned together within the same ADrsl1_st15_b20_1:admin> nszonemember 0x0a0100

1 local zoned members:

Type Pid COS PortName NodeName SCR

N 0a0100; 2,3;10:00:00:00:c9:22:1f:23;20:00:00:00:c9:22:1f:23; 3

FC4s: FCP

NodeSymb: [30] "Emulex LP8000 FV3.90A7 DV6.02h"

Fabric Port Name: 20:01:00:05:1e:02:0c:77

Permanent Port Name: 10:00:00:00:c9:22:1f:23

Device type: Physical Initiator

Port Index: 1

Share Area: No

Device Shared in Other AD: No

…output continued on next slide

Page 73: SAN Trouble Shooting

73

SAN Troubleshooting Basics 73March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Port Configuration – End-to-End (cont.)

Check port configuration for end-to-end device connectivity(nszonemember 0x0a0100 output continued…)

1 remote zoned members:

Type Pid COS PortName NodeName

NL 1400e8; 3;21:00:00:04:cf:92:6a:58;20:00:00:04:cf:92:6a:58;

FC4s: FCP

PortSymb: [28] "SEAGATE ST318452FC 0004"

Fabric Port Name: 20:00:00:05:1e:02:aa:7b

Permanent Port Name: 21:00:00:04:cf:92:6a:58

Device type: Physical Target

Port Index: 0

Share Area: No

Device Shared in Other AD: No

• Verifies end-to-end zoning within the fabric

Page 74: SAN Trouble Shooting

74

SAN Troubleshooting Basics 74March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

When to use an Analyzer?

• When all devices are logged into the fabric, zoning is configured properly, and hosts do not see their targets

• When there are I/O disruptions that cannot be isolated with RASLog(errdump) or porterrshow/portstatsshow

• When a problem exists within the payload of a transfer

• To monitor the health of a system for error statistics and performanceproblems (the switch also has relevant built-in diagnostic capabilities)

• To diagnose protocol problems– A complete look at the FC header and payload

– Capture end-to-end protocol information (including ULPs)

• To troubleshoot extended Fabric communication– An FC analyzer can be installed between the switch and the gateway at each end

• Is the transmission the same as the reception?

• Can bit – char – word sync be established?

Page 75: SAN Trouble Shooting

75

SAN Troubleshooting Basics 75March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Port Mirroring - Configuration

• Decide location of mirror port; on same ASIC as SID or DID port

• Login to the physical fabric using an Admin role account

• Follow these steps to use port mirroring to capture a FC analyzer trace:

1. Configure the port as a mirror port by invoking the following command:portcfg mirrorport <[slot/]port#> --enable• Verify the configuration, invoke portcfgshow <[slot/]port#> and switchshow

2. Connect a FC Analyzer to the mirror port and verify that it comes online

3. Configure port mirroring connection between the SID & DID thru the mirror portportmirror --add <mirrorportnumber> <SourceID> <DestID>• The mirror port must be online

• Verify mirror connection, invoke portmirror –-show

4. Start FC Analyzer capture, reproduce problem, stop capture and review output

5. Remove the port mirror connection with the portmirror --delete command:portmirror --delete <mirrorportnumber> <SourceID> <DestID>

6. Remove the mirror port configuration (to allow other connections to this port):portcfg mirrorport <[slot/]port#> --disable

Page 76: SAN Trouble Shooting

76

Gathering Switch Support Data forProblem Determination and Escalation

Page 77: SAN Trouble Shooting

77

SAN Troubleshooting Basics 77March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Switch Support Data - Overview

• Up to this point, we have gathered details about a switch by running one CLI command at a time

• For long-term support of a switch, we need to begin gathering switchsupport data

– Larger, file-oriented data that provides a broader view of the switch

– Configuration of parameters

– State of FRUs and ports, both currently and in the past

• There are several different types of switch support data that can becollected from a Brocade switch, router, or Director:

– Switch error logs (RASLogs)

– Audit logs

– FFDC files

– Panic dump and core files

– Trace dump files

Page 78: SAN Trouble Shooting

78

SAN Troubleshooting Basics 78March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

RASLog - Overview

• Starting in Fabric OS v4.4, the System Message Log began to becalled the Reliability, Availability, and Serviceability Log (RASLog)

• RASLog error messages are defined in one of two groups– External messages – CRITICAL, ERROR, WARNING, and INFO can be

viewed by admin-level users

– Internal messages - DEBUG and PANIC can not be viewed by admin-level users

• There is one RASLog stored in persistent memory– Up to 1024 external messages stored in a non-volatile circular buffer

– In blade-based switches, each CP maintains a separate RASLog

• In Fabric OS v5.1+, certain security- and zoning-related commandscause an AUDIT flag to be added to error messages

Page 79: SAN Trouble Shooting

79

SAN Troubleshooting Basics 79March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

RASLog - Standard Message Format

• Fabric OS v4.4+ error messages follow a standard format:– Start Delimiter (customizable): Start

– Date (including year) and Time: 2006/03/08-11:59:32

– Message Module and Numeric Instance: ZONE-3006

– Sequence Number: 9

– Audit Flag: AUDIT or FFDC (added in Fabric OS v5.1)

– Severity Level (one of four levels): INFO

– Switch Name: NDA-ST01-B48

– Error description: User: admin, Role: admin, Event: cfgdisable, Status: success, Info: Current zone configuration disabled.

– End Delimiter (customizable): End

Start 2006/03/08-11:59:32, [ZONE-3006], 9, AUDIT, INFO, NDA-ST01-B48, User: admin, Role: admin, Event: cfgdisable, Status: success, Info: Current zone configuration disabled. End

Page 80: SAN Trouble Shooting

80

SAN Troubleshooting Basics 80March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

RASLog - Management

• Use the following commands to view the RASLog associated withexternal messages:

– Display all external messages in the error log with no line breaks –errdump (default display order: least-recent to most-recent)

– Display all external messages in the error log with line breaks - errshow(default display order: least-recent to most-recent)

– Use errdump/show -r to display error messages in reverse order: most-recent to least-recent

– Clear all internal and external messages from the error log with Admin level errclear command

• Forward RASLog and Console log entries to a syslogd daemon on ahost computer (syslogdipadd)

– Especially important on dual-CP systems as host computer logs maintain a single, sequentially ordered, merged file for both CPs

Page 81: SAN Trouble Shooting

81

SAN Troubleshooting Basics 81March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Audit Log - Overview

• The RASLog was designed to capture abnormal, error-related messages – not high-frequency AUDIT events

• In Fabric OS v5.1 and earlier, error messages and AUDIT events are sent to theRASLog

• In Fabric OS v5.2+, error messages go to the RASLog, and all AUDIT events go only to a new Audit Log

Page 82: SAN Trouble Shooting

82

SAN Troubleshooting Basics 82March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Audit Log – Overview (cont.)

• The new Audit Log is designed forpost event audits, and problemdetermination

– Captured per Virtual Fabric AD

– Configurable (off by default)

• For a given event it captures– Who (user), when (timestamp),

what (SAN component), and which AD

– Event type

– Other event-specific information(description)

– Format consistent with DMTFstandard

• AUDIT messages are always sentto the console, and can beconfigured to go to syslog servers

Page 83: SAN Trouble Shooting

83

SAN Troubleshooting Basics 83March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Audit Log - Details

• Fabric OS v5.2+ continues to audit all Fabric OS v5.1 AUDIT messages– Secure Fabric OS configuration

– Security related: SSL, RADIUS, Zone, and password strengthening configuration

• Fabric OS v5.2+ can also be configured to audit these tasks:– configdownload (not configupload)

– firmwaredownload start, complete, and error messages encountered during download

– User initiated security events related to ACLs

– Fabric events related to command execution in other ADs (ad --exec)

• In an AD-aware fabric, Audit Log configuration is done from AD255

• Commands involved in configuring the Audit Log include:– auditcfg to enable auditing and define what gets audited (filters)

– syslogdipadd to specify IP address of syslog server configured to receive audit messages

Page 84: SAN Trouble Shooting

84

SAN Troubleshooting Basics 84March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

FFDC - Overview

• To minimize requests for problem recreation from certain Brocadedefinedevents, Fabric OS captures First Failure Data Capture (FFDC) data

– Goal: Allow Brocade engineers to gain insight into problems that are transient, difficult-to-recreate, or difficult-to-solve

– Triggered by error MSG_IDs that are selected by Brocade engineering

– Messages are written to the console and the error log with an FFDC flag

• Automatically collects “supportshow-like” information (based on CLIcommands) as readable text when the selected event occurs

– A single FFDC event may create one or more FFDC files

– Up to 4 MB for all FFDC files combined (if max size is reached, a RASLogmessage is generated, and periodic console messages are sent)

• FFDC files are stored on the switch, and transferred by supportsave(automatically deletes files) or savecore (does not automatically delete files)

Page 85: SAN Trouble Shooting

85

SAN Troubleshooting Basics 85March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

FFDC - Configuring

• Enable and disable the FFDC functionality with the supportffdc command

– Enabled by default - disable only if directed to do so by next-levelsupportswitch:admin> supportffdc

--enable <Enable FFDC>

--disable <Disable FFDC>

--show <Show FFDC state>

Page 86: SAN Trouble Shooting

86

SAN Troubleshooting Basics 86March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

FFDC - Capturing

• The supportsave command uploads the FFDC data via FTP, and deletes it from the switch

– File name indicates the triggering event, and date/time stamp (example: FSSM-1005-2006-08-12-114707.ffdc)

• The savecore command also uploads the FFDC data via FTP (same filename), but does not delete it from the switch

switch:admin> savecore

following 1 directories contains core files:

[ ]0: /core_files/ffdc_data

Welcome to core files management utility.

Menu

1(or R): Remove all core files

2(or F): FTP all core files

3(or r): Remove marked files

4(or f): FTP marked files

5(or m): Mark Files for action

6(or u): Un Mark Files for action

9(or e): Exit

Your choice:

Page 87: SAN Trouble Shooting

87

SAN Troubleshooting Basics 87March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Panic Dump and Core Files - Overview

• Fabric OS creates panic dump and core files when there are problems in the Fabric OS kernel

– Generated when an important Fabric OS daemon no longerresponds or terminates unexpectedly

– Captures a snapshot of the current state of the switch at the timeof the crash – no historical information retained

– Panic dumps are text files, core file contents are encrypted

• In a dual-CP Director, each CP can create these files, so always check both CPs

Page 88: SAN Trouble Shooting

88

SAN Troubleshooting Basics 88March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Panic Dump and Core Files (cont.)• To display panic dump files at the command line, enter the pdshow

commandswitch:admin> pdshow

Could not find any valid pd file!

• To upload (FTP) or delete (remove) panic dump and core files via FTP, usethe savecore commandswitch:admin> savecore -l

/core_files/panic/core.873

/core_files/zoned/core.1234

/core_files/zoned/core.5678

/mnt/core_files/nsd/core.873

/mnt/core_files/panic/core.873

switch:admin> savecore -h 192.168.204.188 -u jsmith –d core_files_here-p password –f /core_files/zoned/,/mnt/core_files/nsd/

/core_files/zoned//core.1234: 1.12 kB 382.60 B/s

/core_files/zoned//core.5678: 1.12 kB 381.95 B/s

/mnt/core_files/nsd//core.873: 1.12 kB 382.53 B/s

Files transferred successfully!

Page 89: SAN Trouble Shooting

89

SAN Troubleshooting Basics 89March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Trace Dump - Overview

• The trace functionality is a proactive troubleshooting tool– Included in Fabric OS v4.4+ to aid Fabric OS debugging

– Always running, maintaining a historic record of the current andpast state of the switch – can not be disabled

– No impact on user data performance

• The results from the trace operation are stored in a trace dump file

– Triggered by a panic; timeout; CRITICAL-level event; or a manualtrigger

– Binary file, retained in persistent memory

– Can be uploaded automatically or manually via FTP

Page 90: SAN Trouble Shooting

90

SAN Troubleshooting Basics 90March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Trace Dump - Implementation

• Initiate or remove a trace dump file, or display trace dump status with the tracedump command– tracedump –n: Create a trace dump manually

– tracedump –r: Remove (delete) a trace dump from the switch

• Use the traceftp command to manage the uploading (but notdeleting) of trace dumps:– traceftp –n: Manually upload trace dumps via FTP

– traceftp –e: Enable automatic FTP upload of trace dumps

– traceftp –d: Disable automatic FTP upload of trace dumps

– With traceftp –e, specify the FTP server to which trace dumps are uploaded with the supportftp command – must do this, or trace dump files will not be automatically uploaded

• Web Tools supports some of the traceftp command functionality

Page 91: SAN Trouble Shooting

91

SAN Troubleshooting Basics 91March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Capturing Switch Support Data - Overview

• There are several tools that you can use to capture switch support data:– supportshow

– supportsave

– Fabric Manager

– SAN Health

Page 92: SAN Trouble Shooting

92

SAN Troubleshooting Basics 92March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Capturing Switch Support Data -supportshow

• supportshow is a script that executes groups of pre-selected Fabric OS and LINUX commands, and displays them at the CLI command output

• To simplify troubleshooting for the future, use the supportshowoutput to establish a switch baseline

– Documents the switch configuration under good conditions

– Future troubleshooting can start by comparing the current supportshowoutput with the baseline

• supportshow takes ADs into consideration:

– Command is relevant only in AD0 (no user-defined ADs) or AD255 (with user-defined ADs)

– AD must include the switch on which the command is run

– Example supportshow response in non-AD0/AD255 context:Operation not allowed in AD1-AD254 context

Page 93: SAN Trouble Shooting

93

SAN Troubleshooting Basics 93March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Capturing Switch Support Data -supportsave

• To aid the capture of supportshow information, Fabric OS v4.4 introduced supportsave

– Uploads supportshow in a text file whose name indicates the switch name (Director), CP slot (S0, S5), time stamp (200605200014), and SUPPORTSHOW

– Also uploads FFDC files, as well as other informationswitch:admin> supportsave –h 192.168.1.1 –u anonymous –d tmp

This command will collect RASLOG, TRACE, and supportShow (activeCP only) information for the local CP and then transfer them to a FTP server. The operation can take several minutes. OK toproceed? (yes, y, no, n): [no] y

...

Saving support information for module SUPPORTSHOW...

...rtSave_files/Director-S5-200605200014-SUPPORTSHOW: 1.11 MB

346.39 kB/s

• supportsave needs to be run on both the Active and Standby CPs

Page 94: SAN Trouble Shooting

94

SAN Troubleshooting Basics 94March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Capturing Switch Support Data –SAN Health

• Another tool that automates the documentation of a SAN isBrocade SAN Health

• SAN Health is a free utility that helps you create:– Comprehensive Documentation

– Historical Performance Graphs

– Detailed Topology Diagrams

– Best Practice Recommendations

• SAN Health can be run against:– Brocade systems running any version of Fabric OS or XPath OS

– McData systems running EOS 4.x+

Page 95: SAN Trouble Shooting

95

SAN Troubleshooting Basics 95March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Gathering Switch Support Data -Troubleshooting

• Before troubleshooting a Brocade switch, router, or Director, gather all the basic information that you can:

– Document the current state of the switch with supportsave: RASLogs, numerous command outputs (supportshow)

– Identify user actions taken in the past: Audit logs (if available)

• Validate the current state of the switch by reviewing supportshow:– Verify switch access settings (e.g. ipaddrshow)

– Check FRU status (e.g. fanshow)

– Validate firmware revisions (e.g. firmwareshow)

– Check port status, port errors (e.g. porterrshow)

• Identify faults on the switch by checking the RASLog (errdump) for error-related messages

• As needed, compare time stamps between the RASLog and the Audit Log to determine whether user actions were a problem source

Page 96: SAN Trouble Shooting

96

SAN Troubleshooting Basics 96March 2008 ® 2008 Brocade Communications Systems, Inc. All rights reserved.

Gathering Switch Support Data –Escalating to Next-Level Support

• If you are escalating an issue to next-level support, gather all the basic and Brocade information from the switch by running supportsave:

– RASLogs– supportshow

– FFDC files

– Trace dumps

– Core files and panic dumps

– AP blade details

• In addition, describe the problem in as much detail as possible:– Affected devices/ports/switches

– SAN topology drawing

– Previous course of action (timeline, commands run)

– Details on recent changes to the fabric (additions/removal/configs)

• If available, also capture the Audit logs, so that past user actions can beidentified

Page 97: SAN Trouble Shooting

97

Fin