01 Hardware and Loop

Embed Size (px)

DESCRIPTION

01 Hardware and Loop

Citation preview

Slide 1

Data ONTAP 8.1 Cluster-Mode TroubleshootingHardware and Loop Troubleshooting1NetApp Confidential - Internal Use Only1 2011 NetApp. All rights reserved.2Module ObjectivesBy the end of this module, you should be able to:

Know where to find logs and events regarding hardware and loop issuesKnow the differences and similarities of hardware and loop troubleshooting between 7-Mode and Cluster-ModeUnderstand how to isolate and remediate offline and inconsistent aggregates2 2011 NetApp. All rights reserved.3Hardware

3 2011 NetApp. All rights reserved.A Case Study, Part 1A customer called in with a system panic during boot after a head swap.The details:Secure site customer with no ASUPsHead swap was a FAS3170 to a FAS3160CF card was swapped with the headsSet-defaults was runWhich env variables should be set after running set defaults?Details in student notesHead swap procedure was followed properly and done by PSBoth heads in the cluster had the same issuePanic string was:page fault (supervisor read, page not present) on VA 0xfffffe000599fc98 cs:rip = 0x8:0xffffffff9fcd3067 rflags = 0x10282 in SK process vdb_apply_work24 on release NetApp Release 8.0.2P5

How would we troubleshoot this?4NetApp Internal UseEnvironment variables that should be set for CMODE after running set-defaults**:

bootarg.bsdportname bootarg.init.boot_clustered true partner-sysid

**Always capture printenv from loader before running set-defaults.**4 2011 NetApp. All rights reserved.A Case Study, Part 1 (Cont.)Troubleshooting this issue went as follows:Verified the panic string bug search turned up several hits, but mostly dealing with waflironVerified that wafliron had *not* been run on the system everTheorized that the 64bit aggrs may have been contributing to the panic, so those aggrs were offlined in maint modeSystem booted fine after thisTried to online the aggrs one by one to find which aggr was the issueOne aggr would not onlineTried wafliron on the aggr that would not onlineSystem panicked immediately due to mount of the aggrUltimately, we found that the aggr size exceeded the maximum allowed by the 3160 platform (but was supported on the 3170)Only resolutions were to either destroy the aggr or put the 3170 heads back in place.See bug 568928 for details.5NetApp Internal Use5 2011 NetApp. All rights reserved.A Case Study, Part 2A PSE called in due a number of RPC timed out messages.

The details:New install7-ModeCannot upgradeError seen was the following:ERROR: command failed: Cannot perform update/install when mroot is not availableConsole spammed with RPC timed outDisk connectivity was finenodes boot up and we can log in no panics

How would you troubleshoot this?6NetApp Internal Use6 2011 NetApp. All rights reserved.A Case Study, Part 2 (Cont.)Troubleshooting this issue went as follows:Since the error seen during install was mroot not available, went into systemshell to check mount and saw that mroot was not mounted8.0.x 7-Mode does use /mroot (but not RDB)Tried to manually mount /mroot this worked and install was able to completeHow could we manually mount /mroot in Cluster-Mode?Upon reboot, /mroot unmounted againBug search turned up these:http://burtweb-prd.eng.netapp.com/burt/burt-bin/start?burt-id=457685http://burtweb-prd.eng.netapp.com/burt/burt-bin/start?burt-id=514011Root cause turned out to be that the system had too many NICs7NetApp Internal Use7 2011 NetApp. All rights reserved.Supported Platforms in Cluster-ModeEach version of Cluster-Mode will be supported on specific platforms. Below is a list of the supported platforms for 8.0.x and 8.1 Cluster-Mode.

Storage systems may panic if an incompatible version is loaded on a system. For example, a FAS2040 uses an elf version of Data ONTAP. A FAS3050 can use the same version, but if Data ONTAP 8.1 is loaded on a FAS3050, the system will not boot.

8NetApp Internal Use8.0 Supported Platforms8.1 Supported Platforms62XX62XX60XX60XX32XX32XX31XX31XXFAS2040FAS2040FAS3040/3070 FAS2240FAS3050 (PVR only)8 2011 NetApp. All rights reserved.Supported Configurations in Cluster-ModeSupported Cluster-Mode hardware configurations will be found in the same location as supported 7-Mode hardware configurationsthe system configuration guides on the NOW site.

https://now.netapp.com/NOW/knowledge/docs/hardware/NetApp/syscfg/

If running an unsupported hardware configuration, the system will behave unexpectedly and make troubleshooting much more difficult.

Always refer to the system configuration guide when dealing with system panics as a first step to ensure the issue isnt something simple and to save time in troubleshooting.9NetApp Internal Use9 2011 NetApp. All rights reserved.Hardware Considerations in Cluster-ModeThe only considerations to be made regarding hardware in Cluster-mode are the following:

Is the hardware configuration supported?How does hardware work in 7-mode?

Hardware troubleshooting is exactly the same in Cluster-mode as it is in 7-mode Hardware fails the same way (disks, motherboards, HBAs, etc)Errors report the same we just get them in different locations10NetApp Internal Use10 2011 NetApp. All rights reserved.Where to Find Hardware Error MessagesIn Cluster-Mode, we can find hardware errors in the following locationsevent log show output in cluster shellEMS logs in /mroot/etc/logEMS logs in AutoSupportNote that in 7G, we would find hardware issues in /etc/messages file in ASUP; in Cluster-Mode, the messages file wont give us much in terms of hardware informationIn 8.1 Cluster-Mode (or 7-Mode), what is included in an ASUP will depend on why the ASUP was triggered. What sends an ASUP can be checked with the autosupport trigger commandSee the AutoSupport section in Mhost module for details11NetApp Internal Use11 2011 NetApp. All rights reserved.event log show Example::*> event log show -node * -messagename disk.init.failureBytes -instance

Node: cluster1-03 Sequence#: 7946 Time: 1/30/2012 13:25:42 Severity: ERROR EMS Severity: SVC_ERROR Source: isp2400_intrd Message Name: disk.init.failureBytes Event: disk.init.failureBytes: Disk 0d.36 failed due to failure byte setting. Kernel Generation Number: 1327947933 Kernel Sequence Number: 14 EMS Event XML Length: 195 Number of Times Suppressed Since Last Time Logged: 0

Would this message trigger an ASUP? How could you find out?

::*> autosupport trigger show -node * -autosupport-message disk.init.failureBytesThere are no entries matching your query.

12NetApp Internal Use12 2011 NetApp. All rights reserved.How to Find a Failed Disk in Cluster-ModeTo find a failed disk in autosupport:EMS logsSysconfig a and sysconfig r output

To find a failed disk in CLI:::> event log show -node * -messagename disk*::> disk show state broken

To filter out unwanted messages, use ! in the query. These can be strung together:::> event log show -node * -messagename raid*,!raid.aggr.log.CP.count,!raid.rg.media*,!raid.spares*

Classic 7G commands can also be leveraged by using node run:::> node run nodename aggr status f

For more info, see: https://kb.netapp.com/support/index?page=content&id=1012891 13NetApp Internal Use13 2011 NetApp. All rights reserved.How to Find Unowned Disks in Cluster-ModeIn Cluster-Mode, finding unowned disks changes a bit.

In disk show, there is a field called container-type:::> disk show -disk filer1:0a.16 -fields container-typedisk container-type---------------- --------------filer1:0a.16 aggregate

To find unowned disks, we can filter by container-type unassigned:

::> disk show -disk * -container-type unassigned Usable ContainerDisk Size Shelf Bay Type Position Aggregate Owner---------------- ---------- ----- --- ----------- ---------- --------- --------Filer 1:0a.43 - 2 11 unassigned present - -

We can also use the classic 7G command from node shell:

::> node run local disk show -n14NetApp Internal Use14 2011 NetApp. All rights reserved.AutoSupport ExampleScreenshot examples of EMS in AutoSupport:

15NetApp Internal Use

15 2011 NetApp. All rights reserved.16Loop Troubleshooting

16 2011 NetApp. All rights reserved.Storage in Cluster-Mode vs. 7-Mode17NetApp Internal Use

Classic 7G CablingIn 7-mode, we have an HA pair with attached shelves.17 2011 NetApp. All rights reserved.Storage in Cluster-Mode vs. 7-Mode (Cont.)18NetApp Internal Use

In Cluster-Mode, we have HA pairs with attached shelvesconnected via a cluster network.

Cluster network18 2011 NetApp. All rights reserved.Storage in Cluster-Mode vs. 7-Mode (Cont.)19NetApp Internal UseIn 7-Mode, the storage architecture looked like this:DISKSHELF MODULEHBASCSIRAIDWAFL19 2011 NetApp. All rights reserved.Storage in Cluster-Mode vs. 7-Mode (Cont.)20NetApp Internal UseIn Cluster-Mode, the storage architecture looks like this:DISKSHELF MODULEHBASCSIRAIDWAFLCluster-Mode shell20 2011 NetApp. All rights reserved.Loop Troubleshooting in Cluster-ModeIn Cluster-Mode, hardware is the same. So, by definition, loop troubleshooting is also the same. shelves are connected the same waySame divide and conquer troubleshooting used to isolate loop issuesdisks and modules are the samenode level commands are the sameEMS events are the sameMaintenance mode still existsSame bugs, same issues with multi-disk panicsSoftware disk ownership is the sameThe same loop troubleshooting KB applies:https://kb.netapp.com/support/index?page=content&actp=LIST&id=S:1012725 The same cabling KB applies:https://kb.netapp.com/support/index?page=content&id=2013406 Like hardware, what *is* different is how we can view loop configuration and issues in Cluster-Mode via ASUP and CLI.21NetApp Internal Use21 2011 NetApp. All rights reserved.Running Classic 7G CommandsIn Cluster-Mode, we have the ability to run classic 7G commands. This is useful for loop troubleshooting since this is what were used to. However, going forward, these commands will eventually deprecate and be ported to the cluster shell.

To run a classic 7G-like shell:

::> node run [nodename]

This will allow us to run common basic command sets were used to:sysconfigfcadmindiskstoragesasadminAnd many more! (see example in student notes)22NetApp Internal Usecluster::> node run localType 'exit' or 'Ctrl-D' to return to the CLIfiler> ?? fcstat ping spacpadmin file pktt statsaggr flexcache priv storagearp fsecurity qtree sysconfigbackup halt quota sysstatbmc help rdfile timezonecdpd hostname reallocate traceroutecf ic revert_to traceroute6clone ifconfig rlm upsdate ifgrp route uptimedcb ifstat rshstat versiondf license sasadmin vlandisk logger sasstat voldisk_fw_update man savecore wccdownload maxfiles shelfchk wrfileecho mt sis ypcatems netstat snap ypgroupenvironment options snapmirror ypmatchfcadmin partner software ypwhichfcp passwd source

filer> priv set diagWarning: These diagnostic commands are for use by NetApp personnel only.

filer*> ?? help passwd spacorn hostname perf spinhi_statsacpadmin ic ping spinnp_replayaggr icbulk pktt spinnp_replay_statsarp ifconfig printflag stackavailtime ifgrp priv statitbackup ifinfo prof statsbmc ifstat ps storagebootargs inodepath qtree syncbootfs iomem quota sysconfigcdpd iswt raid_config sysstatcf kma_stats rastrace tape_qualclone label rdfile timecna labelmaint reallocate timezonedate led_off registry traceroutedbg led_off_all result traceroute6dcb led_on revert_to treecomparedd led_on_all rlm ttcpdf led_on_off rm upsdisk led_test route uptimedisk_fw_update led_test_one rshstat vdomdownload license rtag versiondumpblock log sasadmin vifdumpstack log_fio sasstat vlanecho logger sata vm_statems ls savecore volenviron mailbox scsi vol_dbenvironment man sesdiag waffinity_statsexit maxfiles setflag waflfcadmin mbstat setral wafl_cmd_restrictionsfcmon mem_scrub_stats shelfchk wafl_steal_statsfcp mem_stats showfh wafl_suspfcstat mkfile signal wafltopfile mt sis wccfilersio mv sldiag wrfileflexcache netmpstat slist xttcpfsecurity netstat smf ypcatgdb options snap ypgroupgetXXbyYY panic snapmirror ypmatchhalt parityck software ypwhichhammer partner source

22 2011 NetApp. All rights reserved.Unavailable 7G CommandsSome commands have been pulled out of the node shell to prevent users from breaking things on the cluster side.

For example, since Cluster-Mode has a VLDB that keeps record of the location of objects like aggregates, we dont allow aggr delete from nodeshell to prevent unintended data loss and inconsistencies in VLDB.

*> aggr destroyaggr: no such command "destroyfor more information type "aggr help".

23NetApp Internal Use23 2011 NetApp. All rights reserved.Unavailable 7G Commands (Cont.)However, there is a hidden option to re-enable commands we may need to use in special circumstances called nodescope.reenabledcmds.

If we need to delete an aggr from WAFL because we cant delete it from cluster shell (for example, foreign aggrs), we can re-enable aggr destroy:

filer*> options nodescope.reenabledcmds "aggr=destroy"

filer*> aggr destroyaggr destroy: No aggregate name supplied.usage:aggr destroy { | } [-f] - destroy aggregate or traditional volume or offline plex . The aggregate, traditional volume, or plex must be taken offline before it can be destroyed.

Always remember to clear the value after setting and using it:

filer*> options nodescope.reenabledcmds ""24NetApp Internal Use24 2011 NetApp. All rights reserved.Analog Commands in Cluster-ModeIn Cluster-Mode, there are analogous commands for classic 7G commands in the cluster shell.

For example, if we want to find the serial number of a node in 7G, we use sysconfig. In Cluster-Mode, we can use the following command from any node in the cluster:

::> node show node1 Node: node1 Owner: Location: Model: FAS3140 Serial Number: 70005193 Asset Tag: - Uptime: 6 days 19:29 NVRAM System ID: 151732818 System ID: 0151732818 Vendor: NetApp Health: true Eligibility: true

25NetApp Internal Use25 2011 NetApp. All rights reserved.Analog Commands in Cluster-Mode (Cont.)Additionally, we are able to run some 7G commands from cluster shell without having to specify node in the command, such as df:

::> dfFilesystem kbytes used avail capacity Mounted on Vserver/vol/vol0/ 703550980 14797412 688753532 2% --- node1/vol/vol0/.snapshot 37028996 605804 36423192 2% --- node1/vol/vol0/ 527566316 13849612 513716680 3% --- node2/vol/vol0/.snapshot 27766648 504740 27261908 2% --- node2/vol/myroot/ 83886080 550972 83335108 1% --- myvserver/vol/myroot/.snapshot 20971520 0 20971520 0% --- myvserver

For a more complete list of command translations, see the following:http://wikid.netapp.com/w/User:Parisi/cmode/cheatsheet

There is also an Excel doc called Rosetta Stone in the class share:\\10.61.77.170\cmode_ts 26NetApp Internal Use26 2011 NetApp. All rights reserved.storage disk showIn Cluster-Mode, for loop troubleshooting, the single, most important command to remember is storage disk show. Beneath storage disk show is a wealth of information useful for troubleshooting.Some fields of interest:

StateContainer typeOwnerAggregatePositionModelInitiatorPrimary/secondary port

Complete list can be found in student notes.27NetApp Internal UseIn diag level:

::*> storage disk show - -broken -instance -longop -maintenance -physical -port -raid -sanown -spare -stat -disk -uid -aggregate -array-name -average-latency -bay -bps -capacity-sectors -checksum-compatibility -container-type -copy-destination -copy-percent -disk-io-kbps-total -disk-iops-total -diskpathnames -errors -firmware-revision -grown-defect-list-count -home -home-id -host-adapter -initiator -initiator-iops -initiator-io-kbps -initiator-lun-in-use-count -initiator-side-switch-port -is-dynamically-qualified -lun -lun-iops -lun-io-kbps -lun-path-use-state -media-scrub-count -media-scrub-last-done -model -nodelist -outage-reason -owner -owner-id -path-error-count -path-iops -path-io-kbps -path-link-errors -path-lun-in-use-count -path-quality -physical-size-mb -physical-size -physical-size-512b -plex -port-speed -position -power-on-hours -prefailed -primary-port -raid-group -reconstruction-percent -replacing -reserver-id -rpm -secondary-name -secondary-port -sectors-read -sectors-written -serial-number -shelf -shm-time-interval -state -target-iops -target-io-kbps -target-lun-in-use-count -target-port-access-state -target-side-switch-port -target-wwpn -tpgn -type -usable-size-mb -usable-size -vendor -zeroed -zeroing-percent -fields

27 2011 NetApp. All rights reserved.System Health AlertsIn 8.1 Cluster-Mode, we also have system health alert.

This feature leverages schmd in M-host and can check overall system health and also will check for cabling mistakes with SAS shelves.

::> system health alert show -alert-id DualControllerHa_Alert -fields alert-idnode monitor alert-id alerting-resource------ -------------- ---------------------- -----------------------node1 system-connect DualControllerHa_Alert 50:05:0c:c1:02:00:84:7d

Node: node1 Resource: Shelf ID 23 Severity: Major Probable Cause: Disk shelf 23 is not connected to both controllers of the HA pair (node11, node12). Possible Effect: Access to disk shelf 23 will be lost with a single controller failure.Corrective Actions: 1. Halt all controllers that are connected to disk shelf 23. 2. Connect disk shelf 23 to both HA controllers following the rules in the Universal SAS and ACP Cabling Guide. 3. Reboot the halted controllers. 4. Contact support personnel if the alert persists.28NetApp Internal Use28 2011 NetApp. All rights reserved.System Health Alerts (Cont.)Other system health monitor bugs:

Burt 538229 - health monitor sends error ASUPs about healthy stacksBug is what the title says. Health monitor will send errors when there arent problems.

Burt 569689 SAS cabling guidelines needs to be modifiedThe SAS cabling guide currently lists only one possible way to cable SAS shelves (A/C to square, B/D to circle) when there are a number of ways to correctly cable for MPHA. This problem affects Data ONTAP 8.1 7-Mode as well.29NetApp Internal Use29 2011 NetApp. All rights reserved.System Health Alerts (Cont.)One issue with system health alerts is that the schmd process runs on one node and will report on all other nodes, which can be confusing. Bug 538975 is filed for this behavior.

Node: node1 Resource: Shelf ID 23 Severity: Major Probable Cause: Disk shelf 23 is not connected to both controllers of the HA pair (node11, node12). Possible Effect: Access to disk shelf 23 will be lost with a single controller failure.Corrective Actions: 1. Halt all controllers that are connected to disk shelf 23. 2. Connect disk shelf 23 to both HA controllers following the rules in the Universal SAS and ACP Cabling Guide. 3. Reboot the halted controllers. 4. Contact support personnel if the alert persists.30NetApp Internal Use30 2011 NetApp. All rights reserved.31Offline and Inconsistent Aggregates

31 2011 NetApp. All rights reserved.WAFL in Cluster-Mode32NetApp Internal UseRecall the storage architecture in Cluster-Mode:DISKSHELF MODULEHBASCSIRAIDWAFLCluster-Mode shellNotice that WAFL is still there:WAFL32 2011 NetApp. All rights reserved.WAFL in Cluster-Mode (Cont.)Since WAFL is the same in Cluster-Mode, WAFL recovery would also be the same.

When WAFL goes inconsistent (bugs, power outages, storage failures), the remediation is the same: wafliron.

The same wafliron KB applies:https://kb.netapp.com/support/index?page=content&id=3011877

One exception: wafl_check should not be used in Cluster-Mode and is removed from the boot menu in Data ONTAP 8.1 and later. In lieu of wafl_check, we have wafliron with optional commit available.33NetApp Internal Use33 2011 NetApp. All rights reserved.WAFL in Cluster-Mode (Cont.)Example of wafliron syntax in Cluster-Mode:::> set diag::*> aggregate wafliron commit member reject review show start stop

::*> aggregate wafliron start - -aggregate -include-mirrors -optional-commit -previous-cp

To find an aggregate that is inconsistent:

::*> aggregate show -state inconsistent

For more info on aggregate recovery and wafliron, enroll in the Down filer troubleshooting class.34NetApp Internal Use34 2011 NetApp. All rights reserved.Inconsistent Root AggregatesIn 7-mode, having an inconsistent root volume meant that you had to run wafl_check on the root aggregate.

In Cluster-Mode, since we have the concept of an immortal cluster and since the root volume only contains replicated databases and system logswe can instead create a new root volume from the boot menu, boot up and continue to operateThe RDB will sync with the new root vol and the cluster can continue to run as normalFor root recovery procedures:

http://wikid.netapp.com/w/Mhost/RR_Configuration_Recovery_Guide 35NetApp Internal Use35 2011 NetApp. All rights reserved.Offline AggregatesIf an aggregate is marked as offline but not inconsistent, do the following:

Check for underlying hardware issuesCheck the aggregates status in D-bladeInvestigate root cause for why the aggr went offlineEMSCommand history logsResolve any outstanding issuesOnline the aggr

To online the aggr in cluster shell:

::> aggregate online -aggregate aggr

36NetApp Internal Use36 2011 NetApp. All rights reserved.ReviewWhat log file contains the most useful information regarding hardware issues?

/mroot/etc/log/ems.logEMS log in ASUP37NetApp Confidential - Internal Use OnlyReviewWhat should be consulted prior to working on any case regarding issues booting, such as filer panics during boot?

sysconfig guide38NetApp Confidential - Internal Use OnlyReviewHow do you find a broken disk?

EMS logs::> storage disk show state broken39NetApp Confidential - Internal Use OnlyReviewWhat command can be used to get into a 7G like command line shell?

node run [nodename]40NetApp Confidential - Internal Use OnlyReviewTrue or false: Hardware and loop troubleshooting in cluster mode is the same as in 7G.

True: same principles, different commands41NetApp Confidential - Internal Use Only42Module SummaryYou should now be able to: Know where to find logs and events regarding hardware and loop issuesKnow the differences and similarities of hardware and loop troubleshooting between 7-Mode and Cluster-ModeUnderstand how to isolate and remediate offline and inconsistent aggregates42 2011 NetApp. All rights reserved.43NetApp Confidential - Internal Use Only