4
How-To: Stop And Shop Disk Troubleshooting/Replacement Procedures Here is a good tip I would like to share on what I typically do when a ticket regarding disk issues comes into the remedy queue. First thing I take a look at is the description and work log in the ticket to get an idea of what the customer is experiencing. You do this to to get a basic idea of what is going on before logging onto the system. You may see errors like the following in the description: errlog|checklvi|ac|checklvinfchecklvinfo - check of logical volume configuration failed! |||systemnone and analyze|scsi_err1|ac|threshold for scsi_err1 errors has been exceeded during the last hour. please investigate.|| After logging into the server / store in question I check the error report (errpt) as well as do a list of the specific device (lspv hdiskXX). S0010000@/ >errpt |more aa8ab241 1007035808 t o operator operator notification aa8ab241 1007025808 t o operator operator notification 03913b94 1007023308 u h lvdd hardware disk block relocation achieved e86653c3 1007023308 p h lvdd i/o error detected by lvm 425bdd47 1007023308 p h hdisk5 disk operation error S0010000@/ >lspv hdisk5 PHYSICAL VOLUME: hdisk5 VOLUME GROUP: raidvg PV IDENTIFIER: 0006eb62d30a146a VG IDENTIFIER 0006eb620000d70000000114d32b9c34 PV STATE: missing this is what you are looking for STALE PARTITIONS: 0 ALLOCATABLE: yes PP SIZE: 128 megabyte(s) LOGICAL VOLUMES: 23 TOTAL PPs: 546 (69888 megabytes) VG DESCRIPTORS: 1 FREE PPs: 351 (44928 megabytes) HOT SPARE: no USED PPs: 195 (24960 megabytes) MAX REQUEST: 256 kilobytes FREE DISTRIBUTION: 22..02..109..109..109 USED DISTRIBUTION: 88..107..00..00..00

Hdisk Replacement

Embed Size (px)

Citation preview

How-To: Stop And Shop Disk Troubleshooting/Replacement Procedures

Here is a good tip I would like to share on what I typically do when a ticket regarding disk issues comes into the remedy queue.

First thing I take a look at is the description and work log in the ticket to get an idea of what the customer is experiencing. You do this to to get a basic idea of what is going on before logging onto the system.

You may see errors like the following in the description:

errlog|checklvi|ac|checklvinfchecklvinfo - check of logical volume configuration failed! |||systemnone

and

analyze|scsi_err1|ac|threshold for scsi_err1 errors has been exceeded during the last hour. please investigate.||

After logging into the server / store in question I check the error report (errpt) as well as do a list of the specific device (lspv hdiskXX).

S0010000@/ >errpt |moreaa8ab241 1007035808 t o operator operator notificationaa8ab241 1007025808 t o operator operator notification03913b94 1007023308 u h lvdd hardware disk block relocation achievede86653c3 1007023308 p h lvdd i/o error detected by lvm425bdd47 1007023308 p h hdisk5 disk operation error

S0010000@/ >lspv hdisk5PHYSICAL VOLUME: hdisk5 VOLUME GROUP: raidvgPV IDENTIFIER: 0006eb62d30a146a VG IDENTIFIER 0006eb620000d70000000114d32b9c34PV STATE: missing this is what you are looking forSTALE PARTITIONS: 0 ALLOCATABLE: yesPP SIZE: 128 megabyte(s) LOGICAL VOLUMES: 23TOTAL PPs: 546 (69888 megabytes) VG DESCRIPTORS: 1FREE PPs: 351 (44928 megabytes) HOT SPARE: noUSED PPs: 195 (24960 megabytes) MAX REQUEST: 256 kilobytesFREE DISTRIBUTION: 22..02..109..109..109USED DISTRIBUTION: 88..107..00..00..00

I will list below a couple of common examples you may see. The first set of errors will effect several scsi devices. The second set for the most part pinpoints what is the cause of the errors that the system is reporting.

Typically you will see 3 types of error conditions in the type, (t) column:U = undefined T = temp P = PermanentExample 1:identifier timestamp t c resource_name descriptionaa8ab241 1102025708 t o operator operator notification49a83216 1102022408 t h hdisk4 disk operation error

49a83216 1102022408 t h hdisk6 disk operation error49a83216 1102022408 t h hdisk7 disk operation error49a83216 1102022408 t h hdisk5 disk operation error0ba49c99 1102022408 t h scsi2 scsi bus error5537ac5f 1102022308 p h rmt0 tape drive failureaa8ab241 1102015708 t o operator operator notificationaa8ab241 1102015708 t o operator operator notificationaa8ab241 1102015708 t o operator operator notificationaa8ab241 1102005808 t o operator operator notificationaa8ab241 1101235808 t o operator operator notification

The first set has quite a few temp errors with a perm error on the tape device. The tape reports you have should tell you how many drive failures there have been. Overall I would have all the scsi connection checked just to make sure everything is ok. My entry in the work log would look like this:

“Please dispatch fujitsu to go onsite to check all scsi connections on system.”

------------------------------------------------------------------------------------------Example 2:aa8ab241 1007035808 t o operator operator notificationaa8ab241 1007025808 t o operator operator notification03913b94 1007023308 u h lvdd hardware disk block relocation achievede86653c3 1007023308 p h lvdd i/o error detected by lvm425bdd47 1007023308 p h hdisk5 disk operation error

For example 2’s set of errors I would also run lspv hdisk5 to check the state of the device (hdisk5 in this case).

S0010000@/ >lspv hdisk5PHYSICAL VOLUME: hdisk5 VOLUME GROUP: raidvgPV IDENTIFIER: 0006eb62d30a146a VG IDENTIFIER 0006eb620000d70000000114d32b9c34PV STATE: missing this is what you are looking forSTALE PARTITIONS: 0 ALLOCATABLE: yesPP SIZE: 128 megabyte(s) LOGICAL VOLUMES: 23TOTAL PPs: 546 (69888 megabytes) VG DESCRIPTORS: 1FREE PPs: 351 (44928 megabytes) HOT SPARE: noUSED PPs: 195 (24960 megabytes) MAX REQUEST: 256 kilobytesFREE DISTRIBUTION: 22..02..109..109..109USED DISTRIBUTION: 88..107..00..00..00

In every case I have had a bad disk the PV State was always missing and never active. You may or may not see stale partitions also with a bad hdisk.

In this particular case I would go ahead and remove hdisk5 utilizing the deletedisk script: #deletedisk hdisk5

The script will then break the mirror between hdisk5 and its primary mirror hdisk1. Upon completion you will see an ascii “SUCCESS” print on the screen and when you run a lspv you will no longer see the device.

You can then send the ticket back to the helpdesk stating in the work log:

“Please dispatch fujitsu to go onsite to replace hdisk#” (replace # with the actual hdisk id).

Just to reiterate what to run for each instance:

Temp errors on devices and disks in the error report run:#errpt |moreLocate the type of errorIf the error is temp and does not effect several devices send back to help desk with the following in the work log: “temp errors no need for dispatch”

If the temp errors effect several devices like this:#errpt |more49a83216 1102022408 t h hdisk4 disk operation error49a83216 1102022408 t h hdisk6 disk operation error49a83216 1102022408 t h hdisk7 disk operation error49a83216 1102022408 t h hdisk5 disk operation error0ba49c99 1102022408 t h scsi2 scsi bus error

Send back to help desk with the following in the work log: “Please dispatch fujitsu to go onsite to check all scsi connections on system.”

Perm errors on a “hdisk”#errpt |moreCheck the device#lspv hdisk# (replace # with the device id)#deletedisk hdisk# (replace the # with the device id)Send back to help desk with the following in the work log:“Please dispatch fujitsu to go onsite to replace hdisk#” (replace # with the actual hdisk id).

NOTE!!!The help desk will run cfgmgr and diags / device verification on the hdisk that was replaced.

The only time we will need to check anything is when the help desk sends the ticket back to us because of a checklvinfo problem with the replaced hdisk.

*Certifying** Disks:** * 1. Run *diag* from the command prompt. 2. Enter to get past the initial screen. 3. Select *Task Selection* from the menu. 4. Cursor down to and select *SSA Service Aids.* 5. Select *Certify Disk.* 6. Select the pdisk from the menu. 7. Select *Certify.* 8. Screen will show percentage complete and whether it completed