10
Prepared By MOHAIDEEN ABDUL KADER. F Enterprise Services CLEARING THE FAULTS AND UNDERSTANDING LED STATUS ON SUN SPARC ENTERPRISE Mx000 SERVERS

Clearing the Faults and Understanding Led Status on Sun Sparc Enterprise Mx000 Servers

Embed Size (px)

DESCRIPTION

Clearing the Faults and Understanding Led Status on Sun Sparc Enterprise Mx000 Servers

Citation preview

Page 1: Clearing the Faults and Understanding Led Status on Sun Sparc Enterprise Mx000 Servers

Prepared By

MOHAIDEEN ABDUL KADER. F

Enterprise Services

CLEARING THE FAULTS AND

UNDERSTANDING LED STATUS ON SUN

SPARC ENTERPRISE Mx000 SERVERS

Page 2: Clearing the Faults and Understanding Led Status on Sun Sparc Enterprise Mx000 Servers

VARIOUS SCENARIOS FOR CLEARING FAULTS

M4000 / M5000

CLEARING A FAULT ON A PSU :

XSCF> showstatus

* PSU#1 Status:Faulted;

service# clearfault /PSU#1

Testing the hardware...

XSCF> showstatus

No failures found in System Initialization.

CLEARING A FAULT ON A DIMM :

XSCF> showstatus

MBU_A Status:Normal;

MEMB#0 Status:Normal;

* MEM#0A Status:Faulted;

service# clearfault /MBU_A/MEMB#0/MEM#0A

clearfault: Fault cannot be cleared for this FRU.

FRU will be marked to clear fault on next circuit breaker off and on. Continue? [y|n]: yes

Fault will be cleared after circuit breaker off and on

XSCF> showstatus

MBU_A Status:Normal;

MEMB#0 Status:Normal;

* MEM#0A Status:Faulted;

CLEARING A FAULT ON A CPUM :

XSCF> showstatus

MBU_A Status:Normal;

* CPUM#0-CHIP#0 Status:Faulted;

* CPUM#0-CHIP#1 Status:Faulted;

service# clearfault /MBU_A/CPUM#0

clearfault: Fault cannot be cleared for this FRU.

Page 3: Clearing the Faults and Understanding Led Status on Sun Sparc Enterprise Mx000 Servers

FRU will be marked to clear fault on next circuit breaker off and on. Continue? [y|n]: y

Fault will be cleared after circuit breaker off and on

XSCF> showstatus

MBU_A Status:Normal;

* CPUM#0-CHIP#0 Status:Faulted;

* CPUM#0-CHIP#1 Status:Faulted;

M8000 / M9000

CLEARING A FAULT ON A PSU :

XSCF> showstatus

* PSU#0 Status:Faulted;

service# clearfault /PSU#0

Testing the hardware...

XSCF> showstatus

No failures found in System Initialization.

CLEARING A FAULT ON THE OPNL :

XSCF> showstatus

* OPNL#0 Status:Faulted;

service# clearfault /OPNL

clearfault: Fault cannot be cleared for this FRU.

FRU will be marked to clear fault on next circuit breaker off and on. Continue? [y|n]: y

Fault will be cleared after circuit breaker off and on

XSCF> showstatus

* OPNL#0 Status:Faulted;

CLEARING A FAULT ON AN IOU NOT PART OF A RUNNING DOMAIN :

XSCF> showstatus

* IOU#1 Status:Faulted;

Page 4: Clearing the Faults and Understanding Led Status on Sun Sparc Enterprise Mx000 Servers

XSCF> showboards -v -a

XSB R DID(LSB) Assignment Pwr Conn Conf Test Fault COD

------------------------------------------------------------------------------------------------------------------------------------------

00-0 * 00(00) Assigned y n n Unknown Normal n

01-0 * 00(01) Assigned y n n Unknown Faulted n

02-0 SP Unavailable y n n Unknown Normal n

03-0 SP Unavailable y n n Unknown Normal n

service# clearfault /IOU#1

Testing the hardware. This may take up to six minutes

XSCF> showstatus

No failures found in System Initialization.

CLEARING A FAULT ON A CMU NOT PART OF A RUNNING DOMAIN

service# clearfault /CMU#2/CPUM#2

Testing the hardware. This may take up to six minutes

XSCF> showstatus

No failures found in System Initialization.

CLEARING A FAULT ON A CMU WHICH IS PART OF A RUNNING DOMAIN :

XSCF> showstatus

CMU#3 Status:Normal;

* CPUM#0-CHIP#0 Status:Faulted;

* OPNL#0 Status:Faulted;

XSCF> showboards -v -a

XSB R DID(LSB) Assignment Pwr Conn Conf Test Fault COD

----------------------------------------------------------------------------------------------------------------------------------------

00-0 00(00) Assigned y y y Passed Normal n

01-0 00(01) Assigned y y y Passed Normal n

03-0 00(03) Assigned y y y Passed Degraded n

service# clearfault /CMU#3/CPUM#0 FRU cannot be detached

clearfault: Fault cannot be cleared for this FRU.

FRU will be marked to clear fault on next circuit breaker off and on. Continue? [y|n]: n

Page 5: Clearing the Faults and Understanding Led Status on Sun Sparc Enterprise Mx000 Servers

We can use DR to detach the XSB and clear the fault.

XSCF> deleteboard -c unassign 03-0

XSB#03-0 will be unassigned from domain immediately. Continue?[y|n] :y

Start unconfiguring XSB from domain.

Unconfigured XSB from domain.

XSB power off sequence started. [1200sec] 0...end

Operation has completed.

XSCF> showboards -v -a

XSB R DID(LSB) Assignment Pwr Conn Conf Test Fault COD

------------------------------------------------------------------------------------------------------------------------------------------

00-0 00(00) Assigned y y y Passed Normal n

01-0 00(01) Assigned y y y Passed Normal n

03-0 SP Available y n n Passed Degraded n

service# clearfault /CMU#3/CPUM#0

Testing the hardware. This may take up to six minutes

XSCF> showboards -v -a

XSB R DID(LSB) Assignment Pwr Conn Conf Test Fault COD

---------------------------------------------------------------------------------------------------------------------------------------

00-0 00(00) Assigned y y y Passed Normal n

01-0 00(01) Assigned y y y Passed Normal n

03-0 00(03) Assigned y y y Passed Normal n

XSCF> showstatus

No failures found in System Initialization.

CLEARING A FAULT ON A CMU WHICH IS PART OF A RUNNING DOMAIN BUT DR

CANNOT BE USED:

XSCF> showstatus

CMU#3 Status:Normal;

* MEM#00A Status:Faulted;

XSCF> clearfault /CMU#3/MEM#00A FRU cannot be detached

clearfault: Fault cannot be cleared for this FRU.

FRU will be marked to clear fault on next circuit breaker off and on. Continue? [y|n]: n

Page 6: Clearing the Faults and Understanding Led Status on Sun Sparc Enterprise Mx000 Servers

XSCF> showboards -v -a

XSB R DID(LSB) Assignment Pwr Conn Conf Test Fault COD

-----------------------------------------------------------------------------------------------------------------------------------------

00-2 00(00) Assigned y y y Passed Normal n

03-0 00(12) Assigned y y y Passed Degraded n

Since DR cannot be used for whatever reason, the domain must be powered off prior to using clearfault

XSCF> showdomainstatus -d 0

DID Domain Status

00 Powered Off

service> clearfault /CMU#3/MEM#00A

Testing the hardware. This may take up to six minutes

XSCF> showstatus

No failures found in System Initialization.

XSCF> showboards -v -d 0

XSB R DID(LSB) Assignment Pwr Conn Conf Test Fault COD

------------------------------------------------------------------------------------------------------------------------------------------

00-2 * 00(00) Assigned y n n Passed Normal n

03-0 * 00(12) Assigned y n n Passed Normal n

UNDERSTANDING THE LED STATUS ON SUN SPARC MX000 SERIES SERVERS:

Each Mx000 system has an Operator Panel (OPNL) with 3 LEDs :

The Power LED,

The XSCF Standby LED,

The Check LED.

When turned ON, the Check LED, aka the System Check LED, indicates a fault on the system.

Page 7: Clearing the Faults and Understanding Led Status on Sun Sparc Enterprise Mx000 Servers

Most of the FRUs on the SPARC Enterprise servers have a FRU check LED which reports that the unit

contains an error.

However, some FRUs like DIMMs or CPUMs do not have LEDs.

For Sun SPARC Enterprise servers running a version of XCP later than 1050, the check LEDs will be set

and reset as below :

The FRU check LED is set if the FRU is the sole FRU in a suspect list; including sub-FRU (CPUM,

DIMM ...) and non-FRU (DDC, SSM ...).

The system check LED is set if there are any FRUs which is considered as the primary suspect

(CFF / UFF) or secondary suspect; which means when 'showstatus' reports any FRUs as faulty or

degraded.

Including IO Box FRUs reported as suspect.

Note that the check LED for the PSUs on the M8000/M9000 may not behave as expected; not being set

when it's the primary suspect.

Check LEDs behaviour after clearfault, clearstatus, replacefru :

replacefru :

o The FRU's check LED is :

ON until the maintenance,

blinking during the maintenance,

OFF as soon as the replacefru as completed successfully.

o The System check LED is OFF :

As soon as the replacefru as completed successfully,

And there is no other suspect component in the system left,

clearfault :

The FRU's check LED is turned off as soon as the clearfault command has succeeded

successfully in clearing the fault for the FRU.

The System's check LED is turned off as soon as the fault status for the latest suspect component

is cleared.

This implies that the LED will turn off after the subsequent power cycle in certain cases as applicable.

Page 8: Clearing the Faults and Understanding Led Status on Sun Sparc Enterprise Mx000 Servers

clearstatus/ clearfru :

The FRU and System check LEDs remain ON until the next power cycle,

Faults on IOBox :

Faults detected on IOBox are stored in the CMEM and in the FRUID of the IOBox (Status_CurrentR).

This information is reported in the showstatus output on the XSCF.

Example :

XSCF> showstatus

IOU#4 Status:Normal;

* PCI#5 Status:Degraded;

IOX@X156 Status:Normal;

* IOB1 Status:Faulted;

* PS0 Status:Degraded;

* PS1 Status:Degraded;

When a fault is reported on the IOBox or its components, the Service LED on the IOBox or PSU is lit.

When an iobox fru is discovered, the frud reads the Status_CurrentR. If it contains fault info, the fault

info is added to CMEM, and the Service led is turned on.

This can be checked via the ioxadm command :

XSCF> ioxadm env -v

Location Sensor Min Min Alarm Value Max Alarm

Max Units

[...]

IOX@X156/IOB1 SERVICE - - On -

- LED

Even if a fault is reported on IOBox and Service LED is lit, the OPNL System Ckeck LED is not lit.

The clearfault command can be used to clear the fault status for primary and secondary suspect on the

IOBox and its components; similarly to any other components in the platform chassis (CMU, DIMM, IOU

etc ...) for XCP > 1050.

Page 9: Clearing the Faults and Understanding Led Status on Sun Sparc Enterprise Mx000 Servers

Example :

XSCF> showstatus

IOU#4 Status:Normal;

* PCI#5 Status:Degraded;

IOX@X156 Status:Normal;

* IOB1 Status:Faulted;

* PS0 Status:Degraded;

* PS1 Status:Degraded;

service> clearfault IOU#4-PCI#5

service> clearfault IOX@X156/IOB1

service> clearfault IOX@X156/PS0

service> clearfault IOX@X156/PS1

XSCF> showstatus

No failures found in System Initialization.

Clearing the LINK to the IOBox: Example: service> clearfault IOX@X1CK/IOB0/LINK As soon as there is no more fault status reported in the showstatus output then all the Service LEDs are cleared. Note : There is no condition requiring to power cycle the IOBox to clear a fault status (similar to clearfault -l).

HIERARCHICAL FAULT CLEARING :

In certain cases, the faulted resources appear to be hierarchical.

In the following example, after clearing the fault on CMU#0, we need to clear the fault on the

subordinates.

XSCF> showstatus

* CMU#0 Status:Faulted;

* CPUM#0-CHIP#0 Status:Faulted;

* MEM#03A Status:Faulted;

Page 10: Clearing the Faults and Understanding Led Status on Sun Sparc Enterprise Mx000 Servers

service# clearfault CMU#0

XSCF> showstatus

CMU#0 Status:Normal;

* CPUM#0-CHIP#0 Status:Faulted;

* MEM#03A Status:Faulted;

CMU#0 remains in the output, although not marked faulted, until the subordinates are cleared:

service# clearfault CMU#0/CPUM#0

XSCF> showstatus

CMU#0 Status:Normal;

* MEM#03A Status:Faulted;

service# clearfault CMU#0/MEM#03A

XSCF> showstatus

No failures found in System Initialization.