Health Check on DGX-1 · • NVIDIA® System Management (NVSM) is a software framework for...

Preview:

Citation preview

Health Check on DGX-1

2

Info: Running nvsm

• NVIDIA® System Management (NVSM) is a softwareframework for monitoring NVIDIA DGX™ nodes in a data center. It includes active health monitoring, system alerts, and log generation.

• It provides notification of fluctuations in system health, faults, and potential failures.

• It is recommended to run nvsm show health after a software or hardware update or replacement

• The "dump health" command produces a health report file suitable for attaching to support tickets. sudo nvsm dump health Writing output to /tmp/nvsm-health-dgx-1-20180907085048.tar.xzDone

“sudo nvsm show health”

3

Info: nvsm show health

• Base OS Version• BIOS Version• BMC Revision• GPU Status• NVLINK Status• CPU Status• Memory Status• Networking Status• Raid Status• Drive Status• Disk errors

List Of Checks By nvsm show health

4

Info: Interpreting ‘unhealthy’

• If the output from running nvsm show health is ‘Unhealthy’ look over all the checks that were run and look for those that did not pass (marked Unhealthy)

• Provide this information and the log file to NVIDIA Enterprise Services in order to continue troubleshooting the system

lab@psg-xpl-evt-23:~$ sudo nvsm show health

[sudo] password for lab:

Info

----

Timestamp: Tue Aug 7 17:00:18 2018 -0700

Version: 18.06-3

Checks

------

DGX BaseOS Version [4.0.0]...........................................

BIOS Version [0.010].................................................

DGX Serial Number [To be filled by O.E.M.]...........................

Verify installed DIMM memory sticks..................................

Healthy

BMC Firmware Revision [0.70].........................................

Check BMC sensor thresholds..........................................

Healthy

Number of logical CPU cores [80].....................................

Unhealthy

Observed 80 logical CPU cores when 96 cores were expected

.

.

.

Health Summary

--------------

203 out of 205 checks are Healthy

2 out of 205 checks are Unhealthy

Overall system status is Unhealthy

Problem detected.

Please visit the ESP portal: https://nvid.nvidia.com/dashboard/

And create a support ticket with the log file attached.

5

Interpreting nvsm show health Output

6

Health Summary (end of nvsm show health output)

● Healthy output:

Summary-------94 out of 94 checks are HealthyOverall system status is Healthy

● Unhealthy output:

Summary-------9 out of 11 checks are Healthy2 out of 11 checks are UnhealthyOverall system status is Unhealthy

Problem detected.Please visit the ESP portal: https://nvid.nvidia.com/enterpriseloginAnd create a support ticket with the log file attached.

7

Scenario: Missing GPU● Missing GPU might be caused by:

○ Failed GPU tray upgrade

○ Hardware failure

$ sudo nvsm show healthInfo----Timestamp: Thu May 16 21:02:15 2019 -0700Version: 19.01.8

Checks------Quick health check of GPU using DCGM.................................Healthy DGX BaseOS Version [4.0.6]........................................... Verify installed DIMM memory sticks.................................. Healthy...Verify Ethernet controllers.......................................... HealthyVerify installed GPU's............................................... Unhealthy

Checking output of 'lspci' for expected GPU'sMissing GPU at PCI address "07:00.0"

Verify installed InfiniBand controllers.............................. HealthyVerify PCIe switches................................................. Healthy...

8

Scenario: Missing PCIe Switch● Missing PCIe switch might be caused by:

○ Hardware failure

$ sudo nvsm show health...

Checks------BIOS Revision [5.11]................................................. ...Verify Ethernet controllers.......................................... HealthyVerify installed GPU's............................................... Unhealthy

Checking output of 'lspci' for expected GPU'sMissing GPU at PCI address "06:00.0"Missing GPU at PCI address "07:00.0"Missing GPU at PCI address "0a:00.0"Missing GPU at PCI address "0b:00.0"

Verify installed InfiniBand controllers.............................. UnhealthyChecking output of 'lspci' for expected InfiniBand controllersMissing InfiniBand controller at PCI address "0c:00.0"Missing InfiniBand controller at PCI address "05:00.0"

Verify PCIe switches................................................. UnhealthyChecking output of 'lspci' for expected PCIe switchesMissing PCIe switch at PCI address "03:00.0"Missing PCIe switch at PCI address "08:00.0"

...

9

Scenario: Missing DIMM● Missing DIMM might be caused by:

○ Improper DIMM installation

○ Unseated during transport

○ DIMM failure

○ DIMM slot failure

$ sudo nvsm show healthInfo----Timestamp: Sat Dec 16 16:26:32 2017 -0800Version: 17.12-5

Checks------BIOS Version [S2W_3A08]................................................. DGX Serial Number [YSY72800016]...................................... Verify installed DIMM memory sticks.................................. Unhealthy

Checking output of 'dmidecode' for expected DIMM's"Memory Device (DIMM_F1)" is missing"Memory Device (DIMM_G0)" is missing

BMC Firmware Revision [3.27]..................................... Healthy...

10

Scenario: Unsupported DIMM● Unsupported DIMM might be caused by:

○ Unknown DIMM vendor/part number installed

○ Unexpected DIMM size installed

$ sudo nvsm show healthInfo----Timestamp: Sat Dec 16 16:26:32 2017 -0800Version: 17.12-5

Checks------BIOS Version [5.11]................................................. DGX Serial Number [YSY72800016]...................................... Verify installed DIMM memory sticks.................................. Unhealthy

Checking output of 'dmidecode' for expected DIMM's"Memory Device (DIMM_D1) -> size" has value "8192 MB" when "32 GB"was expected

Number of logical CPU cores [80]..................................... Healthy...Verify DIMM vendors.................................................. Unhealthy

Comparison: Unknown DIMM vendor "G-Skill"...

11

Scenario: Missing SSD● Missing SSD might be caused by:

○ Unsupported SSD size installed (e.g. larger SSD installed by customer)

○ Unseated during transport

○ SSD hardware failure

○ SSD not installed

$ sudo nvsm show healthInfo----Timestamp: Sat Dec 16 16:26:32 2017 -0800Version: 17.12-5

Checks------BIOS Revision [5.11]................................................. DGX Serial Number [YSY72800016]...................................... Verify installed DIMM memory sticks.................................. HealthyNumber of logical CPU cores [80]..................................... Healthy...Verify installed MegaRAID disks...................................... Unhealthy

Checking output of 'smartctl' for expected disksFound 3 disk(s) with capacity "1.92 TB" when 4 disk(s) were expectedNo disks of capacity "480 GB" were found

Verify DIMM vendors.................................................. Healthy...

12

Scenario: Unsupported System SKU● NVSM show health is only supported on DGX-1 and DGX-2 hardwares at this time

● Support for nvsm show health on DGX Station is coming soon

$ sudo nvsm show healthERROR: Unknown product name "DGX Station"ERROR: nvhealth could not determine system SKU

Please ensure that nvsm show health is running on a supported NVIDIA system.

If this problem persists, please visit the ESP portal:https://nvid.nvidia.com/enterpriselogin

And create a support ticket with the output attached.

● Override the system SKU with the --system-sku flag

$ sudo nvsm show health --system-sku dgx-1-p100Info----Timestamp: Sat Dec 16 16:26:32 2017 -0800Version: 17.12-5

Checks------BIOS Revision [5.11]................................................. DGX Serial Number [YSY72800016]...................................... Verify installed DIMM memory sticks.................................. Healthy...

13

Challenge: DGX-1 Health Check

1. Run nvsm show health on your assigned DGX-1

2. Where can you find the output TAR log from the nsvsm dump health run?

3. Is anything wrong with your system?

_Team Challenge_

14

• Output is saved in /tmp/*.tar.xz

• Sudo tar <file>

• Sudo ls ./<file>

dgxuser@psg-dgx1-02:~$ sudo nvsm show

health

[sudo] password for dgxuser:

Info

----

Timestamp: Wed Apr 4 10:03:07 2018 -0700

Version: 18.03-2

(snip)

dgxuser@psg-dgx1-02:~$ sudo nvsm dump

health

Writing output to /tmp/*.tar.xz

dgxuser@psg-dgx1-02:~$

dgxuser@psg-dgx1-02:/tmp~$ sudo tar nvsm-

health-dgx1-18-04-20190516213243.tar.xz

dgxuser@psg-dgx1-02:/tmp~$ sudo ls

./nvsm-health-dgx1-18-04-

20190516213243.tar.xz

output is reproduced

Make readable the tar log file

Solution: DGX-1 Health Check

15

Info: nvsm

• Software framework for monitoring NVIDIA DGX™ nodes in a data center.

• Documentation: https://docs.nvidia.com/dgx/nvsm-user-guide/index.html

16

Challenge: Using NVSM

● Use nvsm to check fan(s) status

_Team Challenge_

17

Example: Using NVSM

$ sudo nvsm show fans

/chassis/localhost/thermal/fans/FAN10_F

Properties:

Status_State = Enabled

Status_Health = OK

Name = FAN10_F

MemberId = 19

ReadingUnits = RPM

LowerThresholdNonCritical = 5046.000

Reading = 9802 RPM

LowerThresholdCritical = 3596.000

...

Recommended