Upload
others
View
22
Download
0
Embed Size (px)
Citation preview
Health Check on DGX-1
2
Info: Running nvsm
• NVIDIA® System Management (NVSM) is a softwareframework for monitoring NVIDIA DGX™ nodes in a data center. It includes active health monitoring, system alerts, and log generation.
• It provides notification of fluctuations in system health, faults, and potential failures.
• It is recommended to run nvsm show health after a software or hardware update or replacement
• The "dump health" command produces a health report file suitable for attaching to support tickets. sudo nvsm dump health Writing output to /tmp/nvsm-health-dgx-1-20180907085048.tar.xzDone
“sudo nvsm show health”
3
Info: nvsm show health
• Base OS Version• BIOS Version• BMC Revision• GPU Status• NVLINK Status• CPU Status• Memory Status• Networking Status• Raid Status• Drive Status• Disk errors
List Of Checks By nvsm show health
4
Info: Interpreting ‘unhealthy’
• If the output from running nvsm show health is ‘Unhealthy’ look over all the checks that were run and look for those that did not pass (marked Unhealthy)
• Provide this information and the log file to NVIDIA Enterprise Services in order to continue troubleshooting the system
lab@psg-xpl-evt-23:~$ sudo nvsm show health
[sudo] password for lab:
Info
----
Timestamp: Tue Aug 7 17:00:18 2018 -0700
Version: 18.06-3
Checks
------
DGX BaseOS Version [4.0.0]...........................................
BIOS Version [0.010].................................................
DGX Serial Number [To be filled by O.E.M.]...........................
Verify installed DIMM memory sticks..................................
Healthy
BMC Firmware Revision [0.70].........................................
Check BMC sensor thresholds..........................................
Healthy
Number of logical CPU cores [80].....................................
Unhealthy
Observed 80 logical CPU cores when 96 cores were expected
.
.
.
Health Summary
--------------
203 out of 205 checks are Healthy
2 out of 205 checks are Unhealthy
Overall system status is Unhealthy
Problem detected.
Please visit the ESP portal: https://nvid.nvidia.com/dashboard/
And create a support ticket with the log file attached.
5
Interpreting nvsm show health Output
6
Health Summary (end of nvsm show health output)
● Healthy output:
Summary-------94 out of 94 checks are HealthyOverall system status is Healthy
● Unhealthy output:
Summary-------9 out of 11 checks are Healthy2 out of 11 checks are UnhealthyOverall system status is Unhealthy
Problem detected.Please visit the ESP portal: https://nvid.nvidia.com/enterpriseloginAnd create a support ticket with the log file attached.
7
Scenario: Missing GPU● Missing GPU might be caused by:
○ Failed GPU tray upgrade
○ Hardware failure
$ sudo nvsm show healthInfo----Timestamp: Thu May 16 21:02:15 2019 -0700Version: 19.01.8
Checks------Quick health check of GPU using DCGM.................................Healthy DGX BaseOS Version [4.0.6]........................................... Verify installed DIMM memory sticks.................................. Healthy...Verify Ethernet controllers.......................................... HealthyVerify installed GPU's............................................... Unhealthy
Checking output of 'lspci' for expected GPU'sMissing GPU at PCI address "07:00.0"
Verify installed InfiniBand controllers.............................. HealthyVerify PCIe switches................................................. Healthy...
8
Scenario: Missing PCIe Switch● Missing PCIe switch might be caused by:
○ Hardware failure
$ sudo nvsm show health...
Checks------BIOS Revision [5.11]................................................. ...Verify Ethernet controllers.......................................... HealthyVerify installed GPU's............................................... Unhealthy
Checking output of 'lspci' for expected GPU'sMissing GPU at PCI address "06:00.0"Missing GPU at PCI address "07:00.0"Missing GPU at PCI address "0a:00.0"Missing GPU at PCI address "0b:00.0"
Verify installed InfiniBand controllers.............................. UnhealthyChecking output of 'lspci' for expected InfiniBand controllersMissing InfiniBand controller at PCI address "0c:00.0"Missing InfiniBand controller at PCI address "05:00.0"
Verify PCIe switches................................................. UnhealthyChecking output of 'lspci' for expected PCIe switchesMissing PCIe switch at PCI address "03:00.0"Missing PCIe switch at PCI address "08:00.0"
...
9
Scenario: Missing DIMM● Missing DIMM might be caused by:
○ Improper DIMM installation
○ Unseated during transport
○ DIMM failure
○ DIMM slot failure
$ sudo nvsm show healthInfo----Timestamp: Sat Dec 16 16:26:32 2017 -0800Version: 17.12-5
Checks------BIOS Version [S2W_3A08]................................................. DGX Serial Number [YSY72800016]...................................... Verify installed DIMM memory sticks.................................. Unhealthy
Checking output of 'dmidecode' for expected DIMM's"Memory Device (DIMM_F1)" is missing"Memory Device (DIMM_G0)" is missing
BMC Firmware Revision [3.27]..................................... Healthy...
10
Scenario: Unsupported DIMM● Unsupported DIMM might be caused by:
○ Unknown DIMM vendor/part number installed
○ Unexpected DIMM size installed
$ sudo nvsm show healthInfo----Timestamp: Sat Dec 16 16:26:32 2017 -0800Version: 17.12-5
Checks------BIOS Version [5.11]................................................. DGX Serial Number [YSY72800016]...................................... Verify installed DIMM memory sticks.................................. Unhealthy
Checking output of 'dmidecode' for expected DIMM's"Memory Device (DIMM_D1) -> size" has value "8192 MB" when "32 GB"was expected
Number of logical CPU cores [80]..................................... Healthy...Verify DIMM vendors.................................................. Unhealthy
Comparison: Unknown DIMM vendor "G-Skill"...
11
Scenario: Missing SSD● Missing SSD might be caused by:
○ Unsupported SSD size installed (e.g. larger SSD installed by customer)
○ Unseated during transport
○ SSD hardware failure
○ SSD not installed
$ sudo nvsm show healthInfo----Timestamp: Sat Dec 16 16:26:32 2017 -0800Version: 17.12-5
Checks------BIOS Revision [5.11]................................................. DGX Serial Number [YSY72800016]...................................... Verify installed DIMM memory sticks.................................. HealthyNumber of logical CPU cores [80]..................................... Healthy...Verify installed MegaRAID disks...................................... Unhealthy
Checking output of 'smartctl' for expected disksFound 3 disk(s) with capacity "1.92 TB" when 4 disk(s) were expectedNo disks of capacity "480 GB" were found
Verify DIMM vendors.................................................. Healthy...
12
Scenario: Unsupported System SKU● NVSM show health is only supported on DGX-1 and DGX-2 hardwares at this time
● Support for nvsm show health on DGX Station is coming soon
$ sudo nvsm show healthERROR: Unknown product name "DGX Station"ERROR: nvhealth could not determine system SKU
Please ensure that nvsm show health is running on a supported NVIDIA system.
If this problem persists, please visit the ESP portal:https://nvid.nvidia.com/enterpriselogin
And create a support ticket with the output attached.
● Override the system SKU with the --system-sku flag
$ sudo nvsm show health --system-sku dgx-1-p100Info----Timestamp: Sat Dec 16 16:26:32 2017 -0800Version: 17.12-5
Checks------BIOS Revision [5.11]................................................. DGX Serial Number [YSY72800016]...................................... Verify installed DIMM memory sticks.................................. Healthy...
13
Challenge: DGX-1 Health Check
1. Run nvsm show health on your assigned DGX-1
2. Where can you find the output TAR log from the nsvsm dump health run?
3. Is anything wrong with your system?
_Team Challenge_
14
• Output is saved in /tmp/*.tar.xz
• Sudo tar <file>
• Sudo ls ./<file>
dgxuser@psg-dgx1-02:~$ sudo nvsm show
health
[sudo] password for dgxuser:
Info
----
Timestamp: Wed Apr 4 10:03:07 2018 -0700
Version: 18.03-2
(snip)
dgxuser@psg-dgx1-02:~$ sudo nvsm dump
health
Writing output to /tmp/*.tar.xz
dgxuser@psg-dgx1-02:~$
dgxuser@psg-dgx1-02:/tmp~$ sudo tar nvsm-
health-dgx1-18-04-20190516213243.tar.xz
dgxuser@psg-dgx1-02:/tmp~$ sudo ls
./nvsm-health-dgx1-18-04-
20190516213243.tar.xz
output is reproduced
Make readable the tar log file
Solution: DGX-1 Health Check
15
Info: nvsm
• Software framework for monitoring NVIDIA DGX™ nodes in a data center.
• Documentation: https://docs.nvidia.com/dgx/nvsm-user-guide/index.html
16
Challenge: Using NVSM
● Use nvsm to check fan(s) status
_Team Challenge_
17
Example: Using NVSM
$ sudo nvsm show fans
/chassis/localhost/thermal/fans/FAN10_F
Properties:
Status_State = Enabled
Status_Health = OK
Name = FAN10_F
MemberId = 19
ReadingUnits = RPM
LowerThresholdNonCritical = 5046.000
Reading = 9802 RPM
LowerThresholdCritical = 3596.000
...