System Dump

Embed Size (px)

Citation preview

  • 8/8/2019 System Dump

    1/5

    How do I analize a system dump ?

    faq52-1546

    Posted: 7 Feb 02 (Edited 14 Apr 02)

    1. System Dump

    1.1 Overview

    During system crash (flashing 888 on the LCD) dump process is invoked.

    Dump process will store the entire kernel segment that resides in real

    memory (the kernel segment is segment 0) to a disk for future

    debugging. It is actually creates snap shot core image of the machine

    in the moment of a crash and save it to a disk.

    The core will include also the memory resident user data (such as u-blocks).

    If the dump process was successful it will be indicated by 0c0 on the LCD.

    1.2 Dump Status Codes

    The following dump progress indicators, or dump status codes, are part

    of a Type 102

    message.

    Note: When a lowercase c is listed, it displays in the lower half of

    the character

    position. Some systems produce 4-digit codes, the two leftmost positions can

    have a blanks or zeros. Use the two rightmost digits.

    0c0 The dump completed successfully.

    0c1 The dump failed due to an I/O error.

    0c2 A dump, requested by the user, is started.

    0c3 The dump is inhibited.

    0c4 The dump device is not large enough.

    0c5 The dump did not start, or the dump crashed.

    0c6 Dumping to a secondary dump device.

    0c7 Reserved.

    0c8 The dump function is disabled.

    0c9 A dump is in progress.

    0cc Unknown dump failure

    1.3 Crash Codes

    Note: Some systems may produce 4-digit codes. If the leftmost digit of

    a 4-digit code is

    0, use the three rightmost digits.

    The following crash codes are part of a Type 102 message.

    000 Unexpected system interrupt.

  • 8/8/2019 System Dump

    2/5

    200 Machine check because of a memory bus error.

    201 Machine check because of a memory timeout.

    202 Machine check because of a memory card failure.

    203 Machine check because of a out of range address.

    204 Machine check because of an attempt to write to ROS.

    205 Machine check because of an uncorrectable address parity.

    206 Machine check because of an uncorrectable ECC error.

    207 Machine check because of an unidentified error.

    208 Machine check due to an L2 uncorrectable ECC.

    300 Data storage interrupt from the processor.

    32x Data storage interrupt because of an I/O exception from IOCC.

    38x Data storage interrupt because of an I/O exception from SLA.

    400 Instruction storage interrupt.

    500 External interrupt because of a scrub memory bus error.

    501 External interrupt because of an unidentified error.

    51x External interrupt because of a DMA memory bus error.

    52x External interrupt because of an IOCC channel check.

    53x External interrupt from an IOCC bus timeout; x represents the IOCC number.

    54x External interrupt because of an IOCC keyboard check.

    558 There is not enough memory to continue the IPL.

    700 Program interrupt.

    800 Floating point is not available.

    1.4 Enabling system dump

    We can check the current dump settings via "smitty dump" or as following :

    # sysdumpdev -l

    primary /dev/hd6

    secondary /dev/sysdumpnull

    copy directory /var/adm/ras

    forced copy flag TRUE

    always allow dump TRUE

    dump compression OFF

    This means that the system dump is enabled (always allow dump= TRUE)

    and it will be copied to /var/adm/ras.

    If always allow dump=FALSE, then the core dump will not be generated.

    To change this, change the settings in smit:

    Always ALLOW System Dump = true

    /varfilesystem must have enough space to accommodate a couple of system dumps.

    Its size can be increased by, say, 100MB as following:

  • 8/8/2019 System Dump

    3/5

    # chfs -a size=+200000 /var

    1.5 Analyzing system dump

    If the customer complains that his system had frozen with 888 on the

    display, check errpt for the entry like this:

    C0AA5338 0614145601 U S SYSDUMP SYSTEM DUMP

    This means that the system dump have occurred on 14 of June at 14:56.

    Run the following command to verify the status of the last system dump:

    # sysdumpdev -L

    0453-039

    Device name: /dev/hd6

    Major device number: 10

    Minor device number: 2

    Size: 63952384 bytes

    Date/Time: Thu Jun 14 14:43:11 CST 2001

    Dump status: 0

    dump completed successfully

    Dump copy filename: /var/adm/ras/vmcore.0

    Run the crash command in order to get a basic idea on the possible

    reasons of the system dump.

    The crash subcommands (trace -k, thread -r, status 0) are used to

    provide a hint on the problem origin:

    #cd /var/adm/ras

    #crash vmcore.0

    Using /unix as the default namelist file.

    > trace -k

    STACK TRACE:

    0x2ff3b400 (excpt=edffff54:40000000:00001004:edffff54:00000106) (intpri=0)

    IAR: .remove_e_list+38 (00032888): tweqi r7,0x0

    LR: .e_block_thread+40c (00034424)

    2ff3b010: .e_sleep_thread+4c (0003497c)

    2ff3b060: .[nspdd]+4144 (016ba4e4)

    2ff3b100: .[nspdd]+2de4 (016b9184)

  • 8/8/2019 System Dump

    4/5

  • 8/8/2019 System Dump

    5/5

    >q ;quits the crash command

    =================================================================

    In this case trace -k shows a problem with nspdd process, which is

    part of the TSP driver.

    thread -r and status 0 both hint on the application process pltDc as

    responsible for the core dump (it's the last process that run).

    The core file can be copied on a CD and sent to IBM for further analyzing.