Dump Diagnosis - IBM · Web viewIf this window appears, ensure the IP address is the IP address of your workstation. Figure : ECB Monitor setting the workstation IP address correctly

Revision History

Revision Date Description0 Initial Publication1 4/25/13 Infinite loop detection section updated to change

references to R14 to R15 2 4/25/2021

Problem Diagnosis Suppose you have an application that is dumping. You might not know the application. How can you use the z/TPF debugger and associated tools to determine what and where the problem is? In this discussion, we will explore how to investigate dumps and problems by type. The techniques described for a given type of dump may be able to be applied to other types of dumps or debugging scenarios.

The following techniques apply to production (optimized) or test (non-optimized) level code.

CTL-3, CTL-4, SERRC, SNAPC, and other dumpsA lot of dump types fall into this category and investigating each type requires specific knowledge of the application, system and so on. The tools and techniques presented here can be generically used on many dump types.

TPF Dump ViewerThe Dump Viewer is not a traditional textual z/TPF dump. The dump viewer is a debugger like interface to view the contents of a dump. It’s collected at the same time as a traditional z/TPF dump but is written to a file and specific to the kind of data that would be viewed while using the debugger.

There are limitations to the dump viewer: Certain system locks must not be held in order to capture the dump. The dump must be an application dump (system dumps (CPS0) are not captured). The dump is a static view; that is, you cannot step, set breakpoints, change

anything, and so on.

The dump viewer has some distinct advantages over reading traditional dumps: The dump viewer is especially useful for C/C++ code with the ability to use the

variables view to see all C/C++ variables at a glance. You can click through the stack frames and see C/C++ variables on previous stack frames.

The dump viewer provides the ability to apply XML maps in the memory views of given data areas to make it easier to read the data in the memory of the application.

Most debugger views will work as normal such as the SW00SR view, DETAC view, DECB view, TPF malloc view and so on, which could be difficult or impossible to view in a traditional z/TPF dump.

These dumps are portable for viewing from z/TPF system to z/TPF system because the program attribute table (PAT) entries, database definition (DBDEFs), and so on are completely copied to the dump file.

Complete the following steps to view a dump:1) In order to capture a dump for viewing in the dump viewer, dump capturing must be activated. Enter zaser dumpon dbug to activate dump capturing. 2) Start the application that will dump. Note the message that states the dump file has been written out.

Figure : The z/TPF console shows the DDMP0008I message, which indicates the dump was captured.

3) Enter the ZDDMP command to manage the dumps online.4) The dump viewer subsystem in the TPF Toolkit can be used to see details or start a dump viewer session. Start the TPF Toolkit, switch to the Remote System Explorer Perspective, and switch to the Remote Systems view. Expand your TPF connection, expand the TPF Dump Viewer Subsystem and right click on a dump. You can view details about the dump (similar to the ZDDMP command), delete the dump, or debug the dump. Select Debug to debug the dump.

Figure : Right clicking a dump in the TPF Dump Viewer Subsystem

The debugger interface starts providing a debugger like view of the dump. Notice that you can see source code, the stack frame, variables, ECB view, SW00SR view and so on just as you would in a debugger session. However, you cannot modify any values, set breakpoints, or issue any execute actions.

Figure : Dump Viewer session

A couple of final notes on the dump viewer. Enter zaser dumpoff nodbug to turn dump capturing off. Also, a more friendly table view of dumps is available by right clicking the filter under the TPF Dump Viewer subsystem and choosing show in table but this information is also shown in the Properties view.

Figure : TPF Dump Viewer in a table view.

Stack corruptionMany dumps (OPR-3, OPR-4 and so on) and incorrect application behavior are the result of stack corruption. The following features are available in the debugger and dump viewer except where noted.

Note that you can click through the stack frames in the debug view and see the variable values and source locations at those previous stack frames. You may notice that a character array containing a valid string appears to pour over into other variables in your stack. This can be an indication that your application is mishandling that string variable (sprintf, loops, etc).

You can see more details about a given stack frame by opening the properties view (from the windows menu select show view->general->properties) and clicking on a stack frame:

Figure : Properties pane showing details about a stack frame

You can also see the contents of the stack frame. Right click on a stack frame and choose map memory element to open an XML map of the stack frame in the memory view.

Figure : Map Memory Element

Figure : Stack frame content

A couple things to take notice of in the stack frame: Register 14 (R14) is the typical return address register in the z/TPF system.

However, if R14 points into CPS0, it is likely a C/C++ cross module call and the return address is found in CRET.

A bad back chain pointer (BCH) often indicates that the application is overwriting the stack by way of a memcpy, MVC, and so on.

This tip works frequently. Look at the stack contents rendered in EBCDIC or ASCII for a text string. Try doing a grep for that string in your application code. sprintf and similar functions are often the cause of stack corruption and this approach has been used to solve many of these types of dumps.

Another approach is to examine the contents of the entry control block (ECB) trace for function and macro parameters and return values that point into the stack address range as they may be the cause of the stack corruption.

The ECB trace can tell you what the ECB was doing recently. It will show you the macros and functions called as well as parameters passed in and values returned. The ECB trace is available while viewing dumps through the debug console command ECBTrace. A variety of other debug console commands are available.

Figure : ECB Trace

If you are using the debugger and know that a particular stack address will become corrupted (such as the back chain pointer or a variable such as i), you can use the storage change breakpoint support to stop the debugger when the change occurs. Right click in the breakpoints view and choose add breakpoint->watch.

Figure : Creating a watch breakpoint

Enter an address as 0x1234, a pointer expression or & of the variable such as &port. You can set a watch breakpoint to only stop when the storage location equals a particular value or occurs in a particular address range.

Figure : Creating a watch breakpoint based on a variable.

The debugger will stop at the next source line or instruction after the source line or instruction that modified the storage.

Figure : Debugger showing the application stopped after the variable changed.

You can also set watch breakpoints on the contents of a register. One way stack corruption can exhibit itself is:

The application calls a function, module or etc. The application corrupts the stack frame of the caller. The application returns to the caller. The registers are restored from the corrupted stack frame causing the contents of

the registers to become values that will cause a dump or so on.The debugger can help diagnose this situation when a register will be set to a predictable value. For example, suppose an application will set R7 to the value 1234.

Figure : Creating a watch breakpoint on a register.

Run the application until the watch breakpoint hits. It may be that the debugger will show the application stopped in a strange location. This occurs because the application is stopped in linkage code which may not translate to a nice source line. Right click in the source file and choose switch view->show disassembly.

Figure : Switching to disassembly view.

As before, the application is stopped at the instruction after R7 became the desired value.

Figure : Showing application stopped after the instruction that caused the desired watch breakpoint to hit.

Debugger Run to System Error

For many application dumps, the debugger can detect the dumping condition and show you where the error is occurring. To use this feature1) Start debugger on the ECB that will dump.2) Clear or disable all breakpoints.3) Set the application running by clicking the Resume button.4) When the dump occurs, a pop-up box will appear with three choices:

Examine will be the most commonly used choice. It allows you to open views and even change the state of the application and retry the system error (testing a fix, avoiding the dump, and so on). This is called system error retry.

step into will cause the application to step by one source line. This is only useful if the application is not going to exit (for example in the case of SERRC_RETURN).

run will cause the application to go into run mode. It will run to the next breakpoint set in the application. This is only useful if the application is not going to exit (for example in the case of SERRC_RETURN).

Figure : System Error events in the debugger.

If you choose to examine the state of the application, you can avoid the system error in couple different ways.

Use Jump to location to jump over (bypass) a line causing an error.

Figure : Using jump to location Modify the registers, variables, or memory that is causing the error. If you are

debugging assembler code, simply execute the instruction again. If you are debugging C/C++ code, use jump to location on that source line and execute the line again (this ensures all instructions backing that source line are executed correctly).

Debugger Register by System Error

Register by system error allows you to start the debugger at the location of the system error without having to run to the location of the dump. This is very useful if you don’t know the code path of your application but know a dump will come from the application when an action is performed. For example, suppose you do not know the entry point of your application but know that one or more ECBs will be created (for example, CREMC), one of which will dump. With register by system error, the debugger will start in the dumping ECB at the location of the dump.

You can specify module and part with wildcards to limit what dumps will start the debugger session. There is no performance impact when using this type of registration.

There are limitations to this feature. The registered dump must occur in the application and not in the system code (CPS0). OPR dumps, SERRC dumps, SNAPC dumps, and other dumps can be specified.

Do the following to use register by system error

1) Create a debugger registration entry for the dump number (do not include the dump prefix). Here I have registered an OPR-4 for any program and any ECB started from any terminal.

Figure : Register by system error for an OPR-4

2) Right click and register the debugger.3) Start your application to cause the dump.4) The debugger will start and show you the location of the dump. This is a live debugger session similar to running to the location of the dump discussed above. The first ECB claims the registration entry and any other ECB on the system can execute the dumping code without interference from the debugger. The application will behave as if the dump has not occurred yet. Technically it has occurred and dump processing was intercepted. If you step, run, or so on, the dump will occur. You can use the system error retry features discussed above.

Figure : Debugger started by register by system error

CTL-75 heap corruption

A CTL-75 dump indicates that heap (malloc) corruption has occurred. In a z/TPF system this dump is typically issued when the 0xFFFFFFFF FFFFFFFF fence field immediately following an allocated malloc block is corrupted. However, this detection for the CTL-75 dump occurs when the malloc block is freed. CTL-75 dumps occur in the control program and as a result, you cannot run the debugger to the dumping location or use register by system error for these dumps. The TPF Malloc view can be used to locate corruption of malloc blocks.

TPF malloc viewThe malloc view is made up of four panes that can be individually hidden by the buttons in the upper right corner of the view. The inuse and free panes shows the malloc blocks that are inuse or free respectively (malloc blocks that are free can be reused such that they are removed from this view). The changed pane shows the changes in malloc since the last refresh. The malloc view can refresh automatically on each step or set to only refresh when the refresh button is pressed. The selected block pane shows additional information about a malloc block that is selected in one of the other panes like the program that performed the malloc or free. The columns in the view can be rearranged, sorted, and hidden.

If the corruption column is shown in the Malloc view, the corruption detection will be performed. Malloc entries that are corrupted will appear in the changed pane at all times.

You can also do actions like go to address to view the malloc block in the memory view.

Figure : Debugger showing the TPF malloc view, the corrupted fence field and the instruction that corrupted the fence.

One thing to note, using corruption detection in the TPF Malloc view might impact debugger performance.

perform heap check on stopWhile the TPF malloc view is a great way to learn about your malloc blocks and effectively shows you what corruption has occurred, it cannot indicate when that corruption occurred. The perform heap check on stop feature tells the debugger to detect any heap corruption whenever the execution of the application is stopped. When heap corruption is detected, a pop up window is displayed indicating that corruption has been detected.

However, you must step or run the application such that the application is periodically stopping. For example, step over function calls to see which function calls cause the heap corruption or step into past lines in the code to see which line causes the heap corruption. You cannot set the application running to completion and see a pop up window when the corruption occurs. Another option is to use the animated step into feature to have the debugger automatically step through your application for you. However, you must watch for the pop up box and turn off the animated step into feature manually.

To turn on the perform heap check on stop feature, right click in the stack frame and choose perform heap check on stop.

Figure : Debugger turning on perform heap check on stop.

When corruption is detected, as in this case where a step into each line occurred, a pop up appears like this:

Figure : Debugger showing the pop up stating malloc is corrupted.

You can then use the TPF malloc view to learn more about this malloc entry. Note that using perform heap check on stop may impact debugger performance.

heapcheck modeCTL-75 dumps occur in the control program and as a result, you cannot run the debugger to the dumping location or use register by system error for these dumps. However, if your heap corruption is writing past the fence (a typical case) you can use the heapcheck system feature in conjunction with the debugger to quickly locate the problem code. Heapcheck mode causes every malloc to use at least one 4 K frame, the malloc area with the fence is located at the end of the 4 K frame, and the next 4 K frame is invalidated. When the application writes past the fence in corrupting the malloc buffer, the application will start to write over the invalid frame and an OPR-4 will occur. The application must write beyond the fence because overwriting the fence is not enough to cause the OPR-4. As a result, you can debug the application, clear the breakpoints, and run to the OPR-4. Or you can register the OPR-4 in the system error registration. Use the following commands to view and change the heapcheck state on the system.

zstrc disp – displays the current setting for heapcheck mode zstrc alter heapcheck – turns on heapcheck mode. zstrc alter noheapcheck – turns off heapcheck mode.

CTL-10, infinite loop detection, infinite loop investigationDebugger infinite loop detectionCTL-10 dumps occur in the control program and as a result the debugger cannot stop the application at the location of the error. The debugger attempts to do infinite loop detection. However, the application must periodically stop in order for the debugger to perform its detection. This is because the debugger attempts to allow the application to run as fast as possible to provide the optimal debugging experience. As a result, the infinite loop detection cannot occur without you setting breakpoints or stepping of some sort.

The debugger attempts to make you aware of dumps that occur when the application dumps. Unfortunately, it cannot do this in all cases (especially when a dump occurs from the control program). However, CTL-10 is a good example of when the debugger is able to make you aware of application dumping at exit. (For this case I started my infinite looping application, cleared my breakpoints, choose resume to run the application to completion and the following pop up appears.)

Figure : Debugger showing a CTL-10 dump after exiting.

As previously described, breakpoints need to be created that encompass the application range where the infinite loop is occurring. In this case, the application is only in a single module. However, this technique can be used across several modules if needed. Enter ZDMAP to determine as low of an address as possible and as high of an address as

possible. Entering ZDMAP A-XXXX where XXXX is the address in R15 might be a good way to narrow in on a module to create breakpoints around. Notice that the value in R15 in the figure above falls into the range of QDB0 in the figure below.

Figure : ZDMAP on module causing the CTL-10

This gives us an address range of: 409A1AC50 to 409A1AC50 + AAE4 (409A25734). Now start the debugger on your application and use these two addresses to create address breakpoints.

Figure : Creating address breakpoints in the debugger.

Figure : Entering the address for an address breakpoints in the debugger.

Figure : Debugger with two address breakpoints at either end of the module causing our CTL-10.

Infinite loop detection is controlled by a time out that you can set. The default setting is 30 seconds. You can use the TPFTimeout debug console command to shorten the time you will need to wait. In the following screen shot, it is set to 3 seconds.

Figure : Setting the TPFTImeout in the debug console in the debugger.

Next push the resume button and wait the specified number of seconds. A pop up will appear indicating that a possible CTL-10 has been found. The debugger will show you the current stopping location for you to investigate. You can continue to debug as normal or press the resume button to run to the next possible infinite loop detection point.

Figure : Debugger showing a time out occurring.

ECB MonitorSuppose you have a long running ECB on the system. You can enter ZDECB on the console to learn about where that ECB is and enter ZDECB TR to view the current ECB Trace for that ECB. The ECB monitor feature was designed to allow you to learn more information about a long running ECB. For this example, an application is set to run that will likely exit rather quickly. Entering ZDECB 0, you can see the application has been running for 2 minutes and 46 seconds.

0F384000*HPN 1 QDB0JW QDB0 2 46 1052 1 0 355401 DLAYC

Enter zdecb TR 0F384000 to see what it has been doing recently in the ECB trace:

TR GROUP LOADMOD LOADSET OBJECT NAME PSW IS OBJ DSP FUNCTION CALL OR MACRO TIMESTAMPIBM_DEFT CTAL LOADSET-BASE OBJECT-cdlayc _ 1 52 call dlayc(void) C5CE06CD E1676C80 from QDB0JW OBJECT-qdb0er OFFSET-BDC 1 6C return(void) errno=00000000 C5CE06CD E1673740 from dlayc _

1 52 call dlayc(void) C5CE06CD E1672600 from QDB0JW OBJECT-qdb0er OFFSET-BDC 1 6C return(void) errno=00000000 C5CE06CD E166F440 from dlayc _ 1 52 call dlayc(void) C5CE06CD E166E280

From this display it is clear that the application is in a loop calling the dlayc macro and does not appear that it will exit. This is a perfect candidate for using the ECB monitor to better understand why the application is looping. The ECB monitor subsystem shows a list of all ECBs running on the system. Clicking on an ECB will show additional information for that ECB in the propertied pane. To get a better view, right click on the filter and choose show in table.

Figure : ECB Monitor list of ECBs

Then right click on the entry in the tree or the table and choose snapshot:

Figure : ECB Monitor snapshot action in table

Figure : ECB Monitor snapshot action in tree

If this window appears, ensure the IP address is the IP address of your workstation.

Figure : ECB Monitor setting the workstation IP address correctly

The ECB monitor session is very similar to a dump viewer session. The snapshot action causes the system to be paused, the selected ECB goes through the dump viewer dump collection processing and the system is returned to running state. Here is the result of the snapshot. Notice the application is stopped just after the DLAYC macro call in the CTAL dlayc() library function.

Figure : ECB Monitor Session

Click back through the stack to examine your application and use all the features available during a dump viewer session:

Figure : ECB Monitor examining previous stack frames

You can take additional snap shots and compare the state of the two by switching back and forth between the two viewing sessions. Simply take another snapshot and a second session will open (Notice that there two sessions).

Figure : Two ECB Monitor sessions open

You can switch sessions just like in the debugger by clicking a stack frame in the other session.

Figure : Working in two ECB Monitor sessions

A significant difference between dump viewer sessions and ECB monitor sessions is that when you click the terminate button, the ECB monitor session data will be deleted (a dump viewer file will persist). ECB snapshots can be saved by using the ECBSNAPSHOT IMPORT debug console command.

Figure : Saving an ECB Snapshot

This file will now be viewable through dump viewer and will persist as if it is a dump. The dump view files are shown here in a table. Notice dump type is ISNPSHT which indicates this file is a saved ECB monitor session. To view it, open it just like you would a dump viewer session.

Figure : ECB Snapshot saved and available in the list of dump.

Memory LeaksThe z/TPF debugger provides a few features to help identify memory leaks in the application. However they do require that you do some investigating because the debugger cannot determine when a malloc block is no longer used. The ECBHEAP debug console command allows you to gather information regarding the use of heap by the application.

The ECBHEAP STATS debug console command shows how much memory is in use and what types of memory are in use. In the figure that follows, notice that there is no 64 bit memory that is in use.

Figure : ECBHEAP STATS example in the debugger

The ECBHEAP CNTS [SORTCNT] debug console command provides the counts of all malloc entries based on size. It can sort based on size or based on the number of malloc entries of a given size.

Figure : ECBHEAP CNTS example in the debugger

One way to use this feature is to step over a function, perform some action, and so on and then look at the ECBHEap counts to see what has changed. Make note of what memory sizes are not getting freed. Use the TPF malloc view to choose a given size entry and use the selected block pane to know what code is allocating malloc of that size.

Another thing to look at is which part of the application is using the largest blocks of memory. Use the malloc view to examine the malloc blocks further (for example sort the malloc view data by size and Look at largest blocks)

Figure : Using the ECBHEAP and TPF Malloc view together.

The information provided by the ECBHEAP command is more useful than it might appear in helping to know where to begin looking for malloc leaks and has been used many times in the z/TPF lab to do so.

Determining Code PathYou can also use trace log, the code coverage tool or the debugger to learn about the code path that leads to your dump. For example, register the code coverage tool, trace log or other tools and run your application. You will then have the information to understand the path your application took to get to the dumping location in your application. See the Determining Code Path article for more information.

ConclusionThe debugger provides a set of tools useful in problem diagnosis including the dump viewer, ECB monitor, TPF Malloc View and so on. In addition to all of the debugger techniques I have shown, you can also read traditional textual TPF dumps for clues to know how to use the debugger to locate the problem in your code.

Documents

Dump Diagnosis - IBM · Web viewIf this window appears, ensure the IP address is the IP address of your workstation. Figure : ECB Monitor setting the workstation IP address correctly