SCSI Timeout

Embed Size (px)

Citation preview

  • 8/7/2019 SCSI Timeout

    1/7

    1

    Backup Exec for NetWare,Backup Server Hangs, and SCSI Device Time-outs

    SCSI Device Time-outsA SCSI bus time-out occurs when a command is issued to a SCSI device, which then disconnects fromthe SCSI bus to allow other devices to use the bus while it does its job, but then fails to reconnect at theend of the operation. To detect these problems Backup Exec, along with the SCSI and ASPIspecifications, has defined time limits for each possible tape drive operation. Time-out values arecalculated so the device should be able to complete the given task, including retry operations. If the tapedrive takes longer than expected, a SCSI time-out error is generated which will appear as a Red Boxand provide error information, including sense data reporting all zeros. No sense data has actually beenreturned to the controller but the controller will report what it has been given, no data, which translates toall zeros. Sense data is usually generated by the tape drive to give some indication of what the problemis, but in the case of time-outs, the tape drive is not there to give sense data.

    Backup Server HangsThe first sign of a time-out is that the tape drive ceases activity (the lights go off). Because someoperations can take several hours on some tape drives, some time-out values are very long. This canlead to the perception that the backup job is hung. Some of the time limits are rated in numbers of days.A backup which has been hung for 2 or 3 hours, or even longer, with no tape drive activity, has just notyet reached its time-out limit. If this is truly a time out issue, eventually an error message will appearindicating a SCSI bus time-out has occurred but this could conceivably take several days. Attempts toabort a backup job in this hung state will cause the Backup Exec Job Manager to indicate that it isaborting but there will be no physical indication of an abort.

    To close the channel and abort the job, commands were sent to the tape device(s) indicating intent toclose connections. The backup software will then wait for an acknowledgment. If the ASPI manager,SCSI controller driver, SCSI controller itself, SCSI bus, or tape drive is not participating in the

    conversation, the attempt to close will not complete. The abort message is waiting to be serviced butunless the tape drive can successfully reconnect to the bus, it will never get the command and begin theabort process. The controller will continue to wait for the tape drive to reconnect or for the time-out valueto expire, at which time it will cause an abort due to the time out. The abort command sent through theprogram will be discarded. Novell will not allow the backup engine to unload without closing thisconnection, as this would leave the operating system in a questionable and unreliable state, which is notin the best interest of the backup server.

    The immediate priority in this case becomes getting the hung backup job execution stopped. Chooseas good a time as possible to bring the server down normally, using the down and exit commands. Toprevent the job from restarting when the server is rebooted, remark out the BESTART command in theAUTOEXEC.NCF prior to issuing the down command. During the process of downing the server,NetWare will display messages about open files and connections, and may prompt for confirmation to

    clear these connections. NetWare should be allowed to clear these connections. Downing the server inaccordance with Novells guidelines should pose no threat to the file system or to the backup software.Once the server is restarted without the backup engine running, the job(s) can be put on hold using anyone of the client modules. BESTART can then be run without a backup job restarting. Control of thesystem and decisions regarding how and when to troubleshoot and make adjustments is returned to thesystem administrator. It is important to note that this step may relieve some uncomfortable side effects,however, is not a resolution to the underlying problem. The remaining time-out troubleshootingrecommendations should still be examined.

  • 8/7/2019 SCSI Timeout

    2/7

    2

    Note : Some conditions may require the cycling of power to both the tape drive and the server in order toclear them completely.

    There is one other cause of the backup job appearing to be hung that must be mentioned. This is a lossof connection to a remote server, a separate issue not addressed in depth in this document. However,loss of connection to a remote server should be eliminated as a possible cause of failure before

    beginning to troubleshoot the issue as a SCSI time-out. This can be accomplished by first making sureall servers have their own separate backup job scheduled. Then, if the problem still exists, try arrangingthe backup jobs so that the local backup server backs itself up at the approximate point of failure timewise. If the error disappears or if it still happens during the backup of that remote server then there wasor is a connection problem with that server and you should check for online technical support thatdiscusses loss of connection to a remote server or contact technical support. However, if the hangoccurs during the local servers backup at this point the failure can be assumed to be a SCSI time-outand the troubleshooting in this document will provide a stepwise approach toward its resolution.

    RecommendationsRecommendations fall into four categories: Backup Execs Configuration for communications with thecontroller, Operating System Configuration, SCSI ID and Controller Configuration, and Hardware Issues.

    Of these, the first three may improve communication timing for the drive, but later changes, such as achange in the load order of NLMs or adding other devices to the bus, may cause the problem to return.

    The heart of this problem is a hardware issue. The drive could be beginning to develop heat sensitivityproblems. There may be bus related problems having to do with traffic, termination, or cabling makingthe tape drives tasks impossible to complete. There may be firmware issues at the tape drive and/orcontroller level. Even a new drive can be found to be functioning out of tolerance, completing the tasktoo slowly, or getting lost in the process. The underlying issue is hardware related, so altering theBackup Execs Configuration, Operating System Configuration or the SCSI ID and ControllerConfiguration will have a limited impact. There is only one setting to be changed in the backup softwareto help relieve this problem and that is still an attempt to improve the communication timing to thehardware. Nevertheless, these changes may make the difference and are the least expensive andeasiest to implement. The recommended changes are presented here as a stepwise outline for quick

    reference and there is a detailed explanation for readers that desire a more technical explanation in theattached document that may be read with the same Acrobat reader used to view the online manuals.Some of the recommendations made in the section addressing hardware are directed and designed forthe most common implementation of the SCSI bus, the Single Ended bus, and must be adjusted if theimplementation is using a Differential SCSI bus.

    Before making any changes the current configuration should be documented. To do this, the servershould be in its normal state (no hung jobs) and the backup software should be running. At the serverconsole prompt type load BEDIAG. This will create a file, BEDIAG.FAX, which can be printed from aworkstation. This file contains information about the server configuration files and the SCSI devices inthe system and may be of use during the implementation of these recommendations.

  • 8/7/2019 SCSI Timeout

    3/7

    3

    (begin table)Backup Exec Configuration1. Edit the BESTART.NCF and make the following changesa. For version 7.0x, add a /s switch to the end of the LOAD AD_ASPI line.b. For version 7.11, add a -s switch to the end of the LOAD BKUPEXEC line.Operating system Configuration1. CONFIG.SYS if present should only contain lines for files and buffers2. AUTOEXEC.BAT should contain only Echo, Prompt, Path, CD and SERVER.EXE3. The order of the STARTUP.NCF file should be:a. Set Statementsb. Controller Drivers - tape drives controller driver loading first unless memory recognition problemspostpone loading it. See the Memory Configuration document attached as a separate PDF file ifmemory configuration is an issue.c. Name Spacesd. Novell PatchesSCSI ID and Controller Configuration1. Set the tape drive SCSI ID to a favorable setting starting at ID 22. Set INITIATE SYNC NEGOTIATION for the Id of the tape drive to off3. Set INITIATE WIDE NEGOTIATION to off for controllers with a wide bus4. Set MAXIMUM SYNC TRANSFER RATE to the slowest possible setting**** Note **** if these 4 steps helped read that section in the attached PDF document.5. DISCONNECT must be enabled for all tape drive SCSI Ids6. Set SCSI PARITY CHECKING to off. ***** WARNING ***** The Parity Checking adjustment isfor diagnostic use only and should never be left disabled during production backups!!! Read thedetailed explanation!!! Seagate Software can not be held liable for data corruption due to paritychecking being disabled.Hardware Issues1. Check with the controller and tape drive vendors for firmware and driver updates2. Use good quality cabling that follows the SCSI specification.3. For external cables use heavy duty shielded cables4. Under no circumstances should an external SCSI cable, be detached from any connection while poweris still applied to any device on the bus.5. For Single Ended internal cables, be sure there is at least 1 foot of cable between devices.6. Also, for Single Ended buses, be sure the overall cable length from termination to termination does notexceed the maximum bus length of 3 meters (roughly 9 to 10 feet).7. Use active termination on both ends of the SCSI bus8. All external devices should be set to supply termination power9. Put the tape devices on a separate controller by itself or change the existing controller10. Replace the tape drive(end table)

  • 8/7/2019 SCSI Timeout

    4/7

    4

    Backup Exec Configuration

    Backup Exec normally tries to work with the controller in Asynchronous mode to achieve greaterthroughput but occasionally the controller does not perform up to expectations in this mode. AdjustingBackup Exec to use Synchronous mode with the controller will in many cases relieve the problem. Ofcourse, the speed benefits of using Asynchronous communications will be forfeited, the throughput will

    drop slightly and the backup will take slightly longer. However, the controller will operate more reliably inthis mode and there will be less of a chance of getting a time out.

    Operating System Configuration

    Drivers from the DOS partition interfere with NetWare drivers and will cause conflicts. A CONFIG.SYSfile is not required to start a NetWare server. If a CONFIG.SYS file is present, the only contents shouldbe those for files and buffers. All other statements should be remarked out by placing the abbreviationREM before the statement. Another option is to rename the file CONFIG.OLD so the file will not beexecuted during system restart. Either method will retain the original content should the file be neededat a later date.

    There should be no active statements in the AUTOEXEC.BAT file. Echo, Prompt, or Pathstatements, and the line(s) to change directory to the location of SERVER.EXE may remain. All otherstatements should be preceded with REM.

    The STARTUP.NCF is an important configuration file for NetWare. The order in which things happenmay play an important part in resolving time-out issues. The order of the STARTUP.NCF file should be:

    Set Statements - set up the environment before any drivers or NLMs are loaded. Controller Drivers - the driver for the tape drives controller should load first, unless for memory

    recognition purposes the drivers are loaded in the AUTOEXEC.NCF. This puts the drivers low inthe memory stack. To determine if all of the memory is recognized by NetWare before thedrivers are loaded, refer to the Memory Configuration section at the end of this document.

    Name Spaces - examples are MAC.NAM and OS2.NAM. Novell Patches - The most recent Novell patch sets should be used. By default the patch sets

    add 4 lines at the top of the STARTUP.NCF. The result is nearly 30 NLMs get loaded prior tothe set statements and the controller drivers, pushing these further up the memory stack makingit slower to find them.

    One special case: Compaq controllers have an ASPI manager (CPQSASPI.NLM) that should be loadedas early in the AUTOEXEC.NCF as possible. The load cpqsaspi statement should be right after the IPXInternal Net statement.

    At this point all of the Operating System Configuration changes have been implemented. If problemspersist, the next set of instructions may provide relief.

    SCSI ID and Controller Configuration

    Set the SCSI ID of the tape drive to the most favorable setting. Most controllers reserve ID 7 for thecontroller, with IDs 0 and 1 reserved for bootable devices (such as hard drives). Of the remaining IDs (2,3, 4, 5, and 6), ID 2 is best for the tape drive. The ASPI manager will see the tape drive as early aspossible without putting the tape drive on one of the IDs reserved for bootable devices. A fewcontrollers, such as some microchannel cards, are a little different. The ASPI search and IDs reservedfor bootable devices start at the other end of the ID settings. The first ID available for a tape drive wouldstart at 4 and work down to 0. With either controller, IDs 2, 3, or 4 should work. Whether ID 2 or ID 4would be better would depend upon the controller.

  • 8/7/2019 SCSI Timeout

    5/7

    5

    There are some controller settings which may be changed either through the BIOS or through aconfiguration utility (such as those used on EISA based systems). For the ID on which the tape drive islocated, check (and try changing) the following settings.

    Set INITIATE SYNC NEGOTIATION to off. This tells the controller not to initiate negotiation forsynchronized transfers between it and the device on that ID. Asynchronous transfers are slower

    and will often work better when the SCSI bus is not reliable in synchronous modes. This isparticularly effective for older SCSI 1 devices which may not support SYNC negotiation. Ifturning Sync Negotiation off resolves the problem with a newer device, the device may not befunctioning correctly in SCSI 2 mode.

    Set INITIATE WIDE NEGOTIATION to off for controllers which have a wide bus. This

    negotiation attempts to determine if data will be transferred in 8-bit or 16-bit pieces. Some 8-bitSCSI devices may have difficulty handling this negotiation, causing erratic behavior and hangs.When this setting is off only 8-bit transfers are possible.

    MAXIMUM SYNC TRANSFER RATE should be set to the slowest setting. On most controllers

    this is 5 megabytes per second. Some newer controllers use a wide bus so the slowest settingis 10. A wide bus moves 2 bytes of data with each bus cycle. When a device designed for anarrow bus is placed on a wide bus the device is limited to half of the wide limits. OlderSCSI 1 devices, or devices not functioning correctly, may not support Fast SCSI data transferrates, causing erratic behavior or hangs. The Maximum Sync Transfer Rate only comes in toeffect for Synchronous transfers. Either the controller or the SCSI device can initiate SYNCNEGOTIATION. If SYNC NEGOTIATION is set to off, the controller will not initiate negotiation.If the tape device does not request negotiation, then neither device will initiate negotiation andthe Maximum Sync Transfer Rate setting has no effect. If negotiation is requested by the tapedrive, the controller will negotiate no faster than the speed specified by the Maximum SyncTransfer Rate setting.

    If turning off SYNC NEGOTIATION resolved the time-out problem, then adjusting the Maximum

    Sync Transfer Rate while SYNC NEGOTIATION is on may enable the SCSI device and thecontroller to work together in Synchronous mode at a reliable speed. If adjusting the MaximumTransfer Rate with SYNCH NEGOTIATION on resolves the time-out, one of the following istrue: The SCSI Controller does not work reliably at the faster speed. The tape drive does not work reliably at the faster speed. The SCSI controller sends faster than the tape drive requested during negotiation. The tape drive sends faster than the controller requested during negotiation. The cabling, SCSI terminators, and load factors of all devices on the bus, limit the transfer

    rate to a slower speed. DISCONNECT must be enabled for all tape drive SCSI IDs so the controller can disconnect from

    the tape drive to service hard drive access requests.

    Set SCSI PARITY CHECKING to off. ***** WARNING ***** This adjustment is fordiagnostic use only and should never be left disabled during production backups!!!Unless this installation ALWAYS uses the VERIFY AFTER setting, if parity is disabled,undetected data transfer corruption is possible and those backups would be unreliablefor restore purposes. *** NOTE *** Some controllers allow SCSI Parity checking to be set on aper ID basis. Some controllers must set SCSI Parity Checking for the whole controller card. Ifthe controller allows this setting to be changed on a per ID basis, turn parity checking off for theID on which the tape drive resides. If parity checking must be set for the whole controller, set tooff only if there are no hard drives on the controller card. If there are hard drives present, andthis must be set controller wide, leave parity checking set to on. This adjustment is used todetermine if the controller and tape drive are having difficulty recovering from a parity error.

  • 8/7/2019 SCSI Timeout

    6/7

    6

    Disabling parity checking will avoid the problem, but the condition causing the parity error will gounchecked. If this setting is used for diagnostic evaluation, be sure to return paritychecking to the enabled state. Seagate Software can not be held liable for datacorruption due to parity checking being disabled.

    At this point, all of the SCSI ID and Controller Configuration changes have been implemented. If

    problems persist, the next group of recommendations should be investigated one at a time.

    Hardware Issues

    The firmware of the controller or the tape drive and the version of the controller driver may impact atime-out problem. Since the Firmware of both components are actually interpreting the commands beingused their performance and accuracy can be a big factor. The hardware vendor should be contacted todetermine if a firmware upgrade is available for the specific device. In addition, obtain the latestcontroller driver from the controller vendor.

    Check the cabling to be sure it is good quality and follows the SCSI specification. For external cables,heavy duty shielded cables are best, and the cable should be in good condition, free of abrasions orkinks. Under no circumstances should an external SCSI cable be detached from any connection

    while power is still applied to any device on the bus. If power is on to any of the devices on the buswhen the cable is disconnected, it can cause an electrical arc on the SCSI termination line which candegrade the electronic components on the tape drive, the controller, or any of the other devices on thebus.

    For ALL Single Ended internal cables, measure to be sure there is at least 1 foot of cable betweendevices (per SCSI specifications). Also make sure the overall cable length from termination totermination does not exceed the maximum bus length of 3 meters (roughly 9 to 10 feet). Thisincludes any bus cabling in any external housing and any bus cabling which may be included inthe card itself. This last value can range from 3 inches in standard add-on SCSI adapters to upto 18 inches of unseen bus length for a controller that is integrated on the motherboard. The bestway to determine this value is to contact the hardware manufacturer.

    Next, determine the type of terminators being used. Passive terminators are prone to problems.Basically they are voltage dividers developed for SCSI I application and slight fluctuations in voltages onthe bus will cause undervoltage to the bus, resulting in improper termination. If the bus is not properlyterminated then when a SCSI transfer occurs and the data reaches the end of the bus where it should betaken to ground, it instead reflects back down the bus where it can garble the reception of the next datatransfer. Although this is more likely to occur during synchronous transfers it can even happen withasynchronous transfers as well. These collisions slow the devices on the bus down and delay responsetimes, overloading the bus and possibly causing it to fail.

    Active terminators were designed for the SCSI II bus and are either built in to the end device, built in tothe internal cable, or are a cable attachment on either internal or external cables. On external devicesthey may be on the second cable port and have either raised lettering that will indicate they are active oran LED that is on when the line is terminated. Because of their electrical properties they do a much

    better job of ensuring the termination line on the bus has the right voltage to ensure the bus staysterminated and also server to eliminate some noise on the bus.

    Note: External devices such as external tape drives or loaders (devices with their own power supply)should be set to supply termination power. The controller should handle termination power for all internaldevices.

    The number and types of devices on the bus also affect bus performance. If another device fails tosense the bus before transmitting, or if the bus is extremely busy, the result can be SCSI bus contention,an issue of available bandwidth. Mirroring drives on a bus or duplexing adds a certain amount of traffic to

  • 8/7/2019 SCSI Timeout

    7/7

    7

    the bus do to the mirroring operations happening in the background. Even if mirroring is not being done,when other devices are on the bus each time the SCSI bus is free any device can start an arbitrationphase. Arbitration takes a fixed amount of time and has a guaranteed winner based on SCSI ID priority,ID 7 is highest, 0 is lowest. Since the hard drives should be at a lower ID (ID 0 or 1 normally) they will winand any other device set lower than the tape drive will also negotiate a higher priority. A tape devicesharing a bus with mirrored drives or on a crowded bus may have problems including SCSI time-outs

    due to traffic or priority. Isolating the tape drive from such problems by putting it on a separate controllerwill minimize this possibility..

    Example: The tape drive receives a command to write data to tape, if the tape drive is not ready to takedata immediately, it disconnects. If there is a high amount of SCSI bus activity for other devices, it maynot be able to reconnect to get the data when needed. This may cause a data under-run condition,causing the tape drive to reposition the tape (slowing down transfers).

    The final source of the problem may be the tape drive. While the tape drive may be able to pass abattery of tests thrown at it by a testing utility, a timing or sequence of commands may be occurring aspart of the backup but not during the basic testing of the utility. Other possibilities include a temperaturesensitive condition which shows up when one of the devices on the bus reaches a certain temperature.That temperature is reached after an undetermined time of the steady streaming of information to thetape. There is also the possibility that the tape drive has a basic firmware incompatibility or a problemwith the electrical characteristics of the other components on the bus. Replacing the tape drive becomesthe final solution.