Emergency handling for recovery of sap system landscapes

Emergency Handling for Recovery of SAP System Landscapes

Best Practice for Solution Management

Version Date: May 2008 The newest version of this Best Practice can always be

obtained through the SAP Solution Manager

Table of contents

1 Introduction ....................................................................................................................................... 3

1.1 Goal of Document ...................................................................................................................... 3

1.2 What Is a Disaster? .................................................................................................................... 4

1.3 Course of a Recovery ................................................................................................................ 4

1.4 Organization ............................................................................................................................... 5

1.5 Dos and Don‟ts in Case of a Disaster ........................................................................................ 5

2 Flowchart for Emergency Handling .................................................................................................. 6

3 From Incident to Disaster (Steps 1 to 5) .......................................................................................... 7

4 Error Categorization (Step 6) ........................................................................................................... 8

4.1 Technical Failure ........................................................................................................................ 8

4.2 Logical Error ............................................................................................................................... 9

4.3 Cross-System Inconsistency ...................................................................................................... 9

5 Activating Alternate Procedures (Steps 7 to 8) .............................................................................. 10

5.1 Switchover after Technical Failures ......................................................................................... 10

5.2 Workarounds after Logical Errors and Data Inconsistencies ................................................... 11

6 Preparations for Recovery (Step 9) ................................................................................................ 11

7 Executing Recovery (Step 10) ....................................................................................................... 13

7.1 Overview of Recovery Phases ................................................................................................. 13

7.2 Technical Recovery (Recovery-Phase 1) ................................................................................ 14

7.3 Data Repair (Recovery-Phase 2) ............................................................................................. 17

7.4 Business Recovery (Recovery-Phase 3) ................................................................................. 21

7.5 Data Re-entry (Recovery-Phase 4) .......................................................................................... 24

8 Returning to Normal Operation (Steps 11 to 16) ........................................................................... 25

9 Examples ........................................................................................................................................ 27

9.1 Example 1: Media Failure ........................................................................................................ 27

9.2 Example 2: Media Failure and Database Recovery Failure .................................................... 28

© 2008 SAP AG - 2 -

9.3 Example 3: Lost Data ............................................................................................................... 30

9.3.1 Example 3a: All Data Can be Recovered ........................................................................ 30

9.3.2 Example 3b: Remaining Data Loss is Only Locally Relevant .......................................... 31

9.3.3 Example 3c: Remaining Data Loss Causes Cross-system Inconsistencies ................... 32

9.4 Example 4: Database Block Corruptions ................................................................................. 33

Appendix ................................................................................................................................................ 36

A - Flowcharts for Printout ................................................................................................................. 36

© 2008 SAP AG - 3 -

1 Introduction

1.1 Goal of Document

Disruptions of core business functions are critical to the success of a company. When business operations are disrupted, a standardized procedure can help to return to regular operations in a timely manner. Meanwhile, activating technical switchover solutions or business-level workarounds can provide an interim solution to keep up operations (at least at some minimum level) while regular functionality is being restored.

Using flow-charts, this document outlines a general procedure to be followed in case of serious business disruptions. Starting with the escalation of an incident to a disaster, the main phases and steps that are part of the recovery procedure are depicted - allowing the classification of incidents and providing details on the recovery options available in the different phases.

The purpose of this document is twofold:

1. Support the handling of an acute emergency

2. Provide input for Business Continuity Planning

Emergency Situation

Following this document, SAP support employees and customer support organizations will be able to follow a structured approach to support the recovery a customer‟s system environment within a reasonable timeframe. This document provides information for each phase of a recovery procedure. By giving examples for some typical error situations and recovery approaches, the course of the recovery process becomes clear.

The procedure outlined here can also serve as the basis for an action plan to be set up for coordinating an emergency situation.

For a customer, this document will be helpful if there is no disaster recovery plan available or, if there is, it can support the emergency handling process by supplementing a customer-specific plan.

As such, this document is intended for:

A disaster recovery team executing a recovery

Support employees assisting customers in a business-down situation

Duty managers / escalation managers accompanying a recovery

Business Continuity Planning

Business continuity planning has the task of preparing a company for a disaster situation by creating detailed recovery plans for different contingencies. The general flow of an emergency procedure described here can be adopted in a customer-specific recovery plan. The error categorization and recovery options listed in this document can provide input for the detailed recovery instructions to be defined.

As such, this document is intended for:

The business continuity project manager

Members of a business continuity project

Business Continuity Planning is also part of another Best Practice provided by SAP: “Business Continuity Management for SAP System Landscapes”, which covers the project steps of establishing a business continuity concept; see http://service.sap.com/solutionmanagerbp.

The recovery procedures created by a business continuity project are intended to guide emergency handling in a business-down situation.

http://service.sap.com/solutionmanagerbp

© 2008 SAP AG - 4 -

1.2 What Is a Disaster?

Within the scope of this document, a disaster is any event that seriously disrupts business operation beyond the acceptable outage time. An incident that cannot be resolved within a predefined time-limit needs to be escalated to a disaster that requires further recovery procedures. This perception of the term „disaster‟ is often used in the context of Business Continuity Management and goes beyond the more restrictive notion of a „physical disaster‟ like a fire, flooding or explosion.

Business disruptions or „disasters‟ can be caused by technical failure or logical failure.

Technical failures of a system component usually affect all business processes that are using the affected component(s). This can range from crashes of individual hardware components over database block corruptions, to building fires or flooding of an entire computer center.

Logical failures, on the other hand, often only affect single or few business processes while the systems are still up. Logical failures range from partial data loss or data corruptions inside a single system, to data inconsistencies of data being exchanged between multiple systems of an environment.

1.3 Course of a Recovery

Disaster recovery handling, as described in this document, starts with the escalation of an incident, which is seriously disrupting operations, to a disaster.

It is important to identify the type of error causing the disruption as early as possible, since the required recovery phases and applicable activities mainly depend on the error type.

When a disaster is declared, the following main phases of recovery may be applied in this order:

A. Activate possible alternatives to stay in business. This can be a technical switchover to a standby system, or the activation of alternate business processing using workarounds or emergency plans.

Which options are possible or applicable depends on the solutions in place and the actual type of error. Activating a workaround will be easier and faster, if the workaround is already documented in a recovery plan.

B. Prepare systems for the recovery

C. If a system or component is down, system recovery or technical recovery, as a first step, has to reestablish technical availability of the system by fixing any technical error causing the disruption. This can be done, for example, by exchanging some defect hardware component, by activating a standby system or by restoring a database from a backup.

D. If all components are up (or were recovered in the previous step), logical errors inside each system have to be removed to restore integrity of each system in itself. This requires in-depth application knowledge and is a prerequisite for the next step.

E. If data consistency between systems of the environment was affected, this again requires in-depth application knowledge and time to fix it.

F. If data was lost and could not be recovered so far, the next effort should aim at reentering such data into the systems.

G. Having finished all recovery phases, the systems and business functionality should be checked as a prerequisite for resuming regular operations.

More details on the different phases can be found in section 2 and following.

© 2008 SAP AG - 5 -

1.4 Organization

The organization that executes the recovery plans consists of:

The support desk (incident management) staff that report a possible disaster case

The business continuity manager

The recovery team with representatives from business (key business user/business process champion) and IT (application management/SAP technical operations/business process operations)

A crisis team of senior managers from business (business process champion) and IT (application management) that need to be consulted for critical decisions like the activation of a disaster recovery plan

Key users familiar with emergency workarounds for critical business functionality

A standardized method is recommended to support complex business continuity approach. This approach involves all parties that can contribute to the resolution of a disaster.

The focus of the approach should be the analysis and resolution of top issues, which have the highest impact on the operations of productive solutions in case of a disaster.

The benefits of a consolidated approach are:

Standardized approach that has proven to be most effective and efficient

Fast access to all experts needed

Close collaboration and communication with all involved parties

High transparency on current issue status

Continuous reporting on the progress, up to management level

1.5 Dos and Don’ts in Case of a Disaster

This section lists some general pitfalls during a recovery.

Don‟ts Instead do

Do not apply point in time recovery of single system

Do repair inconsistencies / logical errors of single system

Do not apply point in time recovery for the system landscape

Do identify and correct inconsistencies / missing data

Do not chase temporary differences Do make sure that it‟s a real inconsistency, not a temporary difference

© 2008 SAP AG - 6 -

2 Flowchart for Emergency Handling This section describes the general flow of activities for handling a situation that impacts continuity of business operations. The following figure provides a flowchart that can also serve as the basis for a specific action plan in an emergency situation.

Note: If available, a customer‟s business continuity plan and corresponding recovery plans need to be factored in when performing a recovery.

Figure 1: Flowchart 1: “Emergency Handling”

The different steps of the emergency handling process will be discussed in more detail in the following sections:

Step Section Title

1-5 3 From Incident to Disaster

6 4 Error Categorization

7-8 5 Activating Alternate Procedures

9 6 Preparations for Recovery

10 7 Executing Recovery

11-16 8 Returning to Normal Operation

© 2008 SAP AG - 7 -

3 From Incident to Disaster (Steps 1 to 5)

Incident

A business disruption is usually detected by end users who trigger an incident at the support organization. Incident management (support desk), as the primary addressee for any type of business disruption, analyzes and tries to resolve the error by answering the following questions:

What has happened?

Which processes are affected?

How many users are affected?

Is the error reproducible?

Is the error business critical?

Involved Organization: Support desk

Escalation

The situation has to be escalated to business continuity management:

If error resolution is not successful within a given time frame defined in the SLAs or

As soon as it becomes clear that the error is of high criticality and complexity and has a serious impact on business operations

Business continuity management is responsible for further handling such major incidents. Before escalating to a disaster, further analysis has to answer the following questions:

What is the impact on business operations?

Is the incident endangering the business of the customer (production down)?

What is the root cause of the problem?

What is the estimated time required for recovery?

Should I invoke disaster recovery plan (escalate)?

If a serious business disruption (that prevents critical core business functions from operating) is determined, a disaster situation will be declared and the business continuity plan will be invoked.

Involved Organization: Support desk, BC Manager, Senior Management

Disaster

In a disaster situation, the following questions need to be addressed by the disaster recovery team:

Who do I have to call first?

Is the incident of an isolated nature or will the consequences deteriorate over time (for example data inconsistencies that may spread if work continues in the system)?

Is it possible to maintain partial functionality in the system or must the system be taken out of operation completely?

Which workarounds are available; which are possible? See section 5.

What needs to be done before starting recovery actions? See section 6.

Which recovery options are available; which are possible? See section 7.

Who else will be required in the recovery team (BC team)?

Involved Organization: BC Team

© 2008 SAP AG - 8 -

4 Error Categorization (Step 6) The type of error (error category) determines not only the entry point for recovery execution (see step 10), but also the options for activating alternate procedures (business continuity solutions, see steps 7-8). Therefore, the first requirement now will be to determine the type of error leading to the disruption.

We can distinguish

Technical failures

Logical errors inside a system

Cross-system inconsistencies affecting data exchanged between systems of a landscape

Involved Organization: BC Team

4.1 Technical Failure

A technical failure is usually caused by a hardware or system software fault. The system is usually unavailable and thus, all users and business processes relying on that system are affected.

Technical errors can be of the following nature:

System / subsystem crash

Infrastructure failure (network, power, telephony, …)

Failure of service provider

Physical disaster (fire, flooding, …)

Error causes

Technical failures can, for example, result from:

Hardware failure (memory, CPU, controller, …)

Storage media or storage system failure

Software bug (firmware, operating system, filesystem, database, …)

Database block corruptions

Block corruptions

Database block corruptions are a special kind of technical error. The content of storage blocks used by the database is corrupted thus the data stored in these blocks can no longer be used. The impact of block corruptions can range from SQL statements or transactions failing when accessing the corrupt block, up to crashes of the complete database instance. If system data of the database management system was corrupted, it is possible that the database instance may no longer be restarted.

The reason for block corruptions can be multiple – hardware failures like a defective disk controller, memory errors or low-level software bugs messing up the data.

Although block corruptions are sometimes regarded as logical errors (since the data stored in the blocks is logically corrupt), we follow the categorization as a technical failure because the phases that are applicable to recover from block corruptions start at the hardware / technology level (see section 7.2). Fixing a corruption on the technical layer may sometimes result in missing or incorrect data (a logical error).

© 2008 SAP AG - 9 -

4.2 Logical Error

A logical error usually affects only parts of a system or its data and thus only a few business processes and a limited number of users. Since all data is consistent from a database and “SAP Basis” viewpoint, the systems are up and „only‟ some business functionality is disrupted or faulty.

Logical errors can be of the following nature:

Some business data is lost, ranging from complete tables to single table rows or single fields of table rows. If data is lost, the application context will be corrupted since related data still exists in other tables

Business data is falsified, ranging from single table rows to the contents of specific table fields

Reports or other software processes are inoperable

Error causes

Logical errors can result from software error or human error (business user or administrator error) like:

Data deletion or table drops on SQL, database administration or SAP level

Transport induced error (wrong destination, wrong transport buffer, …)

Faulty customizing

Introduction and execution of bad code

Incorrect usage of application component, incorrect data entry

Incorrect data transfer / incorrect data processing through interfaces

4.3 Cross-System Inconsistency

In a system landscape where business processes use and modify data in various systems, data consistency is vital for correct business operation. A business object that is exchanged between two systems and should thus be available in both systems is inconsistent (between the two systems), if:

The object does not exist in any one of the systems

The two instances of the same object have different values in both systems

A special type of inconsistency in this respect is an inconsistency between an IT system and the real world.

Difference or inconsistency

When talking about inconsistencies, it is important to differentiate between Differences and Inconsistencies. While Difference relates to mismatches between data that will always occur in connected running systems (due to the processing times of asynchronous update tasks, IDocs, BDocs and other interfaces, different scheduling frequencies between systems), Inconsistency means a mismatch that does not disappear when all system activities are processed successfully. Before attempting a correction, it is therefore necessary to investigate whether an Inconsistency or a temporary Difference is observed.

Error causes

Inconsistencies between two (or more) systems can be caused by:

Software errors

o No clear leading system

o Program bugs

o Non-transactional interfaces, for example, synchronous communication used for data manipulations

o Incorrect error handling

© 2008 SAP AG - 10 -

User errors / Manual intervention

o Incorrect data entry

o Incorrect error handling

o Deletion of Queues

o Direct access to data

Messages in error states

Simplified Commit-Protocol (as used between APO and liveCache)

Data loss in one of the systems

o Incomplete Recovery of a system

o Technical disaster recovery (data replication) method that does not adhere to data consistency

Tolerated data loss, for example, with asynchronous replication

Missing consistency technology

System failure or failover

o Non-transactional interfaces with non-SAP components may be affected by data loss

5 Activating Alternate Procedures (Steps 7 to 8) Depending on the type of error, different possibilities may be available to continue operations during recovery:

Technical failure A technical continuity solution may allow you to switch over operations to an alternate hardware.

Logical errors or inconsistencies Workaround processes may be available to continue the most critical business functions.

5.1 Switchover after Technical Failures

Involved Organization: BC Team, IT

Server-side failure

Failover on the server side using a cluster solution for database or SAP central services can be done without limitations.

Storage-side failure

If data is replicated to a second facility, a switchover to the alternate storage system must always be considered carefully. The decision must be made by the disaster recovery team after weighting the benefits versus the impact. Since switchover may come along with some data loss, the demand for business recovery to remove cross-system data inconsistencies will come up.

The amount of data loss during switchover depends on the implemented replication technology.

A standby database may incur a relatively high data loss if the most recent logfiles from production cannot be applied.

Asynchronous replication generally incurs data loss according to the allowed replication lag.

© 2008 SAP AG - 11 -

Even with synchronous replication, some data loss may occur if the primary location continued to operate while the replication was already interrupted (“rolling disaster”).

Block Corruptions

If a standby database is available and complete recovery of this standby database can be done using the most recent logs from the production system, switchover to the standby database can be a very quick solution to enable continuity of business operation, because block corruptions caused by a technical failure usually do not transfer into a standby database.

If complete recovery is not possible on the standby database, more detailed analysis should be conducted into other possibilities of resolving the corrupted blocks (see 7.2), because a switchover would result in cross-system inconsistencies whose resolution might be somewhat more complex.

Note: Switching to a standby database will not solve the problem if the corruption was also transferred to the standby database, for example, if the block corruption was caused by bugs in the database software.

5.2 Workarounds after Logical Errors and Data Inconsistencies

Involved Organization: BC Team, Key users

Alternate Processing

If a business process becomes unusable due to logical errors or inconsistent data, the business process can only be re-activated after the error was resolved in a sufficient way. In the meantime, it might be possible to “stay in business” using a workaround procedure.

At best, such workarounds were already documented in a business continuity plan and can be activated according to this plan. If this is not the case, it should be analyzed if any such workarounds are possible and applicable to continue operations on a reduced scale.

The following types of workarounds may be considered:

Manual, paper-based processing

Operation based on the remaining systems of a system landscape

Working with reduced functionality

A combination of the above

Since a workaround always implies some limitations and usually requires some more or less expensive post-processing when normal operations are reestablished, the activation of a workaround should be under the control of the disaster recovery team.

6 Preparations for Recovery (Step 9) Before starting the actual recovery process, some preparations may be required to avoid unintended side-effects. Depending on the actual situation, the affected system may need to be shut down or isolated from other systems of the landscape before error resolution can continue. Isolation from other systems may be required for example to prevent the exchange of messages before data inconsistencies have been resolved.

Consider the following preparations before starting with the recovery:

Notify users

Stop user access to production system(s)

Isolate affected system

Salvage possibly helpful information

© 2008 SAP AG - 12 -

Ensure you are able to revert the system to the point before you start the recovery


Notify Users

Business users need to know about the disruption and must be given guidelines on how to proceed. If a workaround will be activated, the users must be instructed to use it.

Stop User Access

While recovery actions are performed, normal users may not be allowed to work with the affected system or the affected business processes. This is mainly important during resolution of logical errors or data inconsistencies. This can be achieved by disabling user logon to the system or by locking the affected transactions. After a technical system recovery, user logon should be prevented until it was verified that recovery has really completed and that no further recovery on logical level is required.

Possible actions:

Lock users

Lock affected transactions

Lock system

Isolate Affected System

As long as the state of the affected system is not completely clear, the system should be isolated from its environment. Any automatic actions should be disabled; message exchange with other systems of the environment should be avoided, especially when expecting cross-system inconsistencies after data loss or incomplete database recovery.

Possible actions:

Disable communication from other systems

o In other systems: disable connections, deregister outbound queues/destinations, disable automatic data requests

o In affected system: lock RFC user for incoming messages from other systems

Disable communications to other systems

o In other systems: lock RFC user for incoming messages from affected system

o In affected system: disable connections, deregister outbound queues/destinations, disable automatic data requests

Disable transports

Disable print-outs

Salvage Possibly Helpful Information

Prevent data that may be helpful for recovery or analysis on application level being deleted (for example by automatic reorganization jobs). This mainly comprises information on messages that were exchanged between the affected system and other systems of the landscape.

Do not delete the contents of message queues unless you are sure that this information is available in the target system (also see section 7.4)

Unschedule message reorganization in XI

Unschedule BDoc reorganization in CRM

Avoid deletion of ALE data

© 2008 SAP AG - 13 -

Enable Reverting to the State Before Recovery

If recovery would not be successful or even enlarge the damage, it should be possible to revert back to the point before recovery started. This can be ensured by taking a backup before starting recovery or by noting the exact point in time when recovery was started so a database restore and log recovery would allow to revert to that point. If available, other technologies like savepoints or storage-based snapshots can provide even better solutions for this demand.

7 Executing Recovery (Step 10) At this stage, the actual recovery from the failure will be performed, using previously documented recovery procedures from a business continuity plan, if available. The available options depend on the type of error and will lead to the flowchart in figure 2.

7.1 Overview of Recovery Phases

The recovery procedure can be divided into four different phases.

1. Technical Recovery

Fix technical errors to get the system up and running

2. Data repair

Fix logical errors inside affected system

3. Business recovery

Fix cross-system inconsistencies

4. Date re-entry

Re-enter data that might have been lost

The following flowchart depicts the sequence of recovery actions, coming from step 10 of flowchart 1 (figure 1). The entry point into this flowchart is determined by the error type (see section 4). Which phases actually need to be executed depends on the type of error and the resulting outcome from a previous phase. In section 9, you can find examples showing different paths for traversing the flowchart.

© 2008 SAP AG - 14 -

Figure 2: Flowchart 1.1: “Recovery Phases”

7.2 Technical Recovery (Recovery-Phase 1)

Goal: Repair defective hardware or software components to get the database and SAP system up and running


Strategy for Technical Recovery

Systems or components may not be operational due to hardware failure, corrupted filesystems, lost system files, lost or corrupted database files, misconfiguration, corrupted software or software bugs, where software can be any kind of low-level system software, operating system, database or application software. These failures can be the consequence of defect hardware but may have various other reasons.

Coming from phase 1 of flowchart 1.1 (figure 2), the following flowchart depicts the main steps to get the systems back online.

© 2008 SAP AG - 15 -

Figure 3: Flowchart 1.1.1: “Technical Recovery”

Options for Technical Recovery

Although error and root cause analysis may not be easy, the measures to fix technical failures are quite straightforward and can require any of the following hardware-, system- or database-related activities:

Fix hardware

o Exchange defective hardware components

o Switch to an alternate data center (in case of physical disasters)

Fix software and filesystems

o Check and repair filesystems

o Restore filesystems (see below)

o Restore or reinstall affected software, which can be system software, drivers, operating system, database software, application software, and so on

o Install error-free software patches

o Fix configuration errors

Fix database

o Restore database or database files (see below)

o Resolve database block corruptions (see below)

Next step: If the failure or the recovery procedure did not cause any loss or falsification of database or application data, recovery can finish at this stage and proceed with step 11 returning to normal operation. Otherwise, further applicable branches of the flowchart need to be traversed.

© 2008 SAP AG - 16 -

Filesystem Restore

A restore of storage volumes containing non-database managed filesystems with any kind of software components or any other kind of data is always incomplete, which means that any changes done after the last backup are lost. Unlike databases, filesystems do not allow to apply changes done after the backup by applying the concept of logfiles.

Since an SAP system does not store application data in filesystems (with very few exceptions for special kind of data like logfiles, TREX indexes or external content in CM), no impact on application data consistency is to be expected after a filesystem restore. The loss of information may thus affect software, configuration files, transport files or this special kind of data. Subsequently, analysis should be conducted to find out what exactly was lost and then to repeat any activities that allow reconstructing the previous state (repeat installation of software patches, repeat configuration changes, repeat export of transports). Strictly speaking, this already is an activity that should fall into section 7.5 (Data Re-entry) but for simplicity we leave it here.

Database Restore and Recovery

A database restore from a backup may be required in case of:

Media (disk) failures Since nowadays all productive installations implement some form of RAID protection; failure of a single disk no longer requires a restore. Only if more than one disk of a single RAID group fails at the same time, restore of the data residing on the RAID group will be inevitable. Due to striping that is implemented for performance reasons, a restore may affect multiple tablespaces residing on that RAID group or even the complete database.

Block corruptions (see below)

Deletion of data files or misconfiguration of raw devices, for example, due to an administrator fault

A restore always consists of three phases:

1. Restore of a database backup. Depending on the backup strategy, the restore can be done from different sources (tape, virtual tape, local disk, remote disk, standby database), yielding a different restore performance.

2. Restore of database logfiles from the backup medium

3. Application of database logfiles to the restored database (log recovery)

Log recovery is a very important step to roll forward the database and to apply changes that were done after the backup. During log recovery, all archived logfiles should be applied, followed by the current online logs that were being written when the failure occurred. The goal should always be to perform a complete recovery including the latest committed transaction. Only a complete recovery avoids data loss, which is important to maintain data consistency in a system landscape. Only by applying all available logfiles (archive and online logs), can a complete recovery be achieved.

Note: Despite of the status of aborted messages being exchanged between the systems, complete recovery maintains data consistency in an SAP system landscape because of the transactional concept deployed for message exchange. Doing a complete recovery, all committed message states are restored exactly as they were at the time of the failure. The asynchronous messaging protocol used for data exchange in SAP environments (tRFC, qRFC, EO or EOIO messaging) then ensures that message exchange can be continued without losing or duplicating messages.

An incomplete recovery (point-in-time recovery) of the database needs to be avoided since this introduces the need to remove cross-system inconsistencies that are caused by the data loss in the affected system (see below).

Example: For an example for a complete recovery after a technical failure see section 9.1.

Next step: Following a complete database recovery, recovery can finish at this stage and proceed with step 11 returning to normal operation.

© 2008 SAP AG - 17 -

Incomplete Database Recovery / Data Loss

In very rare situations, an incomplete database recovery (point-in-time recovery) may be inevitable following a database restore. Complete database recovery is generally not possible if:

A required logfile is corrupt and no error-free copy of the logfile is available

The tapes storing the logfiles are destroyed

The most current online logs cannot be accessed and applied to the database and there is no other replica of these online logs available

After an incomplete recovery, the database itself is in a consistent, error-free state and the affected system can be started (unless other errors are still present). However, the data loss has an impact on cross-system data consistency in a system landscape and as a next step, the business recovery phase needs to address these issues.

Example: For an example for an incomplete recovery see section 9.2.

Next step: Following an incomplete database recovery, data consistency between systems needs to be re-established and subsequently, the lost data needs to be re-entered in the system. Therefore, recovery phases 3 and 4 are required.

Block corruptions

If you recognize block corruptions on your database, always check your hardware because block corruptions are mostly caused by layers below the database management system. To determine the real extent of the damage, also check your entire database.

The actions that need to be taken as a consequence of block corruptions depend on the affected database areas. The analysis should thus clearly identify the objects that the corrupted blocks belong to.

On Oracle, SAP note 365481 describes different options to proceed depending on the type of object. Additional options are available as follows. Please note that some options require expert knowledge!

Restore and recover single corrupt blocks from an error-free backup with Oracle RMAN. This is possible even if RMAN was not used to perform the backup. This recovery is possible online.

Restore and recover database from an error-free backup

Additional options of SAP Support

o Rebuild from redundant data (from other table, from the indexes, from other system, and so on)

o Workaround with partial data loss (transformation of technical error to logical error)

Example: For an example for handling database block corruptions see section 9.4.

Next step: Depending on the result and method used to recover from block corruptions, recovery may finish at this stage or may require traversing additional recovery phases 2, 3 or 4.

7.3 Data Repair (Recovery-Phase 2)

Goal: Remove logical errors or inconsistent data inside a single system

Involved Organization: BC Team, Business, IT

© 2008 SAP AG - 18 -

Do not Perform Point-in-Time Recovery

Up to now, a commonly accepted method to remove logical errors from a system was to restore and recover the database to a point before the error occurred (database point-in-time recovery). The data loss that came along with this proceeding was acquiesced in favor of the ease of reestablishing logical correctness of the system.

But nowadays, with business processes spanning federated system landscapes, this method is no longer appropriate in most cases!

While removing the logical errors through point-in-time recovery, a new problem is introduced – cross-system data inconsistencies (see error category 3 described in section 4.3). So instead of having to repair logical errors inside a single system, cross-system inconsistencies had to be dealt with, which, in most cases, is an even more challenging task (see „Business Recovery‟ in section 7.4) than repairing the logical error directly inside the affected system.

The following table provides a comparison of „data repair‟ versus „point-in-time recovery plus business recovery‟:

Data repair of logical errors inside single system

Database point-in-time recovery followed by repair of cross-system inconsistencies

Required knowledge Experts from affected application Database administrators, experts from all application areas that exchange data with affected system

Outage Outage of affected business processes

Outage of complete system and many cross-system processes

Duration of outage Time required to fix error Time required for database restore and recovery plus time required to fix cross-system inconsistencies

In situations where a point-in-time recovery was the easiest solution in the past, more investment into logical error resolution is advisable with federated system landscapes (see „options for data repair‟).

Strategy for Dealing with Logical Errors

To avoid point-in-time recovery of the production system, logical errors should be carefully analyzed and possible options to repair the errors should be evaluated. If the effort turns out to be very high, this should be compared against the effort and implications imposed by a point-in-time recovery (including all follow-up activities).

If a point-in-time recovery was nonetheless identified as the best of the evil, data consistency between systems needs to be re-established subsequently and the lost data needs to be re-entered in the system. Thus recovery phases 3 and 4 will be required.

The following flowchart depicts the steps for recovering from logical errors; coming from phase 2 of flowchart 1.1 (figure 2).

© 2008 SAP AG - 19 -

Figure 4: Flowchart 1.1.2: “Data Repair”

Options for Data Repair

The following general approaches are available to fix logical errors and will be described in more detail later:

Reverse Engineering

Recovery of lost data

Check tools

Doing nothing

Reverse Engineering

A typical method to resolve logical errors is reverse engineering, which means reverting the error step by step with the help of the experts (application, development, and so on). Reverse engineering can be supported by an analysis system that allows you to track back to the state when the error was not yet in the system.

Such an analysis system could be provided by different means:

Perform a point-in-time recovery onto an alternate hardware to the state before the error occurred (not a restore onto production)

A standby database that has a sufficient delay to production can be rolled forward to the point before the error occurred

Use the state of a (recently copied) test system to compare with the corrupted production system

As described in SAP note 434645, there are various possibilities that may be applicable to repair logical errors, for instance if:

Data was corrupted by a malicious report

o Develop a report to fix the data

© 2008 SAP AG - 20 -

o Provide an analysis system and reconstruct original data from there

An index is corrupted

o Recreate index

Wrong transports were imported into the system

o Create and apply correcting transports

o In case wrong table data was transported, reconstruct former table contents following the options above (see table deletion)

o Reconstruct former ABAP sources

Recovery of Lost Data

If data was accidentally deleted (by deletion of a database table, drop of a table, deletion of table rows or attributes by a malicious report or human error), there may be several options of getting this data back without restoring the production system itself (as also listed in SAP note 434645).

Provide an analysis system (as described above) and reconstruct original data or database table from there

Reconstruct original data or database table from a standby database that is rolled forward to the point before the error occurred

Oracle: Flashback table to SCN (if Undo-information is still available)

Reconstruct table from redundant data in other tables

Reconstruct table from redundant data in other systems

Do without the data (for example, performance data of table MONI)

Example: For an example for recovering a deleted table from an analysis system, see section 9.3.

Check Tools for Specific Applications

SAP offers several tools or reports to check and repair data consistency of business data used by different applications. Checks are, for example, available for:

Documents in SD and LE

Inconsistencies in MM

Inconsistencies between MM and FI

Processes involving WM

Processes involving PP

Processes involving PS

For more information see the Best Practice “Data Consistency Monitoring within SAP Logistics” that will be available at http://service.sap.com/solutionmanagerbp.

Doing Nothing

In some special situations, logical errors may not require any further action, for example if the affected data is not vital for business operations.

If non-critical data (like logfiles or monitoring data) was deleted there is no need for recovery

If non-critical data was corrupted, it could just be deleted

Further Steps

When finishing the data repair phase, the data of the affected system should be correct from a business process point of view.


© 2008 SAP AG - 21 -

However, in some situations, repairing a logical error may not have been able to recover all lost or corrupted information completely. If data repair came along with some data loss, further analysis will have to show:

If the lost data has an impact on data consistency between the systems – in this case, data repair has to be followed by a phase of business recovery

How important this lost data is and how the lost data can be re-entered into the system – by a subsequent phase of data re-entry

Next step: Depending on the outcome and success of data repair, further recovery may require traversing the additional recovery phases 3 or 4.

7.4 Business Recovery (Recovery-Phase 3)

Goal: Remove inconsistencies between the systems of a system landscape

Strategy for Dealing with Cross-System Inconsistencies

The task of this phase is to deal with inconsistencies that occur between systems of the system landscape. Similarly, inconsistencies between systems and the real world need to be handled in this phase as well.

When dealing with cross-system inconsistencies, we need to know:

Which systems are affected

Which business processes are affected

Which data objects are affected

What is the impact on each business process

The tasks for removing such data inconsistencies include:

Identifying inconsistencies by comparing possibly affected objects between possibly affected systems

Filtering out temporary differences which do not constitute real inconsistencies

Determining a strategy to fix the identified inconsistencies

Involved Organization: BC Team, Business

The following flowchart depicts the steps during business recovery; coming from phase 3 of flowchart 1.1 (figure 2).

© 2008 SAP AG - 22 -

Figure 5: Flowchart 1.1.3: “Business Recovery”

Options for Business Recovery

The following general approaches are available to remove inconsistencies between systems and will be described in more detail later:

Application- or object-level options; addressing inconsistencies by comparing and fixing business objects

Initial load; addressing inconsistencies by retransferring inconsistent data from a leading system

Message-based approaches; addressing inconsistencies by repeating the message transfer

The most suitable approach must be determined for each business object and may thus be different for each type of inconsistency.

Dealing with Pending Messages

Handling of non-processed messages (pending messages) contained in the message queues of the involved systems plays an important role during business recovery because:

They have an influence on the differentiation of inconsistencies from differences and

On the one hand they may contain data that can be salvaged but

On the other hand they may contain data that might lead to duplicates or logical errors if they were processed.

Depending on the state of each of the systems involved in business recovery, it must be determined how each of these message queues has to be handled. The options are:

Delete pending messages because the related data objects will be handled completely by the compare and resynchronization process

Delete pending messages because they contain data that is already available in the other systems

© 2008 SAP AG - 23 -

For example, after an incomplete recovery, the outbound queues of the recovered system may be deleted because that data is already available in the connected systems (unless these queues were stopped and the messages were thus not processed).

Process pending messages because they contain important information that can be rebuilt that way

For example, after an incomplete recovery:

o The inbound queues of the recovered system may contain valuable data that may need to be processed to recover that data.

o The outbound messages in all other systems should be processed because they contain data that is not yet available in the recovered system. To preserve a correct order of these messages, however, it might be required to postpone their processing until all data objects will be compared and fixed.

Application- / Object-level Options

SAP offers a number of tools or reports to compare data objects between different systems. Many of these tools also allow fixing of (delete or re-transfer) inconsistent objects. The following tools are currently available from SAP:

CRM: Data Integrity Manager (DIMa) to check and correct business objects between CRM and ERP as well as CRM and CDB (consolidated database for mobile applications)

CRM: data exchange toolbox to check and correct one order documents (SAP note 718322)

SCM: Tools to check internal and external consistency of business objects between APO and liveCache respectively APO and ERP

For more information see the documentation for each of these tools and the overview that can be found in the Best Practice “Data Consistency Monitoring within SAP Logistics” that will be available at http://service.sap.com/solutionmanagerbp.

If business objects are affected by inconsistencies where no SAP tools are available, the following options may be evaluated:

Compare objects with customer-developed tools

Check for the availability of not officially released SAP “developer tools”

Compare and fix objects manually

Identify possibly affected objects by:

o Evaluating creation or change date

o Comparing mapping tables in both systems

o Analyzing logfiles providing hints whether data was exchanged in the period in question (for example logfiles written by the Communication Station about data exchanged with CRM Mobile Clients)

o Analyzing information about exchanged documents (for example BDoc message store in CRM)

Example: For an example for executing application-level business recovery as a consequence of data loss in a system of a federated landscape see section 9.2.

Initial Load

By reloading inconsistent objects from a leading system, inconsistencies can be removed in one go. Initial load may thus be an option for specific business objects, for example, master data if this does not result in the loss of important additional attributes being maintained in the target system.

Message-based Approaches

Apart from comparing and correcting inconsistent business objects with object-specific methods, some cases may allow you to identify and correct inconsistencies by analyzing the message transfer that took place between the systems. A prerequisite is that the messages that were transferred during the


© 2008 SAP AG - 24 -

period in question are still available in the systems or system logs. The idea is to fix inconsistencies that are caused by lost data by re-sending previously processed messages. But we have to note that the maintenance of number ranges and newly assigned numbers may represent an issue with this approach.

Whether information on messages that were exchanged is still available depends on the type of communication being used between the systems in question:

ALE (IDOC) The “ALE Recovery Tool” (Transactions BDRL, BDRC) allows you to analyze and resend messages. For more information see: http://help.sap.com/saphelp_erp2005vp/helpdata/en/26/29d829213e11d2a5710060087832f8/frameset.htm

RFC For RFC communication, an analysis of past message transfer is not possible since no logs are kept in the systems.

BDoc For BDocs between CRM and ERP, information on the data exchange can be retrieved from the BDoc Message Store (transactions SMW01 and SMW02). For data transfer between CRM and mobile clients, the Mobile Client Log could be evaluated.

Communication via SAP XI SAP XI keeps track of all messages from and to sending systems. In case of EO or EOIO messaging, information on the messages can also be found in the sending and receiving systems. For RFCs that were routed via XI, information will only be available in XI but not in the sending or receiving systems. Currently there is no tool available to support the analysis or re-sending of messages.

File interfaces Data that was uploaded via file interfaces could be recreated by repeating the file upload, if the file is still available. Thus, data consistency to external applications could be reestablished.

Further Steps / Remaining Data Loss

If business recovery became necessary due to some data loss in a system of the landscape, business recovery using the above options might have been able to bring back some of the lost data by transferring it from another system that holds a second copy of the data. But usually, not all lost data can be recovered that way, for example:

Data objects that were not exchanged with other systems

Special attributes of data objects that do not exist in the other systems where the object was replicated from

Next step: If business recovery was not able to completely recover all lost data, further recovery will proceed with phase 4, trying to re-enter lost data into the system.

7.5 Data Re-entry (Recovery-Phase 4)

Goal: Get back lost data in a single system that could not be recovered by previous phases

Description

Incomplete database recovery or resolution of database block corruptions in phase 1 might have caused the loss of data for a complete period of time and, respectively, a very isolated, partial loss of objects.

Data repair for logical errors in phase 2 might also have caused some loss of data objects or attributes.

http://help.sap.com/saphelp_erp2005vp/helpdata/en/26/29d829213e11d2a5710060087832f8/frameset.htm

http://help.sap.com/saphelp_erp2005vp/helpdata/en/26/29d829213e11d2a5710060087832f8/frameset.htm

© 2008 SAP AG - 25 -

Business recovery in phase 3 might have been able to get back some lost data from other systems of the environment, but in most cases not to full scale.

At this stage, the data available in the systems should be consistent within and between the systems. So what is left now is the task of re-entering any information into the system(s) that is still lost after (or due to) the previous recovery phases.

Involved Organization: BC Team, Business, Key users

Procedure

In general, the knowledge about which data is lost should be quite exact at this point in time – or at least the period that is affected by the data loss can be restricted very well. So, now, any such data should be re-entered into the system as comprehensively as possible.

There are two options to approach the data re-entry phase:

Key users get access to the system and re-enter lost data before the system is returned back to regular operation

Data re-entry is postponed and done by some key users or by the normal business users after handover to production.

The best time and method to re-enter lost data depends on the nature of the affected data and both approaches may be taken, dependant on different business objects.

The following options may apply to get back lost information:

Users enter the data from written notes or from memory

Data is re-entered from an external input stream (batch input, file input, upload tools, transports, and so on)

Data is recovered from a copy of the old, corrupt production system

Data is recovered from a (recently copied) test system

Next step: Recovery has finished and checks should verify that recovery was indeed successful and that the system is ready to return back into productive operation.

8 Returning to Normal Operation (Steps 11 to 16)

Checks

When recovery and error resolution has finished, checks are needed to verify if the system has really reached an error-free state that allows returning back to normal operations.

Checks should verify:

Functional operability of business processes

Correctness and consistency of business data

The approval whether the system is ready for production will be taken based on the results of these checks. The decision should be taken by the business continuity manager, together with application management.

If recovery quality was not sufficient to return to production, recovery must be continued until a satisfactory state is reached.

In specific situations, it may be possible to “partially” handover the system, which means that some business process might be excluded for a limited time. This could be the case for example, if the system in general reached a sound state that allows continuing business operations with the exception of some business processes that need further fixing. A prerequisite for partial handover is a clear separation of the released and the non-released processes and their data.

© 2008 SAP AG - 26 -

Involved Organization: BC Team, Key users, Senior Management

Handover to Production

After handover to production, regular users are allowed to log on to the system and work with their usual functionality. All established workarounds will be called off.

Completing Data Integrity

After handover to production, some follow-up activities may still be required, for example to:

Re-enter lost data into the system that was purposefully not yet covered during recovery phase 4 (section 7.5).

Allow business users to identify and re-enter lost data that was not identified by the recovery team. To enable users to check on their data, they need to be informed in detail about the critical period of recovery.

Integrate data, which was created while using the workaround processes, back into the regular system. Depending on the nature of the workaround or alternate process, such data may be available on paper or in other systems.

Involved Organization: BC Manager, Users

Leave Disaster Status

When users signal that all data has been recovered to the best of their knowledge, and when the remains of workarounds have been successfully integrated back into the productive system, this disaster case can be closed.

Lessons Learned

Having left the disaster status, follow-up activities should further investigate the root cause of the disaster with the goal of avoiding similar situations in the future. To learn for future emergencies, the complete disaster handling process should be reflected to identify possible areas of improvement in the business continuity plan.

Involved Organization: BC Manager

© 2008 SAP AG - 27 -

9 Examples

9.1 Example 1: Media Failure

Error Scenario:

The SAP system runs on an Oracle database located on a RAID-protected storage system.

Two disks out of the same RAID group fail. Multiple Oracle tablespaces are located on this RAID group. A backup containing the lost files and the complete changelogs (Online and Offline redologs) are available for restore and recovery.

Figure 6: Recovery Flow for Example 1

Recovery Phase 1:

Execute restore and complete DB recovery

The Oracle database can only be mounted because the datafiles are missing. By accessing the view v$recover_file in mount status, you can find out which datafiles need to be restored from a backup.

The latest backup taken before the crash containing the missing datafiles is identified in the directory /oracle/<SID>/sapbackup. The missing datafiles are restored with SAP‟s tool brrestore.

The Backup logfile contains information about the redolog in use when the backup was taken. The database view v$log contains the information what the current redolog file is. All redologfiles not available on disk anymore are restored from tape with SAP‟s tool BRrestore.

After making all files available on disk, a recovery of the database is started with „recover database;‟ in the Oracle tool sqlplus.

© 2008 SAP AG - 28 -

Details of this procedure can be found in note 4161.

As a rough estimation for the required restore time for datafiles and redologs, we can assume that it is approximately the same as the backup runtime. During log recovery, approximately 50-500 MB of redolog volume can be recovered per minute. This depends on the hardware and can be much better estimated after applying the first redologs.

Recovery Phase 2:

There are no logical errors in the application data of this system and recovery phase 2 is not needed.

Recovery Phase 3:

There is no data loss. Since the database was recovered completely (including the latest committed transaction), all messages that were just being exchanged between the systems when the crash occurred, reflect exactly the state at that point in time. All messages are restartable and can be processed as before. Recovery phase 3 is not needed.

Recovery Phase 4:

No data was lost; recovery phase 4 is not needed.

9.2 Example 2: Media Failure and Database Recovery Failure

Error Scenario:

This example assumes the same error scenario as example 1. However, in this case, complete recovery is not possible. During database log recovery, it is recognized that one logfile needed for recovery is defect and cannot be applied to the database. Therefore, recovery needs to be aborted. Analysis of the timestamps of the logfiles shows that the recovered state of the database lies 2 hours before the time when the database crashed. This means that 2 hours of business data is lost and cannot be recovered by technical means.

© 2008 SAP AG - 29 -


Recovery Phase 1:

Database recovery ended with an incomplete recovery of the database. The database and the SAP system can be started, but 2 hours of data is lost. Nonetheless, the system is in a consistent state (as it was 2 hours before the media failure).

Recovery Phase 2:

There are no logical errors in the application data of this system and recovery phase 2 is not needed.

Recovery Phase 3:

The data loss has an impact on data consistency between the systems of the landscape. Therefore, business recovery is required.

Example business scenario: The ERP system is the leading system for material master records. The major part of material master is created or changed in ERP and subsequently loaded to the CRM system. However, inside CRM the users also create competitive materials to track similar products of competitors for statistics about lost sales opportunities.

The CRM system had an incomplete recovery and has now a state that is two hours older than the ERP system, implying that some material updates are now missing.

As part of the recovery, most material masters can be recreated by a repeated load from ERP to CRM. For instance, you could use report RSSCD100 to evaluate all change documents for ERP materials, creating a list of materials for a corrective load. First, check whether recently changed materials are still in their messaging phase (that is, ERP outbound queue, CRM inbound queue or CRM inbound BDoc in validation error) to identify temporary differences. With the remaining list of materials, you can create a so-called request download definition in CRM Middleware to extract materials from ERP and load them into CRM again.

© 2008 SAP AG - 30 -

Recovery Phase 4:

Still some data is lost, which now needs to be re-entered manually.

The competitive materials were not available in the ERP system and therefore could not be recreated automatically. You would have to manually recreate the lost competitive materials.

Checks:

Afterwards, it is recommended to run a full comparison of material masters between ERP and CRM with the CRM DIMa tool (Data Integrity Manager).

9.3 Example 3: Lost Data

Error Scenario:

This example wants to present the handling of a logical error. We assume that due to a user fault, data of a complete application table was deleted.

Imagine the CRM table CRMM_TERRITORY was dropped by a user error. Now, the complete CRM Territory Management application becomes unusable.

Recovery Phase 1:

This phase is not applicable. Since the error is categorized as “logical error”, recovery starts with phase 2.

To demonstrate that logical errors can require very different measures during the course of recovery, the example is now separated into cases 3a, 3b and 3c.

9.3.1 Example 3a: All Data Can be Recovered

Figure 8: Recovery Flow for Example 3a

© 2008 SAP AG - 31 -

Recovery Phase 2:

Recovery is done using the following procedure:

Figure 9: Handling of Logical Errors Using an Analysis System

Steps:

1. Block user access

2. Unload / export new data (if applicable)

3. Restore database to an “analysis system”

4. Recover analysis system close to the error

5. Unload / export data from analysis system

6. Insert data into production (repair)

7. Merge rescued data with repaired data

Attention: Details depend on many factors (like error type, affected objects, and so on), so careful analysis and application knowledge is required.

Result of recovery phase 2 in example 3a: Recovery of data in table CRMM_TERRITORY from the analysis system is completely successful, no data was lost. Thus, recovery phases 3 and 4 are not required.

9.3.2 Example 3b: Remaining Data Loss is Only Locally Relevant

Recovery Phase 2:

Recovery of table data could not bring back the table completely. Further analysis of the data loss shows that this does not impact objects or attributes that are exchanged with other systems. Thus, recovery phase 3 is not required, but phase 4 needs to be applied to re-enter the lost information.

As in the example above, the CRM table CRMM_TERRITORY was dropped by a user error and the Territory Management application becomes unusable on the CRM system. However, a complete recovery failed and only a part of the table was restored to its original state.

© 2008 SAP AG - 32 -

Figure 10: Recovery Flow for Example 3b

Recovery Phase 3:

Not required for example 3b.

Recovery Phase 4:

Now, the remaining missing table entries of CRMM_TERRITORY need to be recreated manually. The complete list of key fields (territory GUIDs) is still available in related tables, for example in the territory structure table CRMM_TERRSTRUCT or in the territory validity table CRMM_TERRITORY_V. With good knowledge of the data model, there is a chance to recapitulate the structure of the missing entries of the main table.

9.3.3 Example 3c: Remaining Data Loss Causes Cross-system Inconsistencies

Recovery Phase 2:

Recovery of table data could not bring back the table completely. Further analysis of the data loss shows that the lost objects are relevant for data exchange with other systems. Thus, recovery phase 3 (business recovery) is required to reestablish data consistency between the systems.

We take again the example above with the loss of CRM table CRMM_TERRITORY. Again, as in example 3b, a complete recovery failed and only a part of the table was restored to its original state. However, the situation gets more complex now, as in the new scenario we do not only have a Territory Management application on the CRM server, but also on CRM Mobile Clients (laptop application). We thus need to reestablish data consistency between the CRM Server and the CRM Mobile Clients.

© 2008 SAP AG - 33 -

Figure 11: Recovery Flow for Example 3c

Recovery Phase 3:

As a first important step, we need to identify the affected systems. In our example, the connected CRM Mobile Clients get a periodic update of the territory structure by a regular background run of program CRM_TERRMAN_DOWNLOAD. This updates the CDB (consolidated database for the mobile scenario) and sends delta messages to the Mobile Clients.

As an emergency step, it is a wise idea to deactivate this delta update job, to prevent uncontrolled distribution of incomplete table entries to the Mobile Clients.

On the other hand, the CDB database can be a source for reconstructing missing entries of the CRM table. It has all entries up to the last delta update in a comparable data model (table SMOTERR and others). Therefore, the keys and attributes of the territories can be collected there and serve as a source for manual recreation of the missing territories. For a larger amount of missing records, it is also feasible to develop a small ABAP program for this task.

Recovery Phase 4:

As last step, we need to identify which territories were not recreated. Newly created records after the last delta update run have to be recreated manually without any reference system.

9.4 Example 4: Database Block Corruptions

Error Scenario:

During normal system operation, corrupt table blocks in an Oracle database are discovered. A full consistency check on all database files is triggered (as described in SAP note 23345), returning no further corruptions. No regular consistency check was done in the past, so it cannot be guaranteed that there is a backup available containing an old, not corrupted version of the blocks. A consistency check now performed

© 2008 SAP AG - 34 -

on the oldest backup available shows the same corruptions. Restore and recovery of single corrupt datablocks or datafiles is therefore not an option.

The hardware partner of the customer and the hardware partner‟s SAP Competence Center is involved to check the hardware.


Recovery Phase 1:

Trying to access data in non-corrupt blocks fails when a corrupt block has already been read. To restore full access, at least to the non corrupt data, a new version of the table needs to be created containing all data from the non-corrupt blocks, but without any corrupt blocks. This can be achieved in general by reading ‟around‟ the corrupt blocks and copying everything from the corrupt table to a “clean” table. After renaming, the “clean” table will finally become the original table and contain the readable data from the corrupt table minus the data from the corrupt blocks. The new table now is corruption-free, but some data was lost.

This example shows that solving a technical issue can imply the need for corrections in further recovery phases.

Block corruptions are different from other technical failures because the actions in the HW- and DB-related phase may depend on the possible actions in a later recovery phase. Therefore, it is necessary to involve application experts already in the first phase, to decide how each corrupt table should be dealt with. Depending on the specific table, other options than the above procedure of copying the table and loosing the corrupt rows may be possible:

- If all columns of the table are in at least one index, retrieve the data from the indexes without reading the table blocks

- If the table contains redundant data from another table, create a new, empty version of the table and refill it with application reports (for example, for tables BSIS, BSAS and so on) or on

© 2008 SAP AG - 35 -

the fly, during normal system operation (for example, for ABAP-Load- or ABAP-Dynpro-Tables)

- Recreate the table empty, if it contains data that may be less important and will not cause any harm by being deleted. Candidates for this category are tables containing log information, statistics, and so on (for example, MONI).

In the above scenario, we assumed that no special handling for the table was possible. As much data as possible needs to be copied to the clean table. As much information as possible has to be gathered about the lost rows (for example, the primary keys of the lost rows), so that in later recovery phases, the logical inconsistencies can be repaired efficiently by application experts.

While preparing for the copy of the corrupt table, the system is still in use. Only the business process accessing the corrupt table on exactly the corrupted data cannot be executed.

Copying the data to the clean table requires system downtime, because during the copy, concurrent updates to non-corrupt areas of the table need to be avoided. The duration of the downtime can be estimated by a test-copy (copied data is deleted afterwards), while the system is still in use. The duration of the copy largely depends on the hardware and unforeseen problems caused by the corruption. In the worst case, the copy gets stuck and Oracle support has to be involved. Therefore, SAP insists on performing a test-copy – as always in such cases. After the test run, an appropriate downtime is planned as soon as possible.

For copying the data, creating the indexes on the clean table and gathering the information of the lost rows, the SAP internal tool Clean Copy for Oracle (note 796399) is used.

Once the physical recovery of the table is done, the application experts can continue with the next recovery phase. They will get the information how many table rows were lost, together with a current list of the key values of the lost rows. They may now decide that further downtime is required during the next recovery phases, or that it is possible to go live with some restrictions after finishing phase 1 and to execute the following recovery phases while the system is already released for production

Recovery Phases 2, 3 and 4:

Depending on the achievable quality of error resolution resulting from phase 1, it may be required to proceed with further recovery phases. In comparison to example 3, similar cases to 3a), 3b) and 3c) might be considered. This possible flow of error handling is indicated as „possible‟ in figure 12. Since the handling of these errors is generally done in the same way as described in example 3, we do not repeat it here.

© 2008 SAP AG - 36 -

Appendix

A - Flowcharts for Printout

This appendix repeats the flowcharts provided in this document for print-out.

© 2008 SAP AG - 37 -

Flowchart 1: “Emergency Handling”

Flowchart 1.1: “Recovery Phases”

© 2008 SAP AG - 38 -

Flowchart 1.1.1: “Technical Recovery”

Flowchart 1.1.2: “Data Repair”

© 2008 SAP AG - 39 -

Flowchart 1.1.3: “Business Recovery”

© 2008 SAP AG - 40 -

© Copyright 2008 SAP AG. All rights reserved.

No part of this publication may be reproduced or transmitted in any form or for any purpose without the express permission of

SAP AG. The information contained herein may be changed without prior notice.

Some software products marketed by SAP AG and its distributors contain proprietary software components of other software

vendors.

Microsoft®, WINDOWS

®, NT

®, EXCEL

®, Word

®, PowerPoint

® and SQL Server

® are registered trademarks of

Microsoft Corporation.

IBM®, DB2

®, OS/2

®, DB2/6000

®, Parallel Sysplex

®, MVS/ESA

®, RS/6000

®, AIX

®, S/390

®, AS/400

®, OS/390

®, and

OS/400® are registered trademarks of IBM Corporation.

ORACLE® is a registered trademark of ORACLE Corporation.

INFORMIX®-OnLine for SAP and Informix

® Dynamic Server

TM

are registered trademarks of Informix Software Incorporated.

UNIX®, X/Open

®, OSF/1

®, and Motif

® are registered trademarks of the Open Group.

HTML, DHTML, XML, XHTML are trademarks or registered trademarks of W3C®, World Wide Web Consortium, Massachusetts

Institute of Technology.

JAVA® is a registered trademark of Sun Microsystems, Inc. JAVASCRIPT

® is a registered trademark of Sun Microsystems, Inc., used

under license for technology invented and implemented by Netscape.

SAP, SAP Logo, R/2, RIVA, R/3, ABAP, SAP ArchiveLink, SAP Business Workflow, WebFlow, SAP EarlyWatch, BAPI, SAPPHIRE,

Management Cockpit, mySAP.com Logo and mySAP.com are trademarks or registered trademarks of SAP AG in Germany and in

several other countries all over the world. All other products mentioned are trademarks or registered trademarks of their respective

companies.

Disclaimer: SAP AG assumes no responsibility for errors or omissions in these materials. These materials are provided “as is”

without a warranty of any kind, either express or implied, including but not limited to, the implied warranties of merchantability,

fitness for a particular purpose, or non-infringement.

SAP shall not be liable for damages of any kind including without limitation direct, special, indirect, or consequential damages that

may result from the use of these materials. SAP does not warrant the accuracy or completeness of the information, text, graphics,

links or other items contained within these materials. SAP has no control over the information that you may access through the use

of hot links contained in these materials and does not endorse your use of third party Web pages nor provide any warranty

whatsoever relating to third party Web pages.