21
VERS @ DOI Implementation Department of Infrastructure VERS RKS DISASTER RECOVERY MANUAL PILOT SYSTEM August 2001

DRP Manual

Embed Size (px)

Citation preview

Page 1: DRP Manual

VERS @ DOI ImplementationDepartment of Infrastructure

VERS RKS DISASTER RECOVERY MANUAL

PILOT SYSTEM

August 2001

Page 2: DRP Manual

Document History

Version Description Author Date Pages

0.1 Draft Evolution 6 10 July 2001

1.0 Draft Evolution 6 15 Aug 2001

1.1 Draft Evolution 6 27 Aug 2001

Client approval and Sign Off

Authorisation

Evolution 6 <Clientname>

Project Manager <Position>

Signature: Signature:

Date: Date:

Department of Infrastructure

All rights reserved. No part of this publication may be reprinted, reproduced, stored in a retrieval system or transmitted, in any form or by any means, without the prior permission in writing from the Department of Infrastructure.

Department of Infrastructure Date 14/06/10 Page 2VERS RKS DRP Manual Draft 1.1

Page 3: DRP Manual

Table of Contents1. OVERVIEW....................................................................................................................................................................4

2. INTRODUCTION..........................................................................................................................................................5

3. REQUIREMENTS DEFINED BY SCHEDULE 8 & PROV.....................................................................................6

4. FURTHER CONSIDERATIONS.................................................................................................................................7

5. RECOVERY TIMEFRAME.........................................................................................................................................8

6. DATA INTEGRITY AND RELIABILITY..................................................................................................................9

6.1 DATABASE SERVER / SERVICE CRASH OR STOPPED.............................................................................................................96.2 DATABASE SERVER DISK ARRAY FAILURE.......................................................................................................................106.3 DOCUMENT REPOSITORY DISK ARRAY FAILURE...............................................................................................................116.4 SERVER / ARRAY FAILURE DURING VEO CREATION PROCESS...........................................................................................126.5 WORKSTATION CRASH DURING VEO REGISTRATION OR UPDATE.......................................................................................136.6 ACCIDENTAL DESTRUCTION OF VEOS BY STAFF..............................................................................................................146.7 SAN COMMUNICATIONS FAILURE...................................................................................................................................146.8SERVER ROOM DISASTER................................................................................................................................................14

7. ROLES & RESPONSIBILITIES................................................................................................................................16

8. GLOSSARY OF TERMS............................................................................................................................................18

9. REFERENCES.............................................................................................................................................................20

Department of Infrastructure Date 14/06/10 Page 3VERS RKS DRP Manual Draft 1.1

Page 4: DRP Manual

1. Overview

Planning for the business continuity of an organization in the aftermath of a disaster is a complex task. Preparation for, response to, and recovery from a disaster affecting the administrative functions of the organization requires the cooperative efforts of many support departments in partnership with the functional areas supporting the "business" of DOI.

This document proposes disaster recovery plans to address various types of possible disaster scenarios. The plans reflect the analysis and determination of appropriate responses as agreed in discussions with representatives from Corporate IT and other departments.

This document is intended to provide a framework, with some possible solutions, of the backup and disaster recovery plans for the VERS@DOI project. As with all disaster / recovery situations, not all variations can be documented. Hence before attempting a VEO (VERS Encapsulated Object) recovery, please familiarize yourself with the VEO processing procedures as mentioned in the VERS RKS Operations Manual.

Why Disaster Recovery?

Planning for the business continuity of DOI in the aftermath of a disaster is a complex task. Preparation for, response to, and recovery from a disaster affecting the administrative functions of the organization requires the cooperative efforts of many divisions in partnership with the functional areas supporting the "business" of DOI.

The objectives of a disaster recovery plan for information services are to make sufficient preparations, and to establish a sufficient set of agreed upon procedures, for responding to a disaster or emergency, in order to minimize the effect upon the operation of the business.

Need for a Disaster Recovery Plan

Three areas need to be reviewed: legal responsibility, financial loss and business service interruptions.

Legal Responsibility : Management has a legal responsibility to protect its corporate resources and information.

Financial Loss : Because of the efficiency, accuracy, speed and control of information services methods, organizations are more dependent on their information services in normal business operations. If the information systems services break down, a great financial loss to the company could develop, or even destroy the business if proper disaster planning has not been done.

Business Service Interruption : This can be very damaging to future relationships with customers. It can also affect the public image of the organization. The costs of not taking precautions could be much more damaging and costly than modest preparation for disaster recovery.

Department of Infrastructure Date 14/06/10 Page 4VERS RKS DRP Manual Draft 1.1

Page 5: DRP Manual

2. Introduction

The Documentation for the VERS RKS Pilot System has been broken down into two categories:

• System Documentation – where the DOI IT Division is intended to be the primary user of the documentation.

• User Documentation – where the business users of the application are the intended audience (covering Contributors & Browsers, Business Area Experts and Records Administrators)

USER DOCUMENTATION

Includes the following documents:

1. VERS RKS Records Creation and Discovery Guide - Pilot System

2. VERS RKS Business Area Experts Guide - Pilot System

3. VERS RKS Records Administration Manual (TRIM) – Pilot System

4. VERS RKS Configuration and Management Manual - Pilot System

SYSTEM DOCUMENTATION

Includes the following documents:

1. VERS RKS Operations Manual - Pilot System

2. VERS RKS Migration Manual - Pilot System

3. VERS RKS Disaster Recovery Manual - Pilot System (this document)

4. VERS RKS Program Documentation – Pilot System

5. VERS RKS Product Release Notes (includes both COTS and Custom developed software)

Department of Infrastructure Date 14/06/10 Page 5VERS RKS DRP Manual Draft 1.1

Page 6: DRP Manual

3. Requirements Defined by Schedule 8 & PROV

At a minimum the requirements as outlined in Schedule 8 must be adhered to, unless formally document and accepted by DOI and Solution 6. The requirements that fall under the scope of this document are:

• The system must be available 99% during Prime1 time and 95% non-Prime

• Records must not be lost in the event of media, system, component or communication failures or in the event of other disasters (i.e. flood, fire etc.)

• Disaster Recovery planning and procedures

• Suggested functionality of the VERS@DOI maintenance tool

• The PROV standard requires that a second backup be taken, by a separate device and onto different media. For example a backup made to tape and CD-R

Not covered in the scope of this document :

• Details of any tools or software used to prove the integrity of the recovered data

Department of Infrastructure Date 14/06/10 Page 6VERS RKS DRP Manual Draft 1.1

Page 7: DRP Manual

4. Further Considerations

Standard DOI backup cycles will be employed, with the exception of the creation of yearly tapes. For more information on the backup cycles, please refer to the Backup section of the VERS RKS Operations Manual.

Once a document has been added to a record within the VERS RKS system, it will lose its volatility as the only information, which will change in future, will be the record metadata information. This information will be be created as an “Onion layer” on top of the existing metadata information for the record without any change on the document.

An annual audit of recovery process should be performed regularly to evaluate backup procedures and control, ensuring DRP is rehearsed and updated.

Department of Infrastructure Date 14/06/10 Page 7VERS RKS DRP Manual Draft 1.1

Page 8: DRP Manual

5. Recovery Timeframe

Although every effort will be made to restore the system with the least amount of effort and time, there will be certain delays due to validations to be done on the VERS RKS system with regards to the disposal or expunging of records.

There are certain considerations that need to be taken in guesstimating the recovery timeframe to the point of actual commencing of the operation. Below are activities that are required to be performed after any recovery procedures.

All records that have been expunged or disposed will be stored in a separate table. After the restoration of the database, the IT Backup Regime along with the Records Manager will have to make sure that all records that appear in the table are deleted from the recovered VERS RKS system.

Please refer to the VERS RKS Configuration and Management Manual for more information on using Disposal Schedules.

After every restore process, the maintenance tool has to be executed to ensure that the information in the TRIM system and that of the Fulcrum database is in sync and has not lost any information between them.

The maintenance tool can be used in conjunction with application logs to detect and remedy database or VEO inconsistencies. Please refer to section 19.6 Maintenance Program in the VERS Operations Manual for more information on using the maintenance tool.

Shakeout testing should be performed to validate the processing of all the modules. Please refer to section 16.3 Shakeout Testing in the VERS Operations Manual for more information on conducting shakeout tests.

Department of Infrastructure Date 14/06/10 Page 8VERS RKS DRP Manual Draft 1.1

Page 9: DRP Manual

6. Data Integrity and Reliability

Over and above all forms of prevention discussed in the section, some form of server clustering (e.g. Microsoft Wolf Pack, NCR LifeKeeper, Compaq TruServer or ServerNet II, etc.) would greatly improve system recovery speed and could also improve overall data integrity.

As with all systems, database and VEO data are most vulnerable to corruption and loss between nightly backups. In order to reduce the risks, the following steps have been taken:

• Incremental log dumps of the database allow hourly snapshots of the database.

• The primary VEO storage disk sub-system is RAID-5 (VEO repository).

• The Work Group servers will cache all VEO activity for three days to a secondary RAID-5 disk sub-system.

The above precautions ensure that in most cases data integrity can be maintained. In order for VEOs to be lost, all parity plus one drive in both the VEO repository and Work Group server (a minimum of four drives) will have to fail.

Between backup cycles, possible causes of corruption and loss include:

• Database server / service crash or stopped (a stoppage could prevent users logging on or users modifying existing data but does not affect the integrity of live data)

• Database server disk array failure

• Document repository disk array failure

• Server / Array failure during VEO creation process

• Workstation crash during VEO registration or update

• Accidental destruction of VEOs by staff

• SAN / Server Communications Failure

• Server Room Disaster

6.1 Database Server / Service Crash or Stopped

Database activity and SAN activity should be monitored to determine the maximum frequency of log dumps per day, without greatly effecting performance. Log dumps are usually very quick to run and can be used to restore a lost database to the point of the last log dump.

Due to the importance of VEO metadata we will recommend a minimum frequency of one dump every sixty minutes.Department of Infrastructure Date 14/06/10 Page 9VERS RKS DRP Manual Draft 1.1

Page 10: DRP Manual

6.1.1 Detection

Main symptoms will be the inability to connect, add view or alter existing VEOs.

6.1.2 Recovery

The recovery procedure depends on the type of problem being faced and could require anything from a server / service restart to a new server.

Below are the steps to recover from a corrupt database:

Inform all users users to exit from the VERS RKS system and verify the process2;

Shut down TRIM Workgroup and Master services3;

Shut down Fulcrum Indexer, Encapsulator and PDF Generator services4;

Backup the corrupted database5;

Backup the SQL log files;

Execute standard hardware maintenance6;

Restore last SQL backup from tape;

Apply SQL logs at SQL Server level;

Reconcile the Fulcrum server with the maintenance tools;

Run TRIM store check;

Perform shakeout testing.

6.2 Database Server Disk Array Failure

Each disk array uses fault-tolerant RAID5 technology. One disk in each disk sub-system can be faulty at any given time for the array to still function retaining all data. In the event of a second disk crashing, before the first one is replaced or rebuilt, a restore from backup will be required.

In order to minimise risk and dependency on a single disk array, each database device could be mirrored to a separate array. Disk mirroring is a supported feature of SQL Server and will not require any specialized hardware or software. Each log dump during the day could also be copied to another array.

Another option to mirroring the database devices is to install a disk-mirroring module in the SAN.

Department of Infrastructure Date 14/06/10 Page 10VERS RKS DRP Manual Draft 1.1

Page 11: DRP Manual

6.2.1 Detection

The method of detection will vary depending on the type of crash and server monitoring procedures & software employed. Detection is the responsibility of DOI.

6.2.2 Recovery

Please refer to the recovery procedure mentioned in section 6.1.2 for Database Server / service crashed.

It is relevant to note here that in case of a multiple disk array failure where an array has to be restored, it takes around 6 to 8 hours before the restore operation can commence. In this case, the actual data restore can take 24 to 48 hours depending on the size of the database and the complexity of the restoration process.

6.3 Document Repository Disk Array Failure

The document repository, whilst currently not mirrored has redundant data in the TRIM workgroup server cache.

Another option to copying the VEOs is to install a disk-duplexing module in the SAN.

6.3.1 Detection

The method of detection will vary depending on the type of crash and server monitoring procedures and software employed. Detection is the responsibility of DOI.

6.3.2 Recovery

In the event of multiple disk failure and complete corruption:

Load the last good backup5.

In the event of a crash during an encapsulation session, the encapsulation log will need to be read to determine which VEOs require encapsulation. In order to set the VEOs for encapsulation, run the SQL Query Analyzer tool7.

Execute the following SQL script against the TRIM database7 :

INSERT INTO VEO_TRIGGERS (URI, [Date], Action, Stage)SELECT uri, getDate(), 'I', 'F' from SysTest1.dbo.TSRECELEC WHERE RIGHT(CAST(reSID as Char(14)), 3) = 'TXT' and uri in (SELECT evObjectUri from SysTest1.dbo.tsexfieldv where evFieldVal='Y' and evFieldUri in (SELECT uri from SysTest1.dbo.tsexfield WHERE exfieldname ='is_veo_ready'))

DELETE FROM VEO_TRIGGERS where Sequence NOT IN(SELECT min(Sequence) FROM VEO_TRIGGERS GROUP BY URI, Action, Stage)

Department of Infrastructure Date 14/06/10 Page 11VERS RKS DRP Manual Draft 1.1

Page 12: DRP Manual

In the event of a crash during a re-encapsulation session, the encapsulation log will need to be read to determine which VEOs require re-encapsulation. In order to set the VEOs for encapsulation, run the SQL Query Analyzer tool7.

Execute the following SQL script against the TRIM database7 :

INSERT INTO VEO_TRIGGERS (URI, [Date], Action, Stage)SELECT uri, getDate(), 'U', 'F' from SysTest1.dbo.TSRECELEC WHERE RIGHT(CAST(reSID as Char(14)), 3) = 'TXT' and uri in (SELECT evObjectUri from SysTest1.dbo.tsexfieldv where evFieldVal='Y' and evFieldUri in (SELECT uri from SysTest1.dbo.tsexfield WHERE exfieldname ='is_veo_ready'))

DELETE FROM VEO_TRIGGERS where Sequence NOT IN(SELECT min(Sequence) FROM VEO_TRIGGERS GROUP BY URI, Action, Stage)

6.4 Server / Array Failure During VEO Creation Process

The maintenance tool can be used in conjunction with application logs to detect and remedy database or VEO inconsistencies.

6.4.1 Detection

Detection is the responsibilty of DOI.

Department of Infrastructure Date 14/06/10 Page 12VERS RKS DRP Manual Draft 1.1

Page 13: DRP Manual

6.4.2 Recovery

Recovery depends on the failed array and the state of the VEO. Using the Encapsulation, PDF Generator, Fulcrum Indexer logs and the contents of the VEO_TRIGGERS table, ascertain the current state of the VEO. Please refer to the Database Triggers section of the VERS RKS Operations Manual to understand the various stages of processing.

If a record has been created but not indexed, VEO processing will begin automatically when the RKS is back on-line.

If a record has been turned into a VEO but its entry still exists in the VEO_TRIGGERS table, the table entry must be manually removed. VEO creation information can be found in the RKS software logs. Please refer to the VERS RKS Operations Manual for log file location and explanation of the VEO_TRIGGERS table values7.

DELETE FROM VEO_TRIGGERS where URI = <value of the URI>

Recovery from specific points of can be managed after the failure point has been determined to recover using point in time restore5.

6.5 Workstation Crash During VEO Registration or Update

Recovery depends on the time of failure and can result from either the record creation application or Windows crashing.

6.5.1 Detection

Application and operating system generated error messages.

6.5.2 Recovery

The method of recovery depends on the stage the VEO was at when the array failed.

If a crash occurs before “Finish”, “Create Record” or “Update Record” then the record must be recreated. No other action is required8.

If a crash occurs after “Finish”, “Create Record” or “Update Record” and before a “Successful Creation” message box appears, the record may be damaged or incomplete8.

Department of Infrastructure Date 14/06/10 Page 13VERS RKS DRP Manual Draft 1.1

Page 14: DRP Manual

Attempt to search for the record in RKS and check to see if it contains all documents, relationships and other values8.

6.6 Accidental Destruction of VEOs by Staff

General users do not have direct access to any of the VEO’s. When users attempt to view a VEO or their contents, the TRIM workgroup service account fetches all documents on behalf of the users. In order to delete a VEO, one will need to be a TRIM administrator, server administrator or know the workgroup service account password.

6.6.1 Detection

TRIM provides the ability to reconcile the contents of the VEO repository with the TRIM database to determine if all VEOs exist and are the correct size.

6.6.2 Recovery

Any missing VEO’s will be recovered from tape or the workgroup server cache9.

6.7 SAN Communications Failure10

6.7.1 Detection

The method of detection will vary depending on the type of failure and server monitoring procedures and software employed. Detection is the responsibility of DOI.

6.7.2 Recovery

Recovery would vary depending on the type of failure and activity on the servers at the time of the failure.

6.8 Server Room Disaster

The intent here is to understand the impact of a disaster in the server room which will result in the whole system being inoperable at the site.

6.8.1 Detection

Detection depends on the type of disaster.

Department of Infrastructure Date 14/06/10 Page 14VERS RKS DRP Manual Draft 1.1

Page 15: DRP Manual

6.8.2 Recovery

Recovery procedures would as per normal for any server file system recovery if the server room.

This is under the assumption that the VERS RKS training environment which is quite similar to the production system with its 3 servers identically setup, is the backup system at the secondary data site.

Backup of the database and the information should be taken to the secondary site once the server room is deemed as inoperable.

Following are the steps that need to be done at the secondary data site to restore the information from the backup media.

Execute standard hardware maintenance6;

Restore last SQL backup from tape5;

Apply SQL logs at the SQL Server level5;

Restore last information backup from tape to a file system location (in the absence of a SAN)11;

Reconcile the Fulcrum server with the maintenance tools4;

Run TRIM store check3;

Perform shakeout testing4.

Department of Infrastructure Date 14/06/10 Page 15VERS RKS DRP Manual Draft 1.1

Page 16: DRP Manual

7. Roles & Responsibilities

The following personnel are required to be present during pre and post recovery process.

Responsibilities of Roles

IT System Administrator

Responsible for the verification and operational maintenance of the VERS RKS system at the

Server level

Shutdown of the TRIM Workgroup and Master services

Shutdown of the Fulcrum Indexer, PDF Generator and Encapsulator services

Reconcile Fulcrum and TRIM database with the execution of the maintenance tool

Perform TRIM store check

Perform shakeout testing

Execute SQL query to determine missing records at the SQL Server database level with assistance from the SQL Server DBA

Execute SQL query to remove unwanted record information at the SQL Server database level with assistance from the SQL Server DBA

Identifying and recovering missing VEOs from backup or workgroup server cache with assistance from the SQL Server DBA

Records Manager / System Administrator

Responsible for the verification and operational maintenance of the VERS RKS system at the

business level

Notify all users of the DR procedures, advising them to log off and verify the process

Identifying records to be recreated

Verification and maintenance of the records at the TRIM level

Identification and removal of information from the VERS RKS system after the database restoration process for records supposed to have been expunged or purged with the assistance of the SQL Server DBA

Department of Infrastructure Date 14/06/10 Page 16VERS RKS DRP Manual Draft 1.1

Page 17: DRP Manual

SQL Server DBA

Responsible for the operational maintenance, backup and restoration of the SQL server

database. Will also liaise with the various roles in the maintenance of the VERS RKS system

Daily full backup of the SQL database

Hourly backup of the SQL log dump

Backup of the corrupted database

Backup of the SQL log files

Restore last SQL backup from tape

Application of SQL logs at SQL Server level

Daily full backup of File System information with assistance from the IT System Administrator and NT Administrator

NT Administrator

Responsible for the maintenance of the VERS RKS system hardware, communications,

security and network operation

Execution of standard hardware maintenance

Maintenance of the Server hardware environment including SAN, communication, network, etc

Department of Infrastructure Date 14/06/10 Page 17VERS RKS DRP Manual Draft 1.1

Page 18: DRP Manual

8. Glossary of Terms

Term Description

API Applications Programming Interface

ASP Active Server Page

CDSE Common Desktop Standard Environment

COTS Commercial Off The Shelf

DCOM Distributed Component Object Model

DOI Department of Infrastructure

EDMS Electronic Document Management System

EIP Enterprise Information Portal

Encapsulator Process that converts objects into a VEO

Fulcrum Index Hummingbird Search/Indexing Software

GUI Graphic User Interface

HTML Hyper Text Markup Language

IE5 Internet Explorer version 5

JAWS System Accessibility Software for the visually impaired

MAGIC System Accessibility Software for the visually impaired

PDF Portable Document Format

PROV Public Records Office Victoria

RKS Record Keeping System

Trigger Control entry in table that drives events within the RKS

TRIM Workgroup Server Core service which receives, parses and responds to any request between the clients and the backend servers of the RKS system

TRIM Tower Records and Information Management Software

VB Visual Basic

VEO VERS Encapsulated Object – XML Object wrapped with Digital Signature designed to ensure long term preservation

Department of Infrastructure Date 14/06/10 Page 18VERS RKS DRP Manual Draft 1.1

Page 19: DRP Manual

and recoverability of electronic documents

VERS Victorian Electronic Records Strategy

WOVG Whole of Victorian Government

URL Uniform Resource Locator eg. Web address

XML eXtensible Markup Language

Department of Infrastructure Date 14/06/10 Page 19VERS RKS DRP Manual Draft 1.1

Page 20: DRP Manual

9. References

Further information and documents referred to in this document:

Contract for the Provision of a VERS Compliant Record Keeping System

VERS RKS Analysis and Design Report – Pilot System

VERS RKS Strategy Report – Pilot System

VERS Test Strategy

VERS Test Cases

VERS Pilot Testing – Data Population

DOI Technology Architecture

Technology Infrastructure Stds P382C

WOVG App Standards (Lotus notes & domino)

Commonwealth Human Rights & Equal Opportunity Commission Guidelines for Web Accessibility

Electronic Transactions (Victoria) Act 2000

AS4390

PROS 00/02

Department of Infrastructure Date 14/06/10 Page 20VERS RKS DRP Manual Draft 1.1

Page 21: DRP Manual

1 Prime time is defined as Monday to Friday 8:00am to 6:00pm, except for recognised Victorian-wide Public Holidays. The Solution 6 Group can only be held responsible for RKS system failures and not of the underlying infrastructure or hardware.2 Please refer to the VERS Configuration & Management Manual for more information on this section. Responsibility of the Records Manager.3 Please refer to the TRIM User & Administration Guide for more information on this section. Responsibility of the IT System Administrator.4 Please refer to the VERS Operations Manual for more information on this section. Responsibility of the IT System Administrator.5 Responsibility of the SQL Server DBA.6 Standard hardware maintenance will be based on DOI’s regular practise of maintenance done prior to the recovery process.7 Responsibility of the IT System Administrator with assistance from the SQL DBA.8 Responsibility of the User and the Records Manager.9 Responsibility of the IT System Administrator with assistance from the Records Manager and SQL DBA.10 Responsibility of the NT Administrator11 Responsibility of the SQL DBA with assistance from the NT Administrator and IT System Administrator.