Lesson 20. Fault Tolerance and Disaster Recovery

Preview:

Citation preview

Lesson 20. Fault Tolerance and Disaster Recovery

Objectives

At the end of this presentation, you will be able to:

• Identify the purpose and characteristics of fault tolerance.

• Explain how redundancy is used in servers and networks to eliminate single points of failure.

• Identify several techniques used in servers and network systems to increase fault tolerance.

• Define: Fault tolerance, redundancy, RAID, mirror server, and cluster.

• Plan for disaster recovery.• Develop a disaster recovery plan.• Implement a disaster recovery plan.• Document and regularly test the disaster

recovery plan.• Explain standard backup procedures and

backup media storage practices.• Identify types of backups and restoration

schemes• Confirm and use off-site storage of backups

• 3.11• 3.12

Network+ Domain covered:

Fault Tolerance

• The ability of a network or a computer to go on working in spite of one or more component failures.

• Achieved by eliminating “single points of failure.”

• Achieved primarily through redundancy.

Redundancy in the Server

• Eliminates the most common “single points of failure.”

• Uses multiple components in parallel so that if one component fails another takes over.

Hardware Failure

Disk Drives50%Power

Supply28%

Fan 8%

CPU 5%Memory 4%Controller 4%

Motherboard 1%

Source: Intel

CourtesyIntel Corp.

Redundant Array of Inexpensive Drives (RAID)

• RAID is a way of coaxing two or more inexpensive, slow, unreliable drives to perform in concert so that they act like a more expensive, faster, reliable drive.

A disk system with RAID capability:

• Protects its data and provides on-line, immediate access to its data, despite a single disk failure.

• Provides for the on-line reconstruction of the contents of a failed disk to a replacement disk.

RAID Advisory Board (RAB)

Various RAID implementations exist. They are identified as Levels.

• The basic implementations are called level 0 through level 6.

• A higher level is not necessarily better than a lower level.

RAID can be implemented in:

• Softwareo Slowero Less expensive

• Hardwareo Fastero More expensive

RAID Level 1 - data is written to two separate drives.

0 1 2 3

0 1 2 30 1 2 3

Provides access to data despite a disk failure.

0 1 2 3

0 1 2 3

Provides for Reconstruction of the contents of the failed disk.

0 1 2 30 1 2 3

ServerChassis

Five Hard Drives

Redundant Hard Drives in a Server

CourtesyIntel Corp.

Redundant Power Supplies

Spare Power Supply

Redundant Power Supplies

CourtesyIntel Corp.

Caution

Hot-SwapFans

Hot-Swap Fan

CourtesyIntel Corp.

CPU Socket

CPU Socket

DualProcessor

Slots

CourtesyIntel Corp.

Redundant NICs

Active NIC

Spare NIC

Backup Power

• Standby Power Supply (SPS)• Uninterruptible Power Supply (UPS)

Standby Power Supply (SPS)

• An “off-line” device that functions only when normal power fails.

• A sensor detects AC power failure and switches over to standby power.

• Standby power is provided by a battery and a power inverter.

Battery Pack

Standby Power Supply (Normal)

Charger

Battery Pack

Standby Power Supply (AC Power Fails)

Charger Inverter

Uninterruptible Power Supply (UPS)

• An “on-line” device that constantly provides power.

• In the event of an AC power failure there is no switchover to standby power, because the UPS is constantly “on-line.”

• It “conditions” the AC input, isolating the computer equipment from all variations in AC power.

The UPS conditions the AC line against:

• Power outages – Total loss of AC power.• Surges – Temporary voltage rises.• Sags – Temporary voltage drops.• Noise – High frequency voltage spikes,

both up and down.

Battery Pack

Uninterruptible Power Supply (Normal)

Charger Inverter

Battery Pack

Uninterruptible Power Supply (AC Power Fails)

Charger Inverter

Increase Fault Tolerance

• RAID• Multiple power supplies• Multiple fans• Multiple CPUs• Redundant PCI cards• Backup power sources

Disaster Recovery

Types of Disasters

• Fires• Floods• Wind and water damage• Accidents• Power outages• Civil unrest• Malicious attacks

Disaster Recovery

• The ability to return to an acceptable level of operation after a disaster.

• Requires a well thought-out disaster recovery plan.

• A comprehensive implementation of the plan.

• Frequent testing and updating of the plan.

7 Steps to Disaster Tolerance

• Initiate the project• Form a project team• Complete a needs analysis• Develop a plan that encompasses both protection

and recovery• Implement the plan• Test the plan• Constantly update the plan

What’s in the Protection Plan?

• Procedures and policies describing how the facility, its functions, and data are to be protected.

• List of new protective equipment, software, and services needed along with a budget, procurement schedule and installation schedule.

• A step-by-step procedure and timetable for upgrading the data center from its present state to a protected state.

What’s in the Recovery Plan?

• Procedures and policies describing how and under what conditions the recovery plan should be activated.

• Basic protective and recovery information on each major piece of equipment.

• Names and telephone numbers of key corporate officials and the emergency management team members.

• Address of off-site backup facilities, with name and number of contact person.

• Location of backup tapes and disks.

What’s in the Recovery Plan? (continued)

• Names and phone numbers of key hardware, software and services vendors.

• Model numbers, serial numbers, as well as warranty and service agreement information on major pieces of equipment.

• Insurance policy numbers and information.• Documentation of the equipment, software,

configuration and wiring infrastructure of the data center.

24 X 7 X 365

• 24 hours per day• 7 days per week• 365 days per year

Backing up the Main System

• Hot Site Backup• Warm Site Backup• Cold Site Backup

Hot Site Backup

• A duplicate and running complement of computer hardware and software ready to take over immediately should the main system become unavailable for any reason.

• Data on the main system is backed up to the duplicate system in real time.

• If the main system fails the duplicate can take over operation without any downtime.

Warm Site Backup

• A duplicate complement of computer hardware and software ready to take over in a reasonable length of time should the main system become unavailable.

• Data is not backed up to the duplicate system in real-time, but could be restored from back up tapes or other media.

Cold Site Backup

• An off-site location that can be used in case the main site is inoperable.

• Ready to go, but with no equipment installed.

• Least expensive to maintain.• The recovery time is quite long compared

to hot-site or even warm-site backup.

Implement the Plan

• Buying and installing the equipment, software, and services necessary to bring the data center up to a protected state.

• Training in the discipline of new policies and procedures.

• The plan may be so extensive and so expensive that it must be phased in over time.

• Every day that the plan is delayed, the company is at risk.

Test the Plan

• The only way to insure the plan works.• Simulate a disaster.• Test the plan regularly and thoroughly.

Constantly Update the Plan

• New equipment• New software• New people• New tasks

Safeguard the Disaster Recovery Plan

• Make duplicate copies.• Make certain that duplicate copies are

updated when the master copy is updated.• Make sure key people know where the

document can be found.

Need for Frequent and Regular Backup

• The most effective way to prevent data loss.• Protects all but the most recent data.• Protects against hardware failure,

equipment theft, hackers, viruses and vandals.

• Storing backups in a different location protects against fire, flood and other natural disasters.

Backup Considerations

• What data should be backed up?• How often should the data be backed up?• What type of backup media should be used?• What type of backup scheme should be

used?

What data should be backed up?

• Backups require time and media.• Backups must not exceed the capacity of

the backup device. • Ask yourself: “Can I afford to lose this?”• It is better to err on the side of caution.

What data need not be routinely backed up?

• Operating Systems• Application software• Historical data that does not change

How often should the data be backed up?

• Trade-off of risk versus benefit.• Ask yourself: “How much data can I afford

to lose?”• Daily backups are the most common.• But circumstances may dictate anything

from continuous real-time backups to weekly backups.

What type of backup media should be used?

• Magnetic Disks• Optical Disks• Magnetic Tape• Internet Backup

Types of Backup

• Full• Incremental• Differential

Full Backup

• The backup of all files on the drive.• Takes the longest time to record because

every file is copied.• Takes the shortest time to restore because

everything is on a single tape.

Full Backup

Monday

Tuesday

Wednesday

Thursday

Friday

Saturday

Sunday

Restoration From A Full Backup Requires Only One Tape

Wednesday

The Full Backup is:

• A straightforward method of insuring good backups and quick, easy restorations.

• The starting point for the Incremental and the Differential Backups.

Incremental Backup

• Records only those files that have changed since the last Incremental or Full Backup.

• Takes the shortest time to record.• Generally takes the longest time to restore.• Generally requires several tapes to restore.

Incremental Backup

Monday

Tuesday

Wednesday

Thursday

Friday

Saturday

Sunday (Full Backup)

IncrementalBackups

Restoration From an Incremental Backup May Require Several

Tapes

Monday

Tuesday

Wednesday

Sunday (Full Backup)

IncrementalBackups

Differential Backup

• Records only those files that have changed since the last Full backup.

• Takes less time to record than a Full backup.

• Takes less time to restore than an Incremental backup.

• The restore process requires only two tapes.

Differential Backup

Monday

Tuesday

Wednesday

Thursday

Friday

Saturday

Sunday (Full Backup)

DifferentialBackups

Restoration From a Differential Backup Requires Only Two

Tapes

Wednesday (Differential

Backup)

Sunday (Full Backup)

Grandfather - Father - Son (GFS) Tape Rotation Scheme

• Son – The daily backup tapes.• Father – The full backup tape for the week.• Grandfather – The full backup tape for the

month.

Reuse Tapes

• Son – After one week• Father – Five weeks• Grandfather – Save indefinitely at an off-

site location.

Verify that the Backup Works

• Can you restore from the backup tape?• Can you still restore from the backup tape if

the original tape drive is destroyed?

Problems with Tapes

• Problem 1: Tape drive heads are dirty.o Solution: Clean the tape heads.

• Problem 2: The tapes become worn with time and use.o Solution: Replace the worn tape with a new

tape.

Backup Software

• A Utility designed to make routine backups as effortless and as effective as possible.

• Suppliers of Backup softwareo NOS vendoro Third parties

• Identify the purpose and characteristics of fault tolerance.

• Explain how redundancy is used in servers and networks to eliminate single points of failure.

• Identify several techniques used in servers and network systems to increase fault tolerance.

• Define: Fault tolerance, redundancy, RAID, mirror server, and cluster.

• Plan for disaster recovery.• Develop a disaster recovery plan.• Implement a disaster recovery plan.• Document and regularly test the disaster

recovery plan.• Explain standard backup procedures and

backup media storage practices.• Identify types of backups and restoration

schemes• Confirm and use off-site storage of backups

Recommended