74
Lesson 20. Fault Tolerance and Disaster Recovery

Lesson 20. Fault Tolerance and Disaster Recovery

Embed Size (px)

Citation preview

Page 1: Lesson 20. Fault Tolerance and Disaster Recovery

Lesson 20. Fault Tolerance and Disaster Recovery

Page 2: Lesson 20. Fault Tolerance and Disaster Recovery

Objectives

At the end of this presentation, you will be able to:

Page 3: Lesson 20. Fault Tolerance and Disaster Recovery

• Identify the purpose and characteristics of fault tolerance.

• Explain how redundancy is used in servers and networks to eliminate single points of failure.

• Identify several techniques used in servers and network systems to increase fault tolerance.

• Define: Fault tolerance, redundancy, RAID, mirror server, and cluster.

Page 4: Lesson 20. Fault Tolerance and Disaster Recovery

• Plan for disaster recovery.• Develop a disaster recovery plan.• Implement a disaster recovery plan.• Document and regularly test the disaster

recovery plan.• Explain standard backup procedures and

backup media storage practices.• Identify types of backups and restoration

schemes• Confirm and use off-site storage of backups

Page 5: Lesson 20. Fault Tolerance and Disaster Recovery

• 3.11• 3.12

Network+ Domain covered:

Page 6: Lesson 20. Fault Tolerance and Disaster Recovery

Fault Tolerance

• The ability of a network or a computer to go on working in spite of one or more component failures.

• Achieved by eliminating “single points of failure.”

• Achieved primarily through redundancy.

Page 7: Lesson 20. Fault Tolerance and Disaster Recovery

Redundancy in the Server

• Eliminates the most common “single points of failure.”

• Uses multiple components in parallel so that if one component fails another takes over.

Page 8: Lesson 20. Fault Tolerance and Disaster Recovery

Hardware Failure

Disk Drives50%Power

Supply28%

Fan 8%

CPU 5%Memory 4%Controller 4%

Motherboard 1%

Source: Intel

Page 9: Lesson 20. Fault Tolerance and Disaster Recovery

CourtesyIntel Corp.

Page 10: Lesson 20. Fault Tolerance and Disaster Recovery

Redundant Array of Inexpensive Drives (RAID)

• RAID is a way of coaxing two or more inexpensive, slow, unreliable drives to perform in concert so that they act like a more expensive, faster, reliable drive.

Page 11: Lesson 20. Fault Tolerance and Disaster Recovery

A disk system with RAID capability:

• Protects its data and provides on-line, immediate access to its data, despite a single disk failure.

• Provides for the on-line reconstruction of the contents of a failed disk to a replacement disk.

RAID Advisory Board (RAB)

Page 12: Lesson 20. Fault Tolerance and Disaster Recovery

Various RAID implementations exist. They are identified as Levels.

• The basic implementations are called level 0 through level 6.

• A higher level is not necessarily better than a lower level.

Page 13: Lesson 20. Fault Tolerance and Disaster Recovery

RAID can be implemented in:

• Softwareo Slowero Less expensive

• Hardwareo Fastero More expensive

Page 14: Lesson 20. Fault Tolerance and Disaster Recovery

RAID Level 1 - data is written to two separate drives.

0 1 2 3

0 1 2 30 1 2 3

Page 15: Lesson 20. Fault Tolerance and Disaster Recovery

Provides access to data despite a disk failure.

0 1 2 3

0 1 2 3

Page 16: Lesson 20. Fault Tolerance and Disaster Recovery

Provides for Reconstruction of the contents of the failed disk.

0 1 2 30 1 2 3

Page 17: Lesson 20. Fault Tolerance and Disaster Recovery

ServerChassis

Five Hard Drives

Page 18: Lesson 20. Fault Tolerance and Disaster Recovery

Redundant Hard Drives in a Server

CourtesyIntel Corp.

Page 19: Lesson 20. Fault Tolerance and Disaster Recovery

Redundant Power Supplies

Spare Power Supply

Page 20: Lesson 20. Fault Tolerance and Disaster Recovery

Redundant Power Supplies

CourtesyIntel Corp.

Page 21: Lesson 20. Fault Tolerance and Disaster Recovery

Caution

Hot-SwapFans

Page 22: Lesson 20. Fault Tolerance and Disaster Recovery

Hot-Swap Fan

CourtesyIntel Corp.

Page 23: Lesson 20. Fault Tolerance and Disaster Recovery

CPU Socket

CPU Socket

Page 24: Lesson 20. Fault Tolerance and Disaster Recovery

DualProcessor

Slots

CourtesyIntel Corp.

Page 25: Lesson 20. Fault Tolerance and Disaster Recovery

Redundant NICs

Active NIC

Spare NIC

Page 26: Lesson 20. Fault Tolerance and Disaster Recovery

Backup Power

• Standby Power Supply (SPS)• Uninterruptible Power Supply (UPS)

Page 27: Lesson 20. Fault Tolerance and Disaster Recovery

Standby Power Supply (SPS)

• An “off-line” device that functions only when normal power fails.

• A sensor detects AC power failure and switches over to standby power.

• Standby power is provided by a battery and a power inverter.

Page 28: Lesson 20. Fault Tolerance and Disaster Recovery

Battery Pack

Standby Power Supply (Normal)

Charger

Page 29: Lesson 20. Fault Tolerance and Disaster Recovery

Battery Pack

Standby Power Supply (AC Power Fails)

Charger Inverter

Page 30: Lesson 20. Fault Tolerance and Disaster Recovery

Uninterruptible Power Supply (UPS)

• An “on-line” device that constantly provides power.

• In the event of an AC power failure there is no switchover to standby power, because the UPS is constantly “on-line.”

• It “conditions” the AC input, isolating the computer equipment from all variations in AC power.

Page 31: Lesson 20. Fault Tolerance and Disaster Recovery

The UPS conditions the AC line against:

• Power outages – Total loss of AC power.• Surges – Temporary voltage rises.• Sags – Temporary voltage drops.• Noise – High frequency voltage spikes,

both up and down.

Page 32: Lesson 20. Fault Tolerance and Disaster Recovery

Battery Pack

Uninterruptible Power Supply (Normal)

Charger Inverter

Page 33: Lesson 20. Fault Tolerance and Disaster Recovery

Battery Pack

Uninterruptible Power Supply (AC Power Fails)

Charger Inverter

Page 34: Lesson 20. Fault Tolerance and Disaster Recovery

Increase Fault Tolerance

• RAID• Multiple power supplies• Multiple fans• Multiple CPUs• Redundant PCI cards• Backup power sources

Page 35: Lesson 20. Fault Tolerance and Disaster Recovery

Disaster Recovery

Page 36: Lesson 20. Fault Tolerance and Disaster Recovery

Types of Disasters

• Fires• Floods• Wind and water damage• Accidents• Power outages• Civil unrest• Malicious attacks

Page 37: Lesson 20. Fault Tolerance and Disaster Recovery

Disaster Recovery

• The ability to return to an acceptable level of operation after a disaster.

• Requires a well thought-out disaster recovery plan.

• A comprehensive implementation of the plan.

• Frequent testing and updating of the plan.

Page 38: Lesson 20. Fault Tolerance and Disaster Recovery

7 Steps to Disaster Tolerance

• Initiate the project• Form a project team• Complete a needs analysis• Develop a plan that encompasses both protection

and recovery• Implement the plan• Test the plan• Constantly update the plan

Page 39: Lesson 20. Fault Tolerance and Disaster Recovery

What’s in the Protection Plan?

• Procedures and policies describing how the facility, its functions, and data are to be protected.

• List of new protective equipment, software, and services needed along with a budget, procurement schedule and installation schedule.

• A step-by-step procedure and timetable for upgrading the data center from its present state to a protected state.

Page 40: Lesson 20. Fault Tolerance and Disaster Recovery

What’s in the Recovery Plan?

• Procedures and policies describing how and under what conditions the recovery plan should be activated.

• Basic protective and recovery information on each major piece of equipment.

• Names and telephone numbers of key corporate officials and the emergency management team members.

• Address of off-site backup facilities, with name and number of contact person.

• Location of backup tapes and disks.

Page 41: Lesson 20. Fault Tolerance and Disaster Recovery

What’s in the Recovery Plan? (continued)

• Names and phone numbers of key hardware, software and services vendors.

• Model numbers, serial numbers, as well as warranty and service agreement information on major pieces of equipment.

• Insurance policy numbers and information.• Documentation of the equipment, software,

configuration and wiring infrastructure of the data center.

Page 42: Lesson 20. Fault Tolerance and Disaster Recovery

24 X 7 X 365

• 24 hours per day• 7 days per week• 365 days per year

Page 43: Lesson 20. Fault Tolerance and Disaster Recovery

Backing up the Main System

• Hot Site Backup• Warm Site Backup• Cold Site Backup

Page 44: Lesson 20. Fault Tolerance and Disaster Recovery

Hot Site Backup

• A duplicate and running complement of computer hardware and software ready to take over immediately should the main system become unavailable for any reason.

• Data on the main system is backed up to the duplicate system in real time.

• If the main system fails the duplicate can take over operation without any downtime.

Page 45: Lesson 20. Fault Tolerance and Disaster Recovery

Warm Site Backup

• A duplicate complement of computer hardware and software ready to take over in a reasonable length of time should the main system become unavailable.

• Data is not backed up to the duplicate system in real-time, but could be restored from back up tapes or other media.

Page 46: Lesson 20. Fault Tolerance and Disaster Recovery

Cold Site Backup

• An off-site location that can be used in case the main site is inoperable.

• Ready to go, but with no equipment installed.

• Least expensive to maintain.• The recovery time is quite long compared

to hot-site or even warm-site backup.

Page 47: Lesson 20. Fault Tolerance and Disaster Recovery

Implement the Plan

• Buying and installing the equipment, software, and services necessary to bring the data center up to a protected state.

• Training in the discipline of new policies and procedures.

• The plan may be so extensive and so expensive that it must be phased in over time.

• Every day that the plan is delayed, the company is at risk.

Page 48: Lesson 20. Fault Tolerance and Disaster Recovery

Test the Plan

• The only way to insure the plan works.• Simulate a disaster.• Test the plan regularly and thoroughly.

Page 49: Lesson 20. Fault Tolerance and Disaster Recovery

Constantly Update the Plan

• New equipment• New software• New people• New tasks

Page 50: Lesson 20. Fault Tolerance and Disaster Recovery

Safeguard the Disaster Recovery Plan

• Make duplicate copies.• Make certain that duplicate copies are

updated when the master copy is updated.• Make sure key people know where the

document can be found.

Page 51: Lesson 20. Fault Tolerance and Disaster Recovery

Need for Frequent and Regular Backup

• The most effective way to prevent data loss.• Protects all but the most recent data.• Protects against hardware failure,

equipment theft, hackers, viruses and vandals.

• Storing backups in a different location protects against fire, flood and other natural disasters.

Page 52: Lesson 20. Fault Tolerance and Disaster Recovery

Backup Considerations

• What data should be backed up?• How often should the data be backed up?• What type of backup media should be used?• What type of backup scheme should be

used?

Page 53: Lesson 20. Fault Tolerance and Disaster Recovery

What data should be backed up?

• Backups require time and media.• Backups must not exceed the capacity of

the backup device. • Ask yourself: “Can I afford to lose this?”• It is better to err on the side of caution.

Page 54: Lesson 20. Fault Tolerance and Disaster Recovery

What data need not be routinely backed up?

• Operating Systems• Application software• Historical data that does not change

Page 55: Lesson 20. Fault Tolerance and Disaster Recovery

How often should the data be backed up?

• Trade-off of risk versus benefit.• Ask yourself: “How much data can I afford

to lose?”• Daily backups are the most common.• But circumstances may dictate anything

from continuous real-time backups to weekly backups.

Page 56: Lesson 20. Fault Tolerance and Disaster Recovery

What type of backup media should be used?

• Magnetic Disks• Optical Disks• Magnetic Tape• Internet Backup

Page 57: Lesson 20. Fault Tolerance and Disaster Recovery

Types of Backup

• Full• Incremental• Differential

Page 58: Lesson 20. Fault Tolerance and Disaster Recovery

Full Backup

• The backup of all files on the drive.• Takes the longest time to record because

every file is copied.• Takes the shortest time to restore because

everything is on a single tape.

Page 59: Lesson 20. Fault Tolerance and Disaster Recovery

Full Backup

Monday

Tuesday

Wednesday

Thursday

Friday

Saturday

Sunday

Page 60: Lesson 20. Fault Tolerance and Disaster Recovery

Restoration From A Full Backup Requires Only One Tape

Wednesday

Page 61: Lesson 20. Fault Tolerance and Disaster Recovery

The Full Backup is:

• A straightforward method of insuring good backups and quick, easy restorations.

• The starting point for the Incremental and the Differential Backups.

Page 62: Lesson 20. Fault Tolerance and Disaster Recovery

Incremental Backup

• Records only those files that have changed since the last Incremental or Full Backup.

• Takes the shortest time to record.• Generally takes the longest time to restore.• Generally requires several tapes to restore.

Page 63: Lesson 20. Fault Tolerance and Disaster Recovery

Incremental Backup

Monday

Tuesday

Wednesday

Thursday

Friday

Saturday

Sunday (Full Backup)

IncrementalBackups

Page 64: Lesson 20. Fault Tolerance and Disaster Recovery

Restoration From an Incremental Backup May Require Several

Tapes

Monday

Tuesday

Wednesday

Sunday (Full Backup)

IncrementalBackups

Page 65: Lesson 20. Fault Tolerance and Disaster Recovery

Differential Backup

• Records only those files that have changed since the last Full backup.

• Takes less time to record than a Full backup.

• Takes less time to restore than an Incremental backup.

• The restore process requires only two tapes.

Page 66: Lesson 20. Fault Tolerance and Disaster Recovery

Differential Backup

Monday

Tuesday

Wednesday

Thursday

Friday

Saturday

Sunday (Full Backup)

DifferentialBackups

Page 67: Lesson 20. Fault Tolerance and Disaster Recovery

Restoration From a Differential Backup Requires Only Two

Tapes

Wednesday (Differential

Backup)

Sunday (Full Backup)

Page 68: Lesson 20. Fault Tolerance and Disaster Recovery

Grandfather - Father - Son (GFS) Tape Rotation Scheme

• Son – The daily backup tapes.• Father – The full backup tape for the week.• Grandfather – The full backup tape for the

month.

Page 69: Lesson 20. Fault Tolerance and Disaster Recovery

Reuse Tapes

• Son – After one week• Father – Five weeks• Grandfather – Save indefinitely at an off-

site location.

Page 70: Lesson 20. Fault Tolerance and Disaster Recovery

Verify that the Backup Works

• Can you restore from the backup tape?• Can you still restore from the backup tape if

the original tape drive is destroyed?

Page 71: Lesson 20. Fault Tolerance and Disaster Recovery

Problems with Tapes

• Problem 1: Tape drive heads are dirty.o Solution: Clean the tape heads.

• Problem 2: The tapes become worn with time and use.o Solution: Replace the worn tape with a new

tape.

Page 72: Lesson 20. Fault Tolerance and Disaster Recovery

Backup Software

• A Utility designed to make routine backups as effortless and as effective as possible.

• Suppliers of Backup softwareo NOS vendoro Third parties

Page 73: Lesson 20. Fault Tolerance and Disaster Recovery

• Identify the purpose and characteristics of fault tolerance.

• Explain how redundancy is used in servers and networks to eliminate single points of failure.

• Identify several techniques used in servers and network systems to increase fault tolerance.

• Define: Fault tolerance, redundancy, RAID, mirror server, and cluster.

Page 74: Lesson 20. Fault Tolerance and Disaster Recovery

• Plan for disaster recovery.• Develop a disaster recovery plan.• Implement a disaster recovery plan.• Document and regularly test the disaster

recovery plan.• Explain standard backup procedures and

backup media storage practices.• Identify types of backups and restoration

schemes• Confirm and use off-site storage of backups