Tiered Disaster Recovery

Tiered Disaster Recovery

DR SLAs That Work

Bill PeldzusBill PeldzusVice President, Data Center and Business

Continuity/Disaster Recovery ServicesGlassHouse Technologies

It’s all about Risk Avoidance

Audience ResponseAudience Response

Today, at your company, DR is …1. A top priority, with full management support and

budget, dedicated staff and detailed testing.

2. A “best effort” endeavor, with ad hoc testing, some

management support and a limited budget.

3. We ship some of our backup tapes offsite and hope

for the best.

4. What’s DR?

Some Recent DR EventsSome Recent DR EventsUnfortunately, it’s now a question of when, not if, some type of disaster will hit

• Terrorism• 9/11 and ongoing threats

• Internal mistakes• Newbie DBA meant backup, not delete!

• Employee sabotage• Former sys admin planted a logic bomb in financial sector

• Blackouts• Rolling blackouts are the norm the past few summers

• Pandemics• Avian flu threat

• Geographic• Hurricanes, blizzards, tornados, earthquakes…

AgendaAgenda• Today, we will be discussing…

• Business Continuance definition

• RTOs, RPOs and why they are important

• Classifying applications and systems into tiers• Determining how many classifications and service levels

you should consider

• Documenting your tiers and service level agreements

• Getting buy in from management and business units

• Data Protection Options to meet SLAs

• Best practices and experiences in real BC/DR testing

Business ContinuityBusiness Continuity•• Business continuanceBusiness continuance is the

umbrella for many concepts including:• Fault tolerance• High availability• Backup/restore• Redundancy• DR

• For this presentation, the focus is on DR and SLAs, with a business continuity theme

Key Recovery Metrics: RTO/RPO

RTO RTO ---- Recovery Recovery Time ObjectiveTime Objective

Maximum amount of Maximum amount of time to recover from time to recover from a disruption and be a disruption and be up and running againup and running again

““How long before we How long before we are operational are operational

again?again?””

RPO RPO ---- Recovery Recovery Point ObjectivePoint Objective

The maximum The maximum amount of data loss amount of data loss (measured in time) (measured in time) acceptable in the acceptable in the event of a disruptionevent of a disruption

““How much data can How much data can we lose?we lose?””

RTO/RPO: Recovery Types

•• OperationalOperational• Daily recovery within the primary site• Recover from

• Accidental deletion of a file • Corrupted database, virus, etc.

•• DisasterDisaster• Recovery from a catastrophic event to a remote

location• Recover from

• Geographic • Terrorist • Environmental

Defining RTO/RPO Classes

Class 1:RTO & RPO 4 Hours

Class 2:RTO = 24 hoursRPO = 24 hours

Class 3:RTO = 72 hoursRPO = 72 hours

Class 4:RTO = 5 daysRPO = 1 week

Replication Hot standby (dedicated); in-sourced or outsourced facilities

Replication and/or off-site DR backupsIn-house or outsourced

Available hardware; standard recovery from tapeOutsourced/Contracted usually more cost-effective (hot site or mobile)

Quick-ship program mostcost-effective; location TBDStandard recovery from tape

Relative Cost

$$$$$

$$$$

$$$

$$

Recovery Tiers

Picking the Cost-Effective Solution to Recovery…

DR SLAsA More Complex Approach

Complete

1 2 3 4 5 6

Near 0 4 hrs 24 hrs 72 hrs 720 hrs BestEfforts

0 mins0 mins15 mins24 hrs

0 mins15 mins24 hrs48 hrs

48 hrs0 mins15 mins

0 mins15 mins24 hrs48 hrs

Risk Management

Capability

Recovery Class

Maximum downtime Data loss

options

System recovery

speed Instant

Notdefined

Negligible

RTO/RPOConsider “Data” Metric for IT

• Understand the difference between external, service-level RTO/RPOs and what your IT teams are mapping to

• Just restoring the data and having it available does not mean the application is up and running

• DBAs, Application owners and others need to “do their thing” to ensure the application is up, running correctly with the right data and available to the end users• That takes time

• RTO-Data and RPO-Data helps SLAs meet reality

Example: TapeExample: Tape--based DR based DR RecoveryRecovery

4hrsTypical SLA

8-24hrsEquipment

Provisioning

24-48hrs

Tapes Retrieved

Disaster Declared

Data Recovery

Begins

RTO Data = 3 daysData is recovered on

working systems

Application Team Recovers Business

Applications

Tape Retrieval:- Identify Tapes- Request Retrieval- Transport Tapes- Inventory Tapes

Equipment Provisioning:- Identify Required Configs- Procure hardware- Provision

Data Recovery:- Recover Tape Backup System- Load Tapes- Perform Restore

Recovery of Business Capability:- Recover Databases- Ensure Service Consistency- Acceptance testing

RTO = 6

RTO = 6 daysRTO (data) = 3 days

Key Concept:SLAs versus Reality

So when do the apps start getting restored?

Example OR/DR Storage SLA Matrix

Getting Buy-In from Management

• Where’s your BIA?• Looks at the Business, not IT• Directly correlates applications to

RTOs and RPOs (recovery classes)• Compare the investment in DR today

to the inability to recover• Lost Customers • Corporate Reputation• Hit on your Brand

DR Data Protection Options

How to Meet Those SLAs from a Technology Perspective

Operational Recovery: The Old WorldOperational Recovery: The Old World

Disk Tape

Backup

Standard Disk

Intelligent Disk

NAS

SAN

Traditional filer

Grid-basedNAS

VTL

Standalone Integrated

Internal media server

NAS

Disk Tape

Backup

VTC

Operational Recovery: The New WorldOperational Recovery: The New World

Some new technologiesSome new technologies

• CDP (Continuous data protection)• Back up files or blocks every single time they

change• Also Near-CDP with Snapshots and replication• Emerging in Replication & DR

• More on this later

• Data de-duplication backup• Eliminate redundant blocks wherever we can

Snapshots & Mirrors Snapshots & Mirrors

• Snapshots• Usually rely on primary storage/software • Known as copy-on-write; instantaneous• Great for protecting against logical corruption• Cannot protect against hardware failure

• Mirrors• Usually rely on primary storage/software • Full copy of data; must disassociate for protection• Takes more space than Snapshots• Cannot protect against hardware failure

Data Replication:Data Replication:The Core Foundation of DRThe Core Foundation of DR

• A majority of the clients I’ve worked with use some type of data replication for DR due to more aggressive RTOs/RPOs for mission-critical applications

• Let’s explore replication in more detail

Replication: Where Do You Start?Replication: Where Do You Start?

??Array-based

Asynchronous

Host-basedIn-band

Mirrors

Out-of-band

Network-based

Point-in-timecopies

Fabric-based

Synchronous

Sync. versus Sync. versus AsyncAsync. Replication. Replication• Synchronous

• The number of writes are doubled (once to each site) prior to the acknowledgement being sent back to the primary server

• This double-write and wait for dual confirmation introduces latency into application response time

• The impact on application response time is dependent on distance• Primarily used either within an array, within a single site or

between two sites in close proximity

• Asynchronous• Immediate acknowledgement of the write to the primary host once

it is received by the source storage• Then, after the acknowledgement is made, the write I/O is then

replicated to the target site with minimal performance impact onthe host at the source site

• Compared to synchronous mode, asynchronous mode does not ensure that the data at the source and target sites are identical

• Primarily used when there are long distances between sites

““RollingRolling”” DisastersDisasters

• An issue in real-time replication schemas• What happens if the disaster event corrupts my

primary data?• And, via replication, would therefore corrupt my remote,

replicated data

• Requires a protected, consistent and segregated copy of the production data at the DR site

• Number of copies is specific to the RPO

• Hopefully, these copies are not needed

• If no corruption, use the replicated copy!

Ensuring Application Recovery:Ensuring Application Recovery:Consistency GroupsConsistency Groups

• Where does it reside?• Host

• Allows for heterogeneous storage

• Database-only solutions as well

• Fabric/Appliance

• Also allows for heterogeneous storage

• Array

• Storage array and software provided by single vendor

• e.g., EMC SRDF, HDS TrueCopy, IBM PPRC-XD

• There is an array-based data mover that supports heterogeneous storage

Note:

•A NAS appliance can be considered“host + array”

•A NAS “head” can be considered“host”

Replication OptionsReplication Options

HostHost--based Items Of Notebased Items Of Note

• Consumes host resources

• Can affect production application performance

• Issues with consistency groups

• OS dependent

• Windows, Solaris, HP-UX, Linux, AIX

• May require additional host software

• More complexity in setting up mirrors and snapshots in the remote site for rolling disaster protection

• As your replication suite of applications grows, so does the management complexity

Fabric/ApplianceFabric/Appliance--basedbased• Has also been coined “virtualization”

• But not server virtualization – we’ll cover that later

• Intercepts or directs I/O and makes the back-end transparent

• Can be used for replication but also for other activities, such as data pooling, consolidation and migration

• Three basic approaches

• In-band (fabric)

• In-band (array)

• Out-of-band

• Each vendor has its own unique virtualization and replication approach -- therefore the examples are generic

• Lots of new products announced recently in this space

• Some overlapping terminology as well

Three ApproachesThree Approaches

Host Zone

StorageZone Storage

Zone

Host ZoneHost Zone

StorageZone

In-band fabric In-band array Out-of-band fabric

Fabric/ApplianceFabric/Appliance--based based Items of NoteItems of Note

• Introduces additional resources between the servers and the storage in the production environment

• May require specialized drivers on the production hosts

• Issues with consistency groups similar to host-based

• New product offerings -- smaller install base today

• Scaling could be an issue (I/O spread over busses)

• Requires high-availability architectures

• Approaches vary greatly

• Terminology and pros/cons confusing

• I sometimes can argue for or against all three approaches!

D

Frame-based – block-level replication for DRWorkstations

WAN

A

B

A¹

B¹

Production Development

LAN

Test

WANCECE CECE

A B C

SAN SAN

LAN

Workstations

Production Replication

Production

D

D¹

A²

B²

ArrayArray--based Replicationbased Replication

ArrayArray--based Items Of based Items Of NoteNote

• Requires same/similar storage arrays at each site

• Can you replicate your old array to the newest model?

• Requires specialized software from the storage

array vendor

• May require special configurations on the array

• May limit array options (e.g., RAID, disk sizes)

• Usually requires additional cache

• Snapshots vs. mirrors for protection from rolling

disasters can affect total cost

Other Replication NotesOther Replication Notes• Not everything in your most mission critical

application “package” needs to be replicated• Look into update opportunities

• “Reverse Replication” is not just turning a switch• In fact, it often requires a total re-copy and/or re-

synchronization effort

• And can take days, not hours

• Tiered model offers different recovery times for different network services

• ‘Business Critical’ applications recovered first

• Lower tiered applications use different recovery methods

• Allows for ‘expensive’ DR options to be used on only critical systems

• Less critical systems can be recovered faster than physical machines and w/ less on-going cost

• Cost of recovery for an application can be weighed against the needs of the biz

The The ““Virtual ServerVirtual Server”” DR PlanDR Plan

BC/DR TestingBC/DR Testing

• Efficient and effective DR testing is probably the most overlooked (or cut) line items in today’s IT budgets

• You need both processprocess and technologytechnologyto be successful

Validate Your BC/DR PlanValidate Your BC/DR Plan• The non-technical can be a show-stopper

• Who can actually declare a disaster?• Especially when primary site is still up• Who is the primary contact?

• Person or role?• New methods of communications

• Corporate email• VoIP• Cellular

• Travel issues• Staff availability• How/what to test after initial recovery

Testing the Plan!Testing the Plan!• Do I have the data?

• Application consistency groups• Point in time, complete and usable

• One of the primary issues uncovered

• What applications are interdependent on other applications?

• Matched to the RPO

• The network is key• TCP/IP addresses, DNS, DHCP, routing end users, etc.

• The database is key, too• DBAs have granular recovery options as well, including database

rollbacks via recovery and archive logs

• Don’t forget tracking that all-important RTO

Testing the Plan!Testing the Plan! (continued)(continued)

• What about rolling disasters?• When replicating, what if the recovery data is

corrupted?

• Consider using “labs”• Either yours or outsourced• But don’t eliminate testing at your actual recovery site

as well

• Challenge your recovery methodology• Remove a key player or key team to validate

documentation

• Does your production change control ensure DR is also considered and included?

Other Testing ConsiderationsOther Testing Considerations• When to test

• Who’s minding the shop?• How to test

• Protecting the real, production data• Justifying the cost

• Directly relates to RTOs and RPOs• Compare the cost to test to your total investment in DR

today plus the cost of the inability to recover due to lack of testing

• There are some viable options to reduce the cost of testing

• Consider regulatory and associated penalties• Where’s your plan?

• It better not be in your primary data center!

Summary: Tiered DRSummary: Tiered DR

• Business drivers first are foremost

• Technology is second

• Testing your plan is arguably more important than just having a plan• False sense of security

• Benchmark SLAs against reality• Provide Recovery Classes to your Application Owners

• Get outside help from an independent expert with real experience

• Stay flexible -- your plan and tactics will change over time

Thank you!Questions?

Bill [email protected]