22
How Many Nines? Understanding RPO and RTO Metrics for BC/DR Mike Robinson Sr. Solution Marketing Manager [email protected] September 7, 2014 © 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 2 Solutions Track 5: How Many Nines? Understanding RPO and RTO Metrics for BC/DR Virtualization and the prospect of using the cloud for backup or disaster recovery scenarios offers the prospect of scale, flexibility and resilience that IT can leverage to shrink costs and consolidate infrastructure while maximizing application uptime. How does an understanding of RPOs (recovery point objectives) and RTOs (recovery time objectives) impact the ability of IT to set SLAs? How much downtime occurs and data is lost between a “two nine” (99 percent) and “five nine” (99.999 percent) RPO? Explore the differences between RPOs and RTOs, determine how to apply them to server workloads, and learn guidelines for selecting the right DR technologies for the right workloads. Mike Robinson is a senior product marketing manager at NetIQ Corporation.

How Many Nines? - Disaster Recovery Journal · RPO: Recovery Point Objective – A measure of maximum acceptable data loss in terms of time (minutes, hours, days). – An RPO of 4

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: How Many Nines? - Disaster Recovery Journal · RPO: Recovery Point Objective – A measure of maximum acceptable data loss in terms of time (minutes, hours, days). – An RPO of 4

How Many Nines? Understanding RPO and RTO Metrics for BC/DR

Mike Robinson Sr. Solution Marketing Manager [email protected] September 7, 2014

© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 2

Solutions Track 5: How Many Nines? Understanding RPO and RTO Metrics for BC/DR

Virtualization and the prospect of using the cloud for backup or disaster recovery scenarios offers the prospect of scale, flexibility and resilience that IT can leverage to shrink costs and consolidate infrastructure while maximizing application uptime. How does an understanding of RPOs (recovery point objectives) and RTOs (recovery time objectives) impact the ability of IT to set SLAs? How much downtime occurs and data is lost between a “two nine” (99 percent) and “five nine” (99.999 percent) RPO? Explore the differences between RPOs and RTOs, determine how to apply them to server workloads, and learn guidelines for selecting the right DR technologies for the right workloads.

Mike Robinson is a senior product marketing manager at NetIQ Corporation.

Page 2: How Many Nines? - Disaster Recovery Journal · RPO: Recovery Point Objective – A measure of maximum acceptable data loss in terms of time (minutes, hours, days). – An RPO of 4

© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 3

Agenda

•  The Importance of Disaster Recovery

•  3 Phases of Disaster Recovery

•  Key Terminology

•  Availability Tiers

•  The Disaster Recovery Dichotomy

•  Virtualized Disaster Recovery

•  Matching Technology to Requirements

•  Next Steps

The Importance of Disaster Recovery

Page 3: How Many Nines? - Disaster Recovery Journal · RPO: Recovery Point Objective – A measure of maximum acceptable data loss in terms of time (minutes, hours, days). – An RPO of 4

© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 5

Why Disaster Recovery Matters

1 Forrester, September 2010: Business Continuity and Disaster Recovery are top IT Priorities 2 National Archives & Records Administration 3 2003 London Chamber of Commerce and Industry Paper

Total economic damage from disaster in 20091

Economic impact felt in the U.S. from disasters in 20091

$41.3 Billion $10.8 Billion

Business that went bankrupt within 1 year after being unable to use their datacenter for 10 consecutive days2

Proportion of companies that close within 2 years after losing data during a disaster3

93% 90%

© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 6

Disaster Recovery Pressure IT as Competitive Advantage

Availability

Cost of Downtime

Uptime Expectations

Number of Critical Systems

Employee and Customer Expectations

Backup Windows

Tolerance for Downtime

Failover Windows

Response Time

Page 4: How Many Nines? - Disaster Recovery Journal · RPO: Recovery Point Objective – A measure of maximum acceptable data loss in terms of time (minutes, hours, days). – An RPO of 4

© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 7

More Applications Classified as Critical

“What percentage of your applications and data fall into the following tiers?”

Mission-critical 34%

Business-critical 35%

Noncritical 31%

Source: Forrester/Disaster Recovery Journal November 2010 Global Disaster Recovery Preparedness Online Survey

© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 8

Disaster Recovery – Top Priority “Audience polling … shows that of all the data management options, re-architecting backup and recovery was viewed as the top priority….”

Source: Gartner (August, 2011)

Page 5: How Many Nines? - Disaster Recovery Journal · RPO: Recovery Point Objective – A measure of maximum acceptable data loss in terms of time (minutes, hours, days). – An RPO of 4

Key Terminology

© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 10

What is a Server Workload?

Server

Data

Applications

Operating System

A workload is the operating system, applications, middleware and data that reside on a physical server or virtual host.

Page 6: How Many Nines? - Disaster Recovery Journal · RPO: Recovery Point Objective – A measure of maximum acceptable data loss in terms of time (minutes, hours, days). – An RPO of 4

© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 11

The 3 Phases of Disaster Recovery

Disaster Recovery means: 1.  Backing up (replicating) entire server workloads (the

contents of a server, including the operating system, applications and data),

2.  Recovering workloads during an outage, and

3.  Restoring workloads to their original production locations after the outage.

© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 12

Key Disaster Recovery Concepts

RPO: Recovery Point Objective –  A measure of maximum acceptable data loss

in terms of time (minutes, hours, days). –  An RPO of 4 hours means that the most

recent backup has to be no more than 4 hours old at the time of an outage.

Page 7: How Many Nines? - Disaster Recovery Journal · RPO: Recovery Point Objective – A measure of maximum acceptable data loss in terms of time (minutes, hours, days). – An RPO of 4

© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 13

Key Disaster Recovery Concepts

RTO: Recovery Time Objective –  The target maximum allowable time to

recover from an outage. –  An RTO of 4 hours means systems have to

be back up and operational no more than 4 hours after an outage.

© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 14

Key Disaster Recovery Concepts RPO and RTO

9 pm

12 am

3 am

6 am

9 am

12 pm

3 pm

6 pm

9 pm

12 am

3 am

6 am

9 am

12 pm

3 pm

Tape backup Outage

Outage begins

Servers repaired/ replaced

Service restored

Restore

Tape backup window

Recovery time

Lost data Downtime

RPO = 24 hours –  Actual recovery point: 21 hours

RTO = 12 hours –  Actual recovery time = 15 hours

Page 8: How Many Nines? - Disaster Recovery Journal · RPO: Recovery Point Objective – A measure of maximum acceptable data loss in terms of time (minutes, hours, days). – An RPO of 4

© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 15

Key Disaster Recovery Concepts

Availability tiers: 99.9%, Five 9’s, etc. –  Groupings of server workloads by uptime

requirements or SLAs –  Different availability requirements have

different costs and use different technologies

© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 16

Disaster Recovery Solutions

Availability%

90 “one nine”

95

99 “two nines”

99.9 “three nines”

99.99 “four nines”

99.999 “five nines”

Downtime per Year

36.5 days

18.25 days

3.65 days

8.76 hours

52.56 minutes

5.26 minutes

Downtime per Month

72 hours

36 hours

7.2 hours

43.8 minutes

4.32 minutes

25.9 seconds

Typical RTO/RPO

12 – 24 hours

15 – 60 minutes

<5 minutes

Page 9: How Many Nines? - Disaster Recovery Journal · RPO: Recovery Point Objective – A measure of maximum acceptable data loss in terms of time (minutes, hours, days). – An RPO of 4

The Disaster Recovery Dichotomy

© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 18

Disaster Recovery Budgets Rising Slowly

5.27% 5.38%

6.15%

Q2 2010 (N = 566) Q2 2011 (N = 476) Q2 2012 (N = 471)

“Approximately what percentage of your combined IT operating and capital budget will go to business continuity and disaster recovery?”

Base: IT decision-makers from US organizations with more than 500 employees Source: Forrsights Budgets And Priorities Tracker Survey, Q2 2010, Q2 2011, Q2 2012

Page 10: How Many Nines? - Disaster Recovery Journal · RPO: Recovery Point Objective – A measure of maximum acceptable data loss in terms of time (minutes, hours, days). – An RPO of 4

© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 19

Disaster Recovery SLAs “Audience polling shows that 87% of the enterprises surveyed have RTO for their most mission-critical applications/services

as four hours or less….”

RTOs of Mission-Critical Applications (n=93)

Source: Gartner (August, 2011)

© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 20

Mission-Critical Recovery Objectives “For mission-critical … systems in your organization, what are/

were the recovery objectives today and three years ago?”

25%

39%

14%

31%

24%

20%

27%

27%

25%

14%

25%

14%

12%

6%

14%

6%

Three years ago

Today

Three years ago

Today

Rec

over

y po

int

obje

ctiv

es (R

PO

) R

ecov

ery

time

obje

ctiv

es (R

TO)

Less than 15minutes Less than 1 hour Less than 4 hours

Less than 24 hours Less than 72 hours

Base: 51 disaster recovery decision-makers at enterprises with more than 500 employees Source: A commissioned study conducted by Forrester Consulting on behalf of NetIQ, December 2012

Page 11: How Many Nines? - Disaster Recovery Journal · RPO: Recovery Point Objective – A measure of maximum acceptable data loss in terms of time (minutes, hours, days). – An RPO of 4

© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 21

Business-Critical App Recovery Objectives “For mission-critical and business-critical systems in your organization,

what are/were the recovery objectives today and three years ago?”

Base: 51 disaster recovery decision-makers at enterprises with more than 500 employees Source: A commissioned study conducted by Forrester Consulting on behalf of NetIQ, December 2012

25%

31%

18%

27%

18%

29%

25%

37%

37%

16%

37%

14%

8%

4%

6%

4%

Three years ago

Today

Three years ago

Today

Rec

over

y po

int

obje

ctiv

es (R

PO

) R

ecov

ery

time

obje

ctiv

es (R

TO)

Business-critical

Less than 15minutes Less than 1 hour Less than 4 hours

Less than 24 hours Less than 72 hours

© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 22

Most Organizations Want to Improve Recovery Objectives “What are your plans for improving your current recovery objectives?”

Base: 51 disaster recovery decision-makers at enterprises with more than 500 employees Source: A commissioned study conducted by Forrester Consulting on behalf of NetIQ, December 2012

41%

33%

16%

10%

We have plans to improve within the next 6 months

We have plans to improve in the next 6 to 12 months

We would like to, but we have no plans currently

We are happy with the current objectives

Page 12: How Many Nines? - Disaster Recovery Journal · RPO: Recovery Point Objective – A measure of maximum acceptable data loss in terms of time (minutes, hours, days). – An RPO of 4

© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 23

The Need for Better Protection

•  More workloads are considered to be business critical and need better protection –  “Historically, the proportion of an organization's applications that it deems mission-

critical has been between 10% and 20% … of the audience, 60% has more than 20% of their applications/services categorized at the highest level of criticality” Gartner (August 2011)

•  IT is under pressure to stretch their budgets to accommodate the business needs –  “Best practice points to spending more money on the 20% that is mission-critical

and less on the 80% that isn't, to reduce or eliminate the impact of an outage on the business” Gartner (August 2011)

•  Traditional disaster recovery approaches are either too expensive or too inadequate in terms of protection…

© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 24

Mirroring is Used for Critical Apps “How do you copy data between your primary and recovery site(s)?”

(Select all that apply)

35% 41%

22% 17%

39%

22% 28% 29%

18%

46%

5% 8%

24% 17%

59%

Synchronous replication Asynchronous replication

Periodic point in time copies

Remote backup over the wide area network

Backup locally to tape and transport our tapes

Mission critical applications and data Business critical applications and data Non-critical applications and data

Base: 136 disaster recovery decision-makers at enterprises with more than 500 employees Source: Forrester/Disaster Recovery Journal November 2010 Global Disaster Recovery Preparedness Online Survey

Page 13: How Many Nines? - Disaster Recovery Journal · RPO: Recovery Point Objective – A measure of maximum acceptable data loss in terms of time (minutes, hours, days). – An RPO of 4

© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 25

Synchronous Replication Cost Issues

•  Building & Running a Secondary Datacenter –  Secondary location outside of “disaster” zone –  Out of state, country, even continent

•  Additional Equipment & Software Licenses –  Duplicate hardware and software licenses –  Significant expense for rarely or under-utilized servers

•  Networking –  Bandwidth costs between sites can be significant –  Varies with amount of data and RPO tolerances

•  Staff –  Training, testing, additional specialized staff, etc.

© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 26

Mirroring is Used for Critical Apps Tape is for Non-Critical Applications

“How do you copy data between your primary and recovery site(s)?” (Select all that apply)

Base: 136 disaster recovery decision-makers at enterprises with more than 500 employees Source: Forrester/Disaster Recovery Journal November 2010 Global Disaster Recovery Preparedness Online Survey

35% 41%

22% 17%

39%

22% 28% 29%

18%

46%

5% 8%

24% 17%

59%

Synchronous replication Asynchronous replication

Periodic point in time copies

Remote backup over the wide area network

Backup locally to tape and transport our tapes

Mission critical applications and data Business critical applications and data Non-critical applications and data

Page 14: How Many Nines? - Disaster Recovery Journal · RPO: Recovery Point Objective – A measure of maximum acceptable data loss in terms of time (minutes, hours, days). – An RPO of 4

Virtualized Disaster Recovery

© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 28

Disaster Recovery Solutions

Availability%

90 “one nine”

95

99 “two nines”

99.9 “three nines”

99.99 “four nines”

99.999 “five nines”

Downtime per Year

36.5 days

18.25 days

3.65 days

8.76 hours

52.56 minutes

5.26 minutes

Downtime per Month

72 hours

36 hours

7.2 hours

43.8 minutes

4.32 minutes

25.9 seconds

Typical RTO/RPO

12 – 24 hours

15 – 60 minutes

<5 minutes

Cost

Solution

Page 15: How Many Nines? - Disaster Recovery Journal · RPO: Recovery Point Objective – A measure of maximum acceptable data loss in terms of time (minutes, hours, days). – An RPO of 4

© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 29

Disaster Recovery Using Tape Tape headaches •  Physical media needs to be managed •  Sequential access to data

Data only view of the world •  Data is backed up but system is not •  How do you effectively recover the entire server?

Slow restore •  Data only view makes restore a long process •  Rebuild server, install OS, install application, recover data

Slow testing •  Same painful restore process •  Laborious testing leads to no testing

Recovery Risk – How well protected are you?

© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 30

High Availability in the Physical World

Wide Area Network

Page 16: How Many Nines? - Disaster Recovery Journal · RPO: Recovery Point Objective – A measure of maximum acceptable data loss in terms of time (minutes, hours, days). – An RPO of 4

© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 31

Disaster Recovery with Virtualization

Wide Area Network

© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 32

Benefits of Virtualized DR

•  Heterogeneous workload protection –  Protect Windows and Linux workloads, both physical and

virtual, with the same system

•  Rapid failover and failback –  Warm standby VMs provide very fast recovery –  Restore to bare metal, repaired server or virtual platform –  Use hypervisor-based snapshots for point-in-time recovery

•  Safe sandbox testing –  Virtual workloads can be tested at any time without affecting

the production environment

•  Simplified licensing –  No additional OS or app licenses required on recovery servers

Page 17: How Many Nines? - Disaster Recovery Journal · RPO: Recovery Point Objective – A measure of maximum acceptable data loss in terms of time (minutes, hours, days). – An RPO of 4

Taking Disaster Recovery to the Cloud

© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 34

Cloud-Based DR Delivery

Do-It-Yourself: Configure & manage your own solution using public cloud resources

DR-as-a-Service: Prepackaged pay-as-you-go recovery services to the cloud with specified RPO & RTO SLAs

Cloud-to-Cloud DR: Failover from one cloud environment to another

Page 18: How Many Nines? - Disaster Recovery Journal · RPO: Recovery Point Objective – A measure of maximum acceptable data loss in terms of time (minutes, hours, days). – An RPO of 4

© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 35

Traditional Disaster Recovery

Best Worst RPO/RTO

Cost

Offsite Replication Expensive; requires a secondary site, redundant hardware (which is idle / under-utilized most of the time)

Local Replication Only good for individual server failure. No protection against site failures.

Vaulting (tape, imaging) Recovery can take days or weeks. Difficult to test.

$

$$$$

© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 36

Storage as a Service

Advantages

•  Fixed per-gigabyte cost

•  Off-site cloud-based storage

•  Scale up or down on demand

•  Service provider handles hardware maintenance, backups

Disadvantages •  Data only, not workloads

•  Static storage can’t run server workloads

•  If a local outage occurs, data needs to be copied to recovery environment first

Page 19: How Many Nines? - Disaster Recovery Journal · RPO: Recovery Point Objective – A measure of maximum acceptable data loss in terms of time (minutes, hours, days). – An RPO of 4

© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 37

Recovery as a Service Storage as a Service + IaaS

Advantages

•  Fixed per-gigabyte cost

•  Off-site cloud-based storage

•  Scale up or down on demand

•  Service provider handles hardware maintenance, backups

More Advantages •  Protect whole workloads, not just data

•  Replicate to the cloud, recover and run in the cloud

•  Live restore back to repaired data center

© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 38

RaaS vs. Traditional Disaster Recovery

Best Worst RPO/RTO

Cost

Offsite Protection

Local Protection

Vaulting

$

$$$$

Raas

Page 20: How Many Nines? - Disaster Recovery Journal · RPO: Recovery Point Objective – A measure of maximum acceptable data loss in terms of time (minutes, hours, days). – An RPO of 4

© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 39

Private/Hybrid RaaS

•  Dedicated backup hardware at service provider premise

•  Scale by adding hardware

•  Hardware owned, managed, maintained by customer or service provider

•  Replicate workloads directly to offsite facility

•  Run recovery workloads in dedicated environment

© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 40

Public RaaS

•  Shared backup hardware at service provider premise

•  Scale using service provider’s resource pool

•  Hardware owned, managed, maintained by service provider

•  Replicate workloads directly to offsite facility

•  Run recovery workloads in shared environment

Page 21: How Many Nines? - Disaster Recovery Journal · RPO: Recovery Point Objective – A measure of maximum acceptable data loss in terms of time (minutes, hours, days). – An RPO of 4

© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 41

Next Steps…

•  Determine your tolerance for downtime & data loss

•  Establish DR metrics –  Categorize server workloads into

tiers by RPO, RTO •  Match organizational needs to DR technologies –  You can and will use multiple

technologies

–  Balance budget with needs

© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 42

Mike Robinson Sr. Product Marketing Manager [email protected]

Page 22: How Many Nines? - Disaster Recovery Journal · RPO: Recovery Point Objective – A measure of maximum acceptable data loss in terms of time (minutes, hours, days). – An RPO of 4

© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 43

+1 713.548.1700 (Worldwide) 888.323.6768 (Toll-free) [email protected] NetIQ.com

Worldwide Headquarters 515 Post Oak Blvd., Suite 1200 Houston, TX 77027 USA http://community.netiq.com

This document could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein. These changes may be incorporated in new editions of this document. NetIQ Corporation may make improvements in or changes to the software described in this document at any time. Copyright © 2014 NetIQ Corporation and its affiliates. All Rights Reserved. ActiveAudit, ActiveView, Aegis, AppManager, Change Administrator, Change Guardian, Compliance Suite, the cube logo design, Directory and Resource Administrator, Directory Security Administrator, Domain Migration Administrator, Exchange Administrator, File Security Administrator, Group Policy Administrator, Group Policy Guardian, Group Policy Suite, IntelliPolicy, Knowledge Scripts, NetConnect, NetIQ, the NetIQ logo, PSAudit, PSDetect, PSPasswordManager, PSSecure, Secure Configuration Manager, Security Administration Suite, Security Manager, Server Consolidator, VigilEnt, and Vivinet are trademarks or registered trademarks of NetIQ Corporation or its subsidiaries in the United States.