65
© 2014 VMware Inc. All rights reserved. Implementing a Holistic BC/DR Strategy with VMware James O‘Mahony Technical Support Engineer Klaus Kremser Manager Systems Engineering

Implementing a Holistic BC/DR Strategy with VMware · Key Components of SRM Replication vCenter Server SRM Server One vCenter Server (Windows or VCVA) per site, same versions One

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

© 2014 VMware Inc. All rights reserved.

Implementing a Holistic BC/DR Strategy with VMware

James O‘Mahony Technical Support Engineer

• Klaus Kremser • Manager Systems Engineering

What’s on the agenda?

• Defining the problem

• Definitions

• VMware technologies that provide BC and DR

– vSphere HA and App HA

– vSphere FT

– vSphere Data Protection / Advanced

– vCenter Availability

– vCenter Infrastructure Navigator (VIN)

– vCloud Hybrid Service Disaster Recovery

– vSphere Replication

– vCenter Site Recovery Manager (SRM)

• Find out more

IT Business Continuity

Is It a Real Problem?

What’s the Difference?

Disaster

Avoidance

Disaster

Recovery

Planned vs.

Unplanned

Disaster Recovery vs. Business Continuity

Example: Tuesday, August 23, 2011 at 1:51 PM EDT - Magnitude 5.8 earthquake near Mineral, Virginia

Disaster recovery required?

No

Interruption to business?

YES!

Fault Tolerance vs. High Availability

• Fault tolerance

– Ability to recover from component loss

– Example: Hard drive failure

• High availability

Uptime percentage in one year Downtime in one year

99 3.65 days

99.9 8.76 hours

99.99 52 minutes

99.999 “five nines” 5 minutes

X

RTO, RPO, and MTD

• Recovery Time Objective (RTO)

– How long it should take to recover

• Recovery Point Objective (RPO)

– Amount of data loss that can be incurred

• Maximum Tolerable Downtime (MTD)

– Downtime that can occur before significant loss is incurred

– Examples: Financial, reputation

Making an Application Service Highly Available

• vSphere HA

• NEW: vSphere App HA

VMware vFabric™ tc Server

vSphere App HA New

Policy-based

Protect off-the-shelf apps

vSphere App HA

vSphere HA Cluster

vFabric Hyperic

Virtual Appliance

vSphere App HA

Virtual Appliance

Hyperic Agents

Running in VMs

vCenter

Server

vSphere vSphere vSphere vSphere

New

vSphere App HA New

vSphere HA – Keep In Mind…

• RTO – measured in minutes (not seconds)

• Requires shared storage

• Best practices

– Use admission control – percentage policy

– Test post-failure performance with host maintenance mode

– Isolation response – leave powered on

– Network and storage redundancy

vSphere Fault Tolerance (FT)

• Zero recovery time, data loss

– Host hardware failure only

– Does not protect against OS and application failure

• Works fine with HA, App HA

• Why not FT?

– Resource requirements – does workload really need it?

– VM has multiple CPUs

– No VM snapshots – backups require agent

Data Protection (Backup and Restore)

• Agents? No Agents? – Both!

– No agents for majority of workloads – keep it simple

– Agents for certain apps

• vSphere Data Protection (VDP) Advanced

– Backup and recovery for VMware, from VMware

– Based on proven, mature EMC Avamar™

– Agent-less VM backup and restore

– Agents for granular tier-1 application protection

vSphere Data Protection New

VDP Advanced – Keep In Mind…

• Engineered for SMB environments

• Uses VADP – VM snapshots, CBT

• Utilizes Windows VSS in VMware Tools

• Works fine with HA, not with FT

• RDM – virtual yes, physical no

• Is it DR?

– Maybe – depends on RTO, RPO

– Needs replication offsite, right?

VDP Advanced – Keep In Mind…

• Best Practices

– Prepopulate DNS, always use FQDN

– Manage VM snapshots

– Avoid deploying to slow storage

– Do not power-off, always shut down gracefully

– Do not schedule backups during maintenance window

vCenter Availability

• Run vCenter Server application in a VM

• Run vCenter Server database in a VM

• Run both in same VM?

• Protect with vSphere HA

– vCenter and DB VM restart priority set to High

– Enable guest OS and App monitoring

• App HA can protect SQL Server database

vCenter Availability

• Back up vCenter Server VM and database

– Image-level backup for vCenter Server VM

– App-level backup using agent for database backup

• Why not FT for vCenter Server?

– vCenter Server requires minimum of 2 vCPUs

– FT does not protect against application failure

vSphere Infrastructure Navigator

vCloud Hybrid Service - Disaster Recovery

VMware vSphere

VMware

vCenter Server

vSphere

Replication

Site A (Primary)

Servers

vCHS, Site B (Recovery)

US East Region

US West Region

1Dependent on available bandwidth

Simple and secure asynchronous replication and failover for vSphere

• Warm standby capacity on vCHS

• Self-service protection, failover and failback workflows per VM

• 15 min1 – 24 hr. recovery point objective (RPO)

• Initial data seeding by shipping a disk

• Includes:

• 2x 7-day DR tests per year

• 30 days of recovered VM run time

22

Disaster Recovery – New Core Class of Service

Minimum size:

10GHz vCPU

20GB vRAM

Starts at

1 TB

10 Mbps allocated

2 Public IPs

2 Tests*

Term Lengths:

1m, 12m, 24m, 36m subscriptions

Dedicated Cloud

Instance Virtual Private

Cloud Instance

vCloud Hybrid Service Standard Servicer Tiers

New Instance

Type as DR

Service Tier

DR-VDC Instance

vSphere Provides The Best Foundation For Disaster Recovery in the Cloud

Encapsulation: Simple Application Protection

• Entire system – including application, OS, and data – is stored as virtual machine files

• Entire system can be protected with data protection tools

Hardware-Independence: Flexible Infrastructure

• Eliminate the need for SAN or array-based replication

• Enable consistent recovery throughout data center lifecycle changes

Hybrid Aware: Seamless Integration with vCHS

• Reduced costs by leveraging the cloud for DR

• Scale your protection capacity to meet variable demand

24

Fully integrated with vSphere Web Client

Consistent management and operational best practices…

• Single interface and common management

• Designed to integrate with vCloud Hybrid Service

• Doesn’t require “console hopping”

25

Disaster Recovery System Requirements

Primary Data Center

• VMware vSphere 5.1 or above

– vSphere Essentials Plus

– vSphere Standard

– vSphere Enterprise

– vSphere Enterprise Plus

• VMware vCenter 5.1 or above

– Includes vSphere Web Client

• vSphere Replication Appliance 5.6

– 1:1 mapping with vCenter*

• Public internet connectivity

vCloud Hybrid Service

• DR subscription

26

(DR Virtual Data Center instance)

Disaster Recovery and Site Recovery Manager Disaster Recovery as a complementary DR solution to traditional SRM deployments

Seeking DR

Solution?

SRM in scope?

Pass

vCloud Hybrid

Service - DR

Internal/DIY Hosted Solution

On Premise

Co-existence

Yes

No

No

Yes Co-existence

Yes Yes

(Default)

(Partner service contract)

True Multi-Tenancy & Multi-Site Storage agnostic support

Support for different vSphere versions

Shared cloud infrastructure

Simplified management

• UI embedded in vSphere (v5.1+)

• Protect VMs with a couple of clicks

• Failover and testing through API

• Installable in current environment

Administration via vCHS console and API*

RaaS Alternative

vCHS US-East vCHS US-West vCHS EUR-UK

VMware vSphere

customers

27

VMware – Multiple Levels of Protection

SQL

vSphere HA/FT

Site A

VMware – Multiple Levels of Protection

SQL

vSphere HA/FT

VDPA

Site A

VMware – Multiple Levels of Protection

SQL

vSphere HA/FT

VR/SRM SQL

VDPA

Site A Site B

© 2014 VMware Inc. All rights reserved.

SRM and vSphere Replication 5.5

vSphere Replication – Standalone

• Native tool built into the platform

• Per-VM hypervisor replication, managed in VC

Selectable RPO from 15 min up

to 24 hours

Selectable destination

datastore (Disk-type agnostic)

Replication Across Sites

vCenter Server

ESXi

NFC

VRA

ESXi

NFC

VRA

ESXi

NFC

VRA

Storage Storage

(VMDK1)

vCenter Server

ESXi

NFC

VRA

ESXi

NFC

VRA

ESXi

NFC

VRA

VR

Appliance VR

Appliance

Storage Storage

VMDK1

vCenter Server vCenter Server

Four Steps for Full Recovery

Right-click, select “Recover”

Select a target folder

Select a target resource

Click Finish

Will validate your choices as you go

New Feature – Retain Historical Replicas

vSphere

VR Agent

After recovery, use the snapshot manager to revert

to earlier points

Retention of

multiple points

in time allows

reversion to

earlier known

good states

MPIT Presented as VM Snapshots after Failover

Use the snapshot manager to revert to earlier points, an interface

all administrators have been comfortable with for many years.

vSphere Replication – Interoperability

Fault tolerance –

Doesn’t work with VR

• FT conflicts at the

vSCSI disk filter level.

VDP

• Mostly no problem!

HA, vMotion, DRS

Storage vMotion

and Storage DRS

• Now supported!

vSphere Replication – Best Practices

• RPO

– Only what is necessary!

– Just because you can…

• RTO

– Don’t set one! No testing, no automation, manual process.

• VSS – Only if necessary!

• What about bandwidth?

– Very hard to determine. Do a local loopback first.

• RDMs?

– Don’t use them. If you must, use virtual compatible.

© 2014 VMware Inc. All rights reserved.

SRM and vSphere Replication 5.5

SRM

• A Disaster Recovery engine

• A tool that uses externally replicated data (VR or array based) to speed the RTO of a BCP

• A product that allows for DR to be tested, automated, planned, repeatable and customizable

What is it?

• A replication engine

• A disaster avoidance stretched cluster

What is it not?

Key Components of SRM

Replication

vCenter Server

SRM Server

One vCenter Server

(Windows or VCVA) per

site, same versions

One SRM Server per

site, same versions

vSphere hosts,

recommend same

versions per site (pre

vSphere 5.x only if using

array replication)

vSphere Essentials Plus and higher editions supported

vCenter Server

SRM Replication Options

• SRM can utilize BOTH array based

AND vSphere Replication

• SRM will “see” existing standalone

vSphere

Replication protected VMs

• SRM can install vSphere

Replication from scratch

if needed

Hub LUN 2

Web

Multi-tier App

DB

App

vSphere Replication

Storage-based Replication

LUN 1

Web

DB

App

Multi-tier App

Recovery Workflows

• User defined recovery plan

• Minimize errors

Failover Automation

• Isolated test environment

• Increase confidence in DR process

Non-disruptive Failover Testing

• Zero data loss

• Operational migration

Planned Migration

• Re-protect VM’s, migrate back

Failback Automation

SRM Interoperability

• Works with VR –and- ABR

• Backups, VADP or other

are fine

• HA is no problem at all

• vMotion and DRS are fine

• Storage vMotion and

Storage DRS – Sort of…

– Replication Dependent

• FT is “yellow”

– Array replicated only and the

FT status is not recovered

• Web vs vSphere Client

SRM – A Few Best Practices

Not exhaustive

Plenty of support material available on blogs, vmware.com, tech sites

Big ones: Storage Layout

Test Network Configuration

Test often!

Size vCenter correctly

Biggest one:

Do a Business Impact Analysis

RPO, RTO, Cost of downtime, interdependencies, criticality of applications, priorities, units of failover, overlooked externalities, executive buy-in, …..

Protection Groups (PGs)

• More PGs = more granular testing/failover

– DR testing is easier – fewer resource requirements

– Fail-over only what is needed

– More configuration/complexity

• Less protection groups = less complex

– Fewer LUNs, PGs, recovery plans

– Less flexibility

• Find a good balance between flexibility and simplicity

Fewer LUNs/PGs

Less complexity

Less flexibility

More LUNs/PGs

More complexity

More flexibility

Right combination

of complexity and

flexibility Varies by customer

Majority of outages

are partial (not entire

data center) – design

accordingly

Test Network

– Use VLAN or isolated network for test environment

• Default “Auto” setting does not allow VM communication between hosts

– Different vSwitch can be specified in SRM for test versus run

• Specified in Recovery Plan

Typical failover

CONFIDENTIAL 48

Storage Storage

Primary

Site

Secondary

Site

VirtualCenter Site Recovery

Manager VirtualCenter Site Recovery

Manager

Array Replication

/ vSphere Replication

Array Based Replication with SRM

CONFIDENTIAL 49

Replication Software

VMFS VMFS

Replication Software

VMFS VMFS

“Protected” Site “Recovery” Site

SRA SRA

SRM Plug-In SRM Plug-In

Storage Storage

SRM Server SRM Server

vSphere Client vSphere Client

vCenter Server vCenter Server

ESX ESX ESX ESX ESX

Replication

SRA Commands

“Configure arrays" done during the SRM Array Configuration

• 1. discoverArrays

• 2. discoverDevices

Test failover (Test the DR solution at a point of Time using LUN snapshots)

• 3. testFailoverStart

• 4. testFailoverStop

Failover (Planned Migration or Disaster recovery)

• 5. failover

CONFIDENTIAL 50

SRA Commands Continued…

Failback (SRM 5.x onwards)

• 6. reverseReplication

• 7. queryReplicationSetting

Synchronization calls

• 8. syncOnce

• 9. querySyncStatus

CONFIDENTIAL 51

vSphere Replication with SRM

CONFIDENTIAL 52

“Protected” Site “Recovery” Site

VR Server

SRM Plug-In SRM Plug-In

Storage

SRM Server SRM Server

Storage

vSphere Client

ESX ESX ESX

HBR HBR HBR

ESX ESX

VMFS VMFS

Storage

VMFS VMFS

vCenter Server vCenter Server

VR Server

vSphere Client

vSphere Replication failover workflow

53

Test Failover

VMDKs

Servers

Virtual Machines

VirtualCenter Site

Recovery Manager

Replicated VMDKs

Servers

vSphere Replication Appliance

VirtualCenter Site

Recovery Manager

vSphere Replication

Protected Site Recovery Site

Virtual Machines

vSphere Replication failover workflow

54

Full Failover

VMDKs

Servers

Virtual Machines

VirtualCenter Site

Recovery Manager

Replicated VMDKs

Servers

vSphere Replication Appliance

VirtualCenter Site

Recovery Manager

vSphere Replication

Protected Site Recovery Site

Virtual Machines

Synchronize

vSphere Replication failover workflow

55

Re-protect

VMDKs

Servers

Virtual Machines

VirtualCenter Site

Recovery Manager

Replicated VMDKs

Servers

vSphere Replication Appliance

VirtualCenter Site

Recovery Manager

vSphere Replication

Protected Site Recovery Site

Virtual Machines

Protected Site Recovery Site

vSphere Replication Appliance

Pro’s and Con’s of the replication technologies

56

Pros Cons

ABR

• Mature

• Can be synchronous as well as

asynchronous

• Datastore Consistency Groups

• Supports SDRS however all LUNs

involved in the SDRS cluster must be

within the same consistency group

• Coarse Granularity (per LUN)

• Requires compatible HW at Both sites

• Dedicated Physical resources

• Managed Outside of vCenter

• Licensed asset

ABR = Array Based Replication

Pro’s and Con’s of the replication technologies

57

VR

• Fine Granularity (per VM)

• Any to any storage

• Integrated into vSphere

• Uses existing network infrastructure to

replicate the data

• Supports SDRS

• MPIT feature added to allow for failover

to an older point in time

• Is available as a standalone appliance

outside of SRM

• Lack Of Multi-VM CGs

• Does Not Support Low RPOs ( < 15mins)

• VSS is not compatible if another solution

like a backup also uses VSS quiescing

• Dependant on the network bandwidth

between sites

Pros Cons

VR = vSphere Replication

vSphere Replication with SRM

58

“Protected” Site “Recovery” Site

VR Server

SRM Plug-In SRM Plug-In

Storage

SRM Server SRM Server

Storage

vSphere Client

ESX ESX ESX

HBR HBR

VMFS VMFS

Storage

VMFS VMFS

vCenter Server vCenter Server

vSphere Client

HBR

ESX ESX

VR Server

SRM Advanced Settings

59

- Every environment is different so one setting does not fit all

- These settings are “per site” so without consistency, failover and failback will behave differently

SRM Advanced Settings

60

SRM Logs

SRM Log Files location

C:\ProgramData\VMware\VMware vCenter Site Recovery Manager\Logs

Or generate from within SRM (and can also gather the VR logs too) by right clicking the selected site and selecting “Export System Logs”

61

Takeaways

62

- SRM may need to be customized based on your environment

- If an environmental change is made, ensure that test failovers are run to ensure the change has not caused an unforeseen issue

- The test failover workflow exists to test without a production outage and its

purpose is to highlight any issue which may exist that could cause a full failover to fail. Accordingly, it is important to ensure test failovers are a part of scheduled maintenance

- In the event VMware need to be engaged to troubleshoot an issue, ensure that the SRM logs are generated at both sites, and also include the logs for a subset of the source and DR ESXi servers including the vCenter logs at both sites (failover report is also helpful)

- In the event vSphere Replication is also in use, it is important to provide the logs from the appliances also (to match with the ESXi logs from the server hosting the production VM)

Additional Resources

Thank You