Data Center Business Continuance and Disaster Recovery · Business Continuance Is More Critical than Ever 75% of IT decision-makers have altered Disaster Recovery/Business Continuance

Data Center Business ContinuanceBusiness Continuanceand Disaster Recovery

Maciej BocianMaciej [email protected] Sales Manager

© 2009 Cisco Systems, Inc. All rights reserved. Cisco ConfidentialPresentation_ID 1

Data Center and Virtualization, Central Europe

CCIE#7785

Business Continuance Drivers

• Cost of application downtime, lost data

Business Continuance Drivers

Cost of application downtime, lost data and productivity

• Regulatory mandates (Homeland Hurricanesg y (Defense, Basel II, HIPAA, GLB, SEC)

Firms must recover business operations the same business day a disruption occurs“Out-of-region” data center, 200+ km away Mandates backup data centers on separate grids

The Northeast Blackout


NYC Blizzard of 2003

Business Continuance Is More Critical than Ever75% of IT decision-makers have altered Disaster Recovery/Business Continuance programs as a result of September 11result of September 11

Following a disaster 43% of directly affectedFollowing a disaster 43% of directly affected businesses do not reopen and 29% fail within 24 months as a result

Only 15% of Global 2000 enterprises have a full-fledged business continuity plan.

Disasters: fire, storm, floods, earthquakes, chemical accidents, nuclear accidents, wars


accidents, nuclear accidents, wars

Sources: Disaster Recovery Journal, Gartner Group

AgendaAgenda

Introduction to Data Center - The EvolutionIntroduction to Data Center The Evolution

Data Center Disaster RecoveryObjectives Failure Scenarios Design Options

Components of Disaster RecoveryComponents of Disaster RecoverySite Selection - Front End GSLBServer High Availability - ClusteringD t R li ti d S h i ti SAN E t iData Replication and Synchronization - SAN Extension

Sample Design


The Evolution of Data Centers


Data Center EvolutionData Center EvolutionNETWORKED DATA

CENTER PHASEData Center

Network

Data CenterContinuous Availability

Data Center Consolidation

Data Center Distributed

Agi

lity

Client/Server

COMPUTE EVOLUTION

OptimizationInternet Computing

1 Consolidation

Data CenterNetworking

Bus

ines

s MainframesContent

Networking

Thin Client: HTTP

1. Consolidation2. Integration3. Distributed

4. High Availability

TerminalNETWORK

EVOLUTION

TCP/IP


1960 1980 2000 2010

Terminal EVOLUTION

What is involved in a Data CenterWhat is involved in a Data Center

Application solutionLi /HP

Network infrastructure solutionLinux/HP,

Solaris/SunFire, WebLogic, J2EE custom app, etc.

Cisco GSRs, CISCO CATALYST

6500, Cisco Catalyst Cat4000

Database solutionLinux/HP, Solaris/SunFire, Oracle 10G RAC, etc.

Layer 4–7 services solutionCSM, SSLM, CSS,

CE, GSS 10G RAC, etc.

St l ti

Network security solutionPIX®,

FWSM, IDSM, Storage solution

MDS9000

Management and instrumentation solution

IDSM, VPNSM,

CSA

Terminal NAM


servers, NAM,Cisco Works LMS/VMS,

HSE

What is Distributed Data CenterWhat is Distributed Data Center

APP A APP B APP A APP C

Data Replication

Primary SecondaryFC FC


yData Center

yData Center

Why Distributed Data CentersWhy Distributed Data Centers

Provide disaster recovery and business continuance

Avoid single, concentrated data depositary

High availability of applications and data access g y pp

Load balancing together with performance scalability

Better response and optimal content routing: proximityBetter response and optimal content routing: proximityto clients


Front-end IP Access Layer y

“Content Routing”site selectionAPP A APP B APP A APP C



yData Center

yData Center

Application and Database Layerpp y

“Content Switching”


Content SwitchingLoad Balancing

“Server Clustering”High AvailabilityHigh Availability

PrimaryData Center

SecondaryData Center

FC FC


Data Center Data Center

Backend SAN Extension

APP A APP B APP A APP C“Storage” & “Optical”

DataMirroring and Replicationo g a d ep cat o

P i S d

FC FC


PrimaryData Center

SecondaryData Center

Data Center Disaster Recovery


AgendaAgenda


Data Center Disaster RecoveryObjectivesFailure Scenarios Design Options


Sample Design


Disaster RecoveryDisaster Recovery

Recovery of data and resumption of service - EnsuringRecovery of data and resumption of service Ensuring business can recover and continue after failure or disaster

Ability of a business to adapt, change and continue when confronted with various outside impacts

Mitigating the impact of a disaster


What It means For Business

Business ResilienceBusiness ResilienceContinued Operation ofBusiness During a Failure

Business ContinuanceRestoration of Business

After a FailureDisaster Recovery

Protecting Data Through Offsite

After a Failure

g gData Replication

and Backup


Zero Down Time is the ultimate goal

Disaster Recovery PlanningDisaster Recovery Planning

• Business Impact Analysis (BIA)Business Impact Analysis (BIA) Determines the impacts of various disasters to specific business functions and company assets

• Risk Analysis Identifies important functions and assets that are critical to company’s operationscompany s operations

• Disaster Recovery Plan (DRP) Restores operability of the target systems applications orRestores operability of the target systems, applications, or computing facility at the secondary Data Center after the disaster


Disaster Recovery ObjectivesDisaster Recovery Objectives

Recovery Point Objective (RPO)Th i t i ti ( i t th t ) i hi h t d d tThe point in time (prior to the outage) in which system and data

must be restored toTolerable lost of data in event of disaster or failureThe impact of data loss and the cost associated with the loss

Recovery Time Objective (RTO)The period of time after an outage in which the systems and dataThe period of time after an outage in which the systems and data

must be restored to the predetermined RPO The maximum tolerable outage time

R A Obj ti (RAO)Recovery Access Objective (RAO)Time required to reconnect user to the recovered application,

regardless where it is recovered


Recovery Point/Time vs. CostRecovery Point/Time vs. CostDisasterstrikes

Systems recoveredand operational

Critical data is recovered

time

Recovery timeRecovery point

time t1 time t2

Recovery time

secs mins hours days weeks

Recovery point

secsminshoursdays

time t0

ExtendedCluster

ManualMigration

TapeRestore

SynchronousReplication

AsynchronousReplication

PeriodicReplication

Tapebackup

Smaller RPO/RTO Larger RPO/RTO

$$$ Increasing cost$$$ Increasing cost


Smaller RPO/RTO Higher $$$, Replication, Hot

standby

Larger RPO/RTO Lower $$$, Tape backup/restore,

Cold stanby

AgendaAgenda


Data Center Disaster RecoveryObjectives Failure ScenariosDesign Options


Sample Design


Failure ScenariosFailure Scenarios

Disaster could mean many types of FailureDisaster could mean many types of Failure

Network Failure

D i F ilDevice Failure

Storage Failure

Site Failure


Network FailuresNetwork FailuresInternet

ServiceP id A

ServiceProvider BProvider A Provider B

ISP failureDual ISP connectionsMultiple ISP

Connection failure within the networknetwork

ether-channelMultiple route paths


Device FailuresDevice FailuresInternet

ServiceProvider A

ServiceProvider BProvider A

Routers, Switches, FWsHSRPVRRP

HostsHA clusterHA cluster


Storage FailuresStorage FailuresInternet

ServiceP id A


Disk arraysRAID

Disk Controllers


Site FailuresSite FailuresInternet

ServiceP id A


Partial Site FailureApplication maintenanceppApplication migrationApplication scheduled DRexercise

Complete Site FailureDisaster


AgendaAgenda




Sample Design


Cold StandbyCold Standby

One or more data center with appropriately configured space equipped with pre-qualified environmental, electrical, and communication conditioning, g

Hardware and Software installation, Network access, and data restoration all need manual intervention

Least expensive to implement and maintain

Substantial delay from standby to full operationy y p


Disaster Recovery – Active/StandbyDisaster Recovery Active/Standby

APP A APP B APP A APP B



yData Center Data Center

(Cold Standby)

Warm StandbyWarm Standby

A data center that is partially equipped with hardware and communications interfaces capable of providing backup operating support. p g pp

Latest backups from the production data center must be delivered

Network access needs to be activated

Provides better RTO and RPO than Cold Standby yBackup



APP A APP B APP A APP B

IP/Optical Network

Primary SecondaryData Center

FC FC


yData Center Data Center

(Warm Standby)

Hot StandbyHot Standby

A data center that is environmentally ready and hasA data center that is environmentally ready and has sufficient hardware, software to provide data processing service with little down or no down time.

Hot Backup offers Disaster Recovery, with little or no human intervention

A li ti d t i li t d f th i itApplication data is replicated from the primary site

A hot backup site provides very good RTO and RPO




IP/Optical Network



yData Center

yData Center

Disaster Recovery – Active/ActiveDisaster Recovery Active/Active

What Does Active/Active Mean??


Multiple Tiers of ApplicationMultiple Tiers of ApplicationInternet

ServiceP id A


Presentation TierPresentation Tier

Application TierApplication TierApplication TierApplication Tier

Storage TierStorage Tier


Active/Active Data Centers

Internal

Active/Active Data Centers

InternetInternalNetwork

Network InternetService

Provider AService

Provider B

Active/Active Web Hosting

Active/Active Application Processing

Active/Standby


Database ProcessingOr

Active/Active

Disaster Recovery yComponents


AgendaAgenda




Sample Design


Site Selection MechanismsSite Selection MechanismsSite selection mechanisms depend on the technology or mix of technologies adopted for request routing:or mix of technologies adopted for request routing:1. HTTP Redirect

2 DNS Based2. DNS Based

3. L3 Routing with Route Health Injection (RHI)

H lth f d/ li ti d t bHealth of servers and/or applications needs to be taken into account

Optionally other metrics (like load ) can be measuredOptionally, other metrics (like load ) can be measured and utilized for a better selection


HTTP Redirection – The IdeaHTTP Redirection The Idea

Leveraging the HTTP redirect function:Leveraging the HTTP redirect function:HTTP return code 302

Proper site selection made after the initial DNS requestProper site selection made after the initial DNS request has been resolved, via redirection

Mainly as a method of providing site persistence while providing local server farm failure recovery

Can be used with the “Location Cookie” feature of the CSS to provide redirection after wrong site selectionCSS to provide redirection after wrong site selection


HTTP Redirection – Traffic FlowHTTP Redirection Traffic Flow

http://www1.cisco.com/

http://www.cisco.com/


http://www2.cisco.com/

Advantages of the HTTP Redirection ApproachApproach

Can be implemented without any other GSLB devices or mechanisms

Inherent persistence to the selected location

Can be used in conjunction with other methods to provide more sophisticated site selectionsite selection


Limitations of the HTTP Redirection Approach

It is protocol specific – relies on HTTP

Requires redirection to fully qualified q y qadditional names – additional DNS records

U b k k ifi l iUsers may bookmark a specific location – losing automatic failover

HTTPS redirect requires full SSL handHTTPS redirect requires full SSL hand shake to be completed first


DNS-Based Site Selection – The IdeaDNS Based Site Selection The Idea

The client D-proxy (local name server) performs iterative queriesThe device which acts as “site selector” is the authoritative name server for the domain(s) distributedauthoritative name server for the domain(s) distributed in multiple locationsThe “site selector” sends keepalives to servers or

l d b l i th l l d t l tiserver load balancer in the local and remote locationsThe “site selector” selects a site for the name resolution, according to the pre-defined answers andresolution, according to the pre defined answers and site load balance methodThe user traffic is sent to the selected location


DNS-Based Site Selection – Traffic FlowDNS Based Site Selection Traffic Flow

DNS Proxy

Root Name Server for/Authoritative Name Server for .com

2

Authoritative Name Servercisco.com

1

23 4

56

Client Authoritative

1 6

78

9

10

Client

http://www.cisco.com/Name Server

www.cisco.comUDP:53

TCP 80TCP:80


Data Center 1 Data Center 2

Advantages of the DNS ApproachAdvantages of the DNS Approach

Protocol independent: works with any p yapplication that uses name resolution

Minimal configuration changes in the current IP and DNS infrastructure (DNS authoritative (server)

Implementation can be different for specific host nameshost names

A-records can be changed on the fly

Can take load or data center size into account

Can provide proximity


Limitations of the DNS-Based ApproachLimitations of the DNS Based Approach

Visibility limited to the D-proxy (not theVisibility limited to the D proxy (not the client)

Can not guarantee 100% session gpersistency

DNS caching in the D-proxy

DNS caching in the client application

Order of multiple A-record answers can be altered by D-proxies


Route Health Injection – The IdeaRoute Health Injection The Idea

Server and application health monitoring provided byServer and application health monitoring provided by local Server Load Balancers

SLB can advertise or with draw VIP address to upstream routing devices depending on the availability of the local server farm

S VIP dd b d ti d f lti lSame VIP addresses can be advertised from multiple data centers – IP Anycast

Relying on L3 routing protocols for route propagatingRelying on L3 routing protocols for route propagatingand content request routing

Disaster Recovery provided by network convergence


y p y g

Route Health Injection – ImplementationRoute Health Injection Implementation

Client BClient A Router 13Router 11

Router 13

Router 10

Router 12

Location AVery High CostVery High Cost

Low CostLow Cost

Location BPreferred Location for

VIP x.y.w.z

Location ABackup Location for

VIP x.y.w.z


Advantages of the RHI ApproachAdvantages of the RHI Approach

Supports legacy application and does notSupports legacy application and does not rely on a DNS infrastructure

Very good re-convergence time, y g gespecially in Intranets where L3 protocols can be fine tuned appropriately

P t l i d d t k ithProtocol-independent: works with any application

Robust protocols and proven featuresRobust protocols and proven features


Limitations of the RHI ApproachLimitations of the RHI Approach

Relies on host routes (32 bits) whichRelies on host routes (32 bits), which cannot be propagated all over the internet (more on this later)

Requires tight integration between the application-aware devices and the L3 routersrouters

Inability to intelligently load balance among the data centers


AgendaAgenda




Sample Design


Cluster OverviewA cluster is two or more servers configured to appear as one Two types of clustering: Load balancing (LB) and High Availability (HA) Web Servers

Clustering provides benefits for availability, reliability, scalability, and manageabilityLB l t i lti l i f Application ServersLB clustering: multiple copies of the same application against the same data set, usually read only HA clustering: multiple copies of

Application Servers

HA clustering: multiple copies of long running application that requires access to a common data depository, usually read and write

Database Servers


HA Cluster ConnectionsHA Cluster ConnectionsPublic Network (typically Ethernet) for client /Application Ethernet) for client /Application requests

Servers with same hardware, OS, and application software

Private Network (typically Ethernet) for interconnection between nodes. Could be direct

t ti ll iconnect, or optionally going through the public network

Storage Disk (typically Fiber) shared storage array NAS orshared storage array, NAS or SAN


Typical HA Cluster ComponentsTypical HA Cluster Components

Application software that are clustered to provide High pp p gAvailability. Example: Microsoft Exchange, SQL, Oracle database, File and Print Services Operating System that runs on the server hardware. E l Mi ft Wi d 2000 2003 Li ( d thExample: Microsoft Windows 2000 or 2003, Linux (and the other flavors of UNIX), IBM VMS or z/OS (for mainframe)Cluster Software that provides the HA clustering service for the application Example: Microsoft MSCS EMCfor the application. Example: Microsoft MSCS, EMC AutoStart (Legato), Veritas Cluster Server, HP TruCluster and OpenVMS Optionally Cluster Enabler a software that synchronizesOptionally, Cluster Enabler, a software that synchronizes the cluster software with the storage disk array software


Basic HA Cluster DesignBasic HA Cluster Design

Active/Standby:– Active node takes client requests and writing to the data– Standby takes over when detecting failure on active– Two-node or multi-node

Active/Active: node1 node2

– Database requests load balanced to both nodes– Lock mechanism ensures data integrity– Most scalable design


File System Approaches for HA ClustersFile System Approaches for HA Clusters

Shared Everythingy g– Equal access to all storage– Each node mounts all storage resources– Provides a single layout reference system for all nodesProvides a single layout reference system for all nodes– Changes updated in the layout reference

Shared Nothing– Traditional file system with peer-peer communication– Each node mounts only its “semi-private” storage– Data stored on the peer system’s storage is accessed via the peer-p y g ppeer communication– Failed node’s storage needs to be mounted by the peer


Geo-clustersGeo clusters

Geo-cluster: cluster that span multiple data centers

Local Remote

WAN

LocalDatacenter

RemoteDatacenter

node1 node2


Disk Replication

Synchronous or Asynchronous

2 x RTT

Considerations for HA ClustersConsiderations for HA Clusters

Split Brain: Cluster partitioning when nodes can not communicate withSplit Brain: Cluster partitioning when nodes can not communicate with each other but are equally capable of forming a cluster and mount disks.

Extended L2 required in most implementations for:Public Network since client only knows about the Virtual IP address– Public Network, since client only knows about the Virtual IP address

– Private Network, used for Heart-beats

Storage:– Directly Attached Disk (DAS) cannot be used– Shared Disk needs to be visible to both Nodes– Needs to interface with cluster software for disk failover, zoning, LUN masking when there is a node failure


Split-BrainSplit Brain

Split-brain happens when all of theSplit-brain happens when all of the network communication links between two or more cluster nodes fail.

Both nodes could potentially go active, and concurrently access the disk, thus corrupting data

node1 node2

d s , t us co upt g data

Data Corruption


Data Corruption

Resolution for Split Brain: QuorumResolution for Split Brain: Quorum

A quorum device serves as a tie qbreaker to arbitrate which system has access to resources.

The quorum ensures that even if there qis no communication between the nodes, only one node can continue to access the disk. node1 node2

Only the node that owns the quorum (or, majority quorum votes) can bring resources online.

Any resource can be used as the arbitrator to break the tie.


quorum

Application data

Extended Layer 2 NetworkExtended Layer 2 Network

In most implementation, L2 t k i

WANa common L2 network is needed for the heartbeat between the nodes, as well as public client

LocalDatacenter

RemoteDatacenter

accessExtending VLAN on a geographical basis is not

id d b t ti

Public Layer 2 network

node1 node2considered best practice because of the impact of broadcasts, multicast, flooding and Spanning-

Private Layer 2 network node1

g gTree integration issues


Disk Replication: Synchronous or Asynchronous

Resolution: L3 Routed SolutionResolution: L3 Routed Solution

In certain cases a L3 routed solution is possible 11 20 5 x 172.28.210.x

Microsoft MSCS – Requires that 2 nodes be on the same subnet.

Th i ti b t th 2

node1 node2

11.20.5.x

– The communication between the 2 nodes is UDP unicast– Local Area Mobility (LAM) allows the placement of the nodes on 2 different subnetsdifferent subnets

Veritas VCS– Allows having nodes with IP addresses in different subnets

Extended SAN

– The Virtual Address needs to change when moving from node1 to node2– DNS can be used to provide name-


pmultiple IP mapping Disk Replication:

Synchronous or Asynchronous

Storage Disk ZoningStorage Disk Zoning

What storage disk array node1 node2

g yshould node 2 be zoned to before and after a failure on node 1

standbyactive

To complete the failover you need to change the zoning configuration

Extended SAN

Software needed to synchronize the Cluster Software with the Disk Array’s software, i.e. Cluster Enabler

RW RD

sym1320 sym1291


RW RD

Resolution: Cluster Enabler

The Cluster Enabler (CE) provides the interface between the

node1 node2the interface between the Clustering Software and the Disk Array’s softwareWhen the Clustering Software detects a failure and wants to fail

active standby

detects a failure and wants to fail the node, the Cluster Enabler instructs the Disk Array to perform an failover Extended SAN

Cluster Enabler also allows node1 to be zoned to sym1320 and node2 to be zoned to 1291The Cluster Enabler running onThe Cluster Enabler running on each node typically communicates with the Cluster Enabler Software running on the remote node with Local Multicast messages RW WD

sym1320 sym1291


Local Multicast messages WD

RW WD

AgendaAgenda




Sample Design


TerminologyTerminology

Storage subsystemJust a bunch of disks (JBOD)Redundant array of independent disks (RAID)

Storage I/O devicesStorage I/O devicesHost Bus Adapter (HBA)Small Computer Serial Interface (SCSI)p ( )

Storage protocolsSCSIiSCSIFC (FCIP)


Terminology (Cont’d)Terminology (Cont d)

Direct Attached Storage (DAS)St i “l l” b hi d thStorage is “local” behind the server No storage sharing possibleCostly to scale; complex to manage

Network Attached Storage (NAS)Storage is accessed at a file level over an IP networkSt b h d b tStorage can be shared between servers

Storage Area Networks (SAN)Storage is accessed at a block-levelStorage is accessed at a block level Separation of Storage from the ServerHigh performance interconnect providing high I/O throughput


Storage for ApplicationsStorage for ApplicationsPresentation Tier

Unrelated small data files commonly stored on internal disks U yManual distribution

Application Processing Tier Transitional, unrelated data Small files residing on file systemsMay use RAID to spread data over multiple disks y p p

Storage Tier Large, permanent data files or raw dataLarge batch updates, most likely Real timeLog and data on separate volumes


Backup and ReplicationBackup and Replication

Offsite tape vaultingBackup tapes stored at offsite location

Electronic vaultingTransmission of backup data to offsite locationTransmission of backup data to offsite location

Remote disk replicationContinuous copying of data to offsite locationTransparent to host

Other methods of replicationHost-based mirroring Network-based replication


Replication: Modes of OperationReplication: Modes of Operation

SynchronousSynchronousAll data written to cache of local and remote arrays before I/O is complete and acknowledged to host

AsynchronousWrite acknowledged after write to local array cache; changes (writes) are replicated to remote array asynchronously(writes) are replicated to remote array asynchronously

Semi-synchronousWrite acknowledged with a single subsequent WRITE command g gpending from remote array


Synchronous Vs. Asynchronous Trade-Off

SynchronousImpact to Application

AsynchronousNo Application

Off

Impact to Application Performance

Distance Limited (Are Both Sites within the Same

Threat Radius)

No Application Performance Impact

Unlimited Distance (Second Site Outside Threat Radius)

Threat Radius)

No Data Loss Exposure to

Possible Data Loss

Enterprises Must Evaluate the Trade-Offs

Maximum tolerable distance ascertained byMaximum tolerable distance ascertained by assessing each application

Cost of data loss


Data Replication with DB ExampleData Replication with DB Example

Control Files identify other files making up the database and

Control Files• DB name making up the database and records content and state of the db.Datafile is only updated

DB name

• creation date

• backup performed

• redo log time period

• datafile state y pperiodicallyRedo logs record db changes resulting from transactions

U d t l b k h th t

Identify

• datafile state

Used to play back changes that may not have been written to datafile when failure occurred

Typically archived as they fill to local and DR site destinationslocal and DR site destinations

Datafiles Redo Log Files

Record changes to

• Tablespaces • Database changes


Tablespaces

• Indexes

• Data Dictionary

Database changes

Data Replication with DB Example (Cont’d)(Cont d)

Failure or disaster occurs at time t1

• Media Failure (e g disk)time

• Media Failure (e.g. disk)• Human Error (datafile deletion)

• Database Corruption

. . . . . . . . .

t0t1Archived Redo Logs Online Redo

Logs

Database restored to state at time of failure (time t1) by:

1. Restoring Control Files & Datafiles from last Hot Backup (time t0)

Hot Backup of Datafiles and

Control Files taken at Time t0

Backup (time t0)2. Sequentially replaying changes from subsequent

Redo Logs (archived and online) – changes made between time t0 and t1


Data Replication with DB Example (Cont’d)(Cont d)

Redo Logs (Cyclic)Redo Logs (Cyclic)Copy of Every Committed

Transaction Synchronously Replicated

Primary Site Secondary Site

Earlier DBfor Zero Loss

Database

Earlier DB Backups

SAN E t i

Replicated/Copied

Point in Time Copy Taken

When DB Quiescent

Database copy at time t0

Database Copy at Time t0

Extension Transport

Archive LogsReplicated/Copied

Quiescent

Archive Logs

Mixture of sync and async replication technologies commonly usedUsually only redo logs sync replicated to remote siteArchive logs created from redo log and copied when redo log switches


g g p gPoint in time (PiT) copies of datafiles and control files copied periodically

(e.g. nightly)

Data Center Interconnection OptionsInternet

C t t

StatefulFirewalls

Data Center Interconnection OptionsInternet

Content

StatefulFirewalls

IntrusionDetection

ServerLoad Balancing

ContentCaching

HighDensity

MultilayerLAN

SwitchIntrusionDetection

ServerLoad Balancing

Caching

HighDensity

MultilayerLAN

Switch

SONET/SDH

Front-End Application Servers

Front-End Application Servers

DWDM/

Back-End Application Servers

High

Back-End Application Servers

High

DWDM/CWDM

gDensity

MultilayerSAN

Director

Enterprise-Class Storage Arrays

HighDensity

MultilayerSAN

Director

Enterprise-Class storage ArraysIP/Metro E


Data Center Transport OptionsData Center Transport Options

Increasing DistanceData

Center Campus Metro Regional National

Increasing Distance

Limited by Optics (Power Budget)Dark Fiber

CWDM

Sync

Sync (2Gbps) Limited by Optics (Power Budget)

cal

DWDM

SONET/SDH

Sync (2Gbps lambda)

Sync (1Gbps+ subrate) Async

Limited by BB_CreditsOpt

ic

Sync (Metro Eth) Async (1Gbps+)MDS9000 FCIP IP


Data Center Replication with SAN ExtensionExtension

Extend the normal reach ofSh d D Extend the normal reach of a Fibre Channel fabric

ReplicationRemote host to target array

Shared Data Cluster or

Remote Host Access to Storage

Remote host to target arrayShared data clusters

SAN Extension Network

FC FCReplication


SAN Design for Data ReplicationSAN Design for Data Replication

Servers with two fibreSite A Server

Access

Replication Fabrics

Servers with two fibre channel connections to storage arrays for high availability

FC

availabilityUse of multipath software is required in dual fabric host design

DC Interconnect

Network

design

SAN extension fabrics typically separate from

FC

typically separate from host access fabrics

Replication fabric requirements generally


Site B

FCReplication

fabrics

requirements generally specified by array vendor

Data Center Disaster RecoveryDisaster Recoverysample design


Disaster Impact RadiusDisaster Impact RadiusGlobal

Regional< 400km

PrimaryD t C t

SecondaryData CenterDR Site

Metro< 50km

Data CenterData CenterDR Site

Disasters are characterized by their impact

Local metro regional global

Local1–2 km

Local, metro, regional, globalFire, flood, earthquake, attack

Is the backup site within the threat radius?


radius?

Active/Standby Architecture - TodayActive/Standby Architecture TodayCA

High Availability Site 1CA

High Availability Site 2NC

Disaster Recovery Site

Hosts 1 Hosts 2 Hosts 3

HA Cluster(s) Electronic Journaling

Synch CWDMReplicationMDS 9509’s MDS 9509’s MDS 9509’s

Synch FCIPReplication

Asynchronous FCIP Replication

Dual OC12

MDS 9509Gateway

MDS 9509Gateway

MDS 9509Gateway


Storage 1 Storage 2 Storage 3Bunker

Frame Based ReplicationFrame Based Replication

ProductionCluster

Data Center 1D/R

Data Center 2

MDS DUAL OC12 MDS

SRDF

MDS DUAL OC12

R2 BCV/R1

PiTPiT

PiTPiT

Arch

Redo

PROD

Arch

Redo

D/R

BCVTimefinderTimefinder

SRDF/ASRDF/ASRDF/A


Arch

EMC/DMXEMC/DMX

Arch

EMC/DMXTriple Threat

Active/Active Architecture - Tomorrow

UserACE

decryptsrequest

ACEroutes

request

ACNScachespages

Service Locator Group Data Centers

Clustered Backend Y Active

DC2ActiveStandby

Requestsdirected to

b k

Content Engine

ACEprobes t k

GSS performs Site (DC) selection according to pre-configured condition, using

FQDN

Y ActiveX Standby Active

Data Y

ActiveData X

StandbyData X

backup application

track application

health

Presentation LayerMirror

Asynchronous Replication

Requestsdirected to

primary application

DC1Replication


Clustered Backend X Active

Y Standby

ActiveData X Active

Data YStandbyData Y

SANTap and Continuous Data ProtectionSANTap and Continuous Data Protection

Production Servers• SANTap• Appliance based storage replication• Reliable copy of WRITE operations• SCSI-FCIP communication

CDPAppliance

• Continuous Data Protection• Automatic and Continuous Backups• Time Addressable Storage (TAS) Appliance

MDS SAN

Time Addressable Storage (TAS)• Any Point-in-Time Recovery• Application based or Network based

SAN Tap

SecondaryPrimary


Fabric Based Replication with CDPFabric Based Replication with CDP

ProductionCluster

Data Center 1D/R

Data Center 2

DUAL OC12SANTap

Replication/CDPAppliance

Replication/CDPAppliance

MDSMDS

DUAL OC12

Arch

Redo

PROD

APiT

APiT

APiT

APiT

APiT

APiTArch

Redo

BCV

D/R

SRDF/ASRDF/ASRDF/A


Arch

EMC/DMX TAS/SATA TAS/SATA

Arch

EMC/DMX

End-End Data Center ResilienceEnd End Data Center Resilience

GSS-1 GSS-2

Corp. DNS

ACE-1 ACE-2 ACE-3

DC-3

Web/APP

Server

DC-2DC-1

IP/Optical Network

DB

CWDM/DWDM

Server Farm

FC

CWDM/DWDM


PrimaryLocation

FC SecondaryLocation

FC

Summary - Design DetailsSummary Design DetailsData centers 1 and 2 are in primary location with close enough distance that can provide DC HA for active/activeenough distance that can provide DC HA for active/active accessData Center 3 (DR) with > tolerable disaster radius, away for Primary DC 1 and 2for Primary DC 1 and 2Web/App server farms are load balanced geographicallyDB servers are within a geo HA cluster and running in aDB servers are within a geo-HA cluster and running in a L3 designSynchronize Data replication between data centers within y pthe primary locationAsynchronous Data replication is done between the primary and secondary storage systems


primary and secondary storage systems


Documents

Data Center Business Continuance and Disaster Recovery · Business Continuance Is More Critical than Ever 75% of IT decision-makers have altered Disaster Recovery/Business Continuance