21
Page 1 9/11/2012 2 © Copyright IBM Corporation, 2012 “z/OS Multi-Site Business Continuity” September, 2012 Robert F. Kern E-mail: [email protected],

“z/OS Multi-Site Business Continuity” September, 2012

Embed Size (px)

DESCRIPTION

Learn aboput the “z/OS Multi-Site Business Continuity” September, 2012. This paper explores the various GDPS configuration deployments that clients have implemented to provide high availability/continuous operations locally and/or out of region disaster recovery protection. It also explores the trend towards trying to reduce D/R testing costs by moving toward a ‘regular site switch’ or ‘site toggle’ model. For more information on IBM System z, visit http://ibm.co/PNo9Cb. Visit the official Scribd Channel of IBM India Smarter Computing at http://bit.ly/VwO86R to get access to more documents.

Citation preview

Page 1: “z/OS Multi-Site Business Continuity” September, 2012

Page 1 9/11/2012

2 © Copyright IBM Corporation, 2012

“z/OS Multi-Site Business Continuity” September, 2012

Robert F. Kern

E-mail: [email protected],

Page 2: “z/OS Multi-Site Business Continuity” September, 2012

Page 2 9/11/2012

2 © Copyright IBM Corporation, 2012

Notices Copyright © 2012 by International Business Machines Corporation.

No part of this document may be reproduced or transmitted in any form without written

permission from IBM Corporation.

The information provided in this document is distributed “AS IS” without any warranty,

either express or implied. IBM EXPRESSLY DISCLAIMS any warranties of

merchantability, fitness for a particular purpose OR INFRINGEMENT.

IBM shall have no responsibility to update this information.

IBM products are warranted according to the terms and conditions of the agreements (e.g.,

IBM Customer Agreement, Statement of Limited Warranty, International Program

License Agreement, etc.) under which they are provided. IBM is not responsible for the

performance or interoperability of any non-IBM products discussed herein.

The provision of the information contained herein is not intended to, and does not; grant

any right or license under any IBM patents or copyrights. Inquiries regarding patent or

copyright licenses should be made, in writing, to:

IBM Director of Licensing

IBM Corporation

North Castle Drive

Armonk, NY 10504-1785

USA

Trademarks The following trademarks may appear in this Paper.

AIX, AS/400, DS8000, Enterprise Storage Server, Enterprise Storage Server Specialist,

ESCON, FICON, FlashCopy, Geographically Dispersed Parallel Sysplex, HyperSwap,

IBM, iSeries, OS/390, RMF, System/390, S/390, Tivoli, TotalStorage, z/OS, and zSeries

are trademarks of International Business Machines Corporation or Tivoli Systems Inc.

Other company, product, and service names may be trademarks or registered trademarks

of their respective companies.

Page 3: “z/OS Multi-Site Business Continuity” September, 2012

Page 3 9/11/2012

2 © Copyright IBM Corporation, 2012

Abstract Clients look for ways to reduce their TCO, simplify operations, and provide better service

to their customers. A trend in the area of Business Continuity today is that more and

more clients are looking to develop multi-site Continuous Operations and D/R strategies,

with the idea of switching which site production runs at on a regular basis. The concept

of toggling between sites or doing site flip/flops is gaining more scrutiny. Most clients

today who exploit toggling between sites do so with the full GDPS/PPRC HyperSwap

functionality deployed with a Multi-Site Workload. This configuration provides the

ability to perform switch sites in real time, with minimal interruption to the business.

Another emerging trend is for clients with out of region data centers to start examining

how they might also best accomplish the same business objective of switching sites,

while minimizing the impact to their business during the site switch operation.

Page 4: “z/OS Multi-Site Business Continuity” September, 2012

Page 4 9/11/2012

2 © Copyright IBM Corporation, 2012

Introduction

This paper explores the various GDPS configuration deployments that clients have

implemented to provide high availability/continuous operations locally and/or out of

region disaster recovery protection. It also explores the trend towards trying to reduce

D/R testing costs by moving toward a ‘regular site switch’ or ‘site toggle’ model. To do

this, the paper will examine each of these aspects:

� 2 sites within metro/sysplex distance

� active/active (multi-site workload) with HyperSwap and parallel

sysplex exploitation - non-disruptive flip/flop

� active/standby (single site workload) with HyperSwap and parallel

sysplex exploitation - non-disruptive flip/flop possible with appropriate

configuration and temporary performance impact. Applications that do

not exploit sysplex incur an outage during the site move. Disruptive

site switches are typically automated to minimize the outage duration.

� 2 sites beyond metro/sysplex distance or using asynchronous data replication.

� Disruptive switch but automated to minimize outage duration

� Active/Standby – DB2 & IMS Application Disaster/Recovery at

distance. Two separate Sysplexes at distance with application

level Active/Standby across the two Sysplexes utilizing application

specific software based data replication technology.

� 3 Site Configurations & benefits.

� Future vision.

The traditional two site model provides for Site 1 as the “primary production” site

and Site 2 as the “backup or remote recovery” site. The regular site toggle model is a

peer to peer relationship model where production can run at either site and switching

sites for “business reasons” on a regular basis becomes the business norm. An

active/active model, that enables a site switch with minimal performance impacts, can

be realized by clients through the following:

� sysplex enabled applications

� deployment of a multi-site workload under GDPS/PPRC with HyperSwap.

� Duplication of all site resources across the two sites.

As distances between sites increase, data replication must switch from synchronous to

asynchronous techniques to avoid application performance impacts. In addition,

parallel sysplex distances are typically determined by the acceptable CF Link

performance for the various applications as well as the maximum STP Timer distances

(200km maximum). With these types of configurations a site switch is possible, but

Page 5: “z/OS Multi-Site Business Continuity” September, 2012

Page 5 9/11/2012

2 © Copyright IBM Corporation, 2012

an automated sysplex wide IPL is required. End to end automation like GDPS can

minimize the outage time to perform the site switch.

This paper will discuss trends and directions in this arena for z/OS.

High Availability/Continuous Operations & Out of Region Disaster

Protection

IT Infrastructure Availability can be broken down into three pieces; High

Availability, Continuous Operations and Disaster/Recovery. Each brings unique client

requirements to clients when addressing Business Continuity. Through an

understanding of the client business requirements in this arena, IBM can help tailor the

right solution at the right cost point for any IT infrastructure.

6 © 2009 IBM Corporation Copyright IBM 2009

Business Continuity - Aspects of Availability

High AvailabilityFault-tolerant, failure-

resistant infrastructure supporting continuous application processing

Continuous OperationsNon-disruptive backups and system maintenance coupled with continuous availability of

applications

Disaster RecoveryProtection against

unplanned outages such as disasters through reliable,

predictable recovery

Protection of critical business data

Recovery is predictable and reliable

Operations continue after a disaster

Costs are predictable and manageable

GDPS Solutions Overview

Page 6: “z/OS Multi-Site Business Continuity” September, 2012

Page 6 9/11/2012

2 © Copyright IBM Corporation, 2012

GDPS (Geographically Disperses Parallel Sysplex) shipped originally in 1998 and

introduced the concept of multi-site IT Infrastructure resource management, for the

Sysplex. GDPS automation enhances the z/OS base sysplex and parallel sysplex

management to an end to end “server, workload, and data, with a coordinated network

switch” solution of resource management within the same or across multiple sites

providing continuous operations for clients. To accomplish this, GDPS automation

interfaces with many different System z hardware & software interfaces to reduce the

necessity of skilled personnel to perform various operations during a site switch. Some of

these interfaces include:

� System z Hardware Management Console (HMC) to manage the System z

hardware reconfigurations dynamically. (ex. CBU, Expend Lpars, System IPLs,

etc.)

� Sysplex & STD Timer interfaces,

� CF Duplexing Interfaces

� DS8000 Data Replication Functions – FlashCopy, z/OS Global Mirror(XRC),

Metro Mirror (PPRC), and Global Mirror

� Various z/OS System Interfaces

� z/OS integrated with various DS8000 Synergy items.

GDPS is storage vendor independent as all major storage vendors on the System z

platform can participate in solutions using their implementation of the IBM DS8000 Disk

Storage Subsystem data replication architecture of Metro Mirror, FlashCopy and zGM

(XRC). New features and functions are developed with the IBM Systems Storage team

on the DS8000. IBM sells the Host to Storage Subsystem “architecture” to the other

storage vendors. Those vendors then implement the feature/function on their disk

subsystems based on the Host to Disk Storage Subsystem architected interfaces. So, the

disk storage subsystem internal processing for a feature or function may be different from

one vendor to another. Depending on the specific feature/function there generally is

some time where the feature/function is only available on the DS8000. One should

consult with each storage vendor to understand specific feature/function support for any

DS8000 storage subsystem enhancement.

In addition, the GDPS automation inter-operates with all major system automation

packages available for System z.

Relative to Business Resiliency/Business Continuity, IBM’s Flagship product is GDPS.

GDPS comes in a variety of different flavors/solutions. The following two charts

illustrate the various solutions.

Page 7: “z/OS Multi-Site Business Continuity” September, 2012

Page 7 9/11/2012

2 © Copyright IBM Corporation, 2012

GDPS provides an entry level solution called GDPS HyperSwap Manager, focused on

providing the HyperSwap availability solution for z/OS on the same data center floor or

across two local area data centers up to 200km with Parallel Sysplex.

GDPS/PPRC HyperSwap is the Full Function version of HyperSwap Manager, which can

be easily upgrade to. The full function GDPS/PPRC HyperSwap supports zVM and

zLinux data along with z/OS data. In addition to masking disk subsystem failures the full

function version, exploits parallel SYSPLEX to mask CEC failures, persistent sessions to

coordinate a network switch, CF Duplexing to manage CF structure failures and VTS PtP

to mask tape subsystem failures. Finally, if the failures evolve into a disaster scenario,

GDPS provides a complete end to end site failover/fallback capability for both planned

and unplanned site switches. One mouse click and the server, data, workload and a

coordinated network site switch are performed via automation. All data is recovered, the

SYSPLEX IPL’ed, data bases restarted followed by the applications. Skilled personnel

are no longer required to get the Sysplex up and running in the event of a disaster.

GDPS/GM (System z & Open Systems data) & GDPS/XRC (z/OS & zLinux only)

provide site failover/failback (FO/FB), typically “out of region” exploiting IBM’s Global

Mirror and zGM (XRC) data replication technologies.

GDPS/MzGM and GDPS/MGM provide a combination of high availability/continuous

operations locally coupled with out of region D/R protection. All GDPS solutions are

fully automated, proven, auditable, and in the case of PPRC and zGM (XRC) storage

vendor independent!

Page 8: “z/OS Multi-Site Business Continuity” September, 2012

Page 8 9/11/2012

2 © Copyright IBM Corporation, 2012

The various GDPS solutions also support zVM and zLinux data through a feature call

x/DR.

The GDPS System z umbrella also includes the ability for GDPS automation to inter-

operate with System p, x, i (Linux), Windows, HP and Sun through the GDPS/DCM

(Distributed Cluster Manager) automation “inter-operability code” feature that works in

conjunction with Tivoli System Automation Application Manager (SA Appman) and/or

the Symantec Vertias Cluster Server Solutions. With GDPS and the x/DR and/or DCM

features, a single mouse click can yield a coordinated site failover/fall back of all of the

customers systems. (ex. System z (z/OS, zLinux, zVM) coordinated with say System p

AIX systems). The disk replication functions can be managed separately with the GDPS

and DCM automation or together, depending on the clients requirements for cross

platform data consistency.

GDPS is build upon the IBM DS8000 Storage based data replication architecture for

FlashCopy, Metro Mirror, z/OS Global Mirror and Global Mirror. As new features and

functions are implemented in the DS8000, GDPS automation is modified to exploit those

features and functions. In addition, GDPS supports various DS8000 base box features

used in conjunction with the various advanced functions.

IBM DS8000 Metro Mirror and Global Mirror support a function known as ‘Open Lun

Support’, such that through an ECKD device address, GDPS automation is able to

manage the Metro Mirror and/or Global Mirror functions for a distributed system Lun(s).

This is also true for Metro Global Mirror configurations. With the Open Lun support,

GDPS can provide a single restart point across the platforms. More systems and data

replication alternatives will continue to be provided in the future based on client

requirements. This is especially important for clients that have Multi-Platform

Applications where transactions are for example initially received by a Windows system,

then routed to say an AIX system and then to the “backend” z/OS System. Each system

may save data and as a result to recover the “application”, multiple platforms must be

recovered to the same point in time. GDPS inter-operability with Tivoli AppMan and/or

Symantec Veritas Cluster Server can provide such a solution for clients.

Page 9: “z/OS Multi-Site Business Continuity” September, 2012

Page 9 9/11/2012

2 © Copyright IBM Corporation, 2012

Open Lun Support is also important for clients with applications like SAP where the user

interfaces are typically on non System z platforms and the backend data base runs on

z/OS. In some cases clients have moved the application’s parts that were running on non-

System z platforms to zLinux, but many clients resist introducing the risk of any change

to critical production applications that have been running for some time. Open Lun

Support can provide a data consistency solution for multi-platform application(s). All

data is recovered to a single point in time enabling each platform’s data base to perform a

data base Restart operation instead of a data base Recover operation when a site switch

occurs. The data base restart process manages all “in flight” and “in doubt” transactions,

which in turn permits the application(s) parts spread across the different platforms to

resume processing from the restarted point in time forward. GDPS automation when

combined with the DCM automation feature can inter-operate across the enterprise to

provide a complete business solution for clients in the area of IT business continuity.

This critical business function is made possible by the DS8000 ‘open Lun support’.

Page 10: “z/OS Multi-Site Business Continuity” September, 2012

Page 10 9/11/2012

2 © Copyright IBM Corporation, 2012

Two Local Data Centers - 2 sites within metro/sysplex distance

The full GDPS/PPRC HyperSwap implementation can be configured as an active/active

“multi-site workload’ or active/standby “single-site workload” providing real time

planned and unplanned site switches mode through the deployment the following

features/functions:

- Parallel Sysplex – permits the movement of a workload from one processor at site 1 to an

alternate CEC in site 2.

- Sysplex enabled Applications. (required for multi-site workloads)

- HyperSwap – permits the ability for disk access to switch from a Metro Mirror Primary

volume(s) to the target volume(s) and reverse the mirror without an IPL of the parallel Sysplex.

- VTS Peer to Peer Tape configuration permits real time tape mirroring across multiple

physical Tape libraries without interrupting operations.

- Multiple Sysplex Timers permit timer switches in real time.

- CF Duplexing permits the switching of data structure access in real time.

- The concept of persistent sessions enables real time network switches.

Some customer applications have affinities. (e.g., all transactions for a given type must

be routed to a specific system, one transaction passes information onto the next

transaction, etc.). A sysplex enabled application requires that all affinities be removed

so a transaction can be routed to & execute on any clone of the application on any

system in the sysplex. When this is done, the application can then be run in an

active/active, multi-site workload configuration. Transactions can be distributed to run

on any system within the Sysplex, independent of their physical location.

Through the GDPS automation, more and more clients perform both Planned and

Unplanned site switches on a regular basis. Planned site switches are used to minimize

the production risks associated with site or equipment maintenance. Once a lights out

data center opens its doors for maintenance operations, the possibility exists for

production impacts. These can be minimized by switching production to the alternate

site in real time with a multi-site workload configuration. Providing the ability for a

client to exploit this type of operational functionality has spurred clients to think of new

approaches and new business exploitations of the technology.

Page 11: “z/OS Multi-Site Business Continuity” September, 2012

Page 11 9/11/2012

2 © Copyright IBM Corporation, 2012

38 IBM Systems© 2008 IBM Corporation

GDPS/PPRC: a Continuous Availabilty and/or Disaster Recovery Solution- Metropolitan Distance

SITE 1

NETWORK

SITE 2NETWORK

112

2

3

4

56

7

8

9

10

11

112

2

3

4

567

8

9

10

11

� Manages Multi-Site Parallel Sysplex,

Processors, CBU, CF, Couple Data Sets

� Manages Disk RC (System z & open LUN)

� Manages Tape Remote Copy (PtPVTS)

� Exploits Hyperswap & FlashCopy Function

� Automated planned and unplanned actions

(z/OS, CF, disk, tape, site)

� Improves availability of heterogeneous

System z business operations

Planned and unplanned exception conditions

The above diagram shows a high-level view of the GDPS/PPRC topology. The physical

topology of a GDPS/PPRC consists of a base or Parallel Sysplex cluster spread across

two sites (known as site 1 and site 2) with one or more z/OS systems at each site,

separated by up to 200 kilometers (km). The multi-site sysplex must be configured with

redundant hardware (e.g., a Coupling Facility and a Sysplex Timer in each site) and the

cross site connections (typically dedicated or ‘dark’ fibre) must be redundant. All critical

data is mirrored from the primary site (site 1 in this diagram) to the secondary site (site 2).

All Shared CF structures are located on the primary site coupling facilities. Therefore,

when transactions are executed on the processors at the remote site, disk I/O and Shared

CF structure access is through links from the secondary site to the primary site and the

disk I/O and CF structure updates are then mirrored in a synchronous manner back to the

remote site. This adds additional overheads to the applications disk I/O as well as any

access to shared CF structures. Before a customer elects to deploy a multi-site

configuration, he must first insure that his applications are sysplex enabled after which

careful consideration must be given to the system & application performance impacts of

these two accesses when a transaction is executed at the remote site. In many cases the

application performance impact will limit the effective distance that an active/active

configuration can actually sustain.

For disk I/O the performance impact of Metro Mirror rule of thumb:

1. Disk Subsystem overhead of MM at zero distance + (plus)

Page 12: “z/OS Multi-Site Business Continuity” September, 2012

Page 12 9/11/2012

2 © Copyright IBM Corporation, 2012

2. speed of light through dedicated “dark” fibre for a single protocol exchange (linear function

of 1ms/100km or .1ms/10km) x (times)

3. the # of protocol exchanges implemented in the specific MM disk to disk implementation (for

IBM DS8000 MM, a single protocol exchange is accomplished through a feature called pre-

deposit write) + (plus)

4. other device overheads that may be on the fibre path. (ex. Switches, DWDMs, compression

and/or encryption devices, channel extenders, etc.)

For CF single latency rule of thumb:

� Signal latency impact (round trip) = 10 US/KM * fiber distance KM * # of protocol

exchanges

� Example: assume two sites separated by 10 KM and a processor in site 1 is accessing disk in

site 2, signal latency impact = 10 US/KM * 10 KM * 1 (FICON has one protocol exchange) or

100 US impact

� Terminology:

► Kilometer (KM) – one KM equals 5/8 mile

► Millisecond (MS) – 10**-3

► Microsecond (US) – 10**-6

For most clients, the impact of CF single latency beyond 40-50 km (25-30 miles) yields

too great of application impact. Because of this, GDPS/PPRC multi-site implementations

typically tend to be campus or metro distances.

If customer applications are not sysplex enabled and/or the application performance

impact of a multi-site configuration to too great, then the choice for these clients

becomes GDPS/PPRC w/HyperSwap in a Single-site (active/standby) configuration. In

this configuration, all hardware can be duplicated across the two sites. The secondary

site processor typically will run the GDPS control system typically referred to as the k-

sys. Both a planned and unplanned site switch will involve the re-ipl of all systems in the

Sysplex at the recovered site having had automation recover and switch all dependent

resources.

GDPS/PPRC prerequisites include NetView and System Automation for z/OS. GDPS

automation also interacts with any existing automation products. With a multi-site

Parallel Sysplex, this provides a Continuous Availability/Continuous Operations and a

Disaster Recovery solution. In addition, GDPS provides set of panels for standard

actions as well as the ability to customize scripts for an installation.

GDPS/PPRC Multi-site sysplex. At least one system in Site 2 is in the site 1

production Sysplex. All production can run in site 1, the GDPS “K-sys” runs in site 2

or production can run in either or both site 1 & 2. Sysplex timers and CFs are in both

sites. Two (for availability) fiber trunks are recommended to connect both sites, For

unplanned reconfigs, system failures, processor failures, systems can be restarted in

place or on the other site depending upon how they are defined.

Page 13: “z/OS Multi-Site Business Continuity” September, 2012

Page 13 9/11/2012

2 © Copyright IBM Corporation, 2012

GDPS/PPRC Single-site sysplex. All production images run at the primary site. The

GDPS “K-sys” typically runs at site 2 and all resources are typically available at both

sites. Sysplex timers and CFs are in both sites. Two (for availability) fiber trunks are

recommended to connect both sites

The following outlines the typical resources available at each site for GDPS/PPRC

w/HyperSwap.

� Base Sysplex or Parallel Sysplex environment

� Manages unplanned reconfigurations

� z/OS, CF, disk, tape, & coordinates network connections

� Designed to maintain data consistency and integrity across all volumes

� Fast, automated site failover

� No or limited data loss

� Single point of control for

� Standard actions

� Stop, Remove, IPL system(s)

� Parallel Sysplex Configuration management

� Couple data set (CDS), Coupling Facility (CF) management

� User defined script (e.g. Planned Site Switch)

� PPRC Configuration management

2 Sites Beyond Metro/Sysplex Distance

GDPS solutions beyond metro/Sysplex distance include GDPS/XRC and GDPS/GM).

Clients select either the XRC or GM data replication technique based on their specific

requirements. XRC provides for the lowest possible RPO and only supports z/OS, and

zLinux data. Global Mirror provides for a tunable RPO (3-5 seconds to 18 hours) and

supports all System z and distributed systems data.

With asynchronous data replication solutions a site switch will require an automated

Sysplex wide IPL. Asynchronous data replication can support a “Planned Site Switch”

with no loss of data, but to do this the applications must be shut down. Storage based

data replication technology today supports planned site Failover/Failback scenarios

such that only changed data need be copied back to resync the sites. This capability is

available today with the various flavors of GDPS 2-site and 3-site solutions. But, in

Page 14: “z/OS Multi-Site Business Continuity” September, 2012

Page 14 9/11/2012

2 © Copyright IBM Corporation, 2012

each case a Sysplex wide IPL, data replication disk/tape switch, and a client end user

network switch must be done in a coordinated manner. In this way the Sysplex is

restarted as well as all data bases and application(s) workloads at the remote site.

When the various data bases are restarted, “In Flight” and “In Doubt” transactions are

resolved as well as a “rebuild” of any and all coupling facility structures.

If a “planned outage” can be tolerated by the client, then switching sites on a regular

base can help to minimize costs involved with D/R testing. Planned site switches can

verify that all the resources required to run the application are available in both sites.

This can then be fully tested to insure that enough capacity (Processor, storage, network,

etc) is available at both sites for any and all combinations of the workload. In addition,

a client is testing the complete production application(s) end to end. Often, traditional

D/R tests only verify that the ‘system platform’ can be ipl’ed and based on time

available some minimal subset of the production workload is executed. The best D/R

test can be executed by a site switch that in fact leaves production to run in each site for

a reasonably long period of time. (ex. 3-6 months) During this time, the application

typically has gone through various periods of the business cycle including end of day,

end of week, and end of quarter processing. Through careful planning, one can

eventually verify that all application processing can be executed independent of site.

This approach fits into some business models better than others. In some countries a

physical site utility check is required once a year. This requires a full electrical

shutdown. Therefore a site switch to the other production site may be easier in this

environment as the outage is minimized to the time to perform the site switch and have

the application(s) back up and running rather than also including the time to verify all

utilities at the original production site.

The simple approach to insure that a client can easily switch sites and run all

applications with similar performance, scalability and capacity growth is to duplicate

all hardware and software resources across both sites. If a client currently has

deployed a 3-site GDPS configuration with GDPS/PPRC HyperSwap locally at the

production site, one would also want to deploy the same configuration at the target

sister production site. This would typically be called a 4-site configuration is pictured

below.

The emerging thoughts are that money currently spent on Disaster Recovery Testing

could be decreased, if one could provide on a regular basis, the ability to switch back

and forth across sites in an automated fashion. When implemented, planned site

switches provide this function. That means that D/R testing need only verify that the

unique automation required to perform a site switch for an unplanned scenario also

works. Customers minimize the differences between planned and unplanned site switch

scenarios today by deploying the “Test the Way we Recover and Recover the way we

test” model. Typically today, several clients D/R testing is done at the remote site

while maintaining full D/R protection. This is done by making a PiT FlashCopy of the

data and performing all D/R testing against that copy of the data. When a disaster

occurs, as part of the recovery process, a FlashCopy of the data is created and used for

the D/R recovery process. This minimizes unique actions between a planned and

unplanned site failover scenario.

Page 15: “z/OS Multi-Site Business Continuity” September, 2012

Page 15 9/11/2012

2 © Copyright IBM Corporation, 2012

In both the planned and unplanned site switch scenarios, GDPS automation can

minimize the time of the outage or the RTO. GDPS automation can also help to

minimize the risk of performing a site switch as the automation is proven, repeatable

and minimizes human errors. The Recovery Time objective is a measure of the time it

takes from the time that a planned or unplanned site switch is identified until all

applications are up and running at the remote site. A key benefit of GDPS automation

is that, once implemented, the RTO is a known proven, repeatable quantity.

GDPS/Active/Standby - Application by Application Availability:

If all of a client’s application data is within a single data base (DB2 and/or IMS),

clients can implement high availability across two sites on an application by

application basis rather than managing high availability/disaster protection on a

platform(s) basis.

GDPS/Active/Standby automation enables automated ‘application level’ site switches

that typically provides an RTO on the order of seconds to minutes. Clients use DB2 to

DB2 software data replication with IBM Tivoli Infosphere Replication Server for z/OS and/or

IMS to IMS software data replication with IBM Tivoli Clasic Infosphere Replication for z/OS..

In this case the DB2/IMS log entries are replicated between sites by DB2/IMS. An

active z/OS image with a copy of the DB2/IMS data base is running at the remote site

and all DB2/IMS updates are applied when received. In the event of a disaster or a

planned site switch for this application, the end user network is switched to route active

transactions to the remote site for processing with minimal data loss. The routing of

transactions managed by the IBM Workload Distributor software.

This approach typically also requires the client to implement a strict change control

process across all systems to insure that the various system components are always

updated in step to keep the z/OS images in sync. The following picture outlines the

GDPS/Active/Standby solution.

Page 16: “z/OS Multi-Site Business Continuity” September, 2012

Page 16 9/11/2012

2 © Copyright IBM Corporation, 2012

3-Site Configurations

Several clients with an out of region D/R implementation or with high availability

locally have moved to a 3-site configuration by implementing either GDPS/MzGM

w/HyperSwap or a GDPS/MGM w/HyperSwap. These configurations provided ‘local’

high availability/continuous operations environments providing local real time planned

site switch scenarios as well as site failover/failback functionality for a local site disaster

with an RPO of zero. Some clients, implement their ‘2nd

local site on the same data center

floor, or across a fire wall on the same data center floor. A few customers have just

implemented HyperSwap locally to avoid a disk subsystem failure from causing a

Sysplex wide outage. In all cases, the implementation focus was on increasing

availability of IT to the business locally or adding out of region D/R protection. One key

cost component on developing a multi-site solution is the duplication of the client’s end

user network. Depending on the complexity and cost associated with replicating the end

user network, several clients prefer to implement a ‘3-site’ solution across only two

physical sites.

At this time, IBM has deployed some 80+ GDPS/MzGM w/HyperSwap or GDPS/MGM

with HyperSwap multi-site configurations. The following figures outline these

implementations.

Page 17: “z/OS Multi-Site Business Continuity” September, 2012

Page 17 9/11/2012

2 © Copyright IBM Corporation, 2012

57 IBM Systems© 2008 IBM Corporation

GDPS/MzGM w/HyperSwap & Incremental Resync

Site1Site1Site1Site1

Site2Site2Site2Site2

K1

MetroMirror

P2bkup

CF1

KgB

� Data Replication A->B & A -> C

� Incremental resynch B ����C

if Site1 or A-disk fails

� Maintains disaster recovery position

� Improved RTO

� Optional: CFs / Prod systems in Site2

P1 Unix

K1K1

A

K2 P1bkup

CF2

P2

K2

Incremental Resync

Recovery SiteRecovery SiteRecovery SiteRecovery Site

AC

AF

SDM

112

2

3

4

56

7

8

9

10

11

Kx

CF1SDM KxP2Bkup

P1Bkup

z/OS Global Mirror1

12

2

3

4

56

7

8

9

10

11

F Recommended for FlashCopy

112

2

3

4

56

7

8

9

10

11

112

2

3

4

56

7

8

9

10

11

ETR or STP

The standard GDPS/MzGM HyperSwap with Incremental Resync configuration

enables data replication from A -> B with HyperSwap and z/OS Global Mirror data

replication from A -> C. On an A->B HyperSwap event, the Incremental

Resynchronization for GDPS MzGM enables the reestablishment of the z/OS Global

Mirror session from A->C to B->C. GDPS manages the z/OS Global Mirror sessions,

so that only changed tracks need to be sent to the recovery site instead of requiring a

full-volume copy to reestablish the disaster recovery copy. This can greatly reduce the

time required (in some cases from hours down to minutes) to reconnect to your remote

site, reducing the risk of not being protected

Page 18: “z/OS Multi-Site Business Continuity” September, 2012

Page 18 9/11/2012

2 © Copyright IBM Corporation, 2012

54 IBM Systems© 2008 IBM Corporation

GDPS/MGM w/HyperSwap

Site1Site1Site1Site1

Site2Site2Site2Site2 Recovery SiteRecovery SiteRecovery SiteRecovery Site

Kp

R P1Bkup

AD

AC

Global Mirror

P2bkup

P2Bkup

AF

CF1

112

2

3

4

56

7

8

9

10

11

112

2

3

4

56

7

8

9

10

11

112

2

3

4

56

7

8

9

10

11

KgB

� GM K-Sys runs in production LPAR

► HyperSwap protection

� Reduced resource requirement

CF3

Non-zBkup

P1 Non-z

K1Kp

A

KP P1bkup

CF2

P2 Non

-z

KPR

MetroMirror

Kg

Kg

Kg

Kg

ETR or STP

F Recommended for FlashCopy

The standard GDPS/MGM w/HyperSwap configuration provides data replication from

A->B->C. The ability to run the GDPS/GM Ksys in a GDPS/PPRC production system,

reducing the number of z/OS images required for an MGM configuration. (Kg)

� Incrementally resync A->C if Site2 or B-disk fails

� Requires A->C bandwidth

� GDPS/GM K-sys runs in a production system

� HyperSwap protection for GDPS/GM K-sys

� Reduced resource requirement

� Maintain disaster recovery position following resync

� Improved RPO

The Kg System lives in P2. P2 is a production system. It runs GDPS/PPRC in one

Netview. In another Netview it runs the GDPS/GM Ksys function. P2 disk is PPRCed

and protected by HyperSwap. This includes any disk that is related to the "Kg system

function".

P2 is a production system that can live in either Site 1 or Site 2. It has Kg system as

it's parasite. When you move P2, the Kg system function will be moved with it.

Page 19: “z/OS Multi-Site Business Continuity” September, 2012

Page 19 9/11/2012

2 © Copyright IBM Corporation, 2012

3 site configurations provide additional options as well as considerations when

performing site switches.

1. If the two local sites are physically separated for both high availability and local D/R protection,

when a remote site switch occurs is it still requirement to have two local sites physically split at that

location as well? The alternatives would be to have two logical sites within the same physical site,

perhaps separated by a physical fire wall. In the site toggle model this consideration may be very

different than if the remote site is only used in the event of a disaster. In the disaster site scenario,

when a disaster occurs, high availability may be added to that site after the business is back up and

running again. The site toggle model views all sites as ‘production’ ready sites, where as the

disaster/recovery site model views the remote site as only being actually used in the event of a

disaster. Both models are valid, and really vary based on the client’s business requirements.

2. A complete understanding of the various fallback scenarios and additional copies of the disk

required to support each of these scenarios should be investigated and understood with both the

GDPS/MzGM and the GDPS/MGM options.

3. as mentioned above, end user Network connectivity to each data center can definitely influence

the costs associated with the ultimate solution.

A recognized customer requirement in this area is to provide the exact same

functionality at the target site (High Availability + Disaster Recovery protection) on

both a planned and when possible an unplanned site switch. That is, the ability to use

asynchronous data replication back to the original production site as well as providing

local HyperSwap functionality. With this functionality, both sites provide the business

equal functionality to the business and enables a peer site configuration.

Distributed Systems

As mentioned earlier in this paper, with the GDPS/DCM capability, GDPS automation

can inter-operate with either Tivoli AppMan or Veritas Cluster Server to provide end to

end automated management of various distributed platforms in 2-site or 3-site

configurations. Cross System data consistency can also be provided via the DS8000

open lun support. With this function, GDPS can provide a common restart point across

all z/OS and distributed systems data. Today, high availability of data is provided through

distributed systems software mirroring typically called LVM Mirrors. Data availability

for disaster recover can be provided through hardware and software based data

replication functions. Functionality in this arena will continue to evolve as clients

develop more and more cross platform applications.

Future Vision

The next chart outlines the evolution from a single server into an Enterprise Wide

Business Continuity Solution. Single Servers, became clustered servers, clustered

servers then spanned physical sites. This was then extended to end to end multi-site

heterogeneous clusters, followed by integrated end to end multi-site clusters. The

emerging trend for z/OS is next toward multiple application level Active/Active Sites at

distance coupled with the traditional platform based high availability and

disaster/recovery solutions.

Page 20: “z/OS Multi-Site Business Continuity” September, 2012

Page 20 9/11/2012

2 © Copyright IBM Corporation, 2012

. Conclusion

The requirements for real time high availability, continuous operations and disaster

recovery for z/OS as well as distributed systems continue to push IBM to provide 24x7

computing environments with superior business resilience functionality.

New Smarter Planet applications typically deal with real time data that needs to be

captured, stored and analyzed in real time on a 24x7 basis. These applications and

volumes of data also introduce new requirements in scalability as well as challenges in

total cost of ownership. The management of IT Operations across a single site or multiple

sites locally or at distance, presents the opportunity to optimize all compute resources to

maximize their utilization, as well as enable them to meet the business requirements of

the end user clients today and tomorrow. Emerging trends to enable applications and their

platforms to be virtualized and run across physical data centers located around the world

is the ultimate goal. The z/OS platform, coupled with GDPS automation has become the

leading edge of general purpose solutions towards this end...

Page 21: “z/OS Multi-Site Business Continuity” September, 2012

Page 21 9/11/2012

2 © Copyright IBM Corporation, 2012

Author Bob Kern - IBM Advanced Technical Support America’s ( [email protected]). Mr. Kern is an IBM Master Inventor & Executive IT Architect. He has 36 years experience in large system design and development and holds numerous patents in Storage related topics. For the last 28 years, Bob has specialized in disk device support and is a recognized expert in continuous availability, disaster recovery and real time disk mirroring. He created the DFSMS/MVS subcomponents for Asynchronous Operations Manager and the System Data Mover. Bob was named in 2003 a Master Inventor by the IBM Systems & Technology Group and is one of the inventors of Concurrent Copy, PPRC, XRC, GDPS and zCDP solutions. He continues to focus in the Disk Storage Architecture area on HW/SW solutions focused on Continuous Availability, and Data Replication. He is a member of the GDPS core architecture team and the GDPS Customer Design Council with focus on storage related topics.