Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
1
1
The Importance of Being ConsistentDB2 for z/OS and Copy Services for IBM System z
Florence Dubois
IBM DB2 for z/OS Development
Session Code: F09
Thursday 5th May, 2011 8:30AM-9:30AM | Platform: DB2 for z/OS
This presentation will tell you everything you need to know about the Copy Services for
IBM System z (DASD replication functions) and what is required to ensure data
consistency.
2
Agenda
• Introduction
• IBM Remote Copy Services
• Metro Mirror
• z/OS Global Mirror
• Global Copy
• Global Mirror
• DB2 Restart Recovery
• Tune for fast restart
• Optimise GRECP/LPL recovery
• FlashCopy
• FlashCopy and DB2
• FlashCopy and Remote Copy Services
• Conclusion
Objectives:
Introduce and compare Metro Mirror (PPRC), z/OS Global Mirror (XRC), Global Copy
and Global Mirror
Address the most common myths and misconceptions about these solutions
Discuss important concepts and functions such as Rolling Disaster, Consistency Groups,
HyperSwap and FREEZE policies (GO|STOP)
Provide hints and tips on how to tune for fast DB2 restart and how to optimise
GRECP/LPL recovery
Look at how DB2 uses FlashCopy, and discuss all the gotchas of combining FlashCopy
and Remote Copy Services
3
Introduction
• Everything should start with the business objectives
• ‘Quality of Service’ requirements for applications
• Availability
• High availability? Continuous operations? Continuous availability?
• Restart quickly? Mask failures?
• Performance
• In case of a Disaster
• Recovery Time Objective (RTO)
• How long can your business afford to wait for IT services to be resumed following a disaster?
• Recovery Point Objective (RPO)
• What is the acceptable time difference between the data in your production system and the
data at the recovery site (consistent copy)?
• In other words, how much data is your company willing to lose/recreate following a disaster?
• Need to understand the real business requirements and expectations
• These should drive the infrastructure, not the other way round
4
Introduction …
• Dependent writes
• The start of a write operation is dependent upon the completion of a previous write
to a disk in either the same storage subsystem or a different storage subsystem
• For example, typical sequence of write operations for a database update transaction:
1. An application makes an update and the data page is updated in the buffer pool
2. The application commits and the log record is written to the log device on storage subsystem 1
3. The update to the table space is externalized to storage subsystem 2
4. A log record is written to mark that the table space update has completed successfully
• Consistency
• Preserve the order of dependent writes
• For databases, consistent data provides the capability to perform a database restart
rather than a database recovery
• Restart can be measured in minutes while recovery could be hours or even days
5
Metro Mirror
• a.k.a. PPRC (Peer-to-Peer Remote Copy)
• Disk-subsystem-based synchronous replication
• Limited distance
• Over-extended distance can impact the performance of production running on the primary site
2
3
1
4
Host I/O
P S
(1) Write to primary volume (disk subsystem cache and NVS)
(2) Metro Mirror sends the write I/O to secondary disk subsystem
(3) Secondary disk subsystem signals write complete when the updated data is in its cache and NVS
(4) Primary disk subsystem returns Device End (DE) status to the application
6
Metro Mirror …
• Misconception #1: Synchronous replication always guarantees data
consistency of the remote copy
• Answer:
• Not by itself…
• Metro Mirror operates at the device level (like any other DASD replication function)
• Volume pairs are always consistent
• But in a rolling disaster, cross-device (or boxes) consistency is not guaranteed
• An external management method is required to maintain consistency
7
Metro Mirror …
• Traditional example of multiple disk subsystems
• In a real disaster (fire, explosion, earthquake), you can not expect your complex
to fail at the same moment. Failures will be intermittent, gradual, and the disaster
will occur over seconds or even minutes. This is known as the Rolling Disaster.
S3
S4
P3
P4
S1
S2
P1
P2
Devices suspend on this
disk subsystem but writes
are allowed to continue (*)
(*) Volume pairs are defined with
CRIT(NO) - Recommended
Updates continue to be sent
from this disk subsystem
Devices on
secondary disk
subsystems are not
consistent
Network failure causes
inability to mirror data to
secondary site
8
Metro Mirror …
• Example of suspend event for a single volume
S1
S2
P1
P2
1) Temporary communications problem causes P1-S1
pair to suspend e.g. network or SAN event
2) During this time no I/O occurs
to P2 so it remains duplex
3) Subsequent I/O to P2 will be
mirrored to S2
4) S1 and S2 are now not consistent and if the problem
was the first indication of a primary disaster we are
not recoverable
9
Metro Mirror …
• Consistency Group function combined with external automation
S1
S2
P1
P2
1) Network failure causes
inability to mirror data to
secondary site
2) The volume pair defined with CGROUP(Y) that first
detects the error will go into an ‘extended long busy’
state and IEA494I is issued
3) Automation is used to detect the alert and issue the
CGROUP FREEZE command(*) to all LSS pairs
5) Automation issues the CGROUP RUN(*) command to
all LSS pairs releasing the long busy. Secondary
devices are still suspended at a point-in-time
4) CGROUP FREEZE(*) deletes Metro Mirror
paths, puts all primary devices in long busy
and suspends primary devices
(*) or equivalent
Without automation, it can take a very long time to suspend all devices with
intermittent impact on applications over this whole time
10
Metro Mirror …
• Misconception #1: Synchronous replication always guarantees data
consistency of the remote copy
• Answer:
• In case of a rolling disaster, Metro Mirror alone does not guarantee data
consistency at the remote site
• Need to exploit the Consistency Group function AND external automation to
guarantee data consistency at the remote site
• Ensure that if there is a suspension of a Metro Mirror device pair, the whole environment
is suspended and all the secondary devices are consistent with each other
• Supported for both planned and unplanned situations
11
Metro Mirror …
• Misconception #2: Synchronous replication guarantees zero data loss in a
disaster (RPO=0)
• Answer
• Not by itself…
• The only way to ensure zero data loss is to immediately stop all I/O to the primary
disks when a suspend happens
• e.g. if you lose connectivity between the primary and secondary devices
• FREEZE and STOP policy in GDPS/PPRC
• GDPS will reset the production systems while I/O is suspended
• Choosing to have zero data loss really means that
• You have automation in place that will stop all I/O activity in the appropriate circumstances
• You accept a possible impact on continuous availability at the primary site
• Systems could be stopped for a reason other than a real disaster (e.g. broken remote copy link
rather than a fire in the computer room)
12
Metro Mirror …
• Misconception #3: Metro Mirror eliminates DASD subsystem as single point
of failure (SPOF)
• Answer:
• Not by itself...
• Needs to be complemented by a non-disruptive failover HyperSwap capability
e.g.,
• GDPS/PPRC Hyperswap Manager
• Basic Hyperswap in TPC-R
P S
applicationapplication
UCB
PPRC
UCB1) Failure event detected
2) Mirroring
suspended and I/O
quiesced to ensure
data consistency
3) Secondary
devices made
available using
failover command
4) UCBs swapped
on all systems in
the sysplex and I/O
resumed
13
z/OS Global Mirror
• a.k.a. XRC (eXtended Remote Copy)
• Combination of software and hardware functions for asynchronous
replication WITH consistency
• Involves a System Data Mover (SDM) on z/OS in conjunction with disk
subsystem microcode
3
1
2
Host I/O
P
4
S
J5
SDM
(1) Write to primary volume
(2) Primary disk subsystem posts I/O complete
Every so often (several times a second):
(3) Offload data from primary disk subsystem to SDM
(4) CG is formed and written from the SDM’s buffers to the SDM’s journal data sets
(5) Immediately after the CG has been hardened on the journal data sets, the records are written to their
corresponding secondary volumes
14
z/OS Global Mirror …
• Use of Time-stamped Writes and Consistency Groups to ensure data
consistency
• All records being written to z/OS Global Mirror primary volumes are time stamped
• Consistency Groups are created by the SDM
• A Consistency Group contains records that the SDM has determined can be safely
written to the secondary site without risk of out-of-sequence updates
• Order of update is preserved across multiple disk subsystems in the same XRC session
• Recovery Point Objective
• Amount of time that secondary volumes lag behind the primary depends mainly on
• Performance of the SDM (MIPS, storage, I/O configuration)
• Amount of bandwidth
• Use of device blocking or write pacing
• Pause (blocking) or slow down (pacing) I/O write activity for devices with very high update rates
• Objective: maintain a guaranteed maximum RPO
• Should not be used on the DB2 active logs (increased risk of system slowdowns)
15
Global Copy
• a.k.a. PPRC-XD (Peer-to-Peer Remote Copy Extended Distance)
• Disk-subsystem-based asynchronous replication WITHOUT consistency
3
4
1
2
Host I/O
P S
(1) Write to primary volume
(2) Primary disk subsystem posts I/O complete
At some later time:
(3) The primary disk subsystem initiates an I/O to the secondary disk subsystem to transfer the data (only
changed sectors are sent if the data is still in cache)
(4) Secondary indicates to the primary that the write is complete - primary resets indication of modified track
16
Global Copy …
• Misconception #4: Global Copy provides a remote copy that would be usable
in a disaster
• Answer:
• Not by itself…
• Global Copy does NOT guarantee that the arriving writes at the local site are
applied to the remote site in the same sequence
• Secondary copy is a ‘fuzzy’ copy that is just not consistent
• Global Copy is primarily intended for migrating data between sites or between
disk subsystems
• To create a consistent point-in-time copy, you need to pause all updates to the
primaries and allow the updates to drain to the secondaries
• E.g., Use the -SET LOG SUSPEND command for DB2 data
17
Global Mirror
• Combines Global Copy and FlashCopy Consistency Groups
• Disk-subsystem-based asynchronous replication WITH consistency
3
7
1
2
Host I/O
A
1
0
0
0
0
1
0
1
0
0
0
0
1
0
Consistency group
Co-ordination and formation Consistency group save
B
C
Global Copy
Data transmission
(1) Write to primary volume
(2) Primary disk subsystem posts I/O complete
At some later time:
(3) Global Copy is used to transmit data asynchronously between primary and secondary
At predefined time interval:
(4) Create point-in-time copy CG on A-disks – Write I/Os queued for short period of time (usually < 1 ms)
(5) Drain remaining CG data to B-disk
(6) FlashCopy used to save CG to C-disks
(7) Primary disk system notified to continue with the Global Copy process
6
5
4
18
Global Mirror …
• Recovery Point Objective
• Amount of time that the FlashCopy target volumes lag behind the primary
depends mainly on
• Bandwidth and links between primary and secondary disk subsystems
• Distance/latency between primary and secondary
• Hotspots on secondary in write intensive environments
• No pacing mechanism
• Designed to protect production performance at the expense of the mirror currency
• RPO can increase significantly if production write rates exceed the available resources
19
Remote Copy Services Summary
IBM enterprise disk
subsystem (ESS, DS6000,
DS8000)
CKD (System z) and FB
(Open)
Depends on user-
managed procedures and
consistency creation
interval
Virtually unlimited
distances, with minimal
impact to the response
time of the primary
devices
Data Migration
Hardware
Asynchronous without
consistency
Global Copy
IBM enterprise disk
subsystem (ESS, DS6000,
DS8000)
CKD (System z) and FB
(Open)
3-5 seconds or better
with sufficient bandwidth
and resources
No pacing
Virtually unlimited
distances, with minimal
impact to the response
time of the primary
devices
DR
Hardware
Asynchronous with
consistency
Global Mirror
Any enterprise disk
subsystem (ESS, DS8000,
HDS/EMC)
CKD only (z/OS, z/Linux,
z/VM)
1-5 seconds or better
with sufficient bandwidth
and resources
Max RPO can be
guaranteed with pacing
Virtually unlimited
distances, with minimal
impact to the response
time of the primary
devices
DR
Hardware and software
SDM on z/OS only
Asynchronous with
consistency
z/OS Global Mirror
HA and DRUsed for
Hardware
Synchronous
CKD (System z) and FB
(Open)Supported
devices
Any enterprise disk
subsystem (ESS, DS6000,
DS8000, HDS/EMC)
Supported
disk
subsystems
0 if FREEZE and STOP
(no data loss, but no more
production running)
> 0 if FREEZE and RUN
(data loss if real disaster)
RPO
Metro distances, with
impact to the response
time of the primary
devices (10 microsec/km)Distance
Metro Mirror
20
Two Data Centers
Rapid Systems Disaster
Recovery with ‘seconds’ of
Data Loss
Disaster recovery for out of
region interruptions
Multi-site workloads can
withstand site and/or
storage failures
Two Data Centers
Systems remain active
Continuous Availability /
Disaster Recovery within a
Metropolitan Region
GDPS/HyperSwap Mgr
GDPS/PPRC
Continuous Availability
Regionally and Disaster
Recovery Extended Distance
Continuous Availability of
Data within a Data Center
Continuous access to data
in the event of a storage
subsystem outage
Single Data Center
Applications remain active
GDPS/HyperSwap Mgr
Disaster Recovery at
Extended Distance
GDPS/GM
GDPS/XRC
Three Data Centers
High availability for site
disasters
Disaster recovery for
regional disasters
GDPS/MGM
GDPS/MzGM
A B
C
CA and DR Topologies and the GDPS Family
21
DB2 Restart Recovery
• Key ingredient for disk-based DR solutions
• Normal DB2 restart
• Re-establishes DB2 data consistency through restart recovery mechanisms
• Directly impact the ability to meet the RTO
• Must NOT be intended on a mirrored copy that is not consistent
• Guaranteed inconsistent data which will have to be fixed up
• No way to estimate the damage
• After the restart it is too late if the damage is extensive
• Damage may be detected weeks and months after the event
• Data sharing: Do not forget to delete all CF structures owned by the group
• Otherwise, guaranteed logical data corruption
22
DB2 Restart Recovery …
• Tune for fast restart
• Take frequent system checkpoints (2-5 minutes)
• Long-running URs
• Aggressively monitor for long running URs
• Start conservatively when enabling this tracking and adjust the values downwards
progressively
• Initial recommendations
• URLGWTH = 10 (K log records)
• URCHKTH = 5 (system checkpoints)
• Automatically capture warning messages DSNJ031I (URLGWTH) and DSNR035I
(URCHKTH) and/or post process IFCID 0313 records (if Statistics Class 3 is on)
• Need management ownership and process for getting ‘rogue’ applications fixed up in a timely
manner
• Use DB2 Consistent restart (Postponed Abort)
• Limit backout of long-running URs
• LBACKOUT=AUTO
• BACKODUR=5 (interval is 500K log records if time-based checkpoint frequency)
23
DB2 Restart Recovery …
• Optimise GRECP/LPL Recovery
• Frequent castout
• Low CLASST (0-5)
• Low GBPOOLT (5-25)
• Low GBPCHKPT (2-4)
• Use CLOSE YES as design default for tablespaces and indexes
• Set ZPARM LOGAPSTG (Fast Log Apply) to maximum buffer size of 100MB
• Develop an optimised procedure to perform the GRECP/LPL recovery
• Determine objects in GRECP/LPL status that require recovery action
• Start with DB2 Catalog and Directory objects first
• Generate optimal set of jobs to drive GRECP/LPL recovery
• Limit number of objects per -STA DB command to 20-30 objects
• Limit number of -STA DB command per member to 10 (based on 100MB of FLA storage)
• Spread -STA DB commands across all available members
24
DB2 Restart Recovery …
• Optimise GRECP/LPL Recovery …
• DB2 9 can automatically initiate GRECP/LPL recovery at the end of both normal
restart and disaster restart when a GBP failure condition is detected
• Why would I still need a procedure for GRECP/LPL recovery?
• If for any reason the GRECP recovery fails
• Objects that could not be recovered will remain in GRECP status and will need to be
recovered by issuing -START DATABASE commands
• If the GBP failure occurs after DB2 restart is complete
• GRECP/LPL recovery will have to be initiated by the user
• Highly optimised user-written procedures can still outperform the automated
GRECP/LPL recovery on DB2 9
• May want to turn off automated GRECP/LPL recovery if aggressive RTO
• DB2 -ALTER GBPOOL …. AUTOREC(NO)
25
FlashCopy
S
T
1
0
0
0
0
1
0
S
T
1
0
0
0
0
1
0
S
T
1
0
0
0
0
1
0
PIT copy technology on the disk subsystem
When a FlashCopy is issued the copy is available immediately
A bitmap tracks the relationship between source and target tracks
Read and write activity are possible on
both the source and target devices
Writes to the source may cause a copy on write if
the track has not been copied to the target
Reads of tracks on the target that have not been
copied from the source will be redirected to the
source
An optional background copy process
will copy all tracks from the source to the
target which will end the relationship
Several options available for FlashCopy including
• Incremental FlashCopy
• Consistent FlashCopy
• Multiple FlashCopy relationships
• Dataset level FlashCopy
• Space Efficient FlashCopy (OA30816)
• Remote Pair FlashCopy
26
FlashCopy and DB2
• Many applications of the PIT copy created by FlashCopy, e.g.
• Cloning of environments
• Protection during resynchronisation of replication solutions
• System-level PIT backup
• When using FlashCopy to take backups outside of DB2’s control
• Use the DB2 command -SET LOG SUSPEND to temporarily ‘freeze’ all DB2
update activity
• Ensures the PIT copy is a valid base for recovery
• Externalises any unwritten log buffers to active log datasets
• But does not guarantee that the latest version of the data is externalised to DASD
• Need to go through DB2 restart recovery to re-establish DB2 data consistency
• For Data Sharing, you must issue the command on each data sharing member and
receive DSNJ372I before you begin the FlashCopy
27
FlashCopy and DB2 …
• IBM DB2 for z/OS utilities makes more and more use of FlashCopy
• Dataset FC for COPY
• Dataset FC for inline copy in REORG TABLESPACE, REORG INDEX, REBUILD INDEX, LOAD
• FC image copies with consistency and no application outage (SHRLEVEL CHANGE)
• FCIC accepted as input to RECOVER, COPYTOCOPY, DSN1COPY, DSN1COMP, DSN1PRNT
V10
• Incremental FC for BACKUP SYSTEM
• Dataset FC for RECOVER with system-level backup (SLB) as input
• Dataset FC for CHECK DATA SHRLEVEL CHANGE and CHECK LOB SHRLEVEL CHANGE
V9
• BACKUP SYSTEM
• RESTORE SYSTEM
• Dataset FC support for CHECK INDEX SHRLEVEL CHANGE
V8
28
FlashCopy and Remote Copy Services
• FlashCopy and Global Mirror or z/OS Global Mirror (XRC)
• Restriction: A FlashCopy target cannot be established on a device that is part of a
Global Mirror or z/OS Global Mirror volume pair
• What does this mean for the DB2 utilities?
• BACKUP SYSTEM
• The copy pool backup needs to be defined outside of Global Mirror or z/OS Global Mirror
• Object-level RECOVER from SLB
• Standard I/O (slower) is always used when restoring data from a system-level backup
• Consider specifying FASTREPLICATION(DATASETRECOVERY(NONE)) in DFSMShsm
• Eliminates the overhead of always failing the FC before dropping to standard I/O
• RESTORE SYSTEM
• Cannot use FlashCopy to restore the entire DB2 system from a copy pool backup
• But can use a system level backup on tape
• To use FlashCopy to restore the entire DB2 system to a PIT, need to disable mirroring before
running the RESTORE SYSTEM utility
29
FlashCopy and Remote Copy Services
• FlashCopy and Global Mirror or z/OS Global Mirror (XRC) …
• What does it mean for the DB2 utilities? …
• CHECK INDEX|DATA|LOB SHRLEVEL CHANGE
• FlashCopy cannot be used if the temporary shadow dataset is allocated on a mirrored device
• Standard DSS copy will be used
• While the shadow copy is being created, the object is in Read Only mode
• ZPARM UTIL_TEMP_STORCLAS can be used to specify an explicit storage class
• Allow the CHECK utility to use a pool of volumes that are not mirrored
• APAR PM19034 (V9/V10)
• New ZPARM CHECK_FASTREPLICATION (PREFERRED|REQUIRED)
• PREFERRED (default V9) >> Standard I/O will be used if Flash Copy cannot be used
• REQUIRED (default V10) >> CHECK will fail if Flash Copy cannot be used
30
FlashCopy and Remote Copy Services
• FlashCopy and Global Mirror or z/OS Global Mirror (XRC) …
• What does it mean for the DB2 utilities? …
• FlashCopy image copies
• FlashCopy cannot be used if the image copy dataset is allocated on a mirrored device
• Image copy is still taken, but will always use standard I/O (slower)
• Can use SMS to allocate the image copy dataset on a pool of volumes that are not mirrored
• Up to 4 additional sequential format image copies can be created at the same time
• Protection against DASD failure
• Possibility of remote image copy for DR
• RECOVER using FlashCopy image copies
• Standard I/O is always used when recovering from a FlashCopy image copy (slower)
• APAR PM26762
• New ZPARM REC_FASTREPLICATION (NONE|PREFERRED|REQUIRED)
• Consider specifying REC_FASTREPLICATION = NONE >> Eliminates the overhead of
always failing the FC before dropping to standard I/O
31
FlashCopy and Remote Copy Services …
• FlashCopy and Metro Mirror (prior to IBM Remote Pair FlashCopy)
• By default, same restrictions as with Global Mirror and z/OS Global Mirror
• Option FCTOPPRCPrimary on DFSMSdss COPY command was introduced to
allow FlashCopy to Metro Mirror primary volumes
• Support for DFSMShsm in APAR OA23849
• Warning: The Metro Mirror pair will go into Duplex Pending state
• Not acceptable in most cases
• Secondary site may not be recoverable and HyperSwap would fail
S2
P1
P2 Metro Mirror – Duplex Pending
FlashCopy target data will be
available once the Metro Mirror
pair is back in duplex
S1Metro Mirror - Duplex
32
FlashCopy and Remote Copy Services …
• FlashCopy and Metro Mirror (with IBM Remote Pair FlashCopy, aka Preserve Mirror FlashCopy)
• A FlashCopy command issued at the primary site is mirrored at the secondary site
• The FlashCopy source and target volumes must be Metro Mirror primary devices
• Must have microcode and APARs in order to use new function
S1
S2
P1Metro Mirror - Duplex
P2
FlashCopy issued from
one Metro Mirror
device to another
Metro Mirror - Duplex
FlashCopy is executed on
target disk subsystem and
response sent back to host
Primary disk subsystem checks to see if
it is possible to do the FlashCopy on the
remote disk subsystem
33
FlashCopy and Remote Copy Services …
• Preserve Mirror FlashCopy
• z/OS DFSMSdss – New optional sub-keywords to FCTOPPRCPRIMARY
• FCTOPPRCPrimary(PresMirNone)
• If the target volume is a Metro Mirror primary device, the pair will go into a ‘duplex pending’
state as a result of a FlashCopy operation
• This is the default if sub-keyword is not specified
• FCTOPPRCPrimary(PresMirPref)
• If the target volume is a Metro Mirror primary device, the pair may go into a ‘duplex pending’
state as a result of a FlashCopy operation
• FCTOPPRCPrimary(PresMirReq)
• If the target volume is a Metro Mirror primary device, the pair must not go into a ‘duplex
pending’ state as a result of a FlashCopy operation
• This is the option GDPS customers should always use
• As before, if you don’t specify FCTOPPRCPrimary
• A PPRC primary volume is not eligible to become a FlashCopy target volume
• This is the default for DFSMSdss COPY command
• Same restrictions as with Global Mirror and z/OS Global Mirror
34
FlashCopy and Remote Copy Services …
• Preserve Mirror FlashCopy and DB2 utilities
• DB2 utilities using DFSMShsm (BACKUP SYSTEM, RESTORE SYSTEM, RECOVER from SLB)
• Preserve mirror attribute is set at the copy pool level via the SMS definition panel
• FRBACKUP to PPRC Primary Vols allowed (NO|PN|PP|PR)
• FRRECOV to PPRC Primary Vols allowed (NO|PN|PP|PR)
• Default (NO): FCTOPPRCPrimary will not be passed to DFSMSdss
• PPRC primary volumes cannot be used as FlashCopy target volumes
• Other DB2 utilities using Flash Copy (CHECK DATA, CHECK INDEX, CHECK LOB, COPY, REORG
TABLESPACE, REORG INDEX, REBUILD INDEX, LOAD, RECOVER)
• Preserve mirror attribute is set via ZPARM
• APAR PM26762 (V9/V10): FLASHCOPY_PPRC (blank|NONE|PREFERRED|REQUIRED)
• blank (default V9) >> FCTOPPRCPrimary will not be passed to DFSMSdss
• REQUIRED (default V10) >> If the target volume is a Metro Mirror primary device and preserve
mirror cannot be used, the utility will use standard I/O
• Unless …
CHECK_FASTREPLICATION = REQUIRED >> CHECK will fail
REC_FASTREPLICATION = REQUIRED >> RECOVER will fail
35
Conclusion
• Need clear and consistent objectives for Continuous Availability (CA) and
Disaster Recovery (DR)
• In line with the business requirements and expectations
• Clearly differentiate CA and DR to ensure clarity of the objectives for functionalities
• Examples
• Running with a multi-site workload is generally done to provide faster restart in case of site failures
(DR) but can compromise the exploitation of CA capabilities
• RPO=0 (no data loss) can only be achieved with a FREEZE/STOP policy which can impact the
availability of production running on the primary site (‘false positive’)
• The more aggressive the SLAs, the more investments are required (escalating)
• Hardware (e.g. extra DASD)
• Automated, optimised procedures
• Testing and practice
36
Conclusion …
• Data consistency is of paramount importance
• Any sign of inconsistency found in your testing should be driven to root cause
• Broken pages, data/index mismatches, etc.
• A DB2 cold start, or any form of conditional restart, will lead to data corruption and loss of data
• Practice, practice, practice
• Test in anger and not simulated
• Continually validate recovery procedures to maintain readiness
• Verify that RPO/RTO objectives are being met
• Do not throw away your ‘standard’ DB2 log-based recovery procedures
• Even though it should be a very rare occurrence, it is not wise to start planning for mass recovery when the failure actually occurs, e.g.
• Plan A for Disaster Recovery has failed
• Local recovery on the primary site following wide-spread logical corruption
37
Looking for More Information?
• Redbooks
• Disaster Recovery with DB2 UDB for z/OS, SG24-6370
• GDPS Family - An Introduction to Concepts and Capabilities, SG24-6374
• DS8000 Copy Services for IBM System z, SG24-6787
• DS8000 Remote Pair FlashCopy (Preserve Mirror), REDP-4504
• Manuals
• z/OS DFSMS Advanced Copy Services, SC35-0428
• z/OS DFSMSdfp Storage Administration, SC26-7402
• z/OS DFSMShsm Storage Administration, SC35-0421
38
38
Florence DuboisIBM DB2 for z/OS Development
Session F09
The Importance of Being ConsistentDB2 for z/OS and Copy Services for IBM System z
Florence Dubois is an IBM Certified Senior IT Specialist and a member of the
DB2 for z/OS Development SWAT team. In this role, she consults for worldwide
customers on a variety of technical topics, including implementation and
migration, design for high performance and availability, performance and tuning,
system health checks, disaster recovery. Florence presents regularly at
conferences and has co-authored several IBM Redbooks publications.
38