41
Session code: Db2 for z/OS Best Practices for Continuous Availability John Campbell IBM Db2 for z/OS Development A04 Monday 3 rd June, 2019 16:30-17:30 Platform: Db2 for z/OS IDUG 2019 NA Tech Conference in Charlotte, NC

Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

  • Upload
    others

  • View
    11

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

Session code:

Db2 for z/OS Best Practices for Continuous Availability

John CampbellIBM Db2 for z/OS Development

A04

Monday 3rd June, 2019 16:30-17:30 Platform: Db2 for z/OS

IDUG 2019 NA Tech Conference in Charlotte, NC

Page 2: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

Agenda

• What can go wrong• Reduce planned & unplanned downtime• Manage to Service Level Agreement• Preventative service planning• Fast Db2 for z/OS crash restart• Automated Db2 for z/OS restart after failure• CF structure duplexing• Db2 for z/OS mass data recovery• DASD as a single point of failure• Application considerations for data sharing

2

Page 3: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

What can go wrong in continuous availability?

• What can go wrong, go wrong, go wrong …• Almost everything• Many problems and mistakes can be tolerated, but compounding into two, three or four mistakes is when there are

outages• Without some help at every level, and a concern for high availability, there is some potential for problems• The key to success is having a service level agreement that can be used to justify costs when needed and

the commitment of everyone• “Continuous availability is a religion and everyone needs to understand that …”

• One of the key tradeoffs for continuous availability is data integrity• First priority for Db2 for z/OS is data integrity, and there are situations where an outage results from the need to

protect data integrity• With Db2 for z/OS, customers can practically achieve availability of 3 nines or roughly three hours of

downtime per year with excellent planning and implementation

3

Page 4: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

What can go wrong in continuous availability?

• Users• Applications• Operations• Database Administration• Systems Administration• Software bugs

• Data Integrity• Performance• Locking• Continuous Operations

4

24 x 365.25 x 100% x free

Page 5: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

Availability Issues - Reduce planned & unplanned downtime

• z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard”• Planned downtime

• Allows hardware and software changes to be non-disruptive to applications

• Unplanned downtime• Eliminates a single point of failure for a CEC, a z/OS LPAR, a Db2 for z/OS member

• Problem: Customers make big investment in z/OS Parallel Sysplex and Db2 for z/OS data sharing, but still cannot achieve ‘true’ continuous availability

• But additional ingredients are essential• Active inter-system read-write data sharing• Multiple instances of a critical application workloads running across multiple z/OS LPARs across multiple CECs• Fine-grained, dynamic transaction routing

• Use the aggregate capacity of multiple images to satisfy peak demands• Improve application availability, throughput and scalability• Provide automatic re-route around failure

• CF structure duplexing • Automated restart of failed components

5

Page 6: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

Availability Issues - Reduce planned & unplanned downtime …

• Fast, non-disruptive database operations (even without Db2 for z/OS data sharing)• Online Image COPY & REORG• Concurrent COPY – allows you to create a non-fuzzy copy while applications update the data• FlashCopy Consistent Image COPY• Fast Db2 crash restart with postponed abort• Online system parameter change• Online LOAD RESUME• z/OS HyperSwap• ...

6

Page 7: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

Availability Issues - Manage to Service Level Agreement

• The key for managing is a service level agreement and the resources needed to meet that agreement • Set availability criteria balancing needs versus costs

• Should reflect affordable business requirements and should match match infrastructure capability

• Use criteria to determine and prioritise the needed processes and procedures• Practice operation and recovery

• The only real way to determine if criteria are being met is to test and the additional benefit is practice• Doing something right the very first time is very unusual• If not meeting criteria, then optimise and tune• Education for skills and time to do the job

• Stay reasonably current with Db2 for z/OS releases and preventative service• A number of Db2 for z/OS fixes and improvements in new versions are helpful for higher availability and better

serviceability• Whilst you do not want to be too current on service, being far behind can result in outages too

• Isolation and controlled replication• Isolation for high availability or high impact workloads can help improve availability

7

Page 8: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

Where a customer wants to be Where customer is

Months

Num

ber o

f APA

Rs

X XDelta

Availability Issues – Preventative service planning

• Applying preventive maintenance can and will avoid outages• Up to 20% of multi-system outages could have been avoided by regularly installing 'critical' (e.g., HIPER and PE fixes)

PTFs to progressively ‘close the gap’ on missing HIPERs and PE fixes• Executing a preventive maintenance process requires an understanding

of the trade-offs • Position on the release adoption ‘bell curve’ • Problems encountered vs. problems avoided• Potential for PTF in Error (PE)

8

Page 9: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

9

Availability Issues – Preventative service planning …

• Achieving the highest availability depends on a having an adaptive preventive maintenance process that is adjusted based on

• Attitude to risk in changing environment and exploiting new Db2 for z/OS releases and functions• Moving up the adoption curve of a Db2 for z/OS release and/or new function with Continuous Delivery should drive the regular

apply of more frequent drops of preventative service using a rolling calendar

• Experience over previous 12-18 months• Too many PEs should drive less aggressive apply of preventative service • Too many problems and repeat problems where the fixing PTF is readily available should indicate that more frequent drops of

preventative service should be applied

• Db2 for z/OS product and service plans

9

Page 10: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

Consolidated Service Test (CST)

• Goals • Enhance the way service is tested and delivered for z/OS, by providing a single coordinated service recommendation

• Provides cross-product testing for participating products• List of products tested is continually expanding • Performed in addition to existing testing programs and does not replace any current testing performed by the products• Standardize maintenance recommendation on z/OS platform

• Results• Quarterly report available on CST website

• http://www.ibm.com/systems/z/os/zos/support/servicetest

• Monthly addendum with update on tested HIPERs and PE fixes• After service has passed CST testing

• Marked with RSU (Recommended Service Upgrade) RSUyymm SOURCEID notation

10

Page 11: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

CST/RSU vs. PUT calendar

11

H&PE = HIPER/Security/Integrity/Pervasive + PE resolution PTFs (and associated requisites and supersedes)

CST4Q18RSU1812

All service through end Sept 2018 not already marked RSU

+H&PE

through end Nov 2018

January 2019

RSU1901

H&PE through end Dec 2018

February 2019

RSU1902

H&PE through end Jan 2019

March 2019

All service through end Dec 2018 not already marked RSU

+H&PE

through end Feb 2019

April 2019

CST1Q19RSU1903

Base: Sep 2018H&PE: Nov 2018

Base: Sep 2018H&PE: Dec 2018

Base: Sep 2018H&PE: Jan 2019

Base: Dec 2018H&PE: Feb 2019

PUT1901Base: Jan 2019H&PE: Jan 2019

PUT1812Base: Dec 2018H&PE: Dec 2018

PUT1902Base: Feb 2019H&PE: Feb 2019

PUT1903Base: Mar 2019H&PE: Mar 2019

Page 12: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

Enhanced HOLDDATA

• http://service.software.ibm.com/holdata/390holddata.html• Key element of the CST/RSU best practices process

• Simplifies service management• Can be used to identify missing PE fixes and HIPER PTFs

• SMP/E REPORT ERRSYSMODS

• Produces a summary report• Includes the fixing PTF number when the PTF is available• Includes HIPER reason flags

• IPL, DAL (data loss), FUL(major function loss), PRF (performance), PRV (pervasive)• Identifies whether any fixing PTFs are in RECEIVE status (available for installation) and if the chain of PTFs to fix the error has

any outstanding PEs

• Enhanced HOLDDATA is updated daily• A single set of HOLDDATA is cumulative and complete• Up to 3 years of history is available

12

Page 13: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

Fix Category HOLDDATA (FIXCAT)

• https://www.ibm.com/systems/z/os/zos/features/smpe/fix-category.html• Fix categories can be used to identify a group of fixes that are required to support a particular HW device,

or to provide a particular software function • Supplied in the form of SMP/E FIXCAT HOLDDATA statements

• Each FIXCAT HOLDDATA statement associates an APAR to one or more fix categories

• Current list of FIXCAT HOLDs for Db2 for z/OS

13

IBM.DB2.AnalyticsAccelerator.VxRy Fixes for the IDAA VxRy and for other software to support IDAA VxRy

IBM.DB2.ExtendedRBA Fixes for the Db2 Extended RBA function

IBM.DB2.Parallelism Fixes for the Db2 parallelism function

IBM.DB2.Storageleak Fixes for Db2 storage leak problems

IBM.DB2.Storageoverlay Fixes for Db2 storage overlay problems

IBM.DB2.SQL-Incorrout Fixes for Db2 SQL incorrect output problems

IBM.Coexistence.DB2.SYSPLEXDataSharing Fixes that enable Db2 releases to coexist when in data sharing mode

IBM.Function.SYSPLEXDataSharing Fixes that enable Db2 data sharing or fixes required for data sharing

IBM.Migrate-Fallback.DB2.Vx (x=9,10,11,12) Fixes that allow the prior Db2 version to migrate to or fall back from Db2 Vx

Page 14: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

Availability Issues – Preventative service planning …

• Maintenance recommendations for Db2 12 for z/OS and Continuous Delivery• Apply preventative maintenance every 3 months based on rolling calendar

• Use RSU instead of PUT to be less aggressive on applying non-HIPER maintenance • Sample strategy based on two 'major' and two 'minor' releases

• Refresh of the base every 6 months ('major')• Each base upgrade should be based on latest quarterly RSU

• Ensure that RSU-only service is installed by adding the SOURCEID (RSU*) option in the supplied APPLY and ACCEPT jobs

• In addition, apply two mini packages covering HIPERs and PEs in between ('minor’)

• Continuous program to review Enhanced HOLDDATA on a weekly basis• When rolling out maintenance package to production and after post production cutover • For vicious HIPER problem where no operational bypass/workaround, expedite fix into production after 2 weeks in test • Others can be deferred until the rollout of the next maintenance drop

• By applying operational bypass/workaround in production• Only “low grade” problem in production• Not applicable

14

Page 15: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

Availability Issues – Preventative service planning …

• Recommendations …• Develop processes/procedures and technical changes to implement ‘rolling’ maintenance outside of heavily

constrained change windows• Separate SDSNLOAD per Db2 for z/OS member• Separate ICF User Catalog Alias per Db2 for z/OS member • Benefits of ‘rolling’ maintenance

• One Db2 for z/OS member stopped at a time• Db2 data is continuously available via the N-1 members• Fall-back to the prior level is fully supported if necessary• Difficult if your applications have affinities!

15

Page 16: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

Availability Issues - Fast Db2 for z/OS crash restart

• Take frequent system checkpoints• Every 2-5 minutes generally works well

• Application processes should commit frequently! • Use Db2 consistent restart (Postponed Abort) to limit backout of long-running URs

• Controlled via ZPARM• LBACKOUT=AUTO|LIGHTAUTO

• LIGHTAUTO is an option to allow Restart Light with postponed abort• BACKODUR=5 (interval is 500K log records if time-based checkpoint frequency)

• Postponed Abort is not a ‘get-out-of-jail-free’ card• Some retained locks persist through restart when postponed backout processing is active

• Retained locks held on page sets for which backout work has not been completed • Retained locks held on tables, pages, rows, or LOBs of those table spaces or partitions

• If on shared critical resources, these retained locks can prevent applications from running properly

16

Page 17: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

Availability Issues - Fast Db2 for z/OS crash restart …

• Track and eliminate long-running URs• Long-running URs can have a big impact on overall system performance and availability

• Elongated Db2 for z/OS restart and recovery times• Reduced availability due to retained locks held for a long time (data sharing)• Potential long rollback times in case of application abends• Lock contention due to extended lock duration >> timeouts for other applications• Ineffective lock avoidance (data sharing)

• A single long-running UR running anywhere in the data sharing group and accessing at least one GBP-dependent dataset will stop the Global CLSN value from moving forward

• Can reduce the effectiveness of lock avoidance and space reuse for all other applications running against ANY GBP-dependent dataset

• Problems getting Db2 for z/OS utilities executed

• Db2 provides two system parameters to help identify ‘rogue’ applications• URCHKTH for long running URs • URLGWTH for heavy updaters that do not commit

17

Page 18: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

Availability Issues - Fast Db2 for z/OS crash restart …

• Track and eliminate long-running URs …• Aggressively monitor long-running URs

• Start conservatively and adjust the ZPARM values downwards progressively • Initial recommendations

• URLGWTH = 10 (K log records) • URCHKTH = 5 (system checkpoints) – based on a 3-minute checkpoint interval

• Automatically capture warning messages • DSNJ031I (URLGWTH) • DSNR035I (URCHKTH) • and/or post process IFCID 0313 records (if Statistics Class 3 is on)

• Get badly-behaved applications upgraded so that they commit more frequently• Need management ownership and process

18

Page 19: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

Availability Issues - Automated Db2 for z/OS restart after failure

• Use automation to detect failures quickly and restart Db2 for z/OS subsystem immediately • Objective: Release retained locks as soon as possible• Eliminate manual intervention • No management decision involved• Automated Restart Manager (ARM) or automated operators (e.g. Tivoli System Automation family) can be used

• Two failure scenarios to distinguish• Db2 or IRLM failure: restart Db2 for z/OS in place using a normal Db2 warm restart• LPAR or CEC failure: restart Db2 for z/OS on an alternative LPAR using Db2 Restart Light

19

** Very basic ARM Policy **RESTART_GROUP(DB2)

ELEMENT(DSNDB10DB1A)TERMTYPE(ALLTERM)RESTART_METHOD(ELEMTERM,PERSIST)RESTART_METHOD(SYSTERM,STC,'-DB1A STA DB2,LIGHT(NOINDOUBTS)')

Page 20: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

Availability Issues - Automated Db2 restart after failure …

• “Restart Light“ = -START DB2 LIGHT(NO|YES|NOINDOUBTS|CASTOUT)

• Design point of Restart Light is ONLY for cross-system restart following LPAR/CEC failure• Small memory footprint

• Used to avoid ECSA/CSA virtual storage shortage on the alternate LPAR by forcing IRLM PC=YES• No longer needed as from V8 as IRLM PC=YES is always forced

• Avoids real memory shortage on the alternate LPAR by severely constraining pool sizes

• Simplified management• Does not allow new workload to come in• Can shut down Db2 for z/OS member automatically

• BUT …. Restart Light is likely to be slower than normal Db2 crash restart • With zparm LBACKOUT=AUTO (Db2 default) does not honour postponed abort

• Restart Light is NOT intended for restart in place on the ‘home’ LPAR or for DR crash restart

20

Page 21: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

Availability Issues - CF structure duplexing – LOCK1 and SCA

• LOCK1 or SCA can be dynamically rebuilt on failure into an alternate CF

• Problem: LOCK1 or SCA can be dynamically rebuilt, but only if ALL the Db2 for z/OS members survive the failure• OK if LOCK1 and SCA structures are allocated in a failure isolated CF • But without a failure isolated CF to contain LOCK1 and SCA structures, loss of the ‘wrong CF” will result in a group-

wide outage

• Solution: System-managed duplexing for LOCK1 and SCA structures

21

Page 22: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

Availability Issues - CF structure duplexing – LOCK1 and SCA …

• System-managed synchronous CF lock structure duplexing – how it works today prior to Db2 12 for z/OS

22

DB2 IRLM XESLOCK1 - P

LOCK1 - S

1. Request in

5. Response out

2. Request out

2. Request out

3. Communication

4. Response

4. Response

Page 23: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

Availability Issues - CF Structure Duplexing – LOCK1 and SCA

• System managed asynchronous CF lock structure duplexing – how it will now work with Db2 12

23

DB2 IRLM XESLOCK1 - P

LOCK1 - S

1. Request in

4. Response out

2. Request out5. Communication

3. Response

Query 6. Ordered execution

Page 24: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

Availability Issues - CF structure duplexing

• Understand the trade-offs between system-managed CF structure duplexing vs. use of failure-isolated CF with very fast structure rebuild into an alternate CF

• Option 1 (preferred): LOCK1 and SCA are in the same CF and failure-isolated from the Db2 members• Option 2: Use system-managed CF structure duplexing for LOCK1 and SCA

• Performance overhead for all LOCK1 requests

• Enable Db2-managed duplexing for the GBPs• Small performance overhead

• Only changed pages are duplexed• Async request to the secondary GBP structure overlaps with the request to the primary GBP structure

24

Page 25: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

25

• Parallel Sysplex configurations• A: ‘Traditional’ configuration

• LOCK1 and SCA in one CF• Duplexed GBPs spread across both CFs

• Primary and secondary GBPs balanced based on load

• B: One Integrated CF (ICF), one external CF• LOCK1 and SCA in external CF• Duplexed GBPs spread across both CFs

• Primary GBP in ICF has advantages for ‘local’ Db2 for z/OS

• C: Two ICF configuration• LOCK1 and SCA duplexed; allocated in both CFs

• ‘System-managed’ duplexing• Performance implication for all LOCK1 requests

• Duplexed GBPs spread across both CFs

Availability Issues - CF Structure Duplexing …A

CF01

DB2A DB2BCF02

BCF01

DB2A DB2B

ICF02

C

ICF01

DB2A DB2B

ICF02

25

Page 26: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

Availability Issues - CF structure duplexing …

• Parallel Sysplex configurations • D: Two ICF configuration – similar to C but…

• LOCK1 and SCA duplexed if allocated in ICF01• ‘System-managed’ duplexing• Performance implication for LOCK1 requests

• LOCK1 and SCA do not have to be duplexed if allocated in ICF02 • Be aware of planned maintenance when ICF02 needs to come down and LOCK1 and SCA will need to live in ICF01 for a

small window (or if Db2A/Db2B ever moves to CEC where ICF02 is)• Duplexed GBPs spread across both CFs

26

D

DB2A DB2B

ICF02ICF01

Page 27: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

Availability Issues - Db2 for z/OS mass data recovery

• Based on a daily image copy cycle, plan to keep 48 hours of recovery log on DASD through a combination of active log pairs and archive log COPY1• Option #1 (preferred):

• Maintain a minimum of 6 hours in the active log pairs (for availability) • Keep archive log COPY1 on DASD for 48 hours before it is migrated to VSM with DFSMShsm • Develop a procedure to check that both objectives are maintained most of the time

• Option #2:• Supersize the number/size of active log pairs to hold a minimum of 48 hours of recovery log data at all times • Still recommend writing archive log COPY1 to DASD and use migrating to DFSMShsm soon afterwards to allow for automatic

recalls should the archive log datasets ever be needed

• Note: More use of data compression has the potential to reduce logging data volume• Increase the retention period for image copy backups and archive log datasets to at least 28 days

• Do not forget to update the MODIFY RECOVERY jobs accordingly• Develop a process to save away archive logs and ICs at first sign of data corruption

27

Page 28: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

Availability Issues - Db2 for z/OS mass data recovery …

• Always take dual full image copy backups for REORG utilities and other LOG NO events• Develop check for unrecoverable objects and run it DAILY

• Check that in the last 24 hours, there are at least 2 valid backups for each object, including after LOG NO events• When incremental image copies are used, check that there are at least 2 full image copies that are still valid (2 full

cycles)• If taking a daily system-level full volume backup make sure it is guaranteed to be I/O consistent so it has

the potential to be used for data recovery• e.g. using FlashCopy Consistency groups or integrating it with GDPS • Invest in maintaining an isolated environment where the backup can be restored• The backup can then be used as the basis for

• Creating a ‘forensic’ environment to investigate data corruption or malicious damage• Recovery of last resort especially when multiple data stores are impacted• Safe environment to practice Db2 for z/OS release migration, BIND impact analysis, etc.

28

Page 29: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

Availability Issues - Db2 for z/OS mass data recovery …

• Develop automation to generate RECOVER jobs in an optimised fashion• ‘Basic’ starting point, which should be refined after testing to demonstrate achievement

• List of 20-30 objects per RECOVER job, taking into account any stacking of IC datasets• Maximum of 51 jobs per Db2 member for optimal fast log apply (may have to be reduced down if too much contention on VSM

or DASD)• Concurrent REBUILD indexes (level of parallelism limited by available sort space)

• Test to measure achievement, study the time distribution and look for opportunities to optimise to meet recovery time objective, e.g.• Adjust recover order and reduce/eliminate ‘dead’ times in the scheduling• Keep image copies for small objects (<100MB) on DASD to relieve pressure on DFSMShsm• Implement (more) aggressive data archiving/purge policy, separating active/history data into separate tables• Selective use of INDEX COPY/RECOVER instead of REBUILD INDEX for large NPIs• Take more frequent backups for some objects including incremental image copies• Selective use of FlashCopy IC that are kept on DASD for large size objects where the restore phase dominates the

recovery elapsed time• Practice periodically to maintain readiness and ensure recovery objectives are still met 29

Page 30: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

Availability Issues - Db2 for z/OS mass data recovery …

• Any form of application-level point-in-time recovery is predicated on having ‘rock-solid’ understanding of inter-application relationships and respective data dependencies (including IMS data and other data stores)

• Must be clearly documented• Application sets should be prioritised to drive the order of the recovery

30

Page 31: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

Availability Issues - Db2 for z/OS mass data recovery …

• Data consistency Checking • CHECK utilities are critical diagnosis tools in the event of data corruption

• Identify objects that need repairing/recovering and assess the extent of the damage

• Dataset-level FlashCopy capability on DASD storage subsystem required to run the CHECK utilities non-disruptively• Must set ZPARM CHECK_FASTREPLICATION = REQUIRED• ZPARM CHECK_FASTREPLICATION = PREFERRED might be an existing availability exposure

• CHECK with SHRLEVEL CHANGE would be allowed to run, but would not be able to use FlashCopy to create the shadow objects objects could be left in RO status for many minutes whilst they are being copied with standard I/O

• As an immediate defensive measure, set ZPARM CHECK_FASTREPLICATION = REQUIRED• Once dataset-level FlashCopy is available on DASD storage subsystem, set ZPARM FLASHCOPY_PPRC = REQUIRED to

use remote-pair FlashCopy and avoid going out of full duplex on the Metro Mirror pairs

31

Page 32: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

Availability Issues – DASD As Single Point of Failure

• Objectives• Keep data available to applications during DASD storage subsystem maintenance or failure• DASD failure is transparent to Db2 for z/OS

• Myth: Metro Mirror by itself eliminates DASD storage subsystem as single point of failure• Must be complemented by an automated unplanned HyperSwap capability

• Need connectivity between host systems and the secondary DASD subsystems• All Metro Mirror volume pairs are in Full Duplex • AND you need a non-disruptive failover z/OS HyperSwap capability e.g.,

• GDPS/PPRC Hyperswap Manager (GDPS/HM) or GDPS/PPRC• z/OS Basic Hyperswap in Copy Services Manager (CSM) for IBM Z systems - Basic or Full

32

Page 33: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

Availability Issues – Application considerations for data sharing

• Good application design techniques are usually good Db2 for z/OS Data Sharing application design techniques • Minimise locking

• Minimise lock contention which is a critical performance factor• Minimise performance overhead of data sharing• Reduce scope of data being locked to a surviving Db2 for z/OS members when a Db2 for z/OS member fails

• Commit frequently• Reduce recovery log scan to limit application rollback and allow faster Db2 for z/OS restart time• Reduce lock contention• Improve effectiveness of global lock avoidance and space reuse

• Adopt a consistent policy for acquiring data in the same table/row order• Helps reduce the possibility of deadlocks

• Do not hard code values to a particular Db2 for z/OS member• e.g., Db2 for z/OS subsystem ID in batch JCL

33

Page 34: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

Availability Issues – Application considerations for data sharing …

• Remove all transaction or system affinities • Inter-transaction affinity

• Do not assume that subsequent executions of a transaction will run on the same system/image as the first transaction• Do not assume that the order of arrival will be the order of execution for transactions (work units)

• System/Image affinity • Logical by LPAR or Db2 for z/OS member e.g., to reduce MLC software cost• Physical e.g., hardware resource (crypto device, special printer)

• Avoid single points of control and serialisation blocking retained locks• Single row tables with update• Lock escalation• SQL LOCK TABLE• SQL DELETE without a WHERE CLAUSE

• Clone/replicate business critical applications across Db2 for z/OS members to provide redundancy• Across different z/OS LPARs and across different CECs

34

Page 35: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

Availability Issues – Application considerations for data sharing …

• Use a light-weight locking protocol (isolation level) and exploit lock avoidance • Benefits of lock avoidance

• Increased concurrency by reducing lock contention• Decreased lock and unlock activity and associated CPU resource consumption• Decreased number of CF requests and associated CPU overhead• Minimise impact of retained locks

• Use ISOLATION(CS) CURRENTDATA(NO) or use ISOLATION(UR)• Define cursors with their intended use (e.g., FOR UPDATE OF|READ ONLY)• Commit frequently to improve the effectiveness of global lock avoidance

35

Page 36: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

Availability Issues – Application considerations for data sharing …• Enforce use of optimistic locking techniques

• Some exposure even when using ISOLATION(CS) regardless of the CURRENTDATA option, but the use of CURRENTDATA(NO) will increase the exposure

• Exposure when updating or deleting a row having read the row with a non-cursor SELECT

• What should the application programmer do?• Use additional WHERE predicates on the searched UPDATEs and DELETEs to enforce data currency i.e. ensure the data has not changed

since the cursor first selected the data• Option #1 – ‘Over-weighted’ WHERE clause

• Include all columns that logically determined if the update was necessary as WHERE predicates, instead of updating based solely on the key column values

• Option #2 – Use of a timestamp or a version number• Add a timestamp or version number column to the table to record the last update

• ROW CHANGE timestamp is an ideal choice as the value is maintained by Db2, otherwise the application has to maintain the timestamp or version number as part of insert and update

• Select the timestamp or version number column with the cursor• Use it as an additional WHERE predicate on the searched UPDATE or DELETE statement to ensure that the application is updating

the same row

36

Page 37: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

Availability Issues – Application considerations for data sharing …

• Reduce lock contention• Use LOCKSIZE PAGE as design default• Only use LOCKSIZE ROW where appropriate and beneficial

• Consider LOCKSIZE PAGE MAXROWS n as an alternative to LOCKSIZE ROW

• Commit frequently• Avoid use of SQL LOCK TABLE• Avoid any dependence on lock escalation• Issue CLOSE CURSOR as soon as possible• Close open CURSOR WITH HOLD ahead of commit• Implement partitioning to potentially reduce inter-Db2 member lock contention• Access data rows and tables in consistent order

• Helps reduce the possibility of deadlocks

37

Page 38: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

Availability Issues – Application considerations for data sharing …

• Design for ‘parallel batch’• Avoid serial processing, full or part• Saturate available z/OS LPAR, CEC and Sysplex capacity• Determine how many parallel streams are needed to meet elapsed time requirement

• Determine how many partitions required

• Sort input records to drive dynamic sequential prefetch• Avoid random I/O by exploiting data row clustering

• Intermediate commit/restart function in batch should be mandatory• Flexible criteria based on CPU consumption (no. of calls) and/or elapsed time• Criteria externalized in Db2 for z/OS table and should be easy to change• New criteria should be picked up ‘on the fly’ by the application• Intelligent commit point processing based on prevailing operating conditions• Definite requirement to allow batch to run into the ‘online day’• Evaluate and acquire a robust reliable product

38

Page 39: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

Never waste a good outage …

… to loosen the purse strings

Summary

39

Page 40: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

Summary

• What can go wrong• Reduce planned & unplanned downtime• Manage to Service Level Agreement• Preventative service planning• Fast Db2 for z/OS crash restart• Automated Db2 for z/OS restart after failure• CF structure duplexing• Db2 for z/OS mass data recovery• DASD as a single point of failure• Application considerations for data sharing

40

Page 41: Db2 for z/OS Best Practices for Continuous Availability for zOS... · • z/OS Parallel Sysplex and Db2 for z/OS Data Sharing is the “gold standard” • Planned downtime • Allows

IDUG Db2 Tech ConferenceCharlotte, NC | June 2 – 6, 2019

Session code:

Please fill out your session evaluationbefore leaving!

A04

John CampbellIBM Db2 for z/OS [email protected]

41

Please fill out your session evaluation before leaving!