48
Sept. 2009 Bill Minor IBM The DB2 Detective – The Case of the Diligent DBA

Sept. 2009 Bill Minor IBM

  • Upload
    tess98

  • View
    1.045

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Sept. 2009 Bill Minor IBM

Sept. 2009

Bill MinorIBM

The DB2 Detective – The Case of the Diligent DBA

Page 2: Sept. 2009 Bill Minor IBM

2

Highlights

No matter the extent of preparation, it seems that unexpected outages popup when least expected. As a result, unplanned downtime can causesignificant impact to business operations. When misfortune strikes, there arealways two predominant questions, "What happened?" and "When will thesystem be available again?" Mysteries can be resolved by good DB2 Detective work - identifying and circumventing actions that can lead to problems. When unexpected problems do occur, it is vital to be able to act decisively and swiftly.

The goal of this session is to introduce database inspection tools and provideinsights that can be used to help monitor the integrity and maintain theconsistency of your data. I will also provide guidance for salvage, repair, recovery, and mitigation operations that are vital when crisis strikes and timeis of the essence.

Page 3: Sept. 2009 Bill Minor IBM

3

Agenda

• Help with defining a problem

• Help with collecting and understanding information/diagnostics related to a problem

• Tools to help in the event of an outage

• Tools to help avoid outages in the first place

• Features to help avoid, control, and reduce the impact of an unexpected problem (Technology Preview or TP)

• RAS (Reliability, Availability and Serviceability)

Page 4: Sept. 2009 Bill Minor IBM

4

In the Beginning ….

• ‘There is an outage’ !!!!!! Where do we start?

…… Take a deep breath … (DB2 911)

• Keep it simple to begin; fundamentals are essential

• Order of Investigation/ ‘Case’:

Problem DescriptionProblem Determination

Problem Resolution

Page 5: Sept. 2009 Bill Minor IBM

5

‘Tools’ in the Detective Kit

• The problem description

• db2diag.log

• db2trc

• db2pd

• db2dart and inspect

• db2top

• db2exfmt

• admin functions

• db2cos, db2support, db2fodc

• Resiliency (mitigation/toleration/avoidance)

Page 6: Sept. 2009 Bill Minor IBM

6

An ‘Outage’

• “Outage” – subjective terminology: Data availability:

slow access no access

Unexpected crashes

• Notes:• There is always a trade-off between how much time can be spent

investigating a problem and making system available again to users• Other factors: risk mitigation; establishing cause; development of a

workaround (in absence of a ‘fix’ obviously don’t want to hit this again)

An outage can be planned (i.e. maintenance) or unplanned (unexpectederrors, terminations, slow or unavailable access to data, …).

‘Path to DBA diligence’ Handling and avoiding unplanned outages via: investigation, problem

definition, monitoring, prevention, education, …… (hopefully something from this presentation ) …

Page 7: Sept. 2009 Bill Minor IBM

7

What Information Needs to be Collected?

• Each problem has it own set of one or more signatures

• Problems (even those with the same source – so ‘same problem’) can manifest themselves in different ways• Systems and environments are different: hardware,

software, different users, different data (scope and volume), …

• There is a set of fundamental or initial information/diagnostics that needs to be collected for every problem .. No matter what

• Keep in mind that additional information or diagnostics could be required depending on the initial analysis investigation• Goal: To minimize the amount of effort required in

collecting diagnostics

Page 8: Sept. 2009 Bill Minor IBM

8

The Problem Description

• VERY VERY VERY IMPORTANT (may seem obvious)• Avoid taking the wrong path • Invest in:

• Establishing “What happened?”• Getting the correct information and diagnostics

REQUIRED FUNDAMENTALS:

• What is the current state/situation? • Describe what happened and what needs to happen• Include recent problem or change history• Client/Server configuration

• What changed? No really, what changed? (Usually something has but just not apparent how, why or when)

• Include specific actions known to be performed at the time of the incident• Product version and release number (FP? Special build?)• SQL error codes• Scope: System Wide? Instance Wide? Database? Application(s)?

Page 9: Sept. 2009 Bill Minor IBM

9

Performance Issues – What? and When?

• Identification of such an issue requires a baseline – a before and after comparison

• Typically the issue is, ‘things are slow’ (really mean ‘slower’)• Slow query

• After running RUNSTATS

• After applying a fixpack

• After migrating to a new release of DB2

• After making a database configuration change

• Performance of triggers, stored procedures, UDFs with SQL, crashes/traps during query execution

• Differentiate between overall system performance and specific actions

• Date/time correlation? Other maintenance activities?

• Different systems? (db2pd –osinfo)

Page 10: Sept. 2009 Bill Minor IBM

10

Problem ‘Scope’

• Scope and Cause are required in order develop and execute action plan; action plan may be ‘straightforward’ i.e. restore the database

• Manifest itself in one of the following outage scenarios:• Instance trap – db2start required• Database trap • Application failure

Page 11: Sept. 2009 Bill Minor IBM

11

Diagnostics to Collect at the Outset

• DB2 Build Level• Operating System type and level• DB2 registry variable settings• MPP? SMP? Single-processor? Database partitioned? Which partition?• Syntax of command/statement associated with issue• DB2 error received• db2diag.log file which sufficient history• Notification log• DB2 Trap files (Call stacks), DB2 dump files• History File• db2 trace• Messages file • Core file, system error logs (Unix:errpt Windows: event logs)• Log files and log file control structures• …

Fundamental

Page 12: Sept. 2009 Bill Minor IBM

12

db2diag.log file• Where is it? $HOME/sqllib/db2dump/• What is it? ‘Supplementary’ diagnostic logging facility• How to read? Each entry:

• Time ordered entries• Database specific; Database node/partition (i.e. DPF specific)• Process/thread ID; Application specific identifiers/handles• Function name where entry was generated• Messaging and perhaps some data

2009-03-03-10.26.14.523824-300 I37414837E502 LEVEL: Error

PID : 8083 TID : 54 PROC : db2sysc 20

INSTANCE: svtdbm3 NODE : 020 DB : MASUN1

APPHDL : 0-138 APPID: *N0.svtdbm3.090303150827

AUTHID : SVTDBM3

EDUID : 54 EDUNAME: db2agntp (MASUN1) 20

FUNCTION: DB2 UDB, buffer pool services, sqlbDMSCreatePool, probe:800

MESSAGE : ZRC=0x850F000C=-2062614516=SQLO_DISK "Disk full."

DIA8312C Disk was full.

• Can be parsed with diag tool and/or admin function (examples in Backup Slides)

Page 13: Sept. 2009 Bill Minor IBM

13

db2trc – Trace facility

• Useful for problems/issues that can be recreated

• Generates a flow and format of the code paths taken by DB2 while the trace is active• Traces all DB2 activity in an instance on a physical node (database

partition). If many logical db2 nodes, all activity is traced• Trace to memory buffer or file (slower)

• Trace to file for trace wraps

• Always include corresponding db2diag.log file with a db2 trace

Examples:

db2trc on –l 8M … trace to memory

db2trc on –f mytrace.trc … trace to file

db2trc inf …. Information returned when trace is active)

db2trc flw mytrace.trc mytrace.flw ……

db2trc fmt mystrace.trc mytrace.fmt …… Readable versions of trace data

Page 14: Sept. 2009 Bill Minor IBM

14

db2dart (Database Analysis and Reporting Tool)

• Primarily to examine databases for architectural correctness• Check validity of meta-data structures, page headers, row headers, …• Runs against data as it exists on disk (primarily an offline tool)• Critical for investigating problems associated with data ‘corruption’• Granularity: database /db, tablespace /ts, table /t

EXAMPLES:

db2dart SAMPLE /db

db2dart SAMPLE /ts /tsi 4

db2dart SAMPLE /t /tsi 2 /oi 5

• Marking problematic objects as invalid; recovering or extracting data

• Consistency: Index key data checking

• HWM – space tablespace space management

Page 15: Sept. 2009 Bill Minor IBM

15

Inaccessible Table – Salvaging Data• Example: db2dart SAMPLE /ddel

…Table ID or name, tablespace ID, first page, num of pages:(suffic page number with 'p' for pool relative)13 2 0 100

7 of 7 columns in the table will be dumped. Column numbers and datatypes of the columns dumped: 0 SMALLINT 1 VARCHAR() -VARIABLE LENGTH CHARACTER STRING 2 SMALLINT 3 CHAR() -FIXED LENGTH CHARACTER STRING 4 SMALLINT 5 DECIMAL 6 DECIMAL Default filename for output data file is TS2T13.DEL, do you wish to change filename used? y/nn

Filename used for output data file is TS2T13.DEL. If existing file, data will be appended to it.

Page 16: Sept. 2009 Bill Minor IBM

16

Marking Indexes As Invalid – db2dart /MI

Example:

db2dart testdb /mi /tsi 2 /oi 4

• Time of index rebuild dictated by setting of INDEXREC database cfg variable• Usually one of RESTART (default) or ACCESS

• RESTART – database restarted

• ACCESS – when index is first accessed

• Messages written to db2diag.log file indicating index rebuild

Page 17: Sept. 2009 Bill Minor IBM

17

Checking Database Backup Images

• Database backup images represent one of your ultimate recovery options

• Hence, checking their integrity is prudent

• db2ckbkp• Check the integrity of a database backup image• Can extract pages from the backup image to a file (-e option)

• This may be part of a problem causal analysis or repair procedure What ‘state’ is the data in the backup image?

• Tivoli Storage Manager (TSM): db2adutil -verify

Page 18: Sept. 2009 Bill Minor IBM

18

History File

• Records information specific to the following database actions: • BACKUP, ROLLFORWARD, DROPPED TABLE, LOAD, CREATE

TABLESPACE, RENAME TABLESPACE, REORG, ALTER TABLESPACE, ARCHIVE LOG

• Dropped table recovery• CREATE or ALTER TABLESPACE option DROPPED TABLE RECOVERY• Allows recovery of dropped table data using table space-level restore and

rollforward operations

• Faster than database recovery and database available to users

Page 19: Sept. 2009 Bill Minor IBM

19

LIST HISTORY – Dropped Table Recovery

1 Identify the table which was dropped

Op Obj Timestamp+Sequence Type Dev Earliest Log Current Log Backup ID -- --- ------------------ ---- --- ------------ ------------ -------------- D T 20090223125627 000000000000693f00020013 ---------------------------------------------------------------------------- "BMINOR "."STAFF3" resides in 1 tablespace(s):

00001 USERSPACE1 ---------------------------------------------------------------------------- Comment: DROP TABLE Start Time: 20090223125627 End Time: 20090223125627 Status: A ---------------------------------------------------------------------------- EID: 9

DDL: CREATE TABLE "BMINOR "."STAFF3" ( "ID" SMALLINT NOT NULL , "NAME" VARCHAR(9) , "DEPT" SMALLINT , "JOB" CHAR(5) , "YEARS" SMALLINT , "SALARY" DECIMAL(7,2) , "COMM" DECIMAL(7,2) ) IN "USERSPACE1" ;

----------------------------------------------------------------------------

Page 20: Sept. 2009 Bill Minor IBM

20

Dropped Table Recovery (continued)

2 Restore a database- or table space-level backup image taken before the table was dropped

3 Create an export directory to which files containing the table data are to be written

4 Roll forward to a point in time after the table was dropped, using the RECOVER DROPPED TABLE option on the ROLLFORWARD DATABASE command (alternatively, roll forward to the end of the logs, so that updates to other tables in the table space or database are not lost)

5 Re-create the table using the CREATE TABLE statement from the recovery history file.

6 Import the table data that was exported during the rollforward operation into the table

Page 21: Sept. 2009 Bill Minor IBM

21

Performance – Explain Plans

• Optimizer access plans change … could be legitimate, maybe not

• Compare plans, stats, configuration: before vs after

• Monitor

Example:

db2 –tvf $HOME/sqllib/misc/EXPLAIN.DDL

db2 explain all for “select * from bminor.staff where id=10”

db2exfmt -d sample -n % -e % -s % -w -l -# 0 -o explainPlan.out -g TIC

Page 22: Sept. 2009 Bill Minor IBM

22

db2exfmt Output - Example

Access Plan:----------- Total Cost: 11.5946 Query Degree: 1

Rows RETURN ( 1) Cost I/O | 6.28 FETCH ( 2) 11.5946 1 /---+---\ 6.28 157 IXSCAN TABLE: BMINOR ( 3) STAFF 3.2496 Q1 0 | 157 INDEX: BMINOR I1_STAFF Q1

Objects Used in Access Plan:---------------------------

Schema: BMINOR Name: I1_STAFF Type: Index Time of creation: 2009-03-02-15.52.41.836445 Last statistics update: Number of columns: 1 Number of rows: 157 Width of rows: -1 Number of buffer pool pages: 1 Distinct row values: No Tablespace name: USERSPACE1 Tablespace overhead: 7.500000 Tablespace transfer rate: 0.060000 Source for statistics: Single Node Prefetch page count: 32 Container extent page count: 32 … Schema: BMINOR Name: STAFF Type: Table Time of creation: 2009-03-02-15.50.29.727941 Last statistics update: Number of columns: 7 Number of rows: 157 Width of rows: 40 Number of buffer pool pages: 1 Number of data partitions: 1 Distinct row values: No Tablespace name: USERSPACE1 Tablespace overhead: 7.500000 Tablespace transfer rate: 0.060000

Page 23: Sept. 2009 Bill Minor IBM

23

db2pd – Monitor and Troubleshoot DB2

• Indispensable for monitoring and troubleshooting• Non-intrusively capable of obtaining various statistics, snapshot

information, internal meta-data, … from an instance• Relevant options (some of my favorites):

• -osinfo, -mempools, -memblocks, -edus (more to come), -stack, -applications, -transactions, -bufferpools, -logs, -pages, -tablespaces, -tcbstats index, -apinfo, -fvp, …

Page 24: Sept. 2009 Bill Minor IBM

24

Achieving Understanding via SQL Routines • DB2 has a very rich and diverse set of SQL routines that can be used probe

and monitor one’s system

• For example:• PD_GET_DISG_HIST• PD_GET_LOG_MSGS• PDLOGMSGS_LAST24HOURS• DB_HISTORY• LOG_UTILIZATION• LONG_RUNNING_SQL• LOCKS_HELD• LOCKWAITS• CONTAINER UTILIZATION• TBSP_UTILIZATION• ADMIN_GET_TAB_INFO, ADMIN_GET_INDEX_INFO • APPL_PERFORMANCE• BP_HITRATIO, BP_READ_IO, BP_WRITE_IO• TOP_DYNAMIC_SQL• MON_GET_BUFFERPOOL, MON_GET_TABLE, MON_GET_TABLESPACE, ….

Page 25: Sept. 2009 Bill Minor IBM

25

db2top

• What is it?• DB2 dynamic monitoring tool. But so are snapshots?

• Snapshot data is typically cumulative (point in time)

• db2top provides delta values in real time

• db2top integrates multiple types of snapshot data together• Diagnose performance and resource problems, optimize• Provides a unified, single-system view of a multi-partition database or

single-partition database for the purposes of providing a dynamic real-time view of a running DB2 system

• Availability: V8 FP17, V9.1 FP6, V9.5 FP2• developerworks whitepaper:

http://www.ibm.com/developerworks/data/library/techarticle/dm-0812wang/

Page 26: Sept. 2009 Bill Minor IBM

26

When Unexpected Problems Occur – db2cos :DB2 Call Out Script

• Invoked automatically by Database invoked by default when the database manager cannot continue processing due to a panic, trap, segmentation violation or exception

• Each default db2cos script will invoke db2pd commands to collect information in an unlatched manner

• Order of events when trap/panic occurs:• Trap file is created • Signal handler is called • db2cos script is called • An entry is logged in the administration notification log • An entry is logged in the db2diag log file • Information is generated by db2pd (operating system, the Version and Service

Level installed, the database manager and database configuration, as well as information about the state of the agents, memory pools, memory sets, memory blocks, applications, utilities, transactions, …)

Page 27: Sept. 2009 Bill Minor IBM

27

db2support – Diagnostic Data Collection

• Collects environment data about either a client or server machine and places the files containing system data into a compressed file archive

• Collects db and dbm cfg info, reg var settings, db2level, db2diag.log, … all into one file called db2support.zip

• Collects up diagnostics generated by db2cos* scripts• db2supp_opt.zip within db2support.zip

This file contains a host of optimizer related info: query text, exfmt output (explain of query), db2look and db2batch output, catalog table info, …

• To assist with Optimizer (plan) problems

db2support . –db <dbname> -st “select * from bminor.staff”

Page 28: Sept. 2009 Bill Minor IBM

28

db2pdcfg

• Basically a tool which can set or change flags which influence the

database system behavior for problem determination purposes • Control behaviors of cos and fodc (more to come on this one)

• Neat option: db2pdcfg –flushbp –db <dbname>• Allow faster failover from primary to secondary in HADR setups• Monitor and tune bufferpools for number of dirty pages (use

inconjunction with –pages and –bufferpools)• Sync in-memory tablespace meta-data structures with on-disk

versions• Execute before using db2dart /DHWM against active database

Page 29: Sept. 2009 Bill Minor IBM

29

First Occurrence Data Capture (FODC)

• A facility (V9.5) to capture diagnostic information when a problem first occurs• Automatic mode: invoked by DB2 engine when outage or error condition

is detected

• Traps, panics, corruptions• Manual mode: user initiated – db2fodc

• Hangs, performance issues

• Current setting: db2pdcfg –fodc status• db2pdcfg can be used to change defaults

• Information collected and placed in subdirectory under DIAGPATH• FODC_<outagetype>_<timestamp>

• Run db2support after db2fodc to collect all the fodc information and package into one zip file that can be sent to Support

Page 30: Sept. 2009 Bill Minor IBM

30

db2fodc ExampleExample: db2fodc –hang –db SAMPLE

2009-03-03-11.33.19.657457-300 E1683E654 LEVEL: WarningPID : 619 TID : 47350926685104PROC : db2fodcINSTANCE: bminor NODE : 000FUNCTION: DB2 UDB, RAS/PD component, pdDb2FODCMain, probe:30MESSAGE : ADM14003W FODC has been invoked by the user from db2fodc tool for symptom "hang" and diagnostic information has been recorded in directory "/home/bminor/sqllib/db2dump/FODC_Hang_2009-03-03-11.33.19.657318". Please look in this directory for detailed evidence about what happened and contact IBM support if necessary to diagnose the problem.

File Listing for ../sqllib/db2dump/FODC_Hang_2009-03-03-11.33.19.657318

6116.1.000.stack.txt 6116.17.000.stack.txt 6116.27.000.stack.txt 6116.31.000.stack.txt DB2CONFIG6116.11.000.stack.txt 6116.18.000.stack.txt 6116.28.000.stack.txt 6116.32.000.apm.bin6116.12.000.stack.txt 6116.19.000.stack.txt 6116.29.000.apm.bin 6116.32.000.dump.bin 6116.13.000.stack.txt 6116.20.000.stack.txt 6116.29.000.dump.bin 6116.32.000.stack.txt DB2PD6116.14.000.stack.txt 6116.21.000.stack.txt 6116.29.000.stack.txt 6116.33.000.apm.bin DB2SNAPS6116.15.000.dump.bin 6116.22.000.stack.txt 6116.30.000.apm.bin 6116.33.000.dump.bin DB2TRACE6116.15.000.stack.txt 6116.23.000.stack.txt 6116.30.000.dump.bin 6116.33.000.stack.txt OSCONFIG6116.16.000.apm.bin 6116.24.000.stack.txt 6116.30.000.stack.txt 6116.34.000.apm.bin OSSNAPS6116.16.000.dump.bin 6116.25.000.stack.txt 6116.31.000.apm.bin 6116.34.000.dump.bin OSTRACE6116.16.000.stack.txt 6116.26.000.stack.txt 6116.31.000.dump.bin 6116.34.000.stack.txtdb2eventlog.000 db2fodc_hang.log

Page 31: Sept. 2009 Bill Minor IBM

31

Resiliency

• Avoiding and dealing with ‘errors’ that can result in unplanned outages

Page 32: Sept. 2009 Bill Minor IBM

32

Online Data Consistency Checking (V9.5)

• Index to Data consistency problems can be categorized as:

Orphaned index entry:An index entry points to a deleted or non-existent record slot

Invalid index entry:An index entry points to a valid slot on a data page, but index key data does not match record data

Missing key:A data row is not indexed in at least one of the indexes on the table

Duplicate Record IDs (RIDs):A data row is indexed more than once

These are classified as logical corruptions.

Page 33: Sept. 2009 Bill Minor IBM

33

Key-Data Checking via DB2 INSPECT (V9.5)

• New INDEXDATA cross-object checking clause to the INSPECT utility

Example:

db2 inspect check table name STAFF indexdata results keep insp.out

NOTE:

The INDEXDATA option only examines the logical inconsistency between

index and data. Therefore, it is recommended that you first run INDEX and

DATA checking separately, to rule out any physical corruption, before

running INDEXDATA checking.

Page 34: Sept. 2009 Bill Minor IBM

34

Threaded Engine Model for Unix in V9.5

• DB2 has always had a threaded engine model on Windows platforms

• In V9.5, DB2 Unix engine has now evolved from process-based to thread-based model

• All threads within a process can share the memory allocated by, or within, that process

• Several advantages: basically performance and memory management

• From a DBA perspective:• Better administration and tuning of memory through administration• Resiliency improvements (upcoming slides)

Page 35: Sept. 2009 Bill Minor IBM

35

DB2 Processes vs. Threads: ps -ef

• V9.1: bminor 6107356 4780132 0 13:00:53 - 0:00 db2sysc bminor 6148108 6717616 0 13:01:03 - 0:00 db2agent (SAMPLE) bminor 6180952 1458238 0 13:08:22 - 0:00 db2pclnr bminor 6193316 1458238 0 13:08:22 - 0:00 db2pclnr bminor 6660150 6107356 0 13:00:53 - 0:00 db2licc bminor 6717616 6107356 0 13:00:53 - 0:00 db2ipccmbminor 884882 1458238 0 13:08:22 - 0:00 db2loggw (SAMPLE) bminor 1237024 1458238 0 13:08:22 - 0:00 db2lfr (SAMPLE) bminor 1458238 6107356 0 13:00:53 - 0:00 db2gds bminor 1556708 6717616 0 13:01:07 - 0:14 db2stmm (SAMPLE) bminor 1757202 1458238 0 13:08:22 - 0:00 db2dlock (SAMPLE) bminor 1908750 1458238 0 13:08:22 - 0:00 db2pclnr bminor 2195682 1 0 13:08:21 - 0:00 /home4/bminor/sqllib/bin/db2bp

• V9.5:bminor 10364 10362 1 13:15 pts/19 00:00:00 db2sysc

bminor 10377 10362 1 13:15 pts/19 00:00:00 db2acd,0,0,0,1,0,0,1,0,8a18bminor 10385 1 0 13:15 pts/19 00:00:00/home/bminor/sqllib/bin/db2bp

Page 36: Sept. 2009 Bill Minor IBM

36

Running Threads: db2pd -edus

Database Partition 0 -- Active -- Up 2 days 17:59:14

List of all EDUs for database partition 0

db2sysc PID: 1187db2wdog PID: 1185db2acd PID: 1201

EDU ID TID Kernel TID EDU Name USR SYS =========================================================================================================65 47726691608896 20490 db2agntdp (SAMPLE ) 0.000000 0.00000064 47726695803200 999 db2agent (SAMPLE) 0.240000 0.11000063 47726708386112 998 db2agent (SAMPLE) 0.720000 0.16000061 47726712580416 308 db2evmli (DB2DETAILDEADLOCK) 0.000000 0.00000060 47726716774720 307 db2wlmd (SAMPLE) 0.000000 0.00000059 47726720969024 306 db2pfchr (SAMPLE) 0.000000 0.00000057 47726737746240 304 db2pfchr (SAMPLE) 0.000000 0.00000055 47726733551936 302 db2pclnr (SAMPLE) 0.000000 0.00000054 47726729357632 301 db2pclnr (SAMPLE) 0.000000 0.00000053 47726746134848 300 db2dlock (SAMPLE) 0.000000 0.00000052 47726762912064 32767 db2lfr (SAMPLE) 0.000000 0.00000051 47726741940544 32766 db2loggw (SAMPLE) 0.020000 0.03000050 47726758717760 32765 db2loggr (SAMPLE) 0.000000 0.03000049 47726699997504 1554 db2taskd (SAMPLE) 0.000000 0.00000025 47726754523456 1365 db2stmm (SAMPLE) 8.020000 0.46000016 47726767106368 1265 db2agent (SAMPLE) 1.420000 0.81000015 47726771300672 1199 db2resync 0.000000 0.00000014 47726775494976 1196 db2ipccm 0.080000 0.03000013 47726779689280 1195 db2licc 0.000000 0.00000012 47726783883584 1194 db2thcln 0.000000 0.00000011 47726788077888 1192 db2alarm 0.020000 0.0200001 47726528031040 1191 db2sysc 5.420000 5.620000

Page 37: Sept. 2009 Bill Minor IBM

37

Memory Protection via Storage Keys (V9.5)

• DB2 is leveraging technology which is specific to the POWER6 platform and AIX (TL5300-06 or greater)• Hardware feature called Storage Keys• Allows for memory protection functionality

• DB2 V9.5 has the capability to detect erroneous access to critical regions of it’s own memory, primarily bufferpool memory

• This is pro-active protection: • When such access is detected it is stopped before a corruption can

result• Offending UDF or Stored Procedure receives an error code

indicating that an incorrect memory access was performed

• Cannot protect against all types of bugs but it does provide an added layer of protection

Page 38: Sept. 2009 Bill Minor IBM

38

Enablement of Memory Protection viaStorage Keys (V9.5)

• Registry variable: DB2_MEMORY_PROTECT = [ NO | YES ]

• Error DBI1301E on attempted set if:• Hardware, OS does not support storage keys• Large page support (i.e. DB2_LARGE_PAGE_MEM) is enabled

Page 39: Sept. 2009 Bill Minor IBM

39

Higher Availability with Storage Key Support

• Thread Suspension: An additional level of protection that will increase availability of systems when abnormal termination of a DB2 Kernel Thread occurs

DB2_THREAD_SUSPENSION = [ OFF | ON ]

• What is it? It allows you to control whether a DB2 instance sustains a trap by

suspending a faulty DB2 kernel thread (a thread that has tried illegally to access memory protected with storage protection keys).

Page 40: Sept. 2009 Bill Minor IBM

40

Error Toleration (TP)

• Goal: Reduce unplanned outages• Error toleration via:

• Trap resilience• Physical read error toleration• Logical read error toleration

• Registry variable:• DB2RESILIENCE=ON/TRUE or not defined (ie default behavior)

activates error toleration support for the above• DB2RESILIENCE=OFF/FALSE disables these capabilities.

Page 41: Sept. 2009 Bill Minor IBM

41

Extended Trap Resilience (TP)

• Registry Variable:

DB2RESILIENCE= [ ON | OFF ] • When a DB2 server detects a trap the following actions will be taken by

DB2: First Occurrence Data Capture (FODC) will be activated Execution stack is dumped If Storage Key support is active then process thread suspension

decisions If Storage Key suspension decisions allow for thread suspension,

suspend thread If Storage Key support is not active or fails decision points to suspend

thread, then proceed with non Storage Key survival decisions (ie. Utilize this new capability)

DB2 will determine if the current environment is safe to allow recovery

Page 42: Sept. 2009 Bill Minor IBM

42

Messaging:

2008-11-17-09.56.27.542572-300 I75439E563 LEVEL: SeverePID : 28040 TID : 47643040409920PROC : db2syscINSTANCE: marvin NODE : 000 DB : TESTDBAPPHDL : 0-21 APPID: *LOCAL.marvin.081117145537AUTHID : MARVIN EDUID : 29 EDUNAME: db2agent (TESTDB)FUNCTION: DB2 UDB, RAS/PD component, pdResilienceIsSafeToRollback,probe:800DATA #1 : String, 37 bytesTrap Sustainability Criteria CheckingDATA #2 : Hex integer, 8 bytes0x00000000000A0400DATA #3 : Boolean, 1 bytestrue

DATA #1 A generic description of the data followingDATA #2 The hex dump of the trap sustainment criteriaDATA #3 The decision on whether to attempt to sustain the trap (true/false)

Page 43: Sept. 2009 Bill Minor IBM

43

Physical Read Error Toleration (TP)

• Physical read errors are detected when retrieving DB2 pages from the storage medium (for disk drive).

• This error could be caused by issues with the contents of the data on disk or by problems accessing the physical device on which the data resides. Currently, when DB2 detects either of these errors the DB2 instance is shutdown.

• This solution will provide enhanced toleration of physical read errors such that if such an error is encountered, rather than force the DB2 instance offline, the SQL statement or operation being executed will fail, a new SQLCODE will be returned to the application when appropriate, logging of appropriate diagnostics will occur, and the DB2 instance will remain active.

Page 44: Sept. 2009 Bill Minor IBM

44

Toleration of Physical Read Errors

• When a DB2 server detects a physical read error there will be a maximum of two more attempts to read the page from disk and validate the DB2 page header. If at the end of the multiple attempts an error is still determined to exist then:• If DB2RESILIENCE=ON

• Sqlcode -1655E will be returned to the application• The SQL statement or DB2 operation fails• Administration message ADM6006E will be placed in the DB2 Administration log

at notify level URGENT• Administration message ADM6006E will be placed in the DB2 diagnostic log at

level URGENT• First Occurrence Data Capture (FODC) will be activated when the error is

detected and will be turned off when the sqlcode is returned to the application. This will allow collection of diagnostic data generated due to this failure detection for further analysis by DB2 Support.

• The database instance remains available• Application connection remains available for continued operations

• If DB2RESILIENCE=OFF then panic the instance (that is, revert to old behavior)

• This algorithm will aid in reducing application errors which are a result of transient file system I/O errors.

Page 45: Sept. 2009 Bill Minor IBM

45

Logical Read Error Toleration (TP)

• Inconsistency detected by DB2 when processing data contained within a DB2 page

• Instead of hard failure (database or instance shutdown), isolate the issue and allow overall database use to continue• Problematic page is removed from the bufferpool, relevant diagnostics

are generated, notification is returned (messages to admin and db2diag.log and new sqlcode to application)

• Leveraged within the engine where it makes sense

Page 46: Sept. 2009 Bill Minor IBM

46

Messaging

SQL1656E An error occurred while processing data. The operation could not be completed, but the

database remains accessible. Contact IBM Software Support.Explanation: The failure to complete the operation is due to a detected DB2 page inconsistency. The SQL

statement has failed or the operation has been aborted. The database remains accessible.User response: Contact IBM Software Support for instructions on what diagnostic data to collect to assist in

resolving this issue. The application can retry the operation, although it may continue to fail.sqlcode: -1656sqlstate: 58004

ADM6007C

DB2 detected an error while processing page page-number from table space tbspace-id for object object-id of object type object-type.

ExplanationDB2 was not able to complete the operation, but the database remains accessible. This may

indicate the presence of a serious problem that could result in a future outage.User responsePlease contact IBM Software Support and they will guide you through the proper corrective

actions.

Page 47: Sept. 2009 Bill Minor IBM

47

Resources To Help You

• Developerworks (Best Practices Papers, …)

• IBM Database Magazine

• IDUG Web Site and Solutions Journal

• DB2 LUW Chat with the Lab

http://www-01.ibm.com/software/data/db2/9/labchats.html

• IBM DB2 Support home:

www.ibm.com/software/data/db2/support/db2_9

• DB2 Fixes:

http://www-01.ibm.com/software/data/db2/support/db2_9/download.html

• Conferences, RDUGs, Customer Lab Advocate, Early Adoption Programs, …

Page 48: Sept. 2009 Bill Minor IBM

48

Bill MinorIBM

[email protected]

The DB2 Detective – The Case of the Diligent DBA