SQL Server Disaster Recovery 101

SQL Server Disaster Recovery 101Dean [email protected]

SQL Server UGZagreb, 16.4.2014.

Transaction log• Every data modification is first written to

tlog• Before the transaction commits • Delayed durability in 2014!

• Data pages are updated in memory • CHECKPOINT writes dirty pages to disk

• Both committed and uncommitted • Frequency depends on recovery interval

• When TX rolls back, state of the data prior to update is read from tlog

• Space in transaction log is being freed and reused • When exactly this happens, depends on recovery model used

Transaction log and recovery • If SQL Server service shuts down, recovery will

happen on the next startup • These can happen: • Dirty pages from uncommitted tx not written to disk • Dirty pages from committed tx written to disk • Dirty pages from uncommitted tx written to disk

• Have to be rolled back (undo) • Dirty pages from committed tx not written to disk

• Have to be rolled forward (redo) • Without the tlog, transactional consistency of the

database cannot be guaranteed

Recovery models • Three recovery models in SQL Server

• SIMPLE recovery model • FULL • BULK-LOGGED

• Model database is in FULL recovery model • If you want newly created databases be in SIMPLE, change

recovery model for model database

SIMPLE recovery model• Full and differential backups only • Transaction log is not backed up • Space in tlog is freed when the data is

committed and checkpointed to disk • Eliminates the need for managing tlog

space • Restore only to the most recent backup • Consider not using SIMPLE model on

production system

FULL recovery model • Full, differential, and log backups • Log backups are required • Space in tlog is freed when the data is

committed and checkpointed to disk and the log is backed up

• Provides full recoverability • Up to the point of failure • Point in time restore

BULK-LOGGED recovery model • Same as the FULL model, but:

• Bulk operation are only minimally logged • In general, only allocation and deallocation of extents is

logged • Increased performance and lower log

space consumption • Use FULL, but switch to BULK-LOGGED

for bulk operations • Bulk insert and bcp • SELECT INTO • ALTER INDEX … REBUILD • etc

Backup and restore • Full, differential, and transaction log

backup • Log backup in full and bulk-logged model

• Backup strategy • How much data I am permitted to lose? • How much time I am permitted to be down?

• Use differential backups to minimize the restore time

• Check your backups • If you can’t restore from backup..

• Use WITH CHECKSUM option • Backup chain

• Use COPY_ONLY

Full Backups• Creates starting point for all backups• Contains all used pages in db• Can run concurrently to transactions

• ~2% processor overhead• Completely non-blocking backup

• Transaction log• Database growth/shrink

9

Full Backups - Operation• Checkpoint• Mark log• Dump all used pages• Mark log• Copy portion of log between marks

10

Transaction LogMark log

Data changesMark log

Differential Backups• Contains only pages changed since

last full backup• Differential bitmap

• NOT an incremental backup

11

0001001110001001000101100

Bitmap

Transaction Log• Backup the inactive portion of log

• Committed transaction• Clears backed up portion on

completion• Only allowed in full recovery or

bulk_logged mode

12

Partial Backups• Partial database

• Primary filegroup• Read/write filegroups

• Partial differential• Primary filegroup differential• Read/write filegroups differential

13

2014 backup and restore• Backup on-premise database to cloudBACKUP DATABASE AdventureWorks2012 TO URL = 'https://mystorageaccount.blob.core.windows.net/mycontainer/AdventureWorks2012.bak' WITH CREDENTIAL = 'mycredential'

• Supports:• DATABASE, LOG, DIFFERENTIAL, COPY_ONLY, PARTIAL,

COMPRESSION, CHECKSUM• Restore from cloud backup

• SQL Server Managed Backup to Windows Azure• smart_admin

• Backup encryption

I/O errors • Three types

• 823 (a hard I/O error) • Page can’t be read from disk

• 824 (a soft I/O error) • Page was read, but is torn

(inconsistent) • 825 (a read-retry error)

• Page was read, but after retries • Logged in msdb..suspect_pages • Input into single-page restore operations

PAGE_VERIFY • Page is composed of 16 512-byte

sectors • It is possible to be not written entirely • Allows fast detection of problems • Three options

• NONE • TORN_PAGE_DETECTION • Writes two control bits in each sector

• CHECKSUM • Calculates and writes a checksum of data on the page • Written last, checked first

• Failures cause 824 error

DBCC CHECKDB • Checks the logical and physical integrity

of all the objects in the specified database • DBCC CHECKDB db_name WITH ALL_ERRORMSGS, NO_INFOMSGS

• How often? It depends. • DBCC CHECKDB could take a long time to

run • Depends on

• Size of database • Load • Number and type of corruptions

• Run DBCC CHECKDB on restored database

DBCC CHECKDB, contd• Did CHECKDB fail?

• If it stops before completing successfully, something bad has happened that is preventing CHECKDB from running

• This means there is no choice but to restore from a backup as CHECKDB cannot be forced to run (and hence repair)

• Examples of fatal (to CHECKDB) errors • 7984 – 7988: corruption in critical system tables • 8967: invalid states within CHECKDB itself • 8930: corrupt metadata in the database such that

CHECKDB could not run

• Are the corruptions only in non-clustered indexes? • If recommended repair level is REPAIR_REBUILD, then YES! • Otherwise, check all the index IDs in the errors – if they’re

all greater than 1, then YES! • If YES, you *could* just rebuild the

corrupt indexes • Depends on the error, and the size of the index • But, what caused the corruption? • If you just rebuild the indexes, the corruption will probably

happen again (especially if caused by the IO subsystem) • Make sure you do root-cause analysis and take

preventative measures

• Was there an un-repairable error found? • 8909, 8938, 8939 (page header corruption) errors where the type is

‘PFS’ • 8970 error: invalid data for the column type • 8992 error: CHECKCATALOG (metadata mismatch) error • Plus a few more obscure ones

• E.g. an 8904 error (extent is allocated to two objects). This is usually repairable except in the case where the extent is marked as mixed and dedicated, and has pages allocated to multiple objects. • The repair is too complicated and/or destructive so is not attempted.

• None of these can be automatically repaired

Restore or repair? • REPAIR_ALLOW_DATA_LOSS is

destructive • Lot of errors cannot be fixed

automatically • Restore is better if you have a good

backup and proper backup strategy

Documents

SQL Server Disaster Recovery 101