Recovery In Parallel Database Systems - Springer978-3-322-93860-2/1.pdfsystem or transmitted, ... I express the most sincere thanks to Telenor R&D for initiating, ... It then continues

Svein-Olaf Hvasshovd

Recovery In Parallel Database Systems

Database Systems edited by Thea Haerder and Andreas Reuter

In this special book series the interested student or scholar will find leading textbooks and monographs focusing on one of the most prospective areas in the fields of computer science.

Though the classical databose systems have been ranking among the most crucial EDP-opplications for several years, today's development is characterized by new technical concepts of highest relevance for practical use.

The series offers sound information on both the basics and the applicational potential. Among others, the books in this series are treating subjects like open database systems, knowledge based or object oriented database systems, multimedia and CAx databases.

In German:

Hochleistungs-Transaktionssysteme by Erhard Rahm

Datenbank-Integration von Ingenieuranwendungen by Christoph Huebel and Bernd Sutler

Datenbanken in verteilten Systemen by Winfried lamersdarf

Das 8enchmark-Handbuch by Jim Gray

In English:

Recovery in Parallel Database Systems by SveinDlaf Hvasshovd

Vieweg

Svei n-Olaf Hvasshovd

Recovery in Parallel Database Systems

II Vleweg

Verlag Vieweg, P.O. Box 58 29, D-65048 Wiesbaden

All rights reserved © Friedr. Vieweg & Sohn VerJagsgesellschaft mbH, Braunschweig/Wiesbaden, 1996

Vieweg is a subsidiary company of the Bertelsmann Professional Information.

No part of the publication may be reproduced, stored in a retrieval system or transmitted, mechanical, photocopying or otherwise, without prior permission of the copyright holder.

Printed on acid-free paper

ISBN 978-3-528-05411-3 ISBN 978-3-322-93860-2 (eBook) DOI 10.1007/978-3-322-93860-2

Nowhere has the philosophical argument been evaded, but where it runs out into too thin a thread the Author has preferred to cut short, andfall back upon the corresponding results of experience; for in the same way as many plants only bear fruit when they do not shoot too high, so in the practical arts the theoretical leaves andfiowers must not be made to sprout too far, but kept near to experience, which is their proper soil.

Carl von Clausewitz

Preface

The relational DBMS technology is a success in the commercial marketplace with respect to business data processing and related applications. This success is a result of cost effective application development combined with high data consistency. The success has led to the use of relational DBMS technology in other application environments requesting its traditional virtues, while at the same time adding new requirements. For example, the integration of relational DBMS technology with telecommunications has led to such combined requirements as transaction rates of more than 10.000 TPC-B like transactions per second, transaction service availability of certain transaction classes of 99,999%, and transaction response time under 10 milliseconds. The current trends in hardware price/performance and parallel technology make relational DBMSs which support very high transaction rates, very high availability, and soft real-time transaction response a cost effective possibility.

This book describes a recovery method supporting the traditional high DBMS data consistency while offering very high transaction rates, continuous availability, and soft real-time transaction response. It also presents high availability approaches used by commercial DBMS products.

Acknowledgements

The book is based on my doctoral (Dr.Ing.) thesis. Due to an extreme work load over a long period it has taken some time to complete the book transformation. My Dr.lng. research and the following book transformation period clearly demonstrated to me the value of true and trustful friendship. I want to thank all good friends for their kind support and encouragement and their ability to bear with me at all times. It is a great pleasure to express deep gratitude for the help I have received in writing the thesis.

First and foremost I wish to acknowledge Prof. Dr. Andreas Reuter and Prof. Dr. Theo Haerder for giving me the opportunity to publish the result of my research. I also wish to acknowledge Prof. Kjell Bratbergsengen for his Dr.Ing. advisory.

A special thanks goes to my colleagues Dr. 0ystein Torbj!<'lrnsen and Oddvar Risnes for their constructive criticism and helpful motivation. I have been fortunate to work closely with them on the HypRa project and also on the continuation TeleServe and ClustRa projects.

I also wish to thank in particular my colleagues at Telenor R&D and my former colleagues at the Database Technology Group at SINTEF DELAB for a good inspiring working environment. A special thanks to Dr. Svein Erik Bratsberg, Ole-Hj. Kristensen, Dr. Stein Dale, and Steinar Haug for their unending encouragements. I also want to thank Cindy Kandolf for her excellent proofreading.

It would be remiss not to acknowledge the support and sponsorship for my research. First of all, I express my gratitude to my employer Telenor R&D, for the support and sponsorship given during the production of this book. Secondly, I want to express my gratitude to my former employer SINTEF DELAB which was willing to carry the burden of financing me at the darkest hours and

VI

continuously sponsoring me throughout the Dr.Ing. research period. Secondly, I want to express my gratitude to my current employer TeleServe Transaction Technology AS for financing the final stage of this work and for commercialising the results from this research. I also wish to thank the Royal Norwegian Council for Scientific and Industrial Research (NFR) for granting me a two year Dr.Ing. scholarship. The generous support given by the previous director of RUNIT, Dr. Steinar K vitsand, will never be forgotten.

I express the most sincere thanks to Telenor R&D for initiating, funding, and managing the HypRa project and for deciding to continue the research program through the TeleServe and ClustRa projects. I am happy to say that these projects provided the motivating applications, the research environment, and the application harness for my own research, without which the presented results would hardly have been obtainable.

I like to thank IBM Norge AS, Informix Software AS, IT-Technology AS, Oracle Norge AS, Sybase Norge AS, and Tandem Computers AS for all their help with supplying me with product documentation.

Plan of the Book

This book is divided into six parts. The first part introduces motivations and requirements for continuously available database management systems (DBMSs). It also presents the general basic requirements and design goals for DBMS recovery methods.

The second part of the book gives a systematic presentation and analysis of design approaches for DBMS recovery systems developed over more than fifteen years, primarily focusing on their effect on availability and transaction service response times. It starts with the older block oriented recovery approach, which represented the state of the art of the late 1970s. Recovery methods based on this approach are still in use in successful existing DBMS products. It then continues with the more modem record oriented recovery approach, which represented the state of art in the first half of the 1980s. This approach also provides the basis for the recovery methods used in successful DBMSs. It then continues with the compensation oriented recovery approach, which is the most modem of the recovery approaches implemented in commercial systems. This approach represented the state of the art in the late 1980s. The table oriented approach is the last of the presented recovery approaches. It represents an extension of the previous approaches and is only partially implemented in current commercial systems.

The third part presents the design goals and the recovery approach for the HypRa DBMS. The basic hardware and software architectures of the HypRa DBMS is outlined. A more detailed presentation of the HypRa architecture is given in [HST91aj. The HypRa recovery approach is to a large extent based on the principles from the table oriented recovery approach. To meet the response requirement from soft real-time transaction services, it supports main memory logging. To support very high availability, multiple replicas of relations are maintained online. If a primary replica fails, a hot stand-by takes over as primary without interrupting the transaction service availability. Fast automatic online recovery and repair are provided to reestablish the initial fault-tolerance level.

The fourth part of the book presents other high availability DBMS approaches than the HypRa approach. An overview of central design strategies for high availability approaches is given. Two commercial DBMS product solutions based on the shared disc and seven commercial DBMS product solutions based on the shared-nothing hardware architectures are presented. The product presentations are based on public product documentation. Some of the conclusions are deduced from information pieces obtained from multiple sources which unfortunately increases the probability of errors. The responsibility for such errors is mine. I hope they are few in number.

VII

The fifth part of the book presents the conclusions to be drawn from this research. Directions for further work are also indicated.

The sixth part of the book introduces basic concepts relating to databases, transactions, and concurrency control in a DBMS.

VIII

Abstract

This book presents a recovery method for a distributed DBMS based on a multi-computer, with a shared-nothing hardware architecture and a high capacity and low latency inter-node communication network. The recovery method is intended to support: very high transaction service availability, 99,999% availability for simple transactions; very high simple transaction load, 10.000 TPC-B equivalent transactions per second; and soft real-time transaction response, 10 millisec. response time for single tuple transactions.

The main recovery approaches developed over the last two decades are systematically presented and analysed, in particular with respect to how well they fulfill the requirements for availability and soft real-time response. The availability aspect includes both normal operations and fast recovery. An obvious conclusion can be drawn; availability has improved markedly over these years, in particular for those systems that have been around for a while and have shown the ability to adapt to users' needs. Most elements needed in a recovery approach to provide very high availability are present in the newest main approach. The element that is particularly lacking is the ability to perform online non-blocking repair, i.e. to reestablish automatically, online, and without blocking, the initial fault-tolerance level after a recovery. The analysis of soft real-time response is mostly focused on commit processing. All the main recovery approaches analysed are still bound by the response times of disc accesses with its dire effect on transaction response times. The trend, though, is to gradually accept main memory logging.

The HypRa architecture, which supports multiple replicas of relations, i.e. one primary and one or more hot stand-by replicas is introduced. The relations are horizontally fragmented and allocated over the available nodes in the multi-computer, so that two replicas of a fragment are located to nodes with independent failure modes. Log and locks for a fragment replica are clustered to the node in which the fragment is located. The HypRa recovery approach uses main memory logging on multiple nodes with independent failure modes to avoid disc access delays during commit processing and thereby support the soft real-time response requirements. An update channel maintains the write-ahead logging properties between primary and hot stand-by logs. Operations to hot stand-by replicas are performed as redo operations based on the log. Updates to hot stand-by replicas are deferred but guaranteed to be successfully executed after the commit of a transaction. State-identifiers are connected to tuples to support non-blocking fuzzy replication of tuples.

The HypRa fault-tolerance is based on HW failure detection, and SW online failure masking, online recovery, and online non-blocking repair. The HypRa recovery approach is founded on: i) very fast take-over to preserve availability across a primary fragment failure; ii) fast recovery to recover and catch-up failed fragments to the current primary fragments; and iii) fast repair that replicates lost and unavailable replicas after a failure in order to reestablish the original fault-tolerance level. Hot stand-by fragments maintain locks to minimise unavailability during a take-over. The fast recovery utilises main memory aggregation and CPU and 110 parallelism to speed up recovery and catch-up work. Repair is performed online and is non-blocking to avoid threatening soft real-time transaction service availability, and is completed within a given time window to meet the requirements to fault -tolerance.

IX

In order to support very high transaction service availability, novel methods for online and nonblocking DB maintenance are needed. The book introduces such novel methods for fragment replication, index production, redistribution, and transaction recompilation. Online and nonblocking methods for SW and HW maintenance are also presented, to complete the spectrum of online and non-blocking maintenance operations.

x

Contents

I Introduction

1 A Continuously Available, High Transaction Capacity, Soft Real-Time DBMS Server 1.1 Motivation........ ............... . 1.2 Requirements for Continuously Available DBMS Servers 1.3 Fault-Tolerance Concepts .

2 DBMS Recovery Requirements 2.1 The DBMS Service ... 2.2 The DBMS Reliability . . . 2.3 The DBMS Failure Model . 2.4 Node and Fragment Crash Recovery, Masking and Repair. 2.5 The DBMS Server Architecture . . . . . . . 2.6 The DBMS Recovery Methods Design Goals . . . . . . .

II Analysis of Centralised DBMS Transaction and Node Crash Recovery Approaches

3 Introduction

4 The Block Oriented Approach 4.1 The Block Oriented DBMS Server Architecture 4.2 The Block Oriented Recovery Classification Scheme . . . . . . . . . . .

1

3 3 4

5

9 9 9

10

11

13 14

19

21

23 23 25

4.3 The Impact of Volatile DB Buffer on Transaction and Node Recoverability 25 4.4 The Log Strategy. . . . . . . . . . 26 4.5 The DB Block Propagation Strategy 30 4.6 The DB Buffer Strategy . . . . . . 31 4.7 The Checkpoint Strategy. . . . . . 33 4.8 The Combined Block Oriented Classification Scheme 40 4.9 Conclusion . . . . . . . . .

5 The Record Oriented Approach 5.1 The Record Oriented DBMS Server Architecture 5.2 The Record Oriented Classification Scheme 5.3 The Record Logging Strategy 5.4 The State-Identifier Strategy . . . . . . . . 5.5 The Record Access Strategy ....... . 5.6 The Transaction Failure Recovery Strategy 5.7 The Node Crash Recovery Strategy 5.8 The Node Crash Recovery Methods 5.9 Conclusion ............ .

XI

41

45 45 47 48 52 53 55 57 62 64

6 The Compensation Oriented Approach 69 6.1 The Compensation Oriented DBMS Server Architecture 69 6.2 The Compensation Oriented Classification Scheme 70 6.3 The Record Logging Strategy .... 71 6.4 The Compensation Recovery Strategy . . . 72 6.5 The Record Access Strategy . . . . . . . . 80 6.6 The Transaction Failure Recovery Strategy 82 6.7 The Node Crash Recovery Strategy 84 6.8 The Node Crash Recovery Methods 88 6.9 Conclusion . . . . . . . . 91

7 The Table Oriented Approach 97 7.1 The Table Oriented DBMS Server Architecture 97 7.2 The Table Oriented Classification Scheme 99 7.3 The Logging Strategy ... 100 7.4 The Checkpoint Strategy . . . . . . . . 103 7.5 The Tuple Access Strategy . . . . . . . 105 7.6 The Non-Sequential Recovery Strategy III 7.7 The Compensation Recovery Strategy . 116 7.8 The Transaction Failure Recovery Strategy 116 7.9 The Node Crash Recovery Strategy 119 7.10 Conclusion . . . . . . . . . . . . . . . . . 123

III The HypRa Recovery Approach 131

8 Introduction 133 134 8.1 The HypRa DBMS Server Architecture

9 The Hardware and Basic Software Design 9.1 The Hardware Architecture . 9.2 The External Communication .... . 9.3 Replaceable Hardware Units ..... . 9.4 Depends-On Relations Between Hardware Servers 9.5 Disaster-Tolerant Hardware Design 9.6 The Communication Server ..... .

137 137 138 140 140 142 142

10 The HypRa DBMS Server Software Design 147 10.1 The Distributed HypRa DBMS Architecture. 147 10.2 The Distributed Transaction Model .... 147 10.3 Concurrency Control in the HypRa Server. 149 10.4 The Nested Transaction Model. 149 10.5 Buffer Management . . . . . . . . . . . . 150 10.6 Data Distribution. . . . . . . . . . . . . . 151 10.7 Dynamic Reconfiguration of Data as a Mechanism for Repairing Faults . 153 10.8 Gradual Reduction of Fragment Availability. . . . . . . . . . . . 154

11 The Location and Replication Independent Tuple Recovery Strategy 159 11.1 The Tuple State-Identifier Policy. . . . . . . . . . . . . . . . . . 160 11.2 The Primary and Hot Stand-By Tuple Replica Synchronisation Policy. 162 11.3 The Primary Key Update Policies . . . 164 11.4 The Multi-Level State-Identifier Policy 165 11.5 Conclusion . . . . . . . . . . . . 166

XII

12 The Log Distribution Strategy 169 12.1 The Log Distribution Policy . . . . . . . . . . . . . . . . . . . . . . . . . 169 12.2 The Tuple Log Record Identification Policies . . . . . . . . . . . . . . . . 172 12.3 The Tuple Log Record Identification Policy Used at Hot Stand-By Replicas. 173 12.4 The Log Subfragmentation Policy . 175 12.5 The Fragment Concatenation Policy . . . 176 12.6 The Node Internal Log. . . . . . . . . . 177 12.7 The Tuple Identification Concept Policies 178 12.8 Conclusion . . . . . . . . . . . . . . . . 179

13 The Neighbour Write-Ahead Logging Strategy 181 13.1 The Stable Storage Policies ................. 182 13.2 The Neighbour Write Ahead Log Policy . . . . . . . . . . . 183 13.3 The Neighbour Write-Ahead Log Garbage Collection Rules . 184 13.4 The Update Channel Protocol . . . . . . . 186 13.5 The Deferred Execution Policy ...... 187 13.6 The nWAL Distributed Transaction Model. 190 13.7 The nWAL Transaction Model. . . . . . 191 13.8 The nWAL Checkpointing Policy . . . . . 193 13.9 The Main Memory Log Overflow Policy. . 195 13.10Dynamic Replication to Preserve DB Consistency. 196 13.l1Block Split and Merge. . . . . . . . . . . 198 13. 12The Global Undo Space Reservation Policy 198 13.13Conclusion . . . . . . . . . . . . . . . . . 199

14 The Recovery Strategies 205 14.1 The Take-Over Strategy . . . . . . . . . . 205 14.2 The Transaction Failure Recovery Strategy 208 14.3 The Fast Node Crash Recovery Strategy. 212 14.4 Conclusion. . . . . . . . . . . . . . . . . 225

IV Other High Availability DBMS Approaches 229

15 High Availability Approaches 231 15.1 The Symmetric Synchronisation Strategy 231 15.2 The Asymmetric Synchronisation Strategy. 233 15.3 The I-Safe Consistency Policy. 236 15.4 Evaluation Criteria. . . . . . . . . . . . . 238

16 Shared Disc Approaches 239 16.1 The Tandem Non-Stop System. 239 16.2 The IBM Extended Recovery Facility 241

17 Shared-Nothing Approaches 245 17.1 The Tandem Remote Duplicate Database Facility 245 17.2 The E-Net Remote Recovery Data and Log Apply Facility 246 17.3 Oracle7 Symmetric Replication Facility 248 17.4 The Sybase Replication Server. . . . . . . 250 17.5 The CA-OpenIngres/Replicator .. . . . . 251 17.6 The INFORMIX-OnLine Dynamic Server . 252 17.7 The mM Remote Site Recovery. 253 17.8 Telecom Switch Internal DBMSs 255 17.9 Academic Research Work . . . . 255

XIII

V Conclusion and Future Work

18 Conclusion 18.1 Lessons Learned from Centralised DBMS Recovery Approaches 18.2 Main Contributions from the HypRa Recovery Approach . 18.3 Maintaining a Continuously Available Transaction Server.

19 Future Work 19.1 Follow Up Threads. 19.2 Commercialising H ypRa .

Bibliography

Index

VI Appendices

A Basic Concepts A.l Datamodel, Database, and Database Management System Concepts . A.2 Distributed Database Concepts. . A.3 The DBMS Mapping Architecture A.4 The Table Datamodel .... A.5 Table Datamodel Operations . A.6 The Record Datamodel . A.7 The Block Datamodel .... A.8 The Transaction Model. . . . A.9 Concurrency Control Concepts.

XIV

259

261 261 262 263

265 265 266

267

281

291

293 293 293 294 296 297 298 299 299 300

List of Figures

1.1 The depends-on relation . . . . . . . 7

2.1 Failure model of the servers at a node 12 2.2 Take-over recovery to mask a failed fragment 13 2.3 Fragment take-over followed by fragment recovery and catch-up 14 2.4 Fragment take-over recovery followed by non-blocking fragment repair. 15 2.5 Duplexed hardware module . . . . . . . . . 16

4.1 The block oriented DBMS server architecture 24 4.2 The DB propagation strategies and policies implementing the strategies . 30 4.3 Classification of recovery algorithms based on the DB buffer strategy . . 31 4.4 Checkpoint limitation on redo and undo node crash recovery work . . . 33 4.5 Initiation and termination of transactions related to checkpoints and node crash 37 4.6 The basic fuzzy checkpoint policy . 38 4.7 The combined classification scheme . . . . . . 41

5.1 The record oriented DBMS server architecture. 46 5.2 The recovery server depends-on relations 50 5.3 The relocation marker policy. . . . . . . . . . 54 5.4 The active transaction table .......... 57 5.5 Case documentation of DB inconsistency after reverse redo work order 59 5.6 Work order restrictions. . . . . . . . . . . . . . . . . . . . . . . . . 59 5.7 Node crash recovery methods combined with the no-steal DB buffer policy 62 5.8 Node crash recovery methods combined with the force DB buffer policy . 63 5.9 Node crash recovery methods combined with the steaUno-force DB buffer policy. 64

6.1 The compensation oriented DBMS server architecture. . . . . . . . . . . . . .. 70 6.2 Case which illustrates inconsistent undo of delta record operation when using state

based logging of delta record operations . . . . . . . . . . . . . . . . . . . . .. 72 6.3 The log- and transaction-backchain CLR policy combinations. . . . . . . . . .. 74 6.4 Case combining block state-identifiers, the resetting state policy, and fine granu-

larity locking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 77 6.5 Case which illustrates an inconsistent DB resulting from combining the compen-

sation state policy with a non-compensation logging policy . . . . . . 78 6.6 Transaction failure recovery and multiple node crash recovery policies . . . . .. 79 6.7 Transaction rollback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 83 6.8 Inconsistent reverse redo work combined with the fine granularity, semantically

rich locking model . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.9 Work order restrictions in the context of the compensation strategy . . . . . . .. 85 6.10 Inconsistency by using the undo-redo sequentialisation policy . . . . . . . . . .. 86 6.11 Classification of node crash recovery methods combined with the no-steal and

no-force DB buffer policy . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 88 6.12 Classification of node crash recovery methods combined with the force and steal

DB buffer policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 89

xv

6.13 Classification of node crash recovery methods combined with the steal and no-force DB buffer policy . . . . . . . . . . . . . . . 91

7.1 The table oriented DBMS server architecture 98 7.2 The BPKlBPK and BPKffPK combined identification polices. 101 7.3 The link- and upper/lower border policies . . . . . . . . . . . 106 7.4 Case log-aggregation data structure when using the transaction consistent block

split/merge policy ............... . . . . . . . . . . . . . . . . . . 117 7.5 Case log-aggregation data structure when using the action consistent block split/merge

policy ....................................... 118 7.6 Synchronisation of redo and undo work with the transaction consistent block

split/merge policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.7 Synchronisation of redo and undo work with the action consistent block split/merge

policy ................. 120

8.1 The HypRa DBMS server architecture. 135

9.1 The shared-nothing, the shared discs, and the shared main memory hardware architectures . . . . . . . . . . . . . . . . . . . . . . . 138

9.2 The HypRa DBMS server internal hardware architecture ...... . . . . . . . 139 9.3 The HypRa DBMS client-server architecture ................... 140 9.4 Replaceable units, depends-on relations, and server groups for hardware HypRa

DBMS servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 9.5 The depends-on relations between basic software and hardware servers 142

10.1 The HypRa DBMS server group .... 10.2 The DBMS server depends-on relations 10.3 The distributed transaction model . .

148 148 149

10.4 Horizontal fragmentation of a relation . 150 10.5 Single fault -tolerant fragments . . . . . 151 10.6 Mirrored allocation offragment replicas 152 10.7 Refragmentation of fragment replicas . 153 10.8 Production and allocation of subfragment replicas 155 10.9 Subfragment replica allocation after subfragmentation . 156

11.1 Case illustrating inconsistent DB state resulting from use of block state-identifiers with production of a fuzzy subfragmented fragment replica 161

11.2 Execution oftuple operations to hot stand-by tuple replicas 163

12.1 The log distribution policy . . . . . . . . . . . . . . . . . 171 12.2 Primary and hot stand-by tuple log fragment replicas .. . 174 12.3 LSN synchronisation between a primary and a hot stand-by log 175 12.4 Subfragment concatenation . 177

13.1 The nWAL transaction model

14.1 The non-sequential transaction failure recovery policy. 14.2 Fast node crash recovery redo and undo work 14.3 Node crash recovery ............ .

15.1 The gateway external synchronisation policy 15.2 The trigger based external synchronisation policy 15.3 The locking based synchronisation policy 15.4 The log based synchronisation policy ..

XVI

192

211 221 223

233 234 235 236

A.I DB, data items, DBMS, and database operations. . . . . . . . . . . . . . 294 A.2 The distributed database concept. . . . . . . . . . . . . . . . . . . . . . 295 A.3 The centralised and the distributed DBMS reference mapping architecture 295 A.4 The table schema, the table, and the tuple concepts ............ 296

XVII

List of Tables

9.1 Communication capacity as a function of crashed nodes.

A.l Compatibility for the semantically rich locking model . A.2 Compatibility for the traditional locking model . . . .

XIX

143

301 301

Documents

Recovery In Parallel Database Systems - Springer978-3-322-93860-2/1.pdfsystem or transmitted, ... I express the most sincere thanks to Telenor R&D for initiating, ... It then continues