100
Aalto University School of Science Degree Programme of Computer Science and Engineering Tuure Laurinolli High-Availability Database Systems: Evaluation of Existing Open Source Solutions Master’s Thesis Espoo, November 19, 2012 Supervisor: Professor Heikki Saikkonen Instructor: Timo L¨ attil¨ a M.Sc. (Tech.)

High-Availability Database Systems: Evaluation of - Aalto-yliopisto

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

High-Availability Database Systems: Evaluation of Existing Open Source SolutionsTuure Laurinolli
Master’s Thesis Espoo, November 19, 2012
Supervisor: Professor Heikki Saikkonen Instructor: Timo Lattila M.Sc. (Tech.)
Aalto University School of Science Degree Programme of Computer Science and Engineering
ABSTRACT OF MASTER’S THESIS
Author: Tuure Laurinolli
Date: November 19, 2012 Pages: 90
Professorship: Software Systems Code: T-106
Supervisor: Professor Heikki Saikkonen
In recent years the number of open-source database systems offering high- availability functionality has exploded. The functionality offered ranges from simple one-to-one asynchronous replication to self-managing clustering that both partitions and replicates data automatically.
In the thesis I evaluated database systems for use as the basis for high availability of a command and control system that should remain available to operators even upon loss of a whole datacenter. In the first phase of evaluation I eliminated systems that appeared to be unsuitable based on documentation. In the second phase I tested both throughput and fault tolerance characteristics of the remain- ing systems in a simulated WAN environment.
In the first phase I reviewed 24 database systems, of which I selected six, split in two categories based on consistency characteristics, for further evaluation. Ex- perimental evaluation showed that two of these six did not actually fill my re- quirements. Of the remaining four systems, MongoDB proved troublesome in my fault tolerance tests, although the issues seemed resolvable, and Galera’s slight issues were due to its configuration mechanism. This left one in each category. They, Zookeeper and Cassandra, did not exhibit any problems in my tests.
Keywords: database, distributed system, consistency, latency, causality
Language: English
DIPLOMITYON TIIVISTELMA
Paivays: 19. marraskuuta 2012 Sivumaara: 90
Professuuri: Ohjelmistotekniikka Koodi: T-106
Valvoja: Professori Heikki Saikkonen
Ohjaaja: Diplomi-insinoori Timo Lattila
Tassa diplomityossa arvioin tietokantajarjestelmien soveltuvuutta pohjaksi kor- kean saavutettavuuden toiminnoille komentokeskusjarjestelmassa, jonka tulee pysya saavutettavana myos kokonaisen konesalin vikaantuessa. Arvioinnin en- simmaisessa vaiheessa eliminoin dokumentaation perusteella selvasti soveltumat- tomat jarjestelmat. Toisessa vaiheessa testasin seka jarjestelmien viansietoisuutta etta lapaisykykya simuloidussa korkean latenssin verkossa.
Ensimmaisessa vaiheessa tutustuin 24 tietokantajarjestelmaan, joista valitsin kuusi tarkempaan arviointiin. Jaoin tarkemmin arvioidut jarjestelmat kahteen kategoriaan konsistenssiominaisuuksien perusteella. Kokeissa havaitsin etta kaksi naista kuudesta ei tayttanyt asettamiani vaatimuksia. Jaljellejaaneista neljasta jarjestelmasta MongoDB aiheutti ongelmia viansietoisuustesteissani, joskin ongel- mat vaikuttivat olevan korjattavissa, ja Galeran vahaiset ongelmat johtuivat sen asetusjarjestelmasta. Jaljelle jaivat ensimmaisesta kategoriasta Zookeeper ja toi- sesta Cassandra, joiden kummankaan viansietoisuudesta en testeissani loytanyt ongelmia.
Asiasanat: tietokanta, hajautettu jarjestelma, ristiriidattomuus, konsis- tenssi, viive, latenssi, kausaalisuus
Kieli: Englanti
3
Acknowledgements
I would like to thank Portalify Ltd for offering me an interesting thesis project and ample time to work on it. At Portalify I’d especially like to thank M.Sc. Timo Lattila, my instructor, for putting me on the right track from the start. Outside Portalify, I would like to thank Professor Heikki Saikkonen for taking the time to supervise my thesis.
I want to also thank my friends and family for providing me support and, perhaps even more importantly, welcome distractions. Aalto on Waves was downright disruptive, and learning to fly at Polyteknikkojen Ilmailukerho took its time too. However, constant support from old friends was the most important. Thank you, Juha and #kumikanaultimate!
Helsinki, November 19, 2012
Abbreviations and Acronyms
2PC Two-phase Commit ACID Atomicity, Consistency, Isolation, Durability API Application Programming Interface ARP Address Resolution Protocol CAS Compare And Set FMEA Failure Modes and Effects Analysis FMECA Failure Modes, Effects and Criticality Analysis FTA Fault Tree Analysis HAPS High Availability Power System HTTP Hypertext Transfer Protocol JSON JavaScript Object Notation LAN Local Area Network MII Media Independent Interface NAT Network Address Translation PRA Probabilistic Risk Assessment REST Representational State Transfer RPC Remote Procedure Call RTT Round-Trip Time SDS Short Data Service SLA Service Level Agreement SSD Solid State Drive SQL Structured Query Language TAP Linux network tap TCP Transmission Control Protocol TETRA Terrestrial Trunked Radio VM Virtual Machine WAN Wide Area Network XA X/Open Extended Architecture
5
Contents
Abbreviations and Acronyms 4
1 Introduction 8 1.1 High-Availability Command and Control System . . . . . . . . 8 1.2 Open-Source Database Systems . . . . . . . . . . . . . . . . . 9 1.3 Evaluation of Selected Databases . . . . . . . . . . . . . . . . 9 1.4 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . 10
2 High Availability and Fault Tolerance 11 2.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Overcoming Faults . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Analysis techniques . . . . . . . . . . . . . . . . . . . . . . . . 16
3 System Architecture 24 3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Network Communications Architecture . . . . . . . . . . . . . 26 3.3 Software Architecture . . . . . . . . . . . . . . . . . . . . . . . 28 3.4 FMEA Analysis of System . . . . . . . . . . . . . . . . . . . . 33 3.5 FTA Analysis of System . . . . . . . . . . . . . . . . . . . . . 37 3.6 Software Reliability Considerations . . . . . . . . . . . . . . . 39 3.7 Conclusions on Analyses . . . . . . . . . . . . . . . . . . . . . 40
4 Evaluated Database Systems 41 4.1 Database Requirements . . . . . . . . . . . . . . . . . . . . . . 41 4.2 Rejected Databases . . . . . . . . . . . . . . . . . . . . . . . . 42 4.3 Databases Selected for Limited Evaluation . . . . . . . . . . . 48 4.4 Databases Selected for Full-Scale Evaluation . . . . . . . . . . 50
5 Experiment Methodology 54 5.1 Test System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.2 Test Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6
5.3 Fault Simulation . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.4 Test Runs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6 Experiment Results 66 6.1 Throughput Results . . . . . . . . . . . . . . . . . . . . . . . . 66 6.2 Fault Simulation Results . . . . . . . . . . . . . . . . . . . . . 75
7 Comparison of Evaluated Systems 84 7.1 Full-Scale Evaluation . . . . . . . . . . . . . . . . . . . . . . . 84 7.2 Limited Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 85
8 Conclusions 86
B Remaining fault test results 95
7
Introduction
In this thesis I present my research related to adoption of an existing open- source database system as the basis for high availability in a command and control system being developed by Portalify Ltd.
1.1 High-Availability Command and Control
System
The command and control system is designed to support operations of rescue personnel by automatically tracking status and location of field units so that dispatching operators always have correct and up-to-date view of available units. It tracks locations of TETRA handsets and vehicle radios, and handles status messages sent by field personnel in response to events such as receiving dispatch orders. The system also allows operators to dispatch a unit on a mission, and automatically sends necessary information to the unit.
The system should scale to installations that span large geographical ar- eas, with dispatching operators located in multiple, geographically diverse control rooms, and thousands of controlled units spread over the geograph- ical area. Typically operators in one control room would be responsible for controlling units in a specific area, but it should be possible for another control room to take over the area in case the original control room cannot handle its tasks because it has for example lost electrical power.
In this thesis I concentrate on hardware fault tolerance of the command and control system and also the database system, since studying software faults of large, existing software systems appears to be an unsolved problem. However, I touch on higher-level approaches that could be used to enhance software fault tolerance of a complex system in practice in Chapter 3.
I introduce terminology and analysis methods related to availability and
8
CHAPTER 1. INTRODUCTION 9
fault tolerance in Chapter 2. In Chapter 3 I present more elaborate require- ments for the system, a system architecture based on those requirements and fault-tolerance analysis of the architecture model based on analysis methods introduced in Chapter 2.
1.2 Open-Source Database Systems
The system described above must be able to share data between operators working on different workstations, located in different control rooms, dis- tributed across a country. A database system for storing the data and con- trolling access to it is required. Because of the fault tolerance requirements presented in Chapter 3, the database system must be geographically dis- tributed.
Main functional requirement for the database is that it must provide atomic update primitive, preferably with causal consistency and read com- mitted visibility semantics. Main non-functional requirements are quick, au- tomatic handling of software, network and hardware faults and adequate throughput when clustered over high-latency network. Even fairly low through- put is acceptable, since
I limit the evaluation to open-source database systems both because of apparent high cost of commercial high-availability database systems, such as Oracle, and also because it is not possible to inspect how commercial, closed- source systems actually work. The transparency of open-source systems is not beneficial only for research purposes, it is also an operational benefit in that it is actually possible to find and fix problems in the system without having to rely on the database vendor for support. Already during the writing of this thesis I reported issues to several projects and fixed problems in database interfaces to be able to run my tests.
I present the requirements placed on the database system and introduce a wide variety of open-source high-availability database systems in Chapter 4.
1.3 Evaluation of Selected Databases
Since the main objective for the system in question is to find a distributed database system that is fault tolerant, I use a virtualized test environment that capable of injecting faults and latency to a distributed system. The test environment uses Virtualbox1, Netem [20] and virtualized Debian2 systems
1https://www.virtualbox.org/ 2http://www.debian.org/
CHAPTER 1. INTRODUCTION 10
capable of running all the tested database systems. I test fault tolerance characteristics of the selected database systems in
this environment by injecting process, network and hardware faults and measuring effects on clients connected to different nodes of the database cluster. In addition to fault tolerance, I test the update throughput of the database systems in various high-latency configurations with varying num- bers of clients.
I elaborate on the test environment in Chapter 5 and present the test results in chapter 6 as well as a comparison of the evaluated systems based on the test results and features in Chapter 7.
1.4 Structure of the Thesis
In Chapter 1 I introduce the product from which criteria for evaluating the databases are derived. In Chapter 2 I introduce high-availability and fault tolerance terminology and fault tolerance analysis procedures used in other fields. In Chapter 3 I describe the system architecture, how fault-tolerance can be achieved with it and the requirements it places on the database. In Chapter 4 I describe various databases that I considered when selecting sys- tems for evaluation, and explain how the evaluated databases were selected. In Chapter 5 I present test methodology used in obtaining data for evaluation of the databases. In Chapter 6 I present test results for several databases using test methods from Chapter 5. In Chapter 7 I compare the evaluated databases based on the results presented in Chapter 6. In Chapter 8 I present conclusions about suitability of different databases for use as basis for sharing state in a high-availability command and control system.
Chapter 2
High Availability and Fault Tol- erance
High-availability database system is a database system with the characteristic that an operator can achieve high availability using it. The meaning of availability and how it relates to fault tolerance is discussed below.
2.1 Terminology
The terms related to availability, and the meaning of availability itself needs to be carefully defined in order to be useful. In systems that operate con- tinuously, availability is often defined as the probability that the system is operating correctly at any given time [27]. This definition is problem- atic when applied to query-oriented computer systems such as databases for which continuous availability is not easily defined, since availability of the system is only measurable when a query is performed.
2.1.1 Availability
It’s somewhat difficult to measure availability even when queries are per- formed. What is the availability of the system if query A executes success- fully, but during its execution, query B begins executing and fails because of spurious network error? Whether or not this is possible depends on the design of the network protocol that the database cluster uses in its internal communications, but it is certainly imaginable that some of the evaluated systems could allow this kind of behavior.
Even without going as far as proposing simultaneity of failure and suc- cess, it is usually not enough for a query to eventually complete for it to
11
CHAPTER 2. HIGH AVAILABILITY AND FAULT TOLERANCE 12
be considered successful. Instead, there is usually an external requirement limiting execution time of database queries when the database is part of a larger system. In addition, due to concurrency control paradigms employed in certain databases, some queries are actually expected to fail. This happens when multiple clients attempt to concurrently update an entity in a system with optimistic concurrency control. Instead of one of two clients remaining blocked on a lock waiting for the other to complete its update, at least one of the updates must fail at or before commit time.
The interface from the rest of the command and control system to the database system is designed so that failure of an individual query is not disastrous. Nor is unavailability of the database system for a few seconds upon for example failure of underlying hardware a problem for the rest of the system. The database system should thus only be considered unavailable when queries take disproportionately long time to execute or when they fail because of an error in the database system instead of transient error resulting from concurrent access protocol.
2.1.2 Reliability
While availability is usually defined in terms of probability that the system is operating correctly at a point in time during continuous operation, reliability is defined as the probability that the system keeps operating correctly without failures for a defined period of time [27]. For the envisaged system, reliability is not a good metric, since the system does not have well-defined lifetime over which reliability would be meaningful to measure.
For example, if the system had a second of downtime every 10 minutes, its availability would be .998 but its reliability over any 10 minute period 0. For the expected use case with short queries this might be entirely acceptable. However, for a batch system performing video processing tasks that each take 30 minutes, the reliability figure above would be absolutely disastrous, since no task could ever finish.
2.1.3 Faults and Errors
According to Storey, “A fault is a defect within the system” and can take many forms such as hardware failures and software design mistakes. An error on the other hand is “a deviation from the required operation of the system or subsystem” [27]. Storey further classifies faults into random and systematic faults. Random faults include hardware faults due to wear and tear, cosmic rays and other random events. Systematic faults are faults due to design mistakes.
CHAPTER 2. HIGH AVAILABILITY AND FAULT TOLERANCE 13
A fault may cause an error but the operation of the system may also mask the fault. For example, a software design mistake will not result in an error if the part of the software that contains the mistake is never executed. Hardware errors may similarly be masked. If a switch is never used, a fault in its installation cannot produce an error, and a fault in a computer hard disk may stay dormant for the lifetime of the system if the faulty sector is never accessed. In fact, a modern PC CPU contains hundreds of design faults [15], yet millions of the devices are in use everyday without any apparent errors arising from these faults.
Storey also defines data integrity as “the ability of the system to pre- vent damage to its own database and to detect, and possibly correct, errors that do occur” [27]. Database terminology for data integrity is usually more nuanced, using terms such as atomicity, consistency, isolation and durabil- ity to describe characteristics of transaction in the database. According to Wikipedia [30], the terms originate from Haerder and Reuter [19]. Database system that ensures data integrity as defined by Haerder and Reuter is also fail-safe as defined by Storey in the sense that no committed transactions are lost upon error, and thus errors don’t endanger system state, they just prevent accessing it or changing it.
2.1.4 Maintainability
Another concept of interest defined by Storey is maintainability. He defines maintainability as “the ability of a system to be maintained” and mainte- nance as “the action taken to retain a system in, or return a system to, its designed operating condition” [27]. In computer systems common mainte- nance tasks often include ensuring that sufficient resources are available in the system in form of for example disk space, possibly backing up the sys- tem state to external media and applying configuration changes and software updates.
With databases, two intermingled properties often come up with back- ups. Most preferably the backup should be atomic, that is reflect the state of the database at a single point in time. The backup should also have no effect on normal operations of the database system, that is answering queries and performing updates. Some database systems achieve the first property but fall short of the second specifically because achieving the first requires blocking all write operations so that the backup can complete without inter- ference from updates. Others choose to achieve the second property but fail on the first one, yet it is usually possible to achieve both if filesystem-level snapshots are available or if the database uses a multiversion concurrency control scheme. In the first case restoring a backup made by copying the
CHAPTER 2. HIGH AVAILABILITY AND FAULT TOLERANCE 14
filesystem snapshot is equivalent to restarting the database after power fail- ure. In the latter case the database explicitly keeps track of lifetime of items so that during backup old, deleted versions of items are simply kept around until the backup completes, and new items are not included in the backup.
Another maintainability issue with a complex system is management of system configuration over time. In addition to issues that are possible in management of configuration of a centralized software system, distributed systems have additional complexity related to ensuring that the whole sys- tem has compatible configuration. For example, some distributed systems require that all nodes have mostly identical but subtly different configura- tions because their node configurations must specify the addresses of all other nodes but not the node itself.
On a centralized system, a configuration change is performed once, on one computer. If the change requires restart of software, some downtime is unavoidable. In contrast, a distributed system may be able to tolerate config- uration changes that require restart of individual nodes without downtime. On-the-fly upgrades like this are often the preferred method in the world of distributed database systems, where the feature is often called ’rolling restart’ [25]. In practice the difficulty of having a dissimilar configuration for each node may not be great, since the configuration of each node must in any case be managed individually if changes are performed in a staggered fashion.
2.2 Overcoming Faults
Storey [27] divides techniques for overcoming effects of faults into four cat- egories: fault avoidance, fault removal, fault detection and fault tolerance. Fault avoidance covers techniques applied at design stage, fault removal tech- niques applied during testing and fault detection and fault tolerance detec- tion of faults and mitigating their effects when the system is operational. An example of fault avoidance would be use of formal methods during software development to prove that software matches its specification. Fault detection and fault tolerance are related in that fault tolerance in active systems typi- cally requires some form of fault detection so that faulty parts of the system can be isolated or spare components activated, and the fault reported so that it can be repaired.
Several techniques for creating fault-tolerant software are described in literature. The Wikipedia article on Software Fault Tolerance [31] lists Re- covery Blocks, N-version Software and Self-Checking Software. In addition, Storey [27] mentions Formal Methods. Of these, Recovery Blocks are these
CHAPTER 2. HIGH AVAILABILITY AND FAULT TOLERANCE 15
days a mainstream feature in object-oriented programming languages such as C++, Java, Python and Ruby in form of try-catch structures. Storey also mentions other language features common in today’s languages, such as pointer-safety and goto-less program structure, as enhancing reliability of software [27].
N-version (multiversion) software aims to achieve redundancy by creating multiple versions of the same software function. The idea behind multiver- sion software is that the different versions, called variants, will have different faults, and thus correct operation can be ensured by comparing their results and selecting the result that is most popular. Both Storey [27] and Lyu [22] mention that common-cause faults have been found surprisingly common when multiversion programming has been applied. To avoid common-cause faults, the variants should be developed with as much diversity as possi- ble. For example, separate hardware platform, programming language and development tools increase the likelihood of the different program versions actually having different faults.
Multiversion software also usually has a single point of failure, namely the component that selects the final result based on variant results. However, it should be a simple component, maybe so simple as to allow exhaustive test- ing. As techniques for combining variant results, Lyu [23] mentions majority voting and median voting among others.
Majority voting simply picks the majority value, if any. Majority voting cannot produce a result in all cases - namely situations where no majority exists. For example, if three variants each produce a different result, no ma- jority exists, and some other solution is required. Some possibilities in this case are switching control to non-computerized backup system, or shutting down the whole system into safe state. Median voting is an interesting alter- native in that for some special cases it allows the variants to be implemented so that their results do not have to match exactly in order for the combined result to be useful. For example, if diverse algorithms on diverse hardware are used to compute deflection of control surface of an aircraft, combining their outputs with median filter would allow the algorithms to produce slightly different results for common cases, yet choose a common value in case one algorithm produces obviously wrong results.
The article on Self-Checking Software in Wikipedia [31] is actually about N-version Self-Checking Programming as described in Lyu [22, chapter 3], wherein the N-version aspect is the source of redundancy necessary to toler- ate faults and the self-checking part distinguishes it from regular N-version programming as described by Storey [27]. What distinguishes it from regular N-version programming is that in regular N-version programming, there is an external component that compares the results of the N diverse programs and
CHAPTER 2. HIGH AVAILABILITY AND FAULT TOLERANCE 16
determines the correct output whereas in N-version self-checking program- ming each self-checking component must determine if its result is correct and signal the other components in case it detects a fault in its output.
Correct use of formal methods ensure that software matches specification. For them to be applicable, a formal specification must first be created. Some software development standards, such as UK Defense Standard 00-55, require use of formal methods for safety-related software [12]. Techniques borrowed from formal methods are also used in less rigorous settings to find bugs in existing software [13].
In distributed systems, additional techniques that allow the system as a whole to proceed even if components fail are required. The problem of agreement in distributed systems is called the consensus problem. In theory, it is impossible to implement an algorithm solving the distributed consen- sus problem in an asynchronous network, that is a network that does not guarantee delivery of messages in bounded time. In practice this is overcome by employing fault detectors based on timeouts. In addition to distributed consensus, distributed transactions feature widely in literature. Transactions are a special case of distributed consensus, but a plethora of specialized al- gorithms exist for handling them, however lately the trend has perhaps been towards building databases on more generic consensus primitives. For ex- ample Google’s BigTable database is essentially based on the generic Paxos algorithm for solving distributed consensus. [16]
2.3 Analysis techniques
2.3.1 Failure Modes and Effects Analysis
Failure Modes and Effects Analysis (FMEA) was originally developed in United States for military applications and codified in MIL-P-1629 in 1949. Later revisions were standardized in MIL-STD-1629 and MIL-STD-1629A. Early adopters of FMEA in civil applications include aerospace industry and automotive industry. According to Haapanen and Helminen [18], academic record of application of FMEA to software development originates from late 1970s. Haapanen and Helminen mention a paper by Reifer published in 1979 titled Software Failure Modes and Effects Analysis, and in this paper Reifer mentions some earlier work on software reliability, but nothing dating back further than 1974. [26] [18]
Failure Modes, Effects and Criticality Analysis (FMECA) is a develop- ment of FMEA that includes assessment of criticality of failures. Criticality means, according to Haapanen and Helminen [18], “a relative measure of the
CHAPTER 2. HIGH AVAILABILITY AND FAULT TOLERANCE 17
consequences of a failure mode and its frequency of occurrences”. FMECA was part of MIL-STD-1629A, which was published in 1980. In this thesis I will perform qualitative criticality analysis of identified failures in Chapter 3.
The FMECA procedure itself is very simple. The procedure described here is based on the description in Storey [27].
For each system component:
1. Determine failure modes of the component
2. Determine consequences of failure of the component in each failure mode
3. Determine criticality of failure based on consequences and likelihood failure
The result of FMECA is a table that contains description of consequences and criticality of all single-component failures.
The limitations of FMECA are in its simplicity. It prescribes analysis of all system components, which soon becomes burdensome on larger systems. Appropriate modularization helps with this issue. If module interfaces are sufficiently well-defined, internal failures of a module can be treated at a higher level as failures of the larger module, reducing complexity of analysis at higher level. A larger, more difficult problem is that, as prescribed, FMECA limits analysis to single component failures. Consequences of simultaneous failures of multiple components are not covered by the analysis. For example, FMECA analysis of a dual ring network topology would show that any single- link failure does partition the network, but would not cover the two-link failure case which does partition the network.
It is difficult to envision how FMECA could practically be extended to multi-component failures, since already the obvious next step of applying the procedure to component pairs is often infeasible because the number of com- ponent pairs in a system grows quadratically to the number of components. As already mentioned, proper modularization of the system could help some- what, but even for small component count, the number of component pairs is prohibitively large. However, in certain cases reduction of analysis based on symmetries might make analysis of dual component failures. For example, in dual ring network with N identical nodes (Figure 2.1a), single-link failure has 2N identical cases (Figure 2.1b) and the 2N(2N − 1) dual-link failures can be reduced to only three cases with different behavior (Figure 2.2): links in same direction, both links between one pair of nodes and links in different directions between different pairs of nodes. The first two have no effect on
CHAPTER 2. HIGH AVAILABILITY AND FAULT TOLERANCE 18
communications and the third splits the network in two. It is difficult to see how this could be generalized, though.
(a) Healthy network
2.3.2 Fault Tree Analysis
According to NASA Office of Safety and Mission Assurance [24], the history of Fault Tree Analysis (FTA) dates back to the US aerospace and missile programs where FTA was popular in the 60’s. Towhidnejad et al. [29] mention that FTA evolved in the aerospace industry in the early 1960’s. Nowadays FTA and other Probabilistic Risk Assessment (PRA) techniques are used for example in nuclear and aerospace industries. [24]
Storey [27] does not specifically mention probabilities in context of FTA and NASA Office of Safety and Mission Assurance [24] specifically mentions that the Fault Tree (FT) that is the result of FTA is a “qualitative model”. According to Towhidnejad et al. [29] however FTA is associated with prob- abilistic approach to system analysis, and in NASA Office of Safety and Mission Assurance [24] probabilistic aspects are also introduced later. In this thesis I will only perform qualitative FTA type analysis in Chapter 3.
FTA procedure is in some ways the opposite of FMECA. In FTA the starting point is a top event, the causes of which are to be determined. The process is repeated recursively until the level of “basic events” is reached. The question in FTA is thus “What would have to happen for event X to happen?” rather than “What would happen were event X to happen?” as in FMEA. FTA is also advertised as a graphical method, with well-defined graphical notation for the tree structure produced through the recursion mentioned
CHAPTER 2. HIGH AVAILABILITY AND FAULT TOLERANCE 19
(a) Same direction
A
B
C
D
E
A
B
C
D
Figure 2.2: Dual ring network multiple failure example
above [27]. An example of the graphical representation is in Figure 2.3. Note that individual fault events are atomic and combined with Boolean operators when a multiple lower-level faults are required to cause a higher-level fault.
FTA applied to the ring network example of previous section with top- level event “Network is partitioned” is presented in Figure 2.4. The reasoning already described in the previous section is visualized in the FTA model. However, if the symmetry arguments from previous example were not applied to the FTA, the tree would quickly grow prohibitively large (Figure 2.5). Also, there is nothing inherent in the construction of the Fault Tree that would ensure that faults caused by multiple failures are noticed. However, the focus in FTA is on determining causes for a specific event, which helps concentrate analysis on relevant aspects of the system.
CHAPTER 2. HIGH AVAILABILITY AND FAULT TOLERANCE 20
Loss of cooling
Loss of cooling
Figure 2.3: Fault Tree Analysis notation example
In literature FTA is mostly mentioned in context of safety-critical sys- tems. However, it is also useful in more mundane software development and system design tasks. The output of FTA can be directly used as a guide for finding possible causes of problems in running software or operative systems. Automated construction of Fault Trees from programs has been researched by Friedman [17], although another name for the end result might be more suitable, since the top event is not necessarily a fault but rather any state of the program.
Also note how selection of top-level event affects FTA analysis. If top- level event “Single-failure tolerance lost” is selected, the resulting FTA is quite different as can be seen in Figure 2.6. The process for selecting ap- propriate top-level events is not part of the FTA procedure and requires expertise beyond simply applying a prescribed method to a system. In soft- ware systems, both selecting appropriate top-level events and determining appropriate bottom-level for the analysis is challenging because of system complexity. If bottom-level is not set, then eventually all analyses on soft- ware programs end up at causes like “arbitrary memory corruption” which
CHAPTER 2. HIGH AVAILABILITY AND FAULT TOLERANCE 21
Figure 2.4: FTA of dual-ring network with top-level event “Network parti- tion”
Network partition resulting
nodes
can cause any kind of behavior within limits set by laws of physics.
2.3.3 Hazard and Operability Studies
Hazard and Operability Studies (HAZOP) is a technique developed in the 1960s for analyzing hazards in chemical processes. According to Storey [27], it has since become popular in other industries as well.
The roots in chemical industry are apparent from the description by Storey [27], where the process is described as starting with a group of en- gineers studying operation of a process in steady state, and the effects of deviations from that steady state. The procedure undoubtedly fits a con- tinuous chemical process well, but requires adjustments to be applicable in other industries. The HAZOP procedure is also similar to FMEA in that one is supposed to pick a deviation, find out what could cause such deviation, and what the deviation could in turn cause. This is better reflected in the German acronym PAAG (Prognose von Storungen, Auffinden von Ursachen, Abschatzen der Auswirkungen, Gegenmaßnahmen), or prediction of devia- tions, finding of causes, estimation of effects, countermeasures in English) [11].
CHAPTER 2. HIGH AVAILABILITY AND FAULT TOLERANCE 22
F ig
u re
2. 5:
N ai
ve F
T A
fr om
in to
L os s of
lin k
A an d B
L os s of
lin k b e-
tw ee n A
lin ks
L os s of
lin k b e-
tw ee n A
lin k
A an d E
in to
L os s of
lin k b e-
tw ee n A
lin k
B an d C
L os s of
lin ks
L os s of
lin k
A an d B
L os s of
lin k b e-
tw ee n B
CHAPTER 2. HIGH AVAILABILITY AND FAULT TOLERANCE 23
Figure 2.6: FTA of dual-ring network with top-level event “Single failure tolerance lost”
Loss of single fault tolerance
Loss of link in between nodes A and B
Loss of link in between nodes B and C
Loss of link in between nodes C and D
Loss of link in between nodes D and E
Loss of link in between nodes E and A
In HAZOP, guide words are used to ease discovery of potential failure types. Examples of guide words are “no”, “more”, “less” and “reverse” which are easily applicable to for example material flows in a continuous chemical process, but perhaps less easily to computer systems. In computer systems, there is for example no reservoir pool from which traffic can be flow into network if current traffic flow is “less” than expected. It is still possible to apply the same guidewords to a limited extent, though, in the sense that for example “more queries” could lead to analysis of the effect of an overload of well-formed-as-such queries on a query-oriented computer system. Also, overflow and underflow conditions in input values, missing fields in protocol objects and such should of course be determined. However, the latter level of analysis is usually done found in unit tests, and is not part of whole-system analysis. Conceivably the guidewords of HAZOP would indeed be well-suited for unit test construction.
Chapter 3
System Architecture
In this chapter I describe the system architecture of the high-availability com- mand and control system and how it enables tolerance of all single-component hardware failures, some multiple-component hardware failures and certain software failures.
3.1 Background
The background for this thesis is a command and control system with high availability. The system was described at a very high level in Chapter 1. In this section I elaborate on the requirements of the command and control system from which the system architecture in the rest of this chapter is derived.
3.1.1 Availability and Reliability
It is clear that the command and control system should remain operational all the time. It is equally clear that it does not have to be as reliable as, for example, cooling systems of nuclear power plants. Exact reliability and availability requirements are, however, rather unclear, since no generally ap- plied reliability standards exist for command and control systems, unlike for nuclear facilities [28].
As noted in Chapter 1, it should be possible for operators normally re- sponsible for one area to take control of units in another area. To achieve this, operators in all the control rooms must have access to all the information necessary for taking over control of units in another area. The information must be up-to-date, and most importantly it must not be possible for an operator to do operations based on outdated data, such as dispatch a unit
24
CHAPTER 3. SYSTEM ARCHITECTURE 25
that has already been dispatched by another operator, but is still shown as free on his screen because of network delays.
3.1.2 Failure Model
Failures of the command and control system can be divided into multiple categories with varying severity. One category are failures that result in sys- tem unavailability for all operators. Paradoxically, this is perhaps the easiest situation from the perspective of operating procedures. All operators must simply switch to a manual backup procedure. Similarly, failures that result in unavailability of the system for a single operator, or operators located in a single control room can be dealt with by switching control of affected area to another operator in the same control room, or another control room entirely if the whole control room is unavailable.
Besides failures that cause total unavailability of the system for some sub- set of the operators, the system might also experience a partial failure that affects all operators for a myriad of reasons. For example, if the TETRA terminal in a vehicle loses power, communications with the vehicle are dis- rupted, and the system loses ability to locate the vehicle and communicate with it. These kinds of failures are expected, and usually detected with timeouts and acknowledgements in communications protocols. If the sys- tem is correctly designed and implemented, they will be detected and either mitigated or reported to the user.
For example, the system always shows the location of a vehicle with a timestamp indicating when the location report was received, so that the operator can detect if some vehicle is not sending new location reports. Sim- ilarly, if the user attempts to dispatch a vehicle on a mission, and there are communication problems with the vehicle, the system will first attempt to mitigate the failure by resending the message and, after a certain number of failed retries, notify the operator that acknowledgement for the dispatch message was not received so that he can take appropriate action.
In this thesis I will concentrate on the use of a distributed database to mitigate effects of hardware and network failures in internal components of the system. In particular, I will not spend effort in attempt to prove the system is free of software bugs or able to tolerate malicious behavior from internal components. In fact, it’s easy to imagine simple software problems that would result in difficult-to-detect problems on a running system. A bug in text encoding routines for outgoing TETRA SDS messages could cause a dispatch order to be illegible, or worse, legible but wrong, at the receiving terminal.
Since the command and control system is used for disaster response, it
CHAPTER 3. SYSTEM ARCHITECTURE 26
should be resistant to plausible disasters, such as fire in a datacenter where the system is running, preferably without human intervention. If the system is resistant to loss of a whole datacenter, it can obviously also be resistant to failure of whichever component inside the datacenter if the failure is handled the same way as loss of the whole datacenter. However, this may not be desirable for reasons of efficiency, so I will also look at handling of failures at lower level.
3.2 Network Communications Architecture
Conceptually the system operates as described in Chapter 1 and elaborated above. Dispatchers connect to the system using client software running on their workstations. The client software connects to a backend system that runs in multiple data centers. Multiple data centers are exposed to the dispatcher so that the dispatcher may choose which datacenter to connect to. The primary procedure in case of problems with one datacenter is for the client software to automatically switch to a different datacenter. The switch should not lose current state of the client, but may cause an interruption of a few seconds to client operations. See Figure 3.1 for an overview.
To enable switching of datacenter at will, the backend system must main- tain consensus spanning multiple data centers. The minimum number of nodes for a system that maintains availability and consistency upon single crash-type fault is three according to Lamport [21]. It is obvious that one node is not enough (it is unavailable upon crash), and for two nodes it is impossible to distinguish failure of interconnection between the two nodes from one node crashing. Thus both nodes must stop upon communication failure in order to maintain consistency, else it could be that the failure was in the interconnection and both nodes could proceed causing divergence in system states.
Three nodes are sufficient to distinguish failure of a network link between two nodes from the failure of one of the nodes using simple majority vote. It is not even necessary for all the nodes to store the database. One node may instead act as a witness for the other nodes, allowing them to decide whether the other data node is down or the interconnection between data nodes has failed. However, the system architecture assumes that all nodes also store the data. This has implications on data durability upon multiple component failures.
Within a minimum 3-node backend, each of the nodes maintains con- nectivity with both of the other nodes. The logical network is thus a ring network as presented in Chapter 2. See Figure 3.2 for illustration. In re-
CHAPTER 3. SYSTEM ARCHITECTURE 27
DC1 DC2 DC3
Figure 3.1: System communications architecture overview
ality, it is likely that the network topology also resembles a star, since the connections from e.g. DC1 to DC2 and DC3 are not actually independent, but at least inside DC1 likely pass through common wiring and switching equipment (see Figures 3.1 and 3.9). This is not an issue, since it is expected that network connectivity within data centers has redundant physical links with quick enough failover to prevent triggering failure detectors in the ac- tual backend software. Even if failure detectors are triggered, the problem is small, since the system is designed to tolerate the failure of a datacenter.
DC1
DC2
DC3
Figure 3.2: Cluster communications architecture overview
The network configuration described above is later assumed when de- scribing software architecture and database requirements. The test system, described in detail in Chapter 5, is also designed to simulate this configura- tion.
CHAPTER 3. SYSTEM ARCHITECTURE 28
3.3 Software Architecture
At a high level the application software of the command and control system uses a messaging system to communicate changes to other application nodes in real-time and a database to persistently store the current state. Figure 3.3 illustrates this. Dashed lines in the figure are connections to other datacen- ters. Among the information stored is the current state of each unit. Updates to unit state may be initiated by the client software or external system con- nected to any of the application nodes. I will not describe the connectivity with external systems in detail here, since from the application’s perspective, it can be handled the same way as updates initiated by client software.
Client
Application
DBMQ
Figure 3.3: Software architecture overview
It is imperative that state updates are committed to the database before being broadcast over the messaging system, since upon restart an application node will first start listening to updates from the messaging system and then refresh its internal state from the database. If an update were first broadcast using the messaging system and only then became visible through the database, the application node might start listening for updates from the messaging system after the update had been broadcast there and still receive an old version of the object from the database. The application nodes also partially keep the system-state in-memory so that when a client application fetches a particular object, it is primarily returned from memory by the application node, and if not present in-memory, retrieved from the database. Application nodes also forward relevant updates to clients connected to them.
The database should also be causally consistent, that is if client A per- forms a write, then communicates with client B and client B does a read, client B should not be able to see a version of the item written that is older
CHAPTER 3. SYSTEM ARCHITECTURE 29
than what A wrote. This is not an absolute requirement, since with the described system architecture, lack of causal consistency causes unnecessary conflicts but does not cause malfunction.
Application software in the command and control system is designed so that consistency can be maintained as long as the underlying database pro- vides an atomic update primitive. The atomic update primitive must be able to provide a guarantee similar to the CAS memory operation commonly found in modern processor instruction sets. As a memory operation CAS replaces value at address X with value B if current value is A, else it does nothing and somehow signals this. In a database setting, some sort of row or object identifier replaces address, but otherwise the operation remains the same. The ABA problem is avoided by using version counters. Importantly, the software is designed so that it does not require transactions that span multiple rows or objects.
The messaging protocol is designed so that messages are idempotent. For example, state update for unit X contains complete unit state including the version number rather than just the updated fields. Including version numbers in messages also allows nodes to ignore obsolete information. For example, it is possible that nodes A and B could update state of unit X in quick succession so that the update messages are delivered out-of-order to node C. Using version information C can then ignore the obsolete update from A.
3.3.1 Update Operation
Update(X,1,2)
Update(X,1,2)
Success
Updated(X,2)
Success
Success
Figure 3.4: Successful update operation
In nominal case, state update for unix X initiated by a client is performed as shown in Figure 3.4. First the client application requests the application
CHAPTER 3. SYSTEM ARCHITECTURE 30
server to update unit X from version 1 to version 2. The application server requests the database server to perform the same update. In nominal case the update succeeds and the messaging system is used to communicate the up- date to other application server nodes. Finally the application server informs the client that the update was successful.
In case the client does not receive a success response within a timeout, it displays a failure message to the user. The software then switches to another application server, on which the update procedure succeeds. Figure 3.5 illus- trates the update procedure in case a failure occurs on Application Server 1 before it updates the database.
Client Application Server 1 Application Server 2 Database Messaging
Update(X,1,2)
Crash
Update(X,1,2)
Update(X,1,2)
Success
Updated(X,2)
Success
Success
Figure 3.5: Application server crashes before performing database update
If the database update had already been performed, the recovery proce- dure is different. When application server 2 attempts to perform the update for the client, the database operation fails because the current version (ver- sion 2, as updated by application server 1) does not match the version pro- vided (version 1, provided by the client). Application server 2 then fetches the current version from the database and compares it with the new version provided by the client. Since they are the same, the database update had already been completed before, and the server application proceeds to broad- cast the update via the messaging system. Since messages are idempotent, it does not matter whether the crash of the original application server hap- pened before the message was broadcast as in Figure 3.6 or afterwards as in Figure 3.7.
CHAPTER 3. SYSTEM ARCHITECTURE 31
Client Application Server 1 Application Server 2 Database Messaging
Update(X,1,2)
Update(X,1,2)
Success
Crash
Update(X,1,2)
Update(X,1,2)
(X,2)
Updated(X,2)
Success
Success
Figure 3.6: Application server crashes after performing database update but before broadcasting the update
CHAPTER 3. SYSTEM ARCHITECTURE 32
Client Application Server 1 Application Server 2 Database Messaging
Update(X,1,2)
Update(X,1,2)
Success
Updated(X,2)
Success
Crash
Update(X,1,2)
Update(X,1,2)
(X,2)
Updated(X,2)
Success
Success
Figure 3.7: Application server crashes after performing database update and broadcasting it
The version comparison detailed above is also used to detect actual con- flicts. In Figure 3.8 two clients race to update unit X and client 2 wins the race. The application node serving client 1 receives a failure indicating version conflict as in Figure 3.6 or 3.7. However, after fetching the current version from database, it does not match the version that client 1 was offering as version 2. The only possibility upon conflict like this is to return an error to the client, since the system does not know how to resolve the conflict.
CHAPTER 3. SYSTEM ARCHITECTURE 33
Client 1 Application Server 1 Client 2 Application Server 2 Database Messaging
Update(X,1,2)
Update(X,1,2)
Update(X,1,2’)
Update(X,1,2’)
Success
(X,2’)
Updated(X,2’)
Success
Success
3.4 FMEA Analysis of System
In this section I present FMEA analysis of the system. I concentrate on the hardware components of a concrete derivative of the abstract network configuration shown in Figure 3.2. In this concrete system, each datacenter has one physical computer (CPUn) connected to a switch (SWn) with two cables using interface bonding for redundancy. The switch has a single exter- nal connection to an external routed network, the topology of which is such that routes to other data centers are symmetric and common up to a point (Rn) but split after that. The components are shown in Figure 3.9.
In this analysis, the physical computers are treated as a single component. In a real installation, the computers will have redundant subsystems such as multiple disks and power supplies, but also single points of failure such as the motherboard chipset, and some analysis of effects of subsystem failures should be performed. However, this FMEA analysis is not complete at sub- computer level because the configurations of individual computers are not so standardized as to facilitate analysis of anything except actual installations
CHAPTER 3. SYSTEM ARCHITECTURE 34
Rn
SWn
CPUn
DCn
CPU1
CPU2
CPU3
SW1
SW2
SW3
R1
R2
R3
Client
Figure 3.9: Network configuration under FMEA analysis
of the system, and no such installations are available for analysis at present time. Similarly, as mentioned in Chapter 1, software errors are not part of this analysis.
As expected, the FMEA shows that no single non-byzantine failure will cause the system to stop operating. Since the system is not designed to tolerate byzantine behavior, the result is good. However, it should be noted that increased latency due to for example misconfiguration of the network will not be detected in the system, unless the latency is so high as to trigger timeouts in network protocols. Even lower latency levels however will result in lower system performance. In an actual installation, SLAs (service level agreement) should be used to ensure that network links have sufficiently low latency, and in addition latencies should be monitored with standard tools.
CHAPTER 3. SYSTEM ARCHITECTURE 35 T
ab le
3. 1:
F M
E A
an al
y si
s of
th e
sy st
A R C H IT
E C T U R E
36 Continued...
Immediate Effects Detection Automatic Recovery Procedure
Effects After Au- tomatic Recovery Procedure
CPUn-SWn al- ternate network cable
Cable is cut No effect Software on CPUn detects the failure through ARP fail- ures or through MII sniffing.
No automatic recovery No automatic recovery
SWn-Rn con- nectivity
Transactions do not reach CPUn.
Software on other CPUs detects the error through absence of communications.
Software on other CPUs forms a new cluster that resumes service.
After the new cluster has been formed, ser- vice continues.
SWn-Rn con- nectivity
Increased la- tency
Decreased system throughput
No automatic recovery No automatic recovery
Ra-Rb connec- tivity
Software on CPUa and CPUb detects commu- nication failure.
Software on other CPUs forms a new cluster that resumes service.
After the new cluster has been formed, ser- vice continues.
Ra-Rb connec- tivity
Increased la- tency
Decreased system throughput
No automatic recovery. No automatic recovery.
CHAPTER 3. SYSTEM ARCHITECTURE 37
3.5 FTA Analysis of System
In this section I present FTA analysis of the system network configuration based on the following root event: “service inaccessible to client”. See Fig- ure 3.9 for the network configuration on which this analysis is based. Three analysis trees are generated: one for the entire system (Figure 3.10), one for the case where inaccessibility is caused by problems in the cluster (Fig- ure 3.11) and one for failure of a data center and its possible causes in Fig- ure 3.12. All data centers are identical, and thus the single datacenter analy- sis is applicable to all three data centers in the system. In the FTA analysis, each datacenter is also considered to include its external network link, since it is assumed that only a single link exists, and this simplifies higher-level analysis.
Service inaccessible to client
Figure 3.10: System-level FTA
The failure modes discovered through FTA are not surprising. The only difference from analysis of the abstract dual-ring model done in Chapter 2 is that it is assumed that connectivity can break in such a way that node 1 can communicate with node 2 and node 2 with node 3 but node 1 may still be unable to communicate with node 3. This produces a new failure mode for the cluster, named “Datacenter and intra-datacenter link failed” in Figure 3.11.
CHAPTER 3. SYSTEM ARCHITECTURE 38
C lu st er
an d in tr a-
da ta ce nt er
lin k fa ile d
D C 1 an d
R 2- R 3
D C 1
fa ile d
R 1- R 3
D C 1
fa ile d
R 1- R 2
D C 1
fa ile d
D C 2 fa ile d
D C 1
fa ile d
D C 2
fa ile d
D C 3 fa ile d
D C 2
fa ile d
D C 3
fa ile d
D C 3 fa ile d
D C 1
fa ile d
D C 3
fa ile d
lin ks
R 2- R 3
R 1- R 3
F ig
u re
3. 11
Datacenter failed
Internal link 1
Internal link 2
Figure 3.12: Datacenter FTA
3.6 Software Reliability Considerations
In software reliability literature such as Storey [27], the problem of common- cause failures is often mentioned as an issue particularly prevalent in software systems. The reason why common-cause failures are particularly problematic for software systems is that software does not wear out, and thus most of the failure categories that apply to mechanical and electrical systems are not ap- plicable. Naturally software systems still require some underlying hardware to function, but failures in that hardware can largely be tolerated through hardware-level redundancy such as parallel power supplies or ECC RAM, mixed hardware and firmware means such as multiple disks in RAID config- uration or multiple computer units and higher-level clustering software.
Even though multiple-computer configurations help tolerate hardware failures, the software itself remains a single point of failure. Design mis- takes are automatically replicated to all copies of the software. If all the computers run identical software, a fault in the software may easily cause an identical error on all computers in a cluster, possibly stopping operation of the whole computer system. The solution is to ensure that the software run- ning on different computers is different, or diverse. However, all instances of the software must still have identical external behavior, which in prac- tice leads to some common-cause failures even in different programs written against the same specification. It is also notable that software systems tend to be complex, and even system specifications often contain mistakes, which will propagate to all correct implementations of the specification.
With some of the database systems evaluated, some amount of software
CHAPTER 3. SYSTEM ARCHITECTURE 40
diversity can be achieved by running different versions of the database soft- ware in one cluster. This is possible to some extent, since for most of the evaluated systems, the preferred software update method is so-called rolling update in which nodes are taken down to update one node at a time. There is no apparent reason why multiple versions could not be left operational as well. However, for many of the databases it is not specified whether more than two versions can be operated simultaneously. If not, two of the three cluster nodes would still run the same software, possibly suffering from common-cause failures. Two of three cluster nodes failing would cause sys- tem downtime, and thus this is not very attractive configuration, especially since new versions of the database software typically fix issues, and running some nodes without these fixes would expose the system to known failures.
Another method of achieving software diversity in a complex software system that uses lots of standard components such as the C library or Java runtime would be to use as different versions as possible of the standard com- ponents on different systems. For example, the C library could be GNU C Li- brary, BSD libc or musl. For Java runtimes, there are fewer high-performance alternatives, but at least Oracle and IBM offer such. It would even be pos- sible to use diverse operating systems, such as FreeBSD, NetBSD and Linux in the same cluster.
3.7 Conclusions on Analyses
FMEA and FTA analyses of the architectural model of the system produce further evidence of the suitability of the architecture for high-availability operation. It appears that the design goal of single-hardware-fault tolerance is achieved at architectural level. However, since the architecture is so simple, this was evident from the start. The main use for methodical analysis in this case is not as design tool but as documentation. As documentation FMEA and FTA benefit from having standard structure, which makes them easier to interpret quickly than for example free-form text.
It should also be noted that the architectural model presented here is a simplification and in reality for example network topologies between deploy- ment data centers should be analyzed to discover whether they comply with the architecture or not. If noncompliant network topology is discovered, the effects on fault tolerance should be separately analyzed. This is an example of componentization, which allows simplification of the higher-level model to a level at which it can be analyzed within economical bounds. In general componentization also allows generalization of the analysis so that it is not limited to a specific instance of the system.
Chapter 4
Evaluated Database Systems
I considered many databases for evaluation. In the following sections I first present requirements for the database based on system architecture described in Chapter 3 and then list both the databases that I selected for evaluation based on the requirements and the ones that I rejected together with reasons for rejection.
4.1 Database Requirements
As presented in Chapter 3, the command and control system is designed to depend on the database for highly-available shared state. Ability to per- form CAS-like atomic updates is the primary functional requirement for the database. In SQL, the necessary construct is SELECT FOR UPDATE, which ensures that the row is not updated by others before the current transaction ends. In non-SQL databases the APIs vary, but typically the procedure is optimistic so that the SELECT-like read operation and UP- DATE-like write operation are not connected. Instead, the write operation takes both the old and the new version as parameters, and only succeeds if the old version is current. This is exactly equivalent to CAS.
Additional requirements for the database stem from the requirement that the system remain operational despite datacenter-scale failures. To achieve necessary fault tolerance, the database system must support multi-datacenter installations with some form of synchronous replication so that atomic up- dates are available cluster-wide. Successful commits must be replicated to at least two sites so that failure of any single site does not cause data loss. In addition, the database system must have adequate throughput both before and after hardware failures. The database system should also allow backups to be made of a live system without disruption to service.
41
CHAPTER 4. EVALUATED DATABASE SYSTEMS 42
Many modern so-called NoSQL databases do provide excellent through- put but with limited consistency guarantees. Why they do not provide cluster-wide atomic updates varies, but generally it is a design choice that allows writes to complete even when quorum is not available. Even when a database offers quorum writes as an option, the quorum often only ensures that the write is durable, not that a conflicting write cannot succeed con- currently. In essence, the database assumes that writes are independent in that they must not rely on previous values for the same key. Typically both conflicting writes succeed and depending on the database the application de- veloper must either do reconciliation of the conflicting updates at read-time or some sort of timestamp comparison is used to automatically resolve the conflict.
4.2 Rejected Databases
The following sections describe database systems, and some approaches to the problem that are not based on a single database system, that I rejected based on a cursory survey.
4.2.1 XA-based Two-phase Commit
The update problem could possibly be solved using any database that sup- ports the standard XA two-phase commit (2PC) protocol and an external transaction manager such as Atomikos or JBoss Transactions. However, 2PC is not suited for high-availability system that should recover quickly from failures, since the behavior of 2PC upon member failure at certain points requires waiting for that member to come back up before processing of other transactions can proceed on the remaining members. [16, chapter 14.3] The cause for this design decision is clear. The XA protocol was designed to allow atomic commits to happen across disparate systems (such as a message queue and different database engines), where the main concern is that the transac- tion either happen on all the systems or none of them. The problem here, however, is to allow atomic commits to happen in a distributed environment so that the system can proceed even when nodes are not present.
In addition to potentially hazardous failure mode, the application itself would have to manage replication by writing to each replica within a single transaction. The application would thus have to know about all the replicas, and for example bringing down one replica and replacing it with another would require configuration changes to the application itself. On the whole, it appears that a solution based on standard XA 2PC is likely to be a source
CHAPTER 4. EVALUATED DATABASE SYSTEMS 43
of much trouble and not worth investigating further.
4.2.2 Neo4j
Neo4j1 is a graph database package developed by Neo Technology. It is dis- tributed under GPLv3 [7] and AGPLv3 [6] licenses and commercial licensing is also possible. The database can also be used as regular object store, and offers transactions and clustering for high availability. However, transactions are only available through the native Java API. The only API offered to other languages is an HTTP REST API that does not provide transactions or even simpler conditional updates.
I decided to not proceed further in my tests with Neo4j because of the interface limitations.
4.2.3 MySQL
MySQL2 is a well-known database system nowadays developed by Oracle Corporation. Oracle distributes MySQL under GPLv2 [2] license and com- mercial licensing is also possible. MySQL APIs exist for most popular pro- gramming languages and MySQL offers transactions and replication for high availability.
In MySQL version 5.5, only asynchronous replication was available, so I decided not to evaluate MySQL further. During the writing of this thesis Or- acle released MySQL 5.6 with semisynchronous replication. A cursory look at the semisynchronous replication feature suggests that with high synchro- nization timeout values it might have been suitable for further testing.
Several forks of MySQL also exist. After a cursory review of the biggest three (Drizzle, MariaDB and Percona Server) I found that none of them provide synchronous replication.
MySQL MMM
MySQL MMM3 is a multi-master replication system built on top of stan- dard MySQL. It does not offer any consistency guarantees for concurrent conflicting writes. A write made to one master is simply replicated to the other asynchronously. Because MySQL MMM does not provide cluster-wide atomic commits, I decided to not evaluate it further.
1http://neo4j.org/ 2http://www.mysql.com/ 3http://mysql-mmm.org/
4.2.4 PostgreSQL
PostgreSQL4 is an open source database system developed outside the con- trol of any single company. PostgreSQL is distributed under the PostgreSQL license [8], which is similar to the standard MIT [1] license. APIs for Post- greSQL exist for most popular programming languages.
PostgreSQL 9.1 is the first version that offers synchronous replication. However, PostgreSQL 9.1 limits synchronous replication to single target [10]. It would be possible to build a system fulfilling all the database requirements on top of PostgreSQL, but I decided to not start such an ambitious project in the scope of this thesis, thus abandoning further evaluation of PostgreSQL.
4.2.5 HBase
HBase5 is an open source distributed database built on Apache Hadoop dis- tributed software package. HBase is distributed under Apache Software Li- cense version 2.0 [5]. Native API for HBase is only available in Java, however an API based on the Thrift RPC system can be used from many popular lan- guages such as Python and C++. HBase allows atomic updates of single rows through checkAndPut API call.
HBase depends on Hadoop Distributed Filesystem (HDFS) and Zookeeper for clustering. HDFS is used to achieve shared storage through which HBase nodes access data. In Hadoop versions before 2.0.0 HDFS architecture has a single point of failure (SPOF) in form of NameNode. During the writing of this thesis, initial release of Hadoop 2.0.0 partially remedies this issue through HDFS HA feature, although it still only provides master-standby operation and failover remains manual.
I originally abandoned further evaluation of HBase because of the HDFS single point of failure issue.
4.2.6 Redis
Redis6 is an open source key-value store. Redis is distributed under a BSD 3-clause [4] license. APIs for Redis are available for most popular languages. Redis supports compare-and-set type operations through its WATCH, MULTI and EXEC commands.
Redis implements master-slave type asynchronous replication. I aban- doned further evaluation of Redis, since it does not support synchronous
4http://www.postgresql.org/ 5http://hbase.apache.org/ 6http://redis.io/
CHAPTER 4. EVALUATED DATABASE SYSTEMS 45
replication and thus transactions may be lost after the master node has con- firmed them to the client but before the master node has replicated them to any slaves. There is also a specification for Redis Cluster which would support synchronous replication but it has not been implemented yet.
4.2.7 Hypertable
Hypertable7 is an open source distributed database built on Apache Hadoop distributed software package. Hypertable is developed by Hypertable Inc. and available under GPLv3 [7] license as well as commercial license to be negotiated separately with Hypertable Inc. APIs for Hypertable are based on the Thrift RPC system, and bindings are available for several popular programming languages. Hypertable does not support atomic operations beyond counters. Hypertable’s high availability features are restricted by its use of Hadoop HDFS, which introduces single point of failure.
I abandoned further evaluation of Hypertable because of lack of support for atomic updates of even single rows and because of the HDFS single point of failure issue.
4.2.8 H-Store
H-Store8 is an experimental memory-based database developed in collabora- tion between MIT, Brown University, Yale University and HP Labs. H-Store is available under GPLv3 [7] license. The H-store website claims that it is experimental and unsuitable for production environments, which is why I abandoned further evaluation of it.
4.2.9 Infinispan
Infinispan9 is an in-memory key/value store developed by RedHat, originally designed to be used as a cache. It is available under LGPLv2.1 [3] license. Infinispan supports atomic updates, on-disk storage and clustering with syn- chronous replication. However, it has the same failing as MySQL Cluster, namely it cannot be configured so that writes to minority partition of a cluster couldn’t succeed.
7http://hypertable.org/ 8http://hstore.cs.brown.edu/ 9http://www.jboss.org/infinispan/
4.2.10 Project Voldemort
Project Voldemort10 is a distributed key-value store developed at LinkedIn. Project Voldemort is available under Apache Software License 2.0 [5]. Volde- mort uses vector clocks and read repair to provide eventual consistency but it is unclear whether it can be used to implement fault-tolerant atomic up- dates. Rather than attempt to do so, I abandoned further evaluation of Project Voldemort.
4.2.11 Membase / Couchbase
Couchbase11 is a distributed database developed by Couchbase. Couchbase is available under Apache Software License 2.0 [5]. APIs for Couchbase are available for various popular programming languages, including Java, C and Python. Couchbase supports replication, but I could not find a complete description of the replication semantics when I originally selected databases for further evaluation in summer 2011. In summer 2011 it appeared that Couchbase 2.0 might include some improvements to replication, but only developer preview versions were available. In July 2012, Couchbase 2.0 is still available only as developer preview versions. The features offered by the replication subsystem are still unclear.
4.2.12 Terrastore
Terrastore12 is an open-source NoSQL database licensed under Apache Soft- ware License 2.0 [5]. It appears to be an independent project hosted on Google Code. Terrastore can be accessed using a Java API or an HTTP API. Both interfaces offer conditional updates.
Replication features of Terrastore are based on Terracotta in-memory clustering technology from Terracotta Inc. However, I could not find infor- mation on durability or persistence features of Terrastore in the Terrastore wiki, so I abandoned further evaluation.
4.2.13 Hibari
Hibari13 is a strongly consistent key-value store licensed under Apache Soft- ware License 2.0 [5]. Hibari was originally developed by Gemini Mobile
10http://project-voldemort.com/ 11http://www.couchbase.com/ 12http://code.google.com/p/terrastore/ 13http://hibari.github.com/hibari-doc/
CHAPTER 4. EVALUATED DATABASE SYSTEMS 47
Technologies Inc. Hibari has a native Erlang API and a cross-platform API based on the Thrift RPC system. When I originally read Hibari documenta- tion it appeared that it did not support atomic conditional updates, which is why I did not select it for further evaluation. However, upon later reading it appears that I was mistaken and I should have evaluated it more carefully.
4.2.14 Scalaris
Scalaris14 is a transactional distributed key-value store. The development of Scalaris has been funded by Zuse Institute Berlin, onScale solutions GmbH and several EU projects. Scalaris is available under Apache Software License 2.0 [5]. Scalaris does not appear to be production-ready. I originally decided not to evaluate Scalaris further because it appeared to not be ready for production use. At the time of writing of this chapter in July 2012, the links to Users and Developers Guider on Scalaris homepage do not lead anywhere, so it would appear that the original decision was correct.
4.2.15 GT.M
GT.M15 is a key-value database engine originally developed by Greystone Technology Corp. Nowadays it is maintained by Fidelity Information Ser- vices. GT.M is available under GPLv2 [2] license. APIs for GT.M are available for some popular languages, including Python. GT.M offers ACID transactions.
GT.M offers Business Continuity replication, which on a closer look ap- pears to be asynchronous replication for disaster recovery purposes. I aban- doned further evaluation of GT.M because it does not have synchronous replication.
4.2.16 OrientDB
OrientDB16 is an open-source graph-document database system. OrientDB is distributed under Apache Software License 2.0 [5]. The native language for OrientDB is Java. In addition to Java, language bindings are available for at least Python, using the HTTP interface of OrientDB. The HTTP REST API of OrientDB is limited in that it does not offer conditional updates. Conditional updates are, however, supported using the native Java API.
14http://code.google.com/p/scalaris/ 15http://www.fisglobal.com/products-technologyplatforms-gtm 16http://www.orientdb.org/
CHAPTER 4. EVALUATED DATABASE SYSTEMS 48
OrientDB supports both synchronous and asynchronous replication. It is not clear what visibility guarantees the replication offers. I abandoned fur- ther evaluation of OrientDB because of the limitations of its cross-language support.
4.2.17 Kyoto Tycoon
Kyoto Tycoon17 is a key-value database engine developed and maintained by FAL Labs. Kyoto Tycoon is distributed under GPLv3 [7] license. APIs for Kyoto Tycoon exists at least for C/C++ and Python. Kyoto Tycoon supports atomic updates.
High-availability features of Kyoto Tycoon are limited to hot backup and asynchronous replication. I abandoned further evaluation of Kyoto Tycoon because of its lack of synchronous replication.
4.2.18 CouchDB
CouchDB18 is an open-source distributed database. CouchDB is distributed under Apache Software License 2 [5]. CouchDB is accessed using an HTTP API and bindings are available for many popular programming languages including Java and Python. CouchDB supports atomic updates on a single server but not cluster-wide.
CouchDB supports peer-to-peer replication with automatic conflict reso- lution that ensures all nodes resolve conflicts the same way. The replication is not visible to users in that users could for example select how many replicas must receive a write before it being considered successful. Manual conflict resolution that differs from the automated procedure is also possible, since losing copies in the automatic resolution process can also be accessed. I did not consider CouchDB for further evaluation because of the limited atomic update support.
4.3 Databases Selected for Limited Evalua-
tion
Databases selected for limited evaluation were evaluated for throughput in non-conflicting update test without failures and for fault-tolerance with a suite of fault-inducing tests. The databases selected for limited evaluation
17http://fallabs.com/kyototycoon/ 18http://couchdb.apache.org/
CHAPTER 4. EVALUATED DATABASE SYSTEMS 49
are popular among application developers but aren’t suitable for use as the main database for the command and control system because they do not support cluster-wide atomic updates.
All evaluated databases can be set up in a cluster so that the cluster does not have a single point of failure, and their APIs allow the client to specify that the write must be replicated to a certain number of hosts before the write is considered successful. These features make them suitable for write-mostly applications such as logging.
4.3.1 Cassandra
Cassandra19 is an open-source distributed database system. Cassandra is distributed under Apache Software License 2.0 [5]. Cassandra API is based on the Thrift RPC system and bindings exist for many popular languages such as Java, Python and Ruby. I used version 1.1.2 of Cassandra in my tests.
Consistency requirements in Cassandra are specified for each write and read operation separately. If both readers read from a quorum of nodes and writers write to a quorum of nodes, then causal consistency exists between writes and reads so that after a write has completed, all readers will see that write. However, it is not possible to detect conflicting writes, which makes atomic updates impossible to implement and thus I only consider Cassandra suitable for limited evaluation.
Cassandra has limited support for backups. Each node can be backed up separately, but there is not way to get a backup of the whole cluster at a point in time. Considering that atomic updates are not possible, and thus Cassandra is not suited for storing things that must be updated in a consistent fashion, this is probably acceptable.
Based on Cassandra documentation20, nodetool repair command should be run on each node periodically, to ensure that data that is not written to all nodes at commit and rarely read gets replicated appropriately. No other periodic maintenance tasks are mentioned in the documentation. In addi- tion to repairing of data, nodetool also allows other maintenance tasks to be performed, such as removing nodes from the cluster and rebalancing the hash ring Cassandra uses to locate data in the cluster. New node is added by starting it with a configuration that includes some hosts of the existing cluster in so-called “seed list”.
19http://cassandra.apache.org/ 20http://wiki.apache.org/cassandra/
4.3.2 Riak
Riak21 is a distributed database system developed by Basho Technologies Inc. Riak is distributed under Apache Software License 2.0 [5]. Riak is accessed via a REST-style HTTP API and bindings exist for several popular programming languages including Java, Python and Ruby. I used version 1.1.4 of Riak in my tests.
Similar to Cassandra, consistency requirements in Riak are specified for each write and read operation separately. Causal consistency is achievable as with Cassandra. As with Cassandra, atomic updates are not possible because conflicts are resolved at read time. When conflicting writes occur, the reader must select which one is retained. Because of this impossibility of atomic updates I only consider Riak suitable for limited evaluation.
Based on dual fault tests, Riak cannot ensure that writes are only suc- cessful if a quorum of nodes is available. I stopped evaluation of Riak once I noticed this behavior, so not details on its backup mechanism or adminis- trative tools are presented here.
4.4 Databases Selected for Full-Scale Evalu-
ation
Databases selected for full-scale evaluation were evaluated for throughput in non-conflicting update tests and update tests with various conflict rates without failures and for fault tolerance with a suite of fault-inducing tests.
4.4.1 Galera
Galera22 is a multi-master clustering solution for MySQL developed by Coder- ship Oy. Galera is distributed under GPLv2 [2]. Galera is an add-on to MySQL, so standard MySQL clients can be used to access it. I used version 2.2rc1 in my tests.
Galera offers transparent multi-master clustering on top of standard MySQL. Galera uses synchronous replication that ensures that only one of concurrent conflicting transactions can commit. Galera also ensures that only a parti- tion of the cluster with the majority of nodes in it can process transactions, thus ensuring that successful commits are always replicated to at least two sites in the command and control system architecture.
21http://basho.com/products/riak-overview/ 22http://codership.com/
CHAPTER 4. EVALUATED DATABASE SYSTEMS 51
Galera maintains a copy of all data on each cluster node, so it is possible to take a block device-level snapshot of the data on one node and to use that as a backup. Galera also supports standard MySQL/Innodb backup tools such as Percona XtraBackup.
Based on documentation on the Galera website, Galera cluster does not require any periodic maintenance. Connecting and removing nodes from clus- ter is done by manipulating values of configuration variables, either through configuration file or through MySQL command line interface. Cluster state can also be monitored through the regular MySQL interface by inspecting configuration and state variables. Each Galera node has different configura- tion settings, since each node must be configured with the addresses of other nodes in the cluster.
4.4.2 MongoDB
MongoDB23 is a document database system developed by 10gen Inc. The MongoDB server itself is distributed under AGPLv3 [6] license, but the client APIs developed by 10gen are distributed under Apache Software License 2.0 [5]. Client APIs are available for many popular programming languages including C++, Java and Python. MongoDB allows atomic updates of single documents. I used version 2.0.6 of MongoDB in my tests.
MongoDB clustering works in two dimensions that are managed sepa- rately. To increase throughput, data is distributed to multiple shards. Each shard is backed by either a single node or a cluster that facilitates replication. For this thesis, I ignore the sharding dimension, since the main focus is on high availability. In MongoDB terminology a replicating cluster is called a replica set. A replica set has one master and a number of slaves. By default all reads and writes go to the master, facilitating causal consistency and atomic updates. The slaves asynchronously pull updates from the master, but write operations support an option to specify that the write is considered complete until a specific set of slaves has replicated the update.
MongoDB only offers READ UNCOMMITTED semantics cluster-wide for trans- actions. Writes done to the primary are visible to reads on the primary before they have been replicated to secondaries. In extreme case this means that a client may read an object from current primary, but the object may dis- appear if the primary immediately fails and the new primary that is elected had not yet received it. There are also theoretical limitations on data size that stem from how MongoDB handles access to on-disk resources. The on- disk data representation is mapped into memory, so there must be enough
23http://www.mongodb.org/
CHAPTER 4. EVALUATED DATABASE SYSTEMS 52
address space to map all data. This is not a practical limitation on systems with 64-bit address bus, but 32-bit systems are typically limited to about two gigabytes of data.
MongoDB can be backed up by copying the data directory from a filesys- tem snapshot as long as MongoDB has journaling enabled. Only a single node of a replica set needs to be backed up. Additionally, MongoDB allows point-in-time backups even without filesystem snapshots with mongodump
utility. MongoDB does not have a separate utility for cluster administration. Ad-
ministrative tasks, such as adding and removing replicas from replica sets, are performed using special commands with the regular command line ap- plication. Configuration files on all nodes in a MongoDB replica set are identical.
4.4.3 MySQL Cluster
MySQL Cluster24 is a high-availability database system that replaces the standard storage engines in MySQL with a cluster system that allows syn- chronous replication. Nowadays Oracle packages it separately from the stan- dard MySQL software, so I also present it separately here. Oracle offers MySQL Cluster under GPLv2 [2] license and various commercial licensing schemes. MySQL Cluster can be accessed using standard MySQL APIs, so its programming language support good. MySQL Cluster also has a sepa- rate API for directly accessing the replicated storage system. I tested version 7.1.15a of