High-Availability Database Systems: Evaluation of - Aalto-yliopisto
100
Aalto University School of Science Degree Programme of Computer Science and Engineering Tuure Laurinolli High-Availability Database Systems: Evaluation of Existing Open Source Solutions Master’s Thesis Espoo, November 19, 2012 Supervisor: Professor Heikki Saikkonen Instructor: Timo L¨ attil¨ a M.Sc. (Tech.)
High-Availability Database Systems: Evaluation of - Aalto-yliopisto
High-Availability Database Systems: Evaluation of Existing Open
Source SolutionsTuure Laurinolli
Master’s Thesis Espoo, November 19, 2012
Supervisor: Professor Heikki Saikkonen Instructor: Timo Lattila
M.Sc. (Tech.)
Aalto University School of Science Degree Programme of Computer
Science and Engineering
ABSTRACT OF MASTER’S THESIS
Author: Tuure Laurinolli
Date: November 19, 2012 Pages: 90
Professorship: Software Systems Code: T-106
Supervisor: Professor Heikki Saikkonen
In recent years the number of open-source database systems offering
high- availability functionality has exploded. The functionality
offered ranges from simple one-to-one asynchronous replication to
self-managing clustering that both partitions and replicates data
automatically.
In the thesis I evaluated database systems for use as the basis for
high availability of a command and control system that should
remain available to operators even upon loss of a whole datacenter.
In the first phase of evaluation I eliminated systems that appeared
to be unsuitable based on documentation. In the second phase I
tested both throughput and fault tolerance characteristics of the
remain- ing systems in a simulated WAN environment.
In the first phase I reviewed 24 database systems, of which I
selected six, split in two categories based on consistency
characteristics, for further evaluation. Ex- perimental evaluation
showed that two of these six did not actually fill my re-
quirements. Of the remaining four systems, MongoDB proved
troublesome in my fault tolerance tests, although the issues seemed
resolvable, and Galera’s slight issues were due to its
configuration mechanism. This left one in each category. They,
Zookeeper and Cassandra, did not exhibit any problems in my
tests.
Keywords: database, distributed system, consistency, latency,
causality
Language: English
DIPLOMITYON TIIVISTELMA
Paivays: 19. marraskuuta 2012 Sivumaara: 90
Professuuri: Ohjelmistotekniikka Koodi: T-106
Valvoja: Professori Heikki Saikkonen
Ohjaaja: Diplomi-insinoori Timo Lattila
Tassa diplomityossa arvioin tietokantajarjestelmien soveltuvuutta
pohjaksi kor- kean saavutettavuuden toiminnoille
komentokeskusjarjestelmassa, jonka tulee pysya saavutettavana myos
kokonaisen konesalin vikaantuessa. Arvioinnin en- simmaisessa
vaiheessa eliminoin dokumentaation perusteella selvasti soveltumat-
tomat jarjestelmat. Toisessa vaiheessa testasin seka jarjestelmien
viansietoisuutta etta lapaisykykya simuloidussa korkean latenssin
verkossa.
Ensimmaisessa vaiheessa tutustuin 24 tietokantajarjestelmaan,
joista valitsin kuusi tarkempaan arviointiin. Jaoin tarkemmin
arvioidut jarjestelmat kahteen kategoriaan
konsistenssiominaisuuksien perusteella. Kokeissa havaitsin etta
kaksi naista kuudesta ei tayttanyt asettamiani vaatimuksia.
Jaljellejaaneista neljasta jarjestelmasta MongoDB aiheutti ongelmia
viansietoisuustesteissani, joskin ongel- mat vaikuttivat olevan
korjattavissa, ja Galeran vahaiset ongelmat johtuivat sen
asetusjarjestelmasta. Jaljelle jaivat ensimmaisesta kategoriasta
Zookeeper ja toi- sesta Cassandra, joiden kummankaan
viansietoisuudesta en testeissani loytanyt ongelmia.
Asiasanat: tietokanta, hajautettu jarjestelma, ristiriidattomuus,
konsis- tenssi, viive, latenssi, kausaalisuus
Kieli: Englanti
3
Acknowledgements
I would like to thank Portalify Ltd for offering me an interesting
thesis project and ample time to work on it. At Portalify I’d
especially like to thank M.Sc. Timo Lattila, my instructor, for
putting me on the right track from the start. Outside Portalify, I
would like to thank Professor Heikki Saikkonen for taking the time
to supervise my thesis.
I want to also thank my friends and family for providing me support
and, perhaps even more importantly, welcome distractions. Aalto on
Waves was downright disruptive, and learning to fly at
Polyteknikkojen Ilmailukerho took its time too. However, constant
support from old friends was the most important. Thank you, Juha
and #kumikanaultimate!
Helsinki, November 19, 2012
Abbreviations and Acronyms
2PC Two-phase Commit ACID Atomicity, Consistency, Isolation,
Durability API Application Programming Interface ARP Address
Resolution Protocol CAS Compare And Set FMEA Failure Modes and
Effects Analysis FMECA Failure Modes, Effects and Criticality
Analysis FTA Fault Tree Analysis HAPS High Availability Power
System HTTP Hypertext Transfer Protocol JSON JavaScript Object
Notation LAN Local Area Network MII Media Independent Interface NAT
Network Address Translation PRA Probabilistic Risk Assessment REST
Representational State Transfer RPC Remote Procedure Call RTT
Round-Trip Time SDS Short Data Service SLA Service Level Agreement
SSD Solid State Drive SQL Structured Query Language TAP Linux
network tap TCP Transmission Control Protocol TETRA Terrestrial
Trunked Radio VM Virtual Machine WAN Wide Area Network XA X/Open
Extended Architecture
5
Contents
Abbreviations and Acronyms 4
1 Introduction 8 1.1 High-Availability Command and Control System .
. . . . . . . 8 1.2 Open-Source Database Systems . . . . . . . . .
. . . . . . . . 9 1.3 Evaluation of Selected Databases . . . . . .
. . . . . . . . . . 9 1.4 Structure of the Thesis . . . . . . . . .
. . . . . . . . . . . . . 10
2 High Availability and Fault Tolerance 11 2.1 Terminology . . . .
. . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Overcoming
Faults . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3
Analysis techniques . . . . . . . . . . . . . . . . . . . . . . . .
16
3 System Architecture 24 3.1 Background . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 24 3.2 Network Communications
Architecture . . . . . . . . . . . . . 26 3.3 Software Architecture
. . . . . . . . . . . . . . . . . . . . . . . 28 3.4 FMEA Analysis
of System . . . . . . . . . . . . . . . . . . . . 33 3.5 FTA
Analysis of System . . . . . . . . . . . . . . . . . . . . . 37 3.6
Software Reliability Considerations . . . . . . . . . . . . . . .
39 3.7 Conclusions on Analyses . . . . . . . . . . . . . . . . . .
. . . 40
4 Evaluated Database Systems 41 4.1 Database Requirements . . . . .
. . . . . . . . . . . . . . . . . 41 4.2 Rejected Databases . . . .
. . . . . . . . . . . . . . . . . . . . 42 4.3 Databases Selected
for Limited Evaluation . . . . . . . . . . . 48 4.4 Databases
Selected for Full-Scale Evaluation . . . . . . . . . . 50
5 Experiment Methodology 54 5.1 Test System . . . . . . . . . . . .
. . . . . . . . . . . . . . . . 54 5.2 Test Programs . . . . . . .
. . . . . . . . . . . . . . . . . . . . 56
6
5.3 Fault Simulation . . . . . . . . . . . . . . . . . . . . . . .
. . 64 5.4 Test Runs . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 65
6 Experiment Results 66 6.1 Throughput Results . . . . . . . . . .
. . . . . . . . . . . . . . 66 6.2 Fault Simulation Results . . . .
. . . . . . . . . . . . . . . . . 75
7 Comparison of Evaluated Systems 84 7.1 Full-Scale Evaluation . .
. . . . . . . . . . . . . . . . . . . . . 84 7.2 Limited Evaluation
. . . . . . . . . . . . . . . . . . . . . . . . 85
8 Conclusions 86
B Remaining fault test results 95
7
Introduction
In this thesis I present my research related to adoption of an
existing open- source database system as the basis for high
availability in a command and control system being developed by
Portalify Ltd.
1.1 High-Availability Command and Control
System
The command and control system is designed to support operations of
rescue personnel by automatically tracking status and location of
field units so that dispatching operators always have correct and
up-to-date view of available units. It tracks locations of TETRA
handsets and vehicle radios, and handles status messages sent by
field personnel in response to events such as receiving dispatch
orders. The system also allows operators to dispatch a unit on a
mission, and automatically sends necessary information to the
unit.
The system should scale to installations that span large
geographical ar- eas, with dispatching operators located in
multiple, geographically diverse control rooms, and thousands of
controlled units spread over the geograph- ical area. Typically
operators in one control room would be responsible for controlling
units in a specific area, but it should be possible for another
control room to take over the area in case the original control
room cannot handle its tasks because it has for example lost
electrical power.
In this thesis I concentrate on hardware fault tolerance of the
command and control system and also the database system, since
studying software faults of large, existing software systems
appears to be an unsolved problem. However, I touch on higher-level
approaches that could be used to enhance software fault tolerance
of a complex system in practice in Chapter 3.
I introduce terminology and analysis methods related to
availability and
8
CHAPTER 1. INTRODUCTION 9
fault tolerance in Chapter 2. In Chapter 3 I present more elaborate
require- ments for the system, a system architecture based on those
requirements and fault-tolerance analysis of the architecture model
based on analysis methods introduced in Chapter 2.
1.2 Open-Source Database Systems
The system described above must be able to share data between
operators working on different workstations, located in different
control rooms, dis- tributed across a country. A database system
for storing the data and con- trolling access to it is required.
Because of the fault tolerance requirements presented in Chapter 3,
the database system must be geographically dis- tributed.
Main functional requirement for the database is that it must
provide atomic update primitive, preferably with causal consistency
and read com- mitted visibility semantics. Main non-functional
requirements are quick, au- tomatic handling of software, network
and hardware faults and adequate throughput when clustered over
high-latency network. Even fairly low through- put is acceptable,
since
I limit the evaluation to open-source database systems both because
of apparent high cost of commercial high-availability database
systems, such as Oracle, and also because it is not possible to
inspect how commercial, closed- source systems actually work. The
transparency of open-source systems is not beneficial only for
research purposes, it is also an operational benefit in that it is
actually possible to find and fix problems in the system without
having to rely on the database vendor for support. Already during
the writing of this thesis I reported issues to several projects
and fixed problems in database interfaces to be able to run my
tests.
I present the requirements placed on the database system and
introduce a wide variety of open-source high-availability database
systems in Chapter 4.
1.3 Evaluation of Selected Databases
Since the main objective for the system in question is to find a
distributed database system that is fault tolerant, I use a
virtualized test environment that capable of injecting faults and
latency to a distributed system. The test environment uses
Virtualbox1, Netem [20] and virtualized Debian2 systems
1https://www.virtualbox.org/ 2http://www.debian.org/
CHAPTER 1. INTRODUCTION 10
capable of running all the tested database systems. I test fault
tolerance characteristics of the selected database systems in
this environment by injecting process, network and hardware faults
and measuring effects on clients connected to different nodes of
the database cluster. In addition to fault tolerance, I test the
update throughput of the database systems in various high-latency
configurations with varying num- bers of clients.
I elaborate on the test environment in Chapter 5 and present the
test results in chapter 6 as well as a comparison of the evaluated
systems based on the test results and features in Chapter 7.
1.4 Structure of the Thesis
In Chapter 1 I introduce the product from which criteria for
evaluating the databases are derived. In Chapter 2 I introduce
high-availability and fault tolerance terminology and fault
tolerance analysis procedures used in other fields. In Chapter 3 I
describe the system architecture, how fault-tolerance can be
achieved with it and the requirements it places on the database. In
Chapter 4 I describe various databases that I considered when
selecting sys- tems for evaluation, and explain how the evaluated
databases were selected. In Chapter 5 I present test methodology
used in obtaining data for evaluation of the databases. In Chapter
6 I present test results for several databases using test methods
from Chapter 5. In Chapter 7 I compare the evaluated databases
based on the results presented in Chapter 6. In Chapter 8 I present
conclusions about suitability of different databases for use as
basis for sharing state in a high-availability command and control
system.
Chapter 2
High Availability and Fault Tol- erance
High-availability database system is a database system with the
characteristic that an operator can achieve high availability using
it. The meaning of availability and how it relates to fault
tolerance is discussed below.
2.1 Terminology
The terms related to availability, and the meaning of availability
itself needs to be carefully defined in order to be useful. In
systems that operate con- tinuously, availability is often defined
as the probability that the system is operating correctly at any
given time [27]. This definition is problem- atic when applied to
query-oriented computer systems such as databases for which
continuous availability is not easily defined, since availability
of the system is only measurable when a query is performed.
2.1.1 Availability
It’s somewhat difficult to measure availability even when queries
are per- formed. What is the availability of the system if query A
executes success- fully, but during its execution, query B begins
executing and fails because of spurious network error? Whether or
not this is possible depends on the design of the network protocol
that the database cluster uses in its internal communications, but
it is certainly imaginable that some of the evaluated systems could
allow this kind of behavior.
Even without going as far as proposing simultaneity of failure and
suc- cess, it is usually not enough for a query to eventually
complete for it to
11
CHAPTER 2. HIGH AVAILABILITY AND FAULT TOLERANCE 12
be considered successful. Instead, there is usually an external
requirement limiting execution time of database queries when the
database is part of a larger system. In addition, due to
concurrency control paradigms employed in certain databases, some
queries are actually expected to fail. This happens when multiple
clients attempt to concurrently update an entity in a system with
optimistic concurrency control. Instead of one of two clients
remaining blocked on a lock waiting for the other to complete its
update, at least one of the updates must fail at or before commit
time.
The interface from the rest of the command and control system to
the database system is designed so that failure of an individual
query is not disastrous. Nor is unavailability of the database
system for a few seconds upon for example failure of underlying
hardware a problem for the rest of the system. The database system
should thus only be considered unavailable when queries take
disproportionately long time to execute or when they fail because
of an error in the database system instead of transient error
resulting from concurrent access protocol.
2.1.2 Reliability
While availability is usually defined in terms of probability that
the system is operating correctly at a point in time during
continuous operation, reliability is defined as the probability
that the system keeps operating correctly without failures for a
defined period of time [27]. For the envisaged system, reliability
is not a good metric, since the system does not have well-defined
lifetime over which reliability would be meaningful to
measure.
For example, if the system had a second of downtime every 10
minutes, its availability would be .998 but its reliability over
any 10 minute period 0. For the expected use case with short
queries this might be entirely acceptable. However, for a batch
system performing video processing tasks that each take 30 minutes,
the reliability figure above would be absolutely disastrous, since
no task could ever finish.
2.1.3 Faults and Errors
According to Storey, “A fault is a defect within the system” and
can take many forms such as hardware failures and software design
mistakes. An error on the other hand is “a deviation from the
required operation of the system or subsystem” [27]. Storey further
classifies faults into random and systematic faults. Random faults
include hardware faults due to wear and tear, cosmic rays and other
random events. Systematic faults are faults due to design
mistakes.
CHAPTER 2. HIGH AVAILABILITY AND FAULT TOLERANCE 13
A fault may cause an error but the operation of the system may also
mask the fault. For example, a software design mistake will not
result in an error if the part of the software that contains the
mistake is never executed. Hardware errors may similarly be masked.
If a switch is never used, a fault in its installation cannot
produce an error, and a fault in a computer hard disk may stay
dormant for the lifetime of the system if the faulty sector is
never accessed. In fact, a modern PC CPU contains hundreds of
design faults [15], yet millions of the devices are in use everyday
without any apparent errors arising from these faults.
Storey also defines data integrity as “the ability of the system to
pre- vent damage to its own database and to detect, and possibly
correct, errors that do occur” [27]. Database terminology for data
integrity is usually more nuanced, using terms such as atomicity,
consistency, isolation and durabil- ity to describe characteristics
of transaction in the database. According to Wikipedia [30], the
terms originate from Haerder and Reuter [19]. Database system that
ensures data integrity as defined by Haerder and Reuter is also
fail-safe as defined by Storey in the sense that no committed
transactions are lost upon error, and thus errors don’t endanger
system state, they just prevent accessing it or changing it.
2.1.4 Maintainability
Another concept of interest defined by Storey is maintainability.
He defines maintainability as “the ability of a system to be
maintained” and mainte- nance as “the action taken to retain a
system in, or return a system to, its designed operating condition”
[27]. In computer systems common mainte- nance tasks often include
ensuring that sufficient resources are available in the system in
form of for example disk space, possibly backing up the sys- tem
state to external media and applying configuration changes and
software updates.
With databases, two intermingled properties often come up with
back- ups. Most preferably the backup should be atomic, that is
reflect the state of the database at a single point in time. The
backup should also have no effect on normal operations of the
database system, that is answering queries and performing updates.
Some database systems achieve the first property but fall short of
the second specifically because achieving the first requires
blocking all write operations so that the backup can complete
without inter- ference from updates. Others choose to achieve the
second property but fail on the first one, yet it is usually
possible to achieve both if filesystem-level snapshots are
available or if the database uses a multiversion concurrency
control scheme. In the first case restoring a backup made by
copying the
CHAPTER 2. HIGH AVAILABILITY AND FAULT TOLERANCE 14
filesystem snapshot is equivalent to restarting the database after
power fail- ure. In the latter case the database explicitly keeps
track of lifetime of items so that during backup old, deleted
versions of items are simply kept around until the backup
completes, and new items are not included in the backup.
Another maintainability issue with a complex system is management
of system configuration over time. In addition to issues that are
possible in management of configuration of a centralized software
system, distributed systems have additional complexity related to
ensuring that the whole sys- tem has compatible configuration. For
example, some distributed systems require that all nodes have
mostly identical but subtly different configura- tions because
their node configurations must specify the addresses of all other
nodes but not the node itself.
On a centralized system, a configuration change is performed once,
on one computer. If the change requires restart of software, some
downtime is unavoidable. In contrast, a distributed system may be
able to tolerate config- uration changes that require restart of
individual nodes without downtime. On-the-fly upgrades like this
are often the preferred method in the world of distributed database
systems, where the feature is often called ’rolling restart’ [25].
In practice the difficulty of having a dissimilar configuration for
each node may not be great, since the configuration of each node
must in any case be managed individually if changes are performed
in a staggered fashion.
2.2 Overcoming Faults
Storey [27] divides techniques for overcoming effects of faults
into four cat- egories: fault avoidance, fault removal, fault
detection and fault tolerance. Fault avoidance covers techniques
applied at design stage, fault removal tech- niques applied during
testing and fault detection and fault tolerance detec- tion of
faults and mitigating their effects when the system is operational.
An example of fault avoidance would be use of formal methods during
software development to prove that software matches its
specification. Fault detection and fault tolerance are related in
that fault tolerance in active systems typi- cally requires some
form of fault detection so that faulty parts of the system can be
isolated or spare components activated, and the fault reported so
that it can be repaired.
Several techniques for creating fault-tolerant software are
described in literature. The Wikipedia article on Software Fault
Tolerance [31] lists Re- covery Blocks, N-version Software and
Self-Checking Software. In addition, Storey [27] mentions Formal
Methods. Of these, Recovery Blocks are these
CHAPTER 2. HIGH AVAILABILITY AND FAULT TOLERANCE 15
days a mainstream feature in object-oriented programming languages
such as C++, Java, Python and Ruby in form of try-catch structures.
Storey also mentions other language features common in today’s
languages, such as pointer-safety and goto-less program structure,
as enhancing reliability of software [27].
N-version (multiversion) software aims to achieve redundancy by
creating multiple versions of the same software function. The idea
behind multiver- sion software is that the different versions,
called variants, will have different faults, and thus correct
operation can be ensured by comparing their results and selecting
the result that is most popular. Both Storey [27] and Lyu [22]
mention that common-cause faults have been found surprisingly
common when multiversion programming has been applied. To avoid
common-cause faults, the variants should be developed with as much
diversity as possi- ble. For example, separate hardware platform,
programming language and development tools increase the likelihood
of the different program versions actually having different
faults.
Multiversion software also usually has a single point of failure,
namely the component that selects the final result based on variant
results. However, it should be a simple component, maybe so simple
as to allow exhaustive test- ing. As techniques for combining
variant results, Lyu [23] mentions majority voting and median
voting among others.
Majority voting simply picks the majority value, if any. Majority
voting cannot produce a result in all cases - namely situations
where no majority exists. For example, if three variants each
produce a different result, no ma- jority exists, and some other
solution is required. Some possibilities in this case are switching
control to non-computerized backup system, or shutting down the
whole system into safe state. Median voting is an interesting
alter- native in that for some special cases it allows the variants
to be implemented so that their results do not have to match
exactly in order for the combined result to be useful. For example,
if diverse algorithms on diverse hardware are used to compute
deflection of control surface of an aircraft, combining their
outputs with median filter would allow the algorithms to produce
slightly different results for common cases, yet choose a common
value in case one algorithm produces obviously wrong results.
The article on Self-Checking Software in Wikipedia [31] is actually
about N-version Self-Checking Programming as described in Lyu [22,
chapter 3], wherein the N-version aspect is the source of
redundancy necessary to toler- ate faults and the self-checking
part distinguishes it from regular N-version programming as
described by Storey [27]. What distinguishes it from regular
N-version programming is that in regular N-version programming,
there is an external component that compares the results of the N
diverse programs and
CHAPTER 2. HIGH AVAILABILITY AND FAULT TOLERANCE 16
determines the correct output whereas in N-version self-checking
program- ming each self-checking component must determine if its
result is correct and signal the other components in case it
detects a fault in its output.
Correct use of formal methods ensure that software matches
specification. For them to be applicable, a formal specification
must first be created. Some software development standards, such as
UK Defense Standard 00-55, require use of formal methods for
safety-related software [12]. Techniques borrowed from formal
methods are also used in less rigorous settings to find bugs in
existing software [13].
In distributed systems, additional techniques that allow the system
as a whole to proceed even if components fail are required. The
problem of agreement in distributed systems is called the consensus
problem. In theory, it is impossible to implement an algorithm
solving the distributed consen- sus problem in an asynchronous
network, that is a network that does not guarantee delivery of
messages in bounded time. In practice this is overcome by employing
fault detectors based on timeouts. In addition to distributed
consensus, distributed transactions feature widely in literature.
Transactions are a special case of distributed consensus, but a
plethora of specialized al- gorithms exist for handling them,
however lately the trend has perhaps been towards building
databases on more generic consensus primitives. For ex- ample
Google’s BigTable database is essentially based on the generic
Paxos algorithm for solving distributed consensus. [16]
2.3 Analysis techniques
2.3.1 Failure Modes and Effects Analysis
Failure Modes and Effects Analysis (FMEA) was originally developed
in United States for military applications and codified in
MIL-P-1629 in 1949. Later revisions were standardized in
MIL-STD-1629 and MIL-STD-1629A. Early adopters of FMEA in civil
applications include aerospace industry and automotive industry.
According to Haapanen and Helminen [18], academic record of
application of FMEA to software development originates from late
1970s. Haapanen and Helminen mention a paper by Reifer published in
1979 titled Software Failure Modes and Effects Analysis, and in
this paper Reifer mentions some earlier work on software
reliability, but nothing dating back further than 1974. [26]
[18]
Failure Modes, Effects and Criticality Analysis (FMECA) is a
develop- ment of FMEA that includes assessment of criticality of
failures. Criticality means, according to Haapanen and Helminen
[18], “a relative measure of the
CHAPTER 2. HIGH AVAILABILITY AND FAULT TOLERANCE 17
consequences of a failure mode and its frequency of occurrences”.
FMECA was part of MIL-STD-1629A, which was published in 1980. In
this thesis I will perform qualitative criticality analysis of
identified failures in Chapter 3.
The FMECA procedure itself is very simple. The procedure described
here is based on the description in Storey [27].
For each system component:
1. Determine failure modes of the component
2. Determine consequences of failure of the component in each
failure mode
3. Determine criticality of failure based on consequences and
likelihood failure
The result of FMECA is a table that contains description of
consequences and criticality of all single-component
failures.
The limitations of FMECA are in its simplicity. It prescribes
analysis of all system components, which soon becomes burdensome on
larger systems. Appropriate modularization helps with this issue.
If module interfaces are sufficiently well-defined, internal
failures of a module can be treated at a higher level as failures
of the larger module, reducing complexity of analysis at higher
level. A larger, more difficult problem is that, as prescribed,
FMECA limits analysis to single component failures. Consequences of
simultaneous failures of multiple components are not covered by the
analysis. For example, FMECA analysis of a dual ring network
topology would show that any single- link failure does partition
the network, but would not cover the two-link failure case which
does partition the network.
It is difficult to envision how FMECA could practically be extended
to multi-component failures, since already the obvious next step of
applying the procedure to component pairs is often infeasible
because the number of com- ponent pairs in a system grows
quadratically to the number of components. As already mentioned,
proper modularization of the system could help some- what, but even
for small component count, the number of component pairs is
prohibitively large. However, in certain cases reduction of
analysis based on symmetries might make analysis of dual component
failures. For example, in dual ring network with N identical nodes
(Figure 2.1a), single-link failure has 2N identical cases (Figure
2.1b) and the 2N(2N − 1) dual-link failures can be reduced to only
three cases with different behavior (Figure 2.2): links in same
direction, both links between one pair of nodes and links in
different directions between different pairs of nodes. The first
two have no effect on
CHAPTER 2. HIGH AVAILABILITY AND FAULT TOLERANCE 18
communications and the third splits the network in two. It is
difficult to see how this could be generalized, though.
(a) Healthy network
2.3.2 Fault Tree Analysis
According to NASA Office of Safety and Mission Assurance [24], the
history of Fault Tree Analysis (FTA) dates back to the US aerospace
and missile programs where FTA was popular in the 60’s. Towhidnejad
et al. [29] mention that FTA evolved in the aerospace industry in
the early 1960’s. Nowadays FTA and other Probabilistic Risk
Assessment (PRA) techniques are used for example in nuclear and
aerospace industries. [24]
Storey [27] does not specifically mention probabilities in context
of FTA and NASA Office of Safety and Mission Assurance [24]
specifically mentions that the Fault Tree (FT) that is the result
of FTA is a “qualitative model”. According to Towhidnejad et al.
[29] however FTA is associated with prob- abilistic approach to
system analysis, and in NASA Office of Safety and Mission Assurance
[24] probabilistic aspects are also introduced later. In this
thesis I will only perform qualitative FTA type analysis in Chapter
3.
FTA procedure is in some ways the opposite of FMECA. In FTA the
starting point is a top event, the causes of which are to be
determined. The process is repeated recursively until the level of
“basic events” is reached. The question in FTA is thus “What would
have to happen for event X to happen?” rather than “What would
happen were event X to happen?” as in FMEA. FTA is also advertised
as a graphical method, with well-defined graphical notation for the
tree structure produced through the recursion mentioned
CHAPTER 2. HIGH AVAILABILITY AND FAULT TOLERANCE 19
(a) Same direction
A
B
C
D
E
A
B
C
D
Figure 2.2: Dual ring network multiple failure example
above [27]. An example of the graphical representation is in Figure
2.3. Note that individual fault events are atomic and combined with
Boolean operators when a multiple lower-level faults are required
to cause a higher-level fault.
FTA applied to the ring network example of previous section with
top- level event “Network is partitioned” is presented in Figure
2.4. The reasoning already described in the previous section is
visualized in the FTA model. However, if the symmetry arguments
from previous example were not applied to the FTA, the tree would
quickly grow prohibitively large (Figure 2.5). Also, there is
nothing inherent in the construction of the Fault Tree that would
ensure that faults caused by multiple failures are noticed.
However, the focus in FTA is on determining causes for a specific
event, which helps concentrate analysis on relevant aspects of the
system.
CHAPTER 2. HIGH AVAILABILITY AND FAULT TOLERANCE 20
Loss of cooling
Loss of cooling
Figure 2.3: Fault Tree Analysis notation example
In literature FTA is mostly mentioned in context of safety-critical
sys- tems. However, it is also useful in more mundane software
development and system design tasks. The output of FTA can be
directly used as a guide for finding possible causes of problems in
running software or operative systems. Automated construction of
Fault Trees from programs has been researched by Friedman [17],
although another name for the end result might be more suitable,
since the top event is not necessarily a fault but rather any state
of the program.
Also note how selection of top-level event affects FTA analysis. If
top- level event “Single-failure tolerance lost” is selected, the
resulting FTA is quite different as can be seen in Figure 2.6. The
process for selecting ap- propriate top-level events is not part of
the FTA procedure and requires expertise beyond simply applying a
prescribed method to a system. In soft- ware systems, both
selecting appropriate top-level events and determining appropriate
bottom-level for the analysis is challenging because of system
complexity. If bottom-level is not set, then eventually all
analyses on soft- ware programs end up at causes like “arbitrary
memory corruption” which
CHAPTER 2. HIGH AVAILABILITY AND FAULT TOLERANCE 21
Figure 2.4: FTA of dual-ring network with top-level event “Network
parti- tion”
Network partition resulting
nodes
can cause any kind of behavior within limits set by laws of
physics.
2.3.3 Hazard and Operability Studies
Hazard and Operability Studies (HAZOP) is a technique developed in
the 1960s for analyzing hazards in chemical processes. According to
Storey [27], it has since become popular in other industries as
well.
The roots in chemical industry are apparent from the description by
Storey [27], where the process is described as starting with a
group of en- gineers studying operation of a process in steady
state, and the effects of deviations from that steady state. The
procedure undoubtedly fits a con- tinuous chemical process well,
but requires adjustments to be applicable in other industries. The
HAZOP procedure is also similar to FMEA in that one is supposed to
pick a deviation, find out what could cause such deviation, and
what the deviation could in turn cause. This is better reflected in
the German acronym PAAG (Prognose von Storungen, Auffinden von
Ursachen, Abschatzen der Auswirkungen, Gegenmaßnahmen), or
prediction of devia- tions, finding of causes, estimation of
effects, countermeasures in English) [11].
CHAPTER 2. HIGH AVAILABILITY AND FAULT TOLERANCE 22
F ig
u re
2. 5:
N ai
ve F
T A
fr om
in to
L os s of
lin k
A an d B
L os s of
lin k b e-
tw ee n A
lin ks
L os s of
lin k b e-
tw ee n A
lin k
A an d E
in to
L os s of
lin k b e-
tw ee n A
lin k
B an d C
L os s of
lin ks
L os s of
lin k
A an d B
L os s of
lin k b e-
tw ee n B
CHAPTER 2. HIGH AVAILABILITY AND FAULT TOLERANCE 23
Figure 2.6: FTA of dual-ring network with top-level event “Single
failure tolerance lost”
Loss of single fault tolerance
Loss of link in between nodes A and B
Loss of link in between nodes B and C
Loss of link in between nodes C and D
Loss of link in between nodes D and E
Loss of link in between nodes E and A
In HAZOP, guide words are used to ease discovery of potential
failure types. Examples of guide words are “no”, “more”, “less” and
“reverse” which are easily applicable to for example material flows
in a continuous chemical process, but perhaps less easily to
computer systems. In computer systems, there is for example no
reservoir pool from which traffic can be flow into network if
current traffic flow is “less” than expected. It is still possible
to apply the same guidewords to a limited extent, though, in the
sense that for example “more queries” could lead to analysis of the
effect of an overload of well-formed-as-such queries on a
query-oriented computer system. Also, overflow and underflow
conditions in input values, missing fields in protocol objects and
such should of course be determined. However, the latter level of
analysis is usually done found in unit tests, and is not part of
whole-system analysis. Conceivably the guidewords of HAZOP would
indeed be well-suited for unit test construction.
Chapter 3
System Architecture
In this chapter I describe the system architecture of the
high-availability com- mand and control system and how it enables
tolerance of all single-component hardware failures, some
multiple-component hardware failures and certain software
failures.
3.1 Background
The background for this thesis is a command and control system with
high availability. The system was described at a very high level in
Chapter 1. In this section I elaborate on the requirements of the
command and control system from which the system architecture in
the rest of this chapter is derived.
3.1.1 Availability and Reliability
It is clear that the command and control system should remain
operational all the time. It is equally clear that it does not have
to be as reliable as, for example, cooling systems of nuclear power
plants. Exact reliability and availability requirements are,
however, rather unclear, since no generally ap- plied reliability
standards exist for command and control systems, unlike for nuclear
facilities [28].
As noted in Chapter 1, it should be possible for operators normally
re- sponsible for one area to take control of units in another
area. To achieve this, operators in all the control rooms must have
access to all the information necessary for taking over control of
units in another area. The information must be up-to-date, and most
importantly it must not be possible for an operator to do
operations based on outdated data, such as dispatch a unit
24
CHAPTER 3. SYSTEM ARCHITECTURE 25
that has already been dispatched by another operator, but is still
shown as free on his screen because of network delays.
3.1.2 Failure Model
Failures of the command and control system can be divided into
multiple categories with varying severity. One category are
failures that result in sys- tem unavailability for all operators.
Paradoxically, this is perhaps the easiest situation from the
perspective of operating procedures. All operators must simply
switch to a manual backup procedure. Similarly, failures that
result in unavailability of the system for a single operator, or
operators located in a single control room can be dealt with by
switching control of affected area to another operator in the same
control room, or another control room entirely if the whole control
room is unavailable.
Besides failures that cause total unavailability of the system for
some sub- set of the operators, the system might also experience a
partial failure that affects all operators for a myriad of reasons.
For example, if the TETRA terminal in a vehicle loses power,
communications with the vehicle are dis- rupted, and the system
loses ability to locate the vehicle and communicate with it. These
kinds of failures are expected, and usually detected with timeouts
and acknowledgements in communications protocols. If the sys- tem
is correctly designed and implemented, they will be detected and
either mitigated or reported to the user.
For example, the system always shows the location of a vehicle with
a timestamp indicating when the location report was received, so
that the operator can detect if some vehicle is not sending new
location reports. Sim- ilarly, if the user attempts to dispatch a
vehicle on a mission, and there are communication problems with the
vehicle, the system will first attempt to mitigate the failure by
resending the message and, after a certain number of failed
retries, notify the operator that acknowledgement for the dispatch
message was not received so that he can take appropriate
action.
In this thesis I will concentrate on the use of a distributed
database to mitigate effects of hardware and network failures in
internal components of the system. In particular, I will not spend
effort in attempt to prove the system is free of software bugs or
able to tolerate malicious behavior from internal components. In
fact, it’s easy to imagine simple software problems that would
result in difficult-to-detect problems on a running system. A bug
in text encoding routines for outgoing TETRA SDS messages could
cause a dispatch order to be illegible, or worse, legible but
wrong, at the receiving terminal.
Since the command and control system is used for disaster response,
it
CHAPTER 3. SYSTEM ARCHITECTURE 26
should be resistant to plausible disasters, such as fire in a
datacenter where the system is running, preferably without human
intervention. If the system is resistant to loss of a whole
datacenter, it can obviously also be resistant to failure of
whichever component inside the datacenter if the failure is handled
the same way as loss of the whole datacenter. However, this may not
be desirable for reasons of efficiency, so I will also look at
handling of failures at lower level.
3.2 Network Communications Architecture
Conceptually the system operates as described in Chapter 1 and
elaborated above. Dispatchers connect to the system using client
software running on their workstations. The client software
connects to a backend system that runs in multiple data centers.
Multiple data centers are exposed to the dispatcher so that the
dispatcher may choose which datacenter to connect to. The primary
procedure in case of problems with one datacenter is for the client
software to automatically switch to a different datacenter. The
switch should not lose current state of the client, but may cause
an interruption of a few seconds to client operations. See Figure
3.1 for an overview.
To enable switching of datacenter at will, the backend system must
main- tain consensus spanning multiple data centers. The minimum
number of nodes for a system that maintains availability and
consistency upon single crash-type fault is three according to
Lamport [21]. It is obvious that one node is not enough (it is
unavailable upon crash), and for two nodes it is impossible to
distinguish failure of interconnection between the two nodes from
one node crashing. Thus both nodes must stop upon communication
failure in order to maintain consistency, else it could be that the
failure was in the interconnection and both nodes could proceed
causing divergence in system states.
Three nodes are sufficient to distinguish failure of a network link
between two nodes from the failure of one of the nodes using simple
majority vote. It is not even necessary for all the nodes to store
the database. One node may instead act as a witness for the other
nodes, allowing them to decide whether the other data node is down
or the interconnection between data nodes has failed. However, the
system architecture assumes that all nodes also store the data.
This has implications on data durability upon multiple component
failures.
Within a minimum 3-node backend, each of the nodes maintains con-
nectivity with both of the other nodes. The logical network is thus
a ring network as presented in Chapter 2. See Figure 3.2 for
illustration. In re-
CHAPTER 3. SYSTEM ARCHITECTURE 27
DC1 DC2 DC3
Figure 3.1: System communications architecture overview
ality, it is likely that the network topology also resembles a
star, since the connections from e.g. DC1 to DC2 and DC3 are not
actually independent, but at least inside DC1 likely pass through
common wiring and switching equipment (see Figures 3.1 and 3.9).
This is not an issue, since it is expected that network
connectivity within data centers has redundant physical links with
quick enough failover to prevent triggering failure detectors in
the ac- tual backend software. Even if failure detectors are
triggered, the problem is small, since the system is designed to
tolerate the failure of a datacenter.
DC1
DC2
DC3
Figure 3.2: Cluster communications architecture overview
The network configuration described above is later assumed when de-
scribing software architecture and database requirements. The test
system, described in detail in Chapter 5, is also designed to
simulate this configura- tion.
CHAPTER 3. SYSTEM ARCHITECTURE 28
3.3 Software Architecture
At a high level the application software of the command and control
system uses a messaging system to communicate changes to other
application nodes in real-time and a database to persistently store
the current state. Figure 3.3 illustrates this. Dashed lines in the
figure are connections to other datacen- ters. Among the
information stored is the current state of each unit. Updates to
unit state may be initiated by the client software or external
system con- nected to any of the application nodes. I will not
describe the connectivity with external systems in detail here,
since from the application’s perspective, it can be handled the
same way as updates initiated by client software.
Client
Application
DBMQ
Figure 3.3: Software architecture overview
It is imperative that state updates are committed to the database
before being broadcast over the messaging system, since upon
restart an application node will first start listening to updates
from the messaging system and then refresh its internal state from
the database. If an update were first broadcast using the messaging
system and only then became visible through the database, the
application node might start listening for updates from the
messaging system after the update had been broadcast there and
still receive an old version of the object from the database. The
application nodes also partially keep the system-state in-memory so
that when a client application fetches a particular object, it is
primarily returned from memory by the application node, and if not
present in-memory, retrieved from the database. Application nodes
also forward relevant updates to clients connected to them.
The database should also be causally consistent, that is if client
A per- forms a write, then communicates with client B and client B
does a read, client B should not be able to see a version of the
item written that is older
CHAPTER 3. SYSTEM ARCHITECTURE 29
than what A wrote. This is not an absolute requirement, since with
the described system architecture, lack of causal consistency
causes unnecessary conflicts but does not cause malfunction.
Application software in the command and control system is designed
so that consistency can be maintained as long as the underlying
database pro- vides an atomic update primitive. The atomic update
primitive must be able to provide a guarantee similar to the CAS
memory operation commonly found in modern processor instruction
sets. As a memory operation CAS replaces value at address X with
value B if current value is A, else it does nothing and somehow
signals this. In a database setting, some sort of row or object
identifier replaces address, but otherwise the operation remains
the same. The ABA problem is avoided by using version counters.
Importantly, the software is designed so that it does not require
transactions that span multiple rows or objects.
The messaging protocol is designed so that messages are idempotent.
For example, state update for unit X contains complete unit state
including the version number rather than just the updated fields.
Including version numbers in messages also allows nodes to ignore
obsolete information. For example, it is possible that nodes A and
B could update state of unit X in quick succession so that the
update messages are delivered out-of-order to node C. Using version
information C can then ignore the obsolete update from A.
3.3.1 Update Operation
Update(X,1,2)
Update(X,1,2)
Success
Updated(X,2)
Success
Success
Figure 3.4: Successful update operation
In nominal case, state update for unix X initiated by a client is
performed as shown in Figure 3.4. First the client application
requests the application
CHAPTER 3. SYSTEM ARCHITECTURE 30
server to update unit X from version 1 to version 2. The
application server requests the database server to perform the same
update. In nominal case the update succeeds and the messaging
system is used to communicate the up- date to other application
server nodes. Finally the application server informs the client
that the update was successful.
In case the client does not receive a success response within a
timeout, it displays a failure message to the user. The software
then switches to another application server, on which the update
procedure succeeds. Figure 3.5 illus- trates the update procedure
in case a failure occurs on Application Server 1 before it updates
the database.
Client Application Server 1 Application Server 2 Database
Messaging
Update(X,1,2)
Crash
Update(X,1,2)
Update(X,1,2)
Success
Updated(X,2)
Success
Success
Figure 3.5: Application server crashes before performing database
update
If the database update had already been performed, the recovery
proce- dure is different. When application server 2 attempts to
perform the update for the client, the database operation fails
because the current version (ver- sion 2, as updated by application
server 1) does not match the version pro- vided (version 1,
provided by the client). Application server 2 then fetches the
current version from the database and compares it with the new
version provided by the client. Since they are the same, the
database update had already been completed before, and the server
application proceeds to broad- cast the update via the messaging
system. Since messages are idempotent, it does not matter whether
the crash of the original application server hap- pened before the
message was broadcast as in Figure 3.6 or afterwards as in Figure
3.7.
CHAPTER 3. SYSTEM ARCHITECTURE 31
Client Application Server 1 Application Server 2 Database
Messaging
Update(X,1,2)
Update(X,1,2)
Success
Crash
Update(X,1,2)
Update(X,1,2)
(X,2)
Updated(X,2)
Success
Success
Figure 3.6: Application server crashes after performing database
update but before broadcasting the update
CHAPTER 3. SYSTEM ARCHITECTURE 32
Client Application Server 1 Application Server 2 Database
Messaging
Update(X,1,2)
Update(X,1,2)
Success
Updated(X,2)
Success
Crash
Update(X,1,2)
Update(X,1,2)
(X,2)
Updated(X,2)
Success
Success
Figure 3.7: Application server crashes after performing database
update and broadcasting it
The version comparison detailed above is also used to detect actual
con- flicts. In Figure 3.8 two clients race to update unit X and
client 2 wins the race. The application node serving client 1
receives a failure indicating version conflict as in Figure 3.6 or
3.7. However, after fetching the current version from database, it
does not match the version that client 1 was offering as version 2.
The only possibility upon conflict like this is to return an error
to the client, since the system does not know how to resolve the
conflict.
CHAPTER 3. SYSTEM ARCHITECTURE 33
Client 1 Application Server 1 Client 2 Application Server 2
Database Messaging
Update(X,1,2)
Update(X,1,2)
Update(X,1,2’)
Update(X,1,2’)
Success
(X,2’)
Updated(X,2’)
Success
Success
3.4 FMEA Analysis of System
In this section I present FMEA analysis of the system. I
concentrate on the hardware components of a concrete derivative of
the abstract network configuration shown in Figure 3.2. In this
concrete system, each datacenter has one physical computer (CPUn)
connected to a switch (SWn) with two cables using interface bonding
for redundancy. The switch has a single exter- nal connection to an
external routed network, the topology of which is such that routes
to other data centers are symmetric and common up to a point (Rn)
but split after that. The components are shown in Figure 3.9.
In this analysis, the physical computers are treated as a single
component. In a real installation, the computers will have
redundant subsystems such as multiple disks and power supplies, but
also single points of failure such as the motherboard chipset, and
some analysis of effects of subsystem failures should be performed.
However, this FMEA analysis is not complete at sub- computer level
because the configurations of individual computers are not so
standardized as to facilitate analysis of anything except actual
installations
CHAPTER 3. SYSTEM ARCHITECTURE 34
Rn
SWn
CPUn
DCn
CPU1
CPU2
CPU3
SW1
SW2
SW3
R1
R2
R3
Client
Figure 3.9: Network configuration under FMEA analysis
of the system, and no such installations are available for analysis
at present time. Similarly, as mentioned in Chapter 1, software
errors are not part of this analysis.
As expected, the FMEA shows that no single non-byzantine failure
will cause the system to stop operating. Since the system is not
designed to tolerate byzantine behavior, the result is good.
However, it should be noted that increased latency due to for
example misconfiguration of the network will not be detected in the
system, unless the latency is so high as to trigger timeouts in
network protocols. Even lower latency levels however will result in
lower system performance. In an actual installation, SLAs (service
level agreement) should be used to ensure that network links have
sufficiently low latency, and in addition latencies should be
monitored with standard tools.
CHAPTER 3. SYSTEM ARCHITECTURE 35 T
ab le
3. 1:
F M
E A
an al
y si
s of
th e
sy st
A R C H IT
E C T U R E
36 Continued...
Immediate Effects Detection Automatic Recovery Procedure
Effects After Au- tomatic Recovery Procedure
CPUn-SWn al- ternate network cable
Cable is cut No effect Software on CPUn detects the failure through
ARP fail- ures or through MII sniffing.
No automatic recovery No automatic recovery
SWn-Rn con- nectivity
Transactions do not reach CPUn.
Software on other CPUs detects the error through absence of
communications.
Software on other CPUs forms a new cluster that resumes
service.
After the new cluster has been formed, ser- vice continues.
SWn-Rn con- nectivity
Increased la- tency
Decreased system throughput
No automatic recovery No automatic recovery
Ra-Rb connec- tivity
Software on CPUa and CPUb detects commu- nication failure.
Software on other CPUs forms a new cluster that resumes
service.
After the new cluster has been formed, ser- vice continues.
Ra-Rb connec- tivity
Increased la- tency
Decreased system throughput
No automatic recovery. No automatic recovery.
CHAPTER 3. SYSTEM ARCHITECTURE 37
3.5 FTA Analysis of System
In this section I present FTA analysis of the system network
configuration based on the following root event: “service
inaccessible to client”. See Fig- ure 3.9 for the network
configuration on which this analysis is based. Three analysis trees
are generated: one for the entire system (Figure 3.10), one for the
case where inaccessibility is caused by problems in the cluster
(Fig- ure 3.11) and one for failure of a data center and its
possible causes in Fig- ure 3.12. All data centers are identical,
and thus the single datacenter analy- sis is applicable to all
three data centers in the system. In the FTA analysis, each
datacenter is also considered to include its external network link,
since it is assumed that only a single link exists, and this
simplifies higher-level analysis.
Service inaccessible to client
Figure 3.10: System-level FTA
The failure modes discovered through FTA are not surprising. The
only difference from analysis of the abstract dual-ring model done
in Chapter 2 is that it is assumed that connectivity can break in
such a way that node 1 can communicate with node 2 and node 2 with
node 3 but node 1 may still be unable to communicate with node 3.
This produces a new failure mode for the cluster, named “Datacenter
and intra-datacenter link failed” in Figure 3.11.
CHAPTER 3. SYSTEM ARCHITECTURE 38
C lu st er
an d in tr a-
da ta ce nt er
lin k fa ile d
D C 1 an d
R 2- R 3
D C 1
fa ile d
R 1- R 3
D C 1
fa ile d
R 1- R 2
D C 1
fa ile d
D C 2 fa ile d
D C 1
fa ile d
D C 2
fa ile d
D C 3 fa ile d
D C 2
fa ile d
D C 3
fa ile d
D C 3 fa ile d
D C 1
fa ile d
D C 3
fa ile d
lin ks
R 2- R 3
R 1- R 3
F ig
u re
3. 11
Datacenter failed
Internal link 1
Internal link 2
Figure 3.12: Datacenter FTA
3.6 Software Reliability Considerations
In software reliability literature such as Storey [27], the problem
of common- cause failures is often mentioned as an issue
particularly prevalent in software systems. The reason why
common-cause failures are particularly problematic for software
systems is that software does not wear out, and thus most of the
failure categories that apply to mechanical and electrical systems
are not ap- plicable. Naturally software systems still require some
underlying hardware to function, but failures in that hardware can
largely be tolerated through hardware-level redundancy such as
parallel power supplies or ECC RAM, mixed hardware and firmware
means such as multiple disks in RAID config- uration or multiple
computer units and higher-level clustering software.
Even though multiple-computer configurations help tolerate hardware
failures, the software itself remains a single point of failure.
Design mis- takes are automatically replicated to all copies of the
software. If all the computers run identical software, a fault in
the software may easily cause an identical error on all computers
in a cluster, possibly stopping operation of the whole computer
system. The solution is to ensure that the software run- ning on
different computers is different, or diverse. However, all
instances of the software must still have identical external
behavior, which in prac- tice leads to some common-cause failures
even in different programs written against the same specification.
It is also notable that software systems tend to be complex, and
even system specifications often contain mistakes, which will
propagate to all correct implementations of the
specification.
With some of the database systems evaluated, some amount of
software
CHAPTER 3. SYSTEM ARCHITECTURE 40
diversity can be achieved by running different versions of the
database soft- ware in one cluster. This is possible to some
extent, since for most of the evaluated systems, the preferred
software update method is so-called rolling update in which nodes
are taken down to update one node at a time. There is no apparent
reason why multiple versions could not be left operational as well.
However, for many of the databases it is not specified whether more
than two versions can be operated simultaneously. If not, two of
the three cluster nodes would still run the same software, possibly
suffering from common-cause failures. Two of three cluster nodes
failing would cause sys- tem downtime, and thus this is not very
attractive configuration, especially since new versions of the
database software typically fix issues, and running some nodes
without these fixes would expose the system to known
failures.
Another method of achieving software diversity in a complex
software system that uses lots of standard components such as the C
library or Java runtime would be to use as different versions as
possible of the standard com- ponents on different systems. For
example, the C library could be GNU C Li- brary, BSD libc or musl.
For Java runtimes, there are fewer high-performance alternatives,
but at least Oracle and IBM offer such. It would even be pos- sible
to use diverse operating systems, such as FreeBSD, NetBSD and Linux
in the same cluster.
3.7 Conclusions on Analyses
FMEA and FTA analyses of the architectural model of the system
produce further evidence of the suitability of the architecture for
high-availability operation. It appears that the design goal of
single-hardware-fault tolerance is achieved at architectural level.
However, since the architecture is so simple, this was evident from
the start. The main use for methodical analysis in this case is not
as design tool but as documentation. As documentation FMEA and FTA
benefit from having standard structure, which makes them easier to
interpret quickly than for example free-form text.
It should also be noted that the architectural model presented here
is a simplification and in reality for example network topologies
between deploy- ment data centers should be analyzed to discover
whether they comply with the architecture or not. If noncompliant
network topology is discovered, the effects on fault tolerance
should be separately analyzed. This is an example of
componentization, which allows simplification of the higher-level
model to a level at which it can be analyzed within economical
bounds. In general componentization also allows generalization of
the analysis so that it is not limited to a specific instance of
the system.
Chapter 4
Evaluated Database Systems
I considered many databases for evaluation. In the following
sections I first present requirements for the database based on
system architecture described in Chapter 3 and then list both the
databases that I selected for evaluation based on the requirements
and the ones that I rejected together with reasons for
rejection.
4.1 Database Requirements
As presented in Chapter 3, the command and control system is
designed to depend on the database for highly-available shared
state. Ability to per- form CAS-like atomic updates is the primary
functional requirement for the database. In SQL, the necessary
construct is SELECT FOR UPDATE, which ensures that the row is not
updated by others before the current transaction ends. In non-SQL
databases the APIs vary, but typically the procedure is optimistic
so that the SELECT-like read operation and UP- DATE-like write
operation are not connected. Instead, the write operation takes
both the old and the new version as parameters, and only succeeds
if the old version is current. This is exactly equivalent to
CAS.
Additional requirements for the database stem from the requirement
that the system remain operational despite datacenter-scale
failures. To achieve necessary fault tolerance, the database system
must support multi-datacenter installations with some form of
synchronous replication so that atomic up- dates are available
cluster-wide. Successful commits must be replicated to at least two
sites so that failure of any single site does not cause data loss.
In addition, the database system must have adequate throughput both
before and after hardware failures. The database system should also
allow backups to be made of a live system without disruption to
service.
41
CHAPTER 4. EVALUATED DATABASE SYSTEMS 42
Many modern so-called NoSQL databases do provide excellent through-
put but with limited consistency guarantees. Why they do not
provide cluster-wide atomic updates varies, but generally it is a
design choice that allows writes to complete even when quorum is
not available. Even when a database offers quorum writes as an
option, the quorum often only ensures that the write is durable,
not that a conflicting write cannot succeed con- currently. In
essence, the database assumes that writes are independent in that
they must not rely on previous values for the same key. Typically
both conflicting writes succeed and depending on the database the
application de- veloper must either do reconciliation of the
conflicting updates at read-time or some sort of timestamp
comparison is used to automatically resolve the conflict.
4.2 Rejected Databases
The following sections describe database systems, and some
approaches to the problem that are not based on a single database
system, that I rejected based on a cursory survey.
4.2.1 XA-based Two-phase Commit
The update problem could possibly be solved using any database that
sup- ports the standard XA two-phase commit (2PC) protocol and an
external transaction manager such as Atomikos or JBoss
Transactions. However, 2PC is not suited for high-availability
system that should recover quickly from failures, since the
behavior of 2PC upon member failure at certain points requires
waiting for that member to come back up before processing of other
transactions can proceed on the remaining members. [16, chapter
14.3] The cause for this design decision is clear. The XA protocol
was designed to allow atomic commits to happen across disparate
systems (such as a message queue and different database engines),
where the main concern is that the transac- tion either happen on
all the systems or none of them. The problem here, however, is to
allow atomic commits to happen in a distributed environment so that
the system can proceed even when nodes are not present.
In addition to potentially hazardous failure mode, the application
itself would have to manage replication by writing to each replica
within a single transaction. The application would thus have to
know about all the replicas, and for example bringing down one
replica and replacing it with another would require configuration
changes to the application itself. On the whole, it appears that a
solution based on standard XA 2PC is likely to be a source
CHAPTER 4. EVALUATED DATABASE SYSTEMS 43
of much trouble and not worth investigating further.
4.2.2 Neo4j
Neo4j1 is a graph database package developed by Neo Technology. It
is dis- tributed under GPLv3 [7] and AGPLv3 [6] licenses and
commercial licensing is also possible. The database can also be
used as regular object store, and offers transactions and
clustering for high availability. However, transactions are only
available through the native Java API. The only API offered to
other languages is an HTTP REST API that does not provide
transactions or even simpler conditional updates.
I decided to not proceed further in my tests with Neo4j because of
the interface limitations.
4.2.3 MySQL
MySQL2 is a well-known database system nowadays developed by Oracle
Corporation. Oracle distributes MySQL under GPLv2 [2] license and
com- mercial licensing is also possible. MySQL APIs exist for most
popular pro- gramming languages and MySQL offers transactions and
replication for high availability.
In MySQL version 5.5, only asynchronous replication was available,
so I decided not to evaluate MySQL further. During the writing of
this thesis Or- acle released MySQL 5.6 with semisynchronous
replication. A cursory look at the semisynchronous replication
feature suggests that with high synchro- nization timeout values it
might have been suitable for further testing.
Several forks of MySQL also exist. After a cursory review of the
biggest three (Drizzle, MariaDB and Percona Server) I found that
none of them provide synchronous replication.
MySQL MMM
MySQL MMM3 is a multi-master replication system built on top of
stan- dard MySQL. It does not offer any consistency guarantees for
concurrent conflicting writes. A write made to one master is simply
replicated to the other asynchronously. Because MySQL MMM does not
provide cluster-wide atomic commits, I decided to not evaluate it
further.
1http://neo4j.org/ 2http://www.mysql.com/
3http://mysql-mmm.org/
4.2.4 PostgreSQL
PostgreSQL4 is an open source database system developed outside the
con- trol of any single company. PostgreSQL is distributed under
the PostgreSQL license [8], which is similar to the standard MIT
[1] license. APIs for Post- greSQL exist for most popular
programming languages.
PostgreSQL 9.1 is the first version that offers synchronous
replication. However, PostgreSQL 9.1 limits synchronous replication
to single target [10]. It would be possible to build a system
fulfilling all the database requirements on top of PostgreSQL, but
I decided to not start such an ambitious project in the scope of
this thesis, thus abandoning further evaluation of
PostgreSQL.
4.2.5 HBase
HBase5 is an open source distributed database built on Apache
Hadoop dis- tributed software package. HBase is distributed under
Apache Software Li- cense version 2.0 [5]. Native API for HBase is
only available in Java, however an API based on the Thrift RPC
system can be used from many popular lan- guages such as Python and
C++. HBase allows atomic updates of single rows through checkAndPut
API call.
HBase depends on Hadoop Distributed Filesystem (HDFS) and Zookeeper
for clustering. HDFS is used to achieve shared storage through
which HBase nodes access data. In Hadoop versions before 2.0.0 HDFS
architecture has a single point of failure (SPOF) in form of
NameNode. During the writing of this thesis, initial release of
Hadoop 2.0.0 partially remedies this issue through HDFS HA feature,
although it still only provides master-standby operation and
failover remains manual.
I originally abandoned further evaluation of HBase because of the
HDFS single point of failure issue.
4.2.6 Redis
Redis6 is an open source key-value store. Redis is distributed
under a BSD 3-clause [4] license. APIs for Redis are available for
most popular languages. Redis supports compare-and-set type
operations through its WATCH, MULTI and EXEC commands.
Redis implements master-slave type asynchronous replication. I
aban- doned further evaluation of Redis, since it does not support
synchronous
4http://www.postgresql.org/ 5http://hbase.apache.org/
6http://redis.io/
CHAPTER 4. EVALUATED DATABASE SYSTEMS 45
replication and thus transactions may be lost after the master node
has con- firmed them to the client but before the master node has
replicated them to any slaves. There is also a specification for
Redis Cluster which would support synchronous replication but it
has not been implemented yet.
4.2.7 Hypertable
Hypertable7 is an open source distributed database built on Apache
Hadoop distributed software package. Hypertable is developed by
Hypertable Inc. and available under GPLv3 [7] license as well as
commercial license to be negotiated separately with Hypertable Inc.
APIs for Hypertable are based on the Thrift RPC system, and
bindings are available for several popular programming languages.
Hypertable does not support atomic operations beyond counters.
Hypertable’s high availability features are restricted by its use
of Hadoop HDFS, which introduces single point of failure.
I abandoned further evaluation of Hypertable because of lack of
support for atomic updates of even single rows and because of the
HDFS single point of failure issue.
4.2.8 H-Store
H-Store8 is an experimental memory-based database developed in
collabora- tion between MIT, Brown University, Yale University and
HP Labs. H-Store is available under GPLv3 [7] license. The H-store
website claims that it is experimental and unsuitable for
production environments, which is why I abandoned further
evaluation of it.
4.2.9 Infinispan
Infinispan9 is an in-memory key/value store developed by RedHat,
originally designed to be used as a cache. It is available under
LGPLv2.1 [3] license. Infinispan supports atomic updates, on-disk
storage and clustering with syn- chronous replication. However, it
has the same failing as MySQL Cluster, namely it cannot be
configured so that writes to minority partition of a cluster
couldn’t succeed.
7http://hypertable.org/ 8http://hstore.cs.brown.edu/
9http://www.jboss.org/infinispan/
4.2.10 Project Voldemort
Project Voldemort10 is a distributed key-value store developed at
LinkedIn. Project Voldemort is available under Apache Software
License 2.0 [5]. Volde- mort uses vector clocks and read repair to
provide eventual consistency but it is unclear whether it can be
used to implement fault-tolerant atomic up- dates. Rather than
attempt to do so, I abandoned further evaluation of Project
Voldemort.
4.2.11 Membase / Couchbase
Couchbase11 is a distributed database developed by Couchbase.
Couchbase is available under Apache Software License 2.0 [5]. APIs
for Couchbase are available for various popular programming
languages, including Java, C and Python. Couchbase supports
replication, but I could not find a complete description of the
replication semantics when I originally selected databases for
further evaluation in summer 2011. In summer 2011 it appeared that
Couchbase 2.0 might include some improvements to replication, but
only developer preview versions were available. In July 2012,
Couchbase 2.0 is still available only as developer preview
versions. The features offered by the replication subsystem are
still unclear.
4.2.12 Terrastore
Terrastore12 is an open-source NoSQL database licensed under Apache
Soft- ware License 2.0 [5]. It appears to be an independent project
hosted on Google Code. Terrastore can be accessed using a Java API
or an HTTP API. Both interfaces offer conditional updates.
Replication features of Terrastore are based on Terracotta
in-memory clustering technology from Terracotta Inc. However, I
could not find infor- mation on durability or persistence features
of Terrastore in the Terrastore wiki, so I abandoned further
evaluation.
4.2.13 Hibari
Hibari13 is a strongly consistent key-value store licensed under
Apache Soft- ware License 2.0 [5]. Hibari was originally developed
by Gemini Mobile
10http://project-voldemort.com/ 11http://www.couchbase.com/
12http://code.google.com/p/terrastore/
13http://hibari.github.com/hibari-doc/
CHAPTER 4. EVALUATED DATABASE SYSTEMS 47
Technologies Inc. Hibari has a native Erlang API and a
cross-platform API based on the Thrift RPC system. When I
originally read Hibari documenta- tion it appeared that it did not
support atomic conditional updates, which is why I did not select
it for further evaluation. However, upon later reading it appears
that I was mistaken and I should have evaluated it more
carefully.
4.2.14 Scalaris
Scalaris14 is a transactional distributed key-value store. The
development of Scalaris has been funded by Zuse Institute Berlin,
onScale solutions GmbH and several EU projects. Scalaris is
available under Apache Software License 2.0 [5]. Scalaris does not
appear to be production-ready. I originally decided not to evaluate
Scalaris further because it appeared to not be ready for production
use. At the time of writing of this chapter in July 2012, the links
to Users and Developers Guider on Scalaris homepage do not lead
anywhere, so it would appear that the original decision was
correct.
4.2.15 GT.M
GT.M15 is a key-value database engine originally developed by
Greystone Technology Corp. Nowadays it is maintained by Fidelity
Information Ser- vices. GT.M is available under GPLv2 [2] license.
APIs for GT.M are available for some popular languages, including
Python. GT.M offers ACID transactions.
GT.M offers Business Continuity replication, which on a closer look
ap- pears to be asynchronous replication for disaster recovery
purposes. I aban- doned further evaluation of GT.M because it does
not have synchronous replication.
4.2.16 OrientDB
OrientDB16 is an open-source graph-document database system.
OrientDB is distributed under Apache Software License 2.0 [5]. The
native language for OrientDB is Java. In addition to Java, language
bindings are available for at least Python, using the HTTP
interface of OrientDB. The HTTP REST API of OrientDB is limited in
that it does not offer conditional updates. Conditional updates
are, however, supported using the native Java API.
14http://code.google.com/p/scalaris/
15http://www.fisglobal.com/products-technologyplatforms-gtm
16http://www.orientdb.org/
CHAPTER 4. EVALUATED DATABASE SYSTEMS 48
OrientDB supports both synchronous and asynchronous replication. It
is not clear what visibility guarantees the replication offers. I
abandoned fur- ther evaluation of OrientDB because of the
limitations of its cross-language support.
4.2.17 Kyoto Tycoon
Kyoto Tycoon17 is a key-value database engine developed and
maintained by FAL Labs. Kyoto Tycoon is distributed under GPLv3 [7]
license. APIs for Kyoto Tycoon exists at least for C/C++ and
Python. Kyoto Tycoon supports atomic updates.
High-availability features of Kyoto Tycoon are limited to hot
backup and asynchronous replication. I abandoned further evaluation
of Kyoto Tycoon because of its lack of synchronous
replication.
4.2.18 CouchDB
CouchDB18 is an open-source distributed database. CouchDB is
distributed under Apache Software License 2 [5]. CouchDB is
accessed using an HTTP API and bindings are available for many
popular programming languages including Java and Python. CouchDB
supports atomic updates on a single server but not
cluster-wide.
CouchDB supports peer-to-peer replication with automatic conflict
reso- lution that ensures all nodes resolve conflicts the same way.
The replication is not visible to users in that users could for
example select how many replicas must receive a write before it
being considered successful. Manual conflict resolution that
differs from the automated procedure is also possible, since losing
copies in the automatic resolution process can also be accessed. I
did not consider CouchDB for further evaluation because of the
limited atomic update support.
4.3 Databases Selected for Limited Evalua-
tion
Databases selected for limited evaluation were evaluated for
throughput in non-conflicting update test without failures and for
fault-tolerance with a suite of fault-inducing tests. The databases
selected for limited evaluation
17http://fallabs.com/kyototycoon/
18http://couchdb.apache.org/
CHAPTER 4. EVALUATED DATABASE SYSTEMS 49
are popular among application developers but aren’t suitable for
use as the main database for the command and control system because
they do not support cluster-wide atomic updates.
All evaluated databases can be set up in a cluster so that the
cluster does not have a single point of failure, and their APIs
allow the client to specify that the write must be replicated to a
certain number of hosts before the write is considered successful.
These features make them suitable for write-mostly applications
such as logging.
4.3.1 Cassandra
Cassandra19 is an open-source distributed database system.
Cassandra is distributed under Apache Software License 2.0 [5].
Cassandra API is based on the Thrift RPC system and bindings exist
for many popular languages such as Java, Python and Ruby. I used
version 1.1.2 of Cassandra in my tests.
Consistency requirements in Cassandra are specified for each write
and read operation separately. If both readers read from a quorum
of nodes and writers write to a quorum of nodes, then causal
consistency exists between writes and reads so that after a write
has completed, all readers will see that write. However, it is not
possible to detect conflicting writes, which makes atomic updates
impossible to implement and thus I only consider Cassandra suitable
for limited evaluation.
Cassandra has limited support for backups. Each node can be backed
up separately, but there is not way to get a backup of the whole
cluster at a point in time. Considering that atomic updates are not
possible, and thus Cassandra is not suited for storing things that
must be updated in a consistent fashion, this is probably
acceptable.
Based on Cassandra documentation20, nodetool repair command should
be run on each node periodically, to ensure that data that is not
written to all nodes at commit and rarely read gets replicated
appropriately. No other periodic maintenance tasks are mentioned in
the documentation. In addi- tion to repairing of data, nodetool
also allows other maintenance tasks to be performed, such as
removing nodes from the cluster and rebalancing the hash ring
Cassandra uses to locate data in the cluster. New node is added by
starting it with a configuration that includes some hosts of the
existing cluster in so-called “seed list”.
19http://cassandra.apache.org/
20http://wiki.apache.org/cassandra/
4.3.2 Riak
Riak21 is a distributed database system developed by Basho
Technologies Inc. Riak is distributed under Apache Software License
2.0 [5]. Riak is accessed via a REST-style HTTP API and bindings
exist for several popular programming languages including Java,
Python and Ruby. I used version 1.1.4 of Riak in my tests.
Similar to Cassandra, consistency requirements in Riak are
specified for each write and read operation separately. Causal
consistency is achievable as with Cassandra. As with Cassandra,
atomic updates are not possible because conflicts are resolved at
read time. When conflicting writes occur, the reader must select
which one is retained. Because of this impossibility of atomic
updates I only consider Riak suitable for limited evaluation.
Based on dual fault tests, Riak cannot ensure that writes are only
suc- cessful if a quorum of nodes is available. I stopped
evaluation of Riak once I noticed this behavior, so not details on
its backup mechanism or adminis- trative tools are presented
here.
4.4 Databases Selected for Full-Scale Evalu-
ation
Databases selected for full-scale evaluation were evaluated for
throughput in non-conflicting update tests and update tests with
various conflict rates without failures and for fault tolerance
with a suite of fault-inducing tests.
4.4.1 Galera
Galera22 is a multi-master clustering solution for MySQL developed
by Coder- ship Oy. Galera is distributed under GPLv2 [2]. Galera is
an add-on to MySQL, so standard MySQL clients can be used to access
it. I used version 2.2rc1 in my tests.
Galera offers transparent multi-master clustering on top of
standard MySQL. Galera uses synchronous replication that ensures
that only one of concurrent conflicting transactions can commit.
Galera also ensures that only a parti- tion of the cluster with the
majority of nodes in it can process transactions, thus ensuring
that successful commits are always replicated to at least two sites
in the command and control system architecture.
21http://basho.com/products/riak-overview/
22http://codership.com/
CHAPTER 4. EVALUATED DATABASE SYSTEMS 51
Galera maintains a copy of all data on each cluster node, so it is
possible to take a block device-level snapshot of the data on one
node and to use that as a backup. Galera also supports standard
MySQL/Innodb backup tools such as Percona XtraBackup.
Based on documentation on the Galera website, Galera cluster does
not require any periodic maintenance. Connecting and removing nodes
from clus- ter is done by manipulating values of configuration
variables, either through configuration file or through MySQL
command line interface. Cluster state can also be monitored through
the regular MySQL interface by inspecting configuration and state
variables. Each Galera node has different configura- tion settings,
since each node must be configured with the addresses of other
nodes in the cluster.
4.4.2 MongoDB
MongoDB23 is a document database system developed by 10gen Inc. The
MongoDB server itself is distributed under AGPLv3 [6] license, but
the client APIs developed by 10gen are distributed under Apache
Software License 2.0 [5]. Client APIs are available for many
popular programming languages including C++, Java and Python.
MongoDB allows atomic updates of single documents. I used version
2.0.6 of MongoDB in my tests.
MongoDB clustering works in two dimensions that are managed sepa-
rately. To increase throughput, data is distributed to multiple
shards. Each shard is backed by either a single node or a cluster
that facilitates replication. For this thesis, I ignore the
sharding dimension, since the main focus is on high availability.
In MongoDB terminology a replicating cluster is called a replica
set. A replica set has one master and a number of slaves. By
default all reads and writes go to the master, facilitating causal
consistency and atomic updates. The slaves asynchronously pull
updates from the master, but write operations support an option to
specify that the write is considered complete until a specific set
of slaves has replicated the update.
MongoDB only offers READ UNCOMMITTED semantics cluster-wide for
trans- actions. Writes done to the primary are visible to reads on
the primary before they have been replicated to secondaries. In
extreme case this means that a client may read an object from
current primary, but the object may dis- appear if the primary
immediately fails and the new primary that is elected had not yet
received it. There are also theoretical limitations on data size
that stem from how MongoDB handles access to on-disk resources. The
on- disk data representation is mapped into memory, so there must
be enough
23http://www.mongodb.org/
CHAPTER 4. EVALUATED DATABASE SYSTEMS 52
address space to map all data. This is not a practical limitation
on systems with 64-bit address bus, but 32-bit systems are
typically limited to about two gigabytes of data.
MongoDB can be backed up by copying the data directory from a
filesys- tem snapshot as long as MongoDB has journaling enabled.
Only a single node of a replica set needs to be backed up.
Additionally, MongoDB allows point-in-time backups even without
filesystem snapshots with mongodump
utility. MongoDB does not have a separate utility for cluster
administration. Ad-
ministrative tasks, such as adding and removing replicas from
replica sets, are performed using special commands with the regular
command line ap- plication. Configuration files on all nodes in a
MongoDB replica set are identical.
4.4.3 MySQL Cluster
MySQL Cluster24 is a high-availability database system that
replaces the standard storage engines in MySQL with a cluster
system that allows syn- chronous replication. Nowadays Oracle
packages it separately from the stan- dard MySQL software, so I
also present it separately here. Oracle offers MySQL Cluster under
GPLv2 [2] license and various commercial licensing schemes. MySQL
Cluster can be accessed using standard MySQL APIs, so its
programming language support good. MySQL Cluster also has a sepa-
rate API for directly accessing the replicated storage system. I
tested version 7.1.15a of