Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
Virtual Full Replication for Scalable
Distributed Real-Time Databases
Thesis Proposal
Technical Report HS-IKI-TR-06-006
Gunnar Mathiason
University of Skovde
June, 2006
1
Abstract
Distributed real-time systems increase in size an complexity, and the nodes in such
systems become difficult to implement and test. In particular, communication for synchro-
nization of shared information in groups of nodes becomes complex to manage. Several
authors have proposed to using a distributed database as a communication subsystem,
to off-load database applications from explicit communication. This lets the task for in-
formation dissemination be done by the replication mechanisms of the database. With
increasingly larger systems, however, there is a need for managing the scalability for such
database approach. Furthermore, timeliness for database clients requires predictable re-
source usage, and scalability requires bounded resource usage in the database system.
Thus, predictable resource management is an essential function for realizing timeliness in
a large scale setting.
We discuss scalability problems and methods for distributed real-time databases in the
context of the DeeDS database prototype. Here, all transactions can be executed timely at
the local node due to main memory residence, full replication and detached replication of
updates. Full replication contributes to timeliness and availability, but has a high cost in
excessive usage of bandwidth, storage, and processing, in sending all updates to all nodes
regardless of updates will be used there or not. In particular, unbounded resource usage
is an obstacle for building large scale distributed databases.
For many application scenarios it can be assumed that most of the database is shared
by only a limited number of nodes. Under this assumption it is reasonable to believe that
the degree of replication can be bounded, so that a bound also can be set on resource
usage.
The thesis proposal identifies and elaborates research problems for bounding resource
usage in large scale distributed real-time databases. One objective is to bound resource
usage by taking advantages of pre-specified data needs, but also by detecting unspecified
data needs and adapting resource management accordingly. We elaborate and evaluate the
concept of virtual full replication, which provides an image of a fully replicated database to
database clients. It makes data objects available where needed, while fulfilling timeliness
and consistency requirements on the data.
In the first part of our work, virtual full replication makes data available where needed
by taking advantages of pre-specified data accesses to the distributed database. For hard
real-time systems, the required data accesses are usually known since such systems need
to be well specified to guarantee timeliness. However, there are many applications where
a specification of data accesses can not be done before execution. The second part of
our work extends virtual full replication to be used with such applications. By detecting
2
new and changed data accesses during execution and adapt database replication, virtual
full replication can continuously provide the image of full replication while preserving
scalability.
One of the objective of the thesis work is to quantify scalability in the database context,
so that actual benefits and achievements can be evaluated. Further, we find out the
conditions for setting bounds on resource usage for scalability, under both static and
dynamic data requirements.
3
Contents
1 Introduction 7
1.1 Document layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Background 9
2.1 A distributed real-time database architecture . . . . . . . . . . . . . . . . . 9
2.2 The concept of scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Scalability in a fully replicated real-time main memory database . . . . . . 11
2.4 Virtual full replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.1 Basic notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Incremental recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Problem formulation 15
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Problem decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.5 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Methodology 19
4.1 Virtual full replication for scalability . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Static segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.1 Static segmentation algorithm . . . . . . . . . . . . . . . . . . . . . . 22
4.2.2 Properties, dependencies and rules . . . . . . . . . . . . . . . . . . . 25
4.2.3 Analysis model and assumptions . . . . . . . . . . . . . . . . . . . . 26
4.2.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 Simulation for scalability evaluation . . . . . . . . . . . . . . . . . . . . . . 31
4.3.1 Motivation for a simulation study . . . . . . . . . . . . . . . . . . . . 32
4.3.2 Simulation objectives . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3.3 Validation of software simulations . . . . . . . . . . . . . . . . . . . 33
4.3.4 Modeling detail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3.5 Approaches for validity and credibility . . . . . . . . . . . . . . . . . 35
4.3.6 Experimental process and experiment design . . . . . . . . . . . . . 36
4.3.7 Simulator implementation . . . . . . . . . . . . . . . . . . . . . . . . 37
4
4.3.8 Experiment execution . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.9 Discussion and results . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4 Adaptive segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4.1 Adaptive segmentation with pre-fetch . . . . . . . . . . . . . . . . . 43
4.5 A framework of properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.6 Case study and implementation . . . . . . . . . . . . . . . . . . . . . . . . . 44
5 Related work 46
5.1 Strict consistency replication . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2 Replication with relaxed consistency . . . . . . . . . . . . . . . . . . . . . . 46
5.3 Partial replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4 Asynchronous replication as a scalability approach . . . . . . . . . . . . . . 47
5.5 Adaptive replication and local replicas . . . . . . . . . . . . . . . . . . . . . 47
5.6 Usage of data properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6 Conclusions 48
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.2 Initial results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.3 Expected contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.4 Expected future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
A Project plan and publications 51
A.1 Time plan and milestones . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
A.2 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
A.2.1 Current . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
A.2.2 Planned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5
List of papers
This Thesis Proposal is based on the work in the following papers.
G. Mathiason and S. F. Andler. Virtual full replication: Achieving scalability in distrib-
uted real-time main-memory systems. In Proc. of the Work-in-Progress Session of
the 15th Euromicro Conf. on Real-Time Systems, pages 33-36, July 2003. (ISBN
972-8688-11-3) [52]
G. Mathiason, S. F. Andler, and D. Jagszent. Virtual full replication by static segmen-
tation for multiple properties of data objects. In Proceedings of Real-time in Sweden
(RTiS 05), pages 11-18, Aug 2005. ISBN 91-631-7349-2, ISSN 1653-2325) [53]
G. Mathiason. A simulation approach for evaluating scalability of a virtually fully repli-
cated real-time database. Technical Report HS-IKI-TR-06-002, University of Skovde,
Sweden, Mar 2006. [50]
6
1 Introduction
In a distributed database, availability of data can be improved by allocating data at the
local nodes where data is used. For real-time databases transaction timeliness is a major
concern. Data can be allocated to the local node, to avoid remote data access with the
risk of unpredictable delays on the network. Further, multiple replicas of the same data
at different nodes allow data redundancy that can be used for recovery from failures,
avoiding corruption or losses of data when nodes fail. In a fully replicated database the
entire databse is available at all nodes. This offers full local availability of all data objects
at each node.
The DeeDS [6] database prototype stores the fully replicated database in main mem-
ory to make transaction timeliness independent of disk accesses. Full replication of data
together with detached replication of updates allow transactions to execute on the local
node entirely, independent of network delays. Detached replication sends updates to other
nodes independent from the execution of the transaction. Database clients get the per-
ception of a single local database and need not to consider specific data locations or how
to synchronize concurrent updates of different replicas of the same data.
Full replication uses excessive system resources, since the system must replicate all
updates to all the nodes in such a system. This causes a scalability problem for resource
usage, such as bandwidth for replication of updates, storage for data replicas, and process-
ing for replicating updates and resolving conflicts for concurrent updates at different nodes
for the same data object.
With existing approaches, real-time databases does not scale well, and the need for
increasingly larger real-time databases increase. With this Thesis we aim show how to
bound resource usage in a distributed real-time database such as DeeDS, by using virtual
full replication to make it scalable. We also quantify scalability for the domain, for an
evaluation of the benefits
We elaborate virtual full replication [6] as an approach to manage resources for scala-
bility, to replicate and store updates only for those data objects that are used at a node,
and maintain scalability over time. This avoids irrelevant replication and maintains real-
time properties, while providing an image of full replication for local availability of data.
Virtual full replication reduces resource usage to meet the actual need, rather than using
resources to replicate all updates blindly to all nodes, resulting in a scalable distributed
real-time database. Virtual full replication uses knowledge about the needs of the ap-
plication, and uses application requirements for properties of the data, such as location,
consistency model, and storage media to support timeliness requirements of the clients
in a scalable system. Furthermore, properties have relations that can be used to control
7
resource usage in more detail.
This Thesis Proposal claims that a replicated database with virtual full replication
and eventual consistency enables large scale distributed real-time databases, where each
database client have an image of full availability of a local database.
1.1 Document layout
Section 2 contains a background on distributed real-time databases and in particular how
full replication may support timeliness. An generic approach for scalability evaluation is
introduced, and indicates how such approach can be used for a scalability evaluation of a
distributed databases. In section 3 the problem of scalability in a fully replicated database
is detailed and how we decompose the problem, how the problem can be defined in terms
of assessable objectives. Section 4 describes our methodology in detail, for how to reach
the objectives. Some of the steps in the methodology have been performed already and
the results are presented in conjunction, while other steps remains to be done. Section 5
describes related work, both for distributed real-time databases and for other areas that
relates to our approach. Finally, section 6 summarizes the conclusions in this proposal,
highlights the expected contributions and expected consequences for subsequent future
work.
8
2 Background
2.1 A distributed real-time database architecture
The main property of real-time systems is timeliness, which can only be achieved with
predictable resource usage and with sufficiently efficient execution. Predictability is essen-
tial and for a hard real-time system the consequence of a missed deadline may be fatal,
while for soft real-time systems a missed deadline lowers the value of the provided service.
Thus, for real-time systems, predictable resource usage is the primary design concern that
enables timeliness.
To improve predictability and efficiency, the database of the distributed real-time data-
base system DeeDS [6] resides entirely in main memory, removing dependability on disk
I/O delays caused by unpredictable access times for hard drives. Also, accesses to main
memory are many times faster.
To further improve predictability of database transactions, the database is fully repli-
cated to all nodes, to make transaction execution independent of network delays or network
partitioning. With full replication there is no need for remote data access during trans-
actions. Replication also improves fault tolerance, since there are redundant copies of the
data. Full replication allows transactions to have all of its operations running at the local
node. A fully replicated database with detached replication [32], where replication is done
after transaction commit, allow independent updates [18], that is, concurrent and unsyn-
chronized updates for replicas of the same data objects. Independent updates may cause
database replicas to become inconsistent, and inconsistencies must be resolved in the repli-
cation process by a conflict detection and resolution mechanism. In DeeDS, updates are
replicated to all nodes detached from transaction execution, by propagation from the node
executing an updating transaction after transaction commit, and integration of replicated
updates at all the other nodes. Conflicting updates as result of independent updates are
resolved at integration time. Temporary inconsistencies are allowed and guaranteed to be
resolved at some point in time, giving the database the property of eventual consistency.
Applications that use eventually consistent databases need to be tolerant to temporarily
inconsistent replicas, and many distributed applications are. Further, in an eventually con-
sistent database that supports a bounded time for such temporary inconsistencies (which
can be achieved by using bounded time replication), applications that have requirements
on timely replication and consistency can use a temporary inconsistent database as well.
In a fully replicated database using detached replication, a number of predictability
problems can be avoided that are associated with synchronization of concurrent updates
at different nodes, such as agreement protocols or distributed locking of replicas of objects,
9
and reliance on stable communication to access data. Also, application programming is
easier since the application programmer may assume that the entire database is available,
and that the application program has exclusive access to it.
2.2 The concept of scalability
Scalability is a concept that intuitively may appear obvious. It is used in many areas
of research, but the availability of a generic definition and a theoretical framework with
metrics is limited. A system is scalable if the growth function for required amount of
resources, req(p), does not exceed the function for available amount of resources, res(p),
when the system is scaled up for some system parameter p (also called the scale factor).
The resource usage may follow some function of the scale factor, and the upper bound for
this function, O(req(p)), must not exceed the function of available resources. For a system
with linear scalability this relation must be valid for all sizes of p, but for other systems
scalability may be related to only certain sizes of p. In a few research areas, scalability
concepts are well developed and related metrics for scalability are available, namely in
the areas of parallel computing systems [86] [57] in particular for resource management
[55], and for shared virtual memory [74], for design of system architectures for distributed
systems [16], and for network resource management [3].
Frolund and Garg define a list of generic terms for scalability analysis in distributed
application design [29]:
• Scalability : A distributed (software) design D, is scalable if its performance model
predicts that there are possible deployment and implementation configurations of D
that would meet the Quality of Service (QoS) expected by the end user, within the
scalability tolerance, over a range over scale factor variations, within the scalability
limits.
• Scale factor : A variable that captures a dimension of the input vector that defines
the usage of a system.
• Scaling enabler : Entities of design, implementation or deployment that can be
changed to enable scalability of the design.
Further, the authors also define Scalability point, Scalability tolerance, Scalability limits
and Scaling parameters, which we exclude here.
We can map the terms above into our context. QoS can be seen as timeliness of local
transactions (deadline miss ratio), the level of consistency (consistency model properties
and the bound on replication delay). Scale factors may be the database size, the number
of nodes, the ratio and frequency of update transactions or others.
10
Jogalekar and Woodside [40] have presented a generic metric for scalability in distrib-
uted systems based on productivity, and a system is regarded as scalable if productivity is
maintained as the scale changes. Given the quantities:
• λ(k) = throughput in responses/sec, at scale k
• f(k) = average value of each response, calculated from its quality of service at scale
k
• C(k) = cost at scale k, expressed as running cost per second to be uniform with λ
The value function f(k) includes appropriate system measures, such as response de-
lay, availability, timeouts or probability of data loss. The productivity F (k) is the value
delivered per second, divided by the cost per second:
F (k) = λ(k) ∗ f(k)/C(k) (1)
The scalability metric ψ(k1, k2) relates the productivity at two different scales such
that
ψ(k1, k2) = (F (k2))/(F (k1)) (2)
With ψ(kx, ky), design alternatives x and y can be compared, such that a higher ratio
indicates better scalability for ky. It it not possible to compare scalability between different
systems since the ratio is based on specific value functions, but scalability can be evaluated
for finding alternative scaling enablers and settings for scale factors for a particular system.
The authors use the scalability metric for evaluating actual systems, and the metric has
also been used by other authors using other value functions [17] [42]. Similarly, we need
to define an appropriate value function for a distributed real-time database, which would
be connected to the real-time properties of the system. Typically this would include the
timeliness of transactions for different classes of transactions. The cost function would
relate to the amount of resources used, in terms of bandwidth, storage and processing
time.
2.3 Scalability in a fully replicated real-time main memory data-
base
For a fully replicated database with immediate consistency, updates are replicated to
all data replicas during the execution of the updating transaction, using a distributed
agreement algorithm to update all replicas at one instant, and where there is no state where
replicas can differ during the update. Such updates must lock a majority of replicas of the
11
updated object during the transaction. With detached replication, transaction timeliness
does not depend on timeliness for locking replicas, the network delays, or the waiting time
for release of locks set by transactions at other nodes. Thus, detached replication is a
scaling enabler for such database. However, a fully replicated database with detached
replication has another scalability problem in that all updates need to be sent to all other
nodes, regardless of whether the data will be used there or not. Full replication also
requires that replicas of all data objects must be stored at all the nodes, independent of
whether the data ever will by used there or not. Also, updates in fully replicated databases
must be integrated at all nodes, requiring integration processing of updates at all nodes.
Under the assumption that replicas of data objects will only be used at a bounded
subset of the nodes, the required degree of replication becomes lower than in a fully
replicated database. The resource usage of bandwidth, storage and processing depends on
the degree of replication, and these resources are wasted in a fully replicated database,
compared to a database with a lower degree of replication. With replication of only those
data objects that are actually used, resources can be saved and thereby scalability can be
improved.
2.4 Virtual full replication
With virtual full replication [6] the database clients has a image of a fully replicated
database. The database system manages knowledge of what is needed for database clients
to perceive such an image. This knowledge includes a specification of the data accesses
required by database clients. Virtual full replication use the knowledge to replicate data
objects to a subset of the nodes where data is needed, and also to replicate with the
least resource usage needed to maintain the image. With virtual full replication several
consistency models for data may coexist in the same database, and the knowledge is also
used to ensure the consistency model for each data object.
Scalable usage of resources is dependent on the number of replicas to update. Since a
virtually fully replicated database has fewer replicas, resource consumption is lower, while
in a fully replicated database there are as many replicas as there are number of nodes.
The overall degree of replication is lower and replication processing that serves no purpose
is avoided. Only those data objects that are used at a node are replicated there, which
reduces resource usage of both bandwidth, storage and processing. With such resource
management, scalability is improved without changing the application’s assumption of
having a complete database replica available at the local node.
However, virtual full replication that considers only specified data accesses does not
make data available for accesses to arbitrary data at an arbitrarily selected node, which
12
makes such a system less flexible than a full replicated database. For unspecified data
accesses and for changes in data requirements, virtual full replication adapts by detecting
changes and reconfigure replica allocation and replication. This preserves scalability by
managing the degree of replication over time.
Segmentation of the database is an approach for limiting the degree of replication in a
virtually fully replicated database. A segment is a subset of the objects in the database
and each segment has an individual degree of replication over the nodes. The degree of
replication is a result of allocating segments only to the nodes where its data objects are
accessed. This is typically much fewer nodes than used in a fully replicated database. Also,
a database may have multiple segmentations, each for a different purpose. Segmenting for
data availability strives to minimize the degree of replication, while a segmentation for
consistency points out the method for replicating updates in the database.
2.4.1 Basic notation
In this thesis proposal the following notation is used.
A database maintains a finite set of logical data objects O = {o0, o1, ...}, representing
database values. Object replicas as physical manifestations of logical objects. A distributed
database is stored at a finite set of nodes N = {N0, N1, ...}. A replicated database contains
a set of object replicas R = {ro, r1, ...}.The function R : O × N → R identifies the replica r ∈ R of a logical object o ∈ O
on a node N ∈ N if such a replica exists. R(o, N) = r if r is the replica of o on node
N . If no such replica exists, R(o,N) = null. node(r) is the node where replica r is
located, i.e. node(R(o,N)) .= N . object(r) is the logical object that r represents, i.e.
object(R(o,N)) .= o.
A distributed global database (or simply database) D is a tuple < O,R,N >, where
O is the set of objects in D, and R is the set of replicas of objects in O, and N is the set
of nodes such that each node N ∈ N hosts at least one replica in R, i.e. N = {N | ∃r ∈R(node(r) = N)}.
We model transaction programs, T , using four parameters, including the set of objects
read by the transaction, READT (the read set), the set of objects written by the transac-
tion, WRIT ET (the write set), the conflict set CONFLICT T is the set of objects that
conflicts with updates at other nodes. The transaction program T can thus be defined as
T = {READT ,WRIT ET , CONFLICT T }. Also, we refer to the size of the read set as
rT =| READT |, the size of the write set as wT =| WRIT ET |, and the size of the conflict
set as cT =| CONFLICT T |. Also, the working set WST is the union of the read and
write sets of the transaction program WST = {READT ∪WRIT ET }.
13
An transaction instance Tj of a transaction program is executing at a given node n
with a certain maximal frequency fj . We define such transaction instance by a tuple such
that Tj =< fj , n, T >. When the node for execution is implicit and when we only need the
sizes of the read, write, and conflict sets, we simplify the notation as Tj =< fj , rj , wj , cj >.
Further, node(Tj) is the node where transaction T is executed, begin(Tj) is the time
at which transaction Tj begun its execution. commit(Tj) is the commit time of Tj and
abort(Tj) is the abort time of Tj (if Tj is aborted, commit(Tj) = ⊥, if Tj is committed,
abort(Tj) = ⊥). end(Tj) is the completion time of Tj .
2.4.2 Definition
Virtual full replication ensures that for each transaction that read or write database objects
at a node, there exists a replica of the object. Formally,
∀o ∈ O∀T (o ∈ {READT ∪WRIT ET } → ∃r ∈ R(r = R(o, node(T ))))
2.5 Incremental recovery
With incremental recovery, a node in a full replicated distributed database can be recovered
into a consistent database replica, without that any of the other working replicas need to
stopped or even locked [45]. Incremental recovery was proven to give a consistent database
copy at the recovered node.
In a main memory database, data is stored by memory pages, which is the smallest
memory entity that is access during read or write operations. It is assumed that memory
management circuitry ensures that a read or write operation has exclusively access to a
certain memory page during the operation. Fuzzy checkpointing uses this mechanism for
sequentially copying all memory pages at a node (the recovery source), and such a copy
can be sent over the network to recover a failed node (the recovery target). Selecting a
recovery source is done by a negotiation process, to select the most appropriate source
node. Each page that is copied is logged at the recovery source, and pages in the log that
are updated after the memory page was sent to the target node need to be sent again.
For such updates, updated memory pages are forwarded to the recovery target as long as
fuzzy checkpointing proceeds. Once the entire database has been copied and all subsequent
updates have been forwarded, the fuzzy checkpoint finishes atomically with a consistent
replica at the recovered node.
14
3 Problem formulation
3.1 Motivation
A fully replicated distributed database scales badly since it replicates all updates to all
nodes, using excess resources. Scalability is an increasingly important issue for distributed
real-time databases, since the number of nodes, the database size, the number of users
and the workload involved in typical distributed database applications are increasing. For
many such systems it can be assumed that only a fraction of the replicated data is used
at the local node, which is motivated by hot-spot and locality behavior of accesses in
distributed databases.
This thesis argues that a fully replicated distributed main memory real-time database
can be made scalable by effective resource management by using virtual full replication, and
that degrees of scalability can be quantified by metrics. Different scale factors and different
scaling enablers influence resource usage differently are varied to evaluate scalability of
the database. Also, resource usage is compared to alternative approaches for timeliness
in large scale distributed databases, such as fully replicated databases, or approaches that
use partial replication and remote transactions.
For many applications, we believe that a distributed real-time database can be a suit-
able infrastructure for communication between nodes. Publishing and using exchange data
through a distributed real-time database facilitates structured storage and access, im-
plicit consistency management between replicas, fault-tolerance and higher independence
by lower coupling between applications. With a distributed database as an infrastructure
there is no need to explicitly coordinate communication in a distributed application, which
reduces complexity for the communicating application, in particular where the groups that
communicate often change.
Consider a wildfire fighting mission. In such dynamic situation, hundreds of different
actors need to coordinate and distribute information in real time. In such scenario, actors
and information are added and removed dynamically when the mission situation suddenly
change. A distributed database has been pointed out as a suitable infrastructure for emer-
gency management [75], and could be used as a white-board (also called ’black-board’ [56],
in particular as an software architecture approach [30], or for loosely coupled agents [54]),
for reading and publishing current information about the state of a mission, supporting
situation awareness from information assembled from all the actors. Using a database,
implying the usage of transactions, ensures consistency of the information and also avoids
the need of specific addressing within the communication infrastructure. Ethnographical
field studies [43] show that such infrastructure supports sense-making by enabling hu-
15
mans to interact, and it also gives support for the chain of command and strengthens
organizational awareness, which is essential for success of the mission [58]. In such an
infrastructure actors may have access to the complete information when needed, but each
actor will most of the time use only parts of the information for their local actions and
for collaboration with close by peers. By using virtual full replication, scalability of such
distributed database can be preserved.
Virtual full replication uses known properties about data objects, the applications and
the database system, to reduce resource usage by resource management based on the
data needs. We have shown a database segmentation approach that considers multiple
properties for data, and relations between the data properties. So far, only a few selected
properties have been used, but an extended set of useful properties and their relations will
to be defined, to setup segments that meet data requirements from database applications.
Since properties are related, it can be expected that groups of properties can be structured
into profiles of data properties to reduce the resulting number of segments for our approach.
3.2 Problem statement
Predictable execution of transactions is inherent for distributed real-time databases. Pre-
dictable transactions can be achieved by predictable resource usage and sufficient efficiency
of execution, and by excluding involvement of sources of unpredictability in execution of
transactions. With main memory residence, access times become independent of disk ac-
cess times. With full replication, the entire database becomes locally available and local
accesses become independent of network delays. With detached replication, the network
delays for updating replicas is separated from the execution time of the transaction. Com-
bining main memory residence, full replication and detached replication of updates gives
predictable transactions in a distributed real-time database.
Such database does not scale well, since all updates need to be replicated to all other
nodes. By using knowledge about data needs, irrelevant replication can be avoided and
the database becomes scalable, but such replication makes data available only for the data
needs that are known prior to execution. Thereby, the database loses the flexibility to
execute arbitrary transactions at arbitrary nodes. To maintain scalability while making
data available, the virtually fully replicated database also needs to detect and adapt to
unspecified data needs that arises during execution.
Our initial work shows that segmentation of the database, based on a priori known
data needs and database and application properties, improves scalability by scalable usage
of three key resources. Therefore, reconfigurable segmentation of the database seems to be
a viable extension to be able to adapt to changed data needs and to regain the flexibility
16
lost by fixed segmentation. Further, scalability of such an approach need to be evaluated.
A quantification of scalability is needed to evaluate the influence of different scale factors
and alternative scaling enablers.
This thesis explores, by detailed simulation, the scalability of a large scale distributed
real-time database that manage resources by using virtual full replication. Such database
is also capable of adapting to changed data needs while preserving real-time properties
of the application. The hypothesis of this thesis is that such system is scalable, and that
scalability can be preserved by adaptation during execution.
3.3 Problem decomposition
The problem has several aspects and we choose to divide it into the following components:
• Concept elaboration. The concept of virtual full replication has been pointed out as
an approach for providing an image of full replication, while replicating only what
is needed to create such an image [6]. An approach needs to be elaborated with
algorithms and an architecture that can support it. Also, conditions need to be
found out for which virtual full will manage to provide such an image.
• Formation and allocation of units of replication. Segmentation of the database, and
replication of segments only to the nodes where the data is used, seems to be useful
approach for bounding resource requirements. The cost of segmentation needs to be
elaborated, both the processing cost for using a segmentation algorithm, in partic-
ular during system run time, and the storage cost for data structures to maintain
segments. A part of this subproblem is to define how segments are appropriately
formed and adapted, and what data properties to consider for saving resources by
using segmentation. Also, architectural support need to be found out for segment
management and replication of updates.
• Adaptation of database configuration. To maintain scalability of the database through-
out execution, the virtually fully replicated database will need to adapt to changed
data needs of the database clients. For this purpose, adaptation algorithms and ar-
chitectures need to be developed and evaluated, as well as the type of changes to
consider for adaptation.
• Properties and their relations. Virtual full replication is based on knowledge about
data properties and application requirements. In our present work, we have used only
a few of the data properties that could be considered for segmentation and resource
management. Further, we have shown that there exists relations between some of
the properties that can be used, and such relations may be used to improve manage-
17
ment of resources. However, we have not defined a full framework of properties and
their relations. In such framework it may be useful to define application profiles of
related properties to match typical groups of applications, to reduce the amount of
information used for resource management.
• Scalability analysis and evaluation. From our initial studies we see that different scale
factors influence scalability differently. So far, we have only evaluated a few scale
factors, such as the number of nodes and the bound on replication degree, and we
will add others, for a more extensive evaluation. The concept of virtual full replica-
tion does not put limitations on how the image of full replication is built. Therefore,
alternative scaling enablers may be explored and combined for resource management,
including approaches for replication of updates, segmentation considering differenti-
ation in cost of communication links, or different propagation topologies.
3.4 Aims
Our aims with this thesis are:
A1 To bound resource usage to achieve scalability in a distributed main memory real-
time database, by exploring virtual full replication as a scaling enabler.
A2 To show how scalability can be quantified for such a system.
A3 To show how properties and their relations can be used in resource management for
virtual full replication.
3.5 Objectives
O1 Elaborate the concept of virtual full replication
O2 Bound used resources by virtual full replication for expected data requirements.
O3 Bound used resources by virtual full replication for unexpected data requirements.
O4 Define conditions for scalability, for expected and unexpected requirements.
O5 Quantify scalability in the domain, and valuate scalability achieved related to the
conditions.
18
4 Methodology
We develop and evaluate an approach of resource management for scalability by using vir-
tual full replication. We assess scalability using analysis, simulation and implementation,
while evaluating the proposed scalability enablers. We start by elaborating the problem
with full replication and we present possible approaches for virtual full replication. As
next steps we refine the approach by using static and adaptive segmentation, and we elab-
orate segmentation algorithms for multiple properties of data and multiple requirements
from database applications. For an initial evaluation of segmentation, we implement static
segmentation in the DeeDS database prototype. We develop a simulation for evaluating
adaptive segmentation and to allow study of large scale systems. Adaptive segmentation
is more complex to analyze and also to implement in the actual database system, so we
study scalability in an accurate simulation of the system. For sanity control of the simula-
tion, we replicate the experiments done with the implementation already done for DeeDS.
Further, the approach for adaptive segmentation is extended with pre-fetch of data for
availability. Finally, we conclude the findings about data properties and their relations,
as used in our segmentation approach, to document a framework of useful properties that
can be use to improve scalability of distributed real-time databases. Some of the steps in
our methodology have already been done while other steps remains. Some of the steps
described can optionally be excluded, which is indicated at each step.
4.1 Virtual full replication for scalability
M1: In the first research step, virtual full replication has been introduced and segmen-
tation has been proposed as a mean for achieving it [52]. The concept of virtual full
replication was elaborated. By using knowledge about the actual data needs, and also
recognizing differences in properties of the data, segments can be allocated only to the
nodes where data is used, and thereby bound resource requirements at a lower level, while
maintaining the same degree of (perceived) availability of data for the application. This
reduces flexibility compared to a fully replicated database, since not all objects are avail-
able at all nodes. Consequently, availability for unspecified accesses can not be guaranteed.
For hard real-time systems, access patterns and resource requirements are usually known a
priori of execution and the data needs can easily be specified. For such systems the reduced
flexibility is not a problem, since virtual full replication then will guarantee availability for
all data needs.
The aspect of cohesive nodes is emphasized, where some nodes share information more
frequently than other nodes, or some nodes use data with the same consistency model.
The paper describes a work in progress and points out essential research problems in
19
the area, such as the need for an analysis model for how system parameters influence
critical resources. A first definition of scalability for a fully replicated database (’replication
effort’) was presented and an algorithm for segmentation was introduced, however with
high complexity and a combinatorial problem when using multiple data properties.
Multiple consistency models may coexist in the same database, since different segments
can use different consistency models. To evaluate approach for this, an implementation for
co-exstince of both bounded and unbounded replication time for updates was implemented
in the DeeDS prototype [48], following the proposed architecture for virtual full replication.
In this implementation there are two segments with different types of eventual consistency,
one with unbounded replication delays and another with a bound for replication delays.
4.2 Static segmentation
M2: In the second research step, we pursued deeper knowledge about static segmentation.
The segmentation approach for virtual full replication was refined [53].
We explore how Virtual full replication can be supported by grouping data objects into
segments. The database is segmented to group data objects that need to be accessed at the
same set of nodes and each segment is allocated only at the nodes where its data objects
need to be accessed. Segmentation is an approach for managing resources by limiting
the degree of replication for data objects to avoiding excessive resource usage. The data
objects in a segment share key properties, where node allocation is one such property.
Segmenting a database for the known data references is a trivial problem, since a
partitioning and replication schema can be derived directly from the list of required ac-
cesses. This can be done by a database designer or automatically from a list operations in
transactions. However, when adding other data properties, there will be a combinatorial
increase in the resulting number of segments with unique combinations of properties, such
as described in [49].
With a segmented database, each segmentation enables support for a specific property
and by allowing multiple segmentations on the same database, it is possible support several
properties. A segmentation on consistency allows multiple concurrent consistency models
in the same replicated database. This enables new types of applications for the DeeDS
database system, since it allows data objects that can not use eventual consistency. With
a segmentation on storage medium, some segments may be allocated to disk instead of
memory, when the requirement is to prioritize durability in favor of timeliness or even
efficiency.
By segmenting the database on combinations of properties we can find the largest
possible physical segments, where all data objects of a segment can be managed uniformly.
20
We can recover segments faster by block-reading an entire physical segment from a backup
node and also prioritize recovery of critical segments.
When introducing combination of properties of the data to replicate, the number of
segments will multiply with the number of data properties used and the possible values
that properties can be set to. Consider a segmentation that allows two consistency models
combined with segmentation for allocation. The resulting number of possible segments
may double compared to a segmentation only for allocation. To find the the segments in a
database where multiple properties are combined, a naive algorithm need to loop through
the combination of values of each property for all data objects. Also, the more properties
that are used for segmenting the database, the smaller the segments will become. In a case
where each object has its own unique combination of properties, each segment will become
one object large. This is a bound on the number of possible segments, which is less likely
in a database where usage of objects rather cluster to cohesive sets of data objects. Thus,
there is a need for an approach of segmenting data on multiple properties that is efficient
and that considers clustering but also capture and use knowledge on how properties can
be combined.
In this step we have presented an algorithm that handles the combinatorial problem
of segmenting a database on multiple properties, where multiple and overlapping seg-
mentations of the database are allowed, and where dependencies between properties are
considered[53]. We introduce logical segments that allow overlapping segmentations, where
each segmentation represents a property or a set of properties of interest. From logical
segments, we can derive physical segments, such as allocation and recovery units. With
the refined segmentation algorithm presented, we can segment a database by multiple
properties without a combinatorial problem. This algorithm also recognizes the relations
between data properties, where relations can be dependent and unrelated of each other,
or even mutual. The dependency relation implies that a dependent property can not be
supported if the property dependent upon is not supported. Relations between properties
reduce the combinations of segmentations for multiple properties, and is the way profiles
may be created. Our approach allows control of relations, by rules that can be applied on
the set of combined properties. The approach also allows specification of data clustering,
which sets the common properties equal for clustered objects typically shared by cohesive
nodes.
In current work, we use our algorithm for automatic segmentation and segment alloca-
tion of units of distribution, physical segments, based on application knowledge about data
used by transactions and other knowledge about data properties originating in application
semantics.
21
For our future work, we plan to extend the algorithm to maintain scalability at execu-
tion time, supporting mode changes and unspecified data requests.
4.2.1 Static segmentation algorithm
Our approach for static segmentation is based on that sets of properties are associated
with each object, and that objects with same or similar property sets can be combined
into segments. We introduce an example here by first considering a single property, where
data is accessed.
Consider a scenario, with the following five transaction programs executing as seven
transaction instances, in a database with at least six objects replicated to at least five
nodes: T1.1 =< r : o1, o6, w : o6, N1 >, T1.2 =< r : o1, o6, w : o6, N4 >, T2 =< r : o3, w :
o5, N3 >, T3 =< r : o5, w : o3, N2 >, T4.1 =< r : o2, w : o2, N2 >, T4.2 =< r : o2, w :
o2, N5 >, T5 =< r : o4, N3 >. Based on these accesses the objects have the following
access sets:
o1 = {N1, N4},o2 = {N2, N5},o3 = {N2, N3},o4 = {N3},o5 = {N2, N3},o6 = {N1, N4}.These particular access sets can give a segmentation that has an optimal placement of
data: s1 =< {o4}, N3 >, s2 =< {o3, o5}, N2, N3 >, s3 =< {o1, o6}, N1, N4 >, s4 =<
{o2}, N2, N5 >.
For an implementation, we use a table to collect all properties of interest. For the
algorithm used with the table, the data accesses of all transactions are marked in a table,
where we have data objects in rows and nodes in columns. See Figure 1 (left) for a
database of 6 objects replicated to 5 nodes. Note that rows in the table equally well could
represent groups of data objects, such as object classes or user-defined object clusters that
share the same properties. By assigning a binary value to each column, each row can be
interpreted as a binary number that forms an object key that identifies the nodes where
the object is accessed. By sorting the table on the object key, we get the table in Figure
1 (right). Passing through this table once, we can collect rows with same key value into
unique segments and allocate each segment at the nodes marked. See Algorithm 1 for
pseudo code.
22
sort
N1 N2 N3 N4 N5
o4 x
o5 x x
o6 xx
o1 xx
o2 x x
o3 x x
4
6
9
9
18
6 Seg
2
Seg
3
Seg
4
Seg
1
segment keyN1 N2 N3 N4 N5
o4 x
o5 x x
o6 xx
o1 xx
o2 x x
o3 x x
1 4 8 16
4
6
9
9
18
6
2column
value
objects
ACCESS LOCATIONSACCESS LOCATIONS
segment key
Figure 1: Segmentation principle, used for segment allocation
The sort operation in this algorithm contributes with the highest computational com-
plexity. Thus this algorithm segments the database in O(o log o), where o is the number
of objects in the database.
Such property set can be extended with additional knowledge. We can uniformly handle
more properties on objects by extending the property sets. Assume that the database
clients require that o2 must have a bound on the replication time, and that o1 and o6
can be allowed to be temporality inconsistent. Also, assume that objects o3, o4, o5 need
to be immediately consistent. Further, assume that objects o1, o3, o5, o6 are specified to
be stored in main memory, while objects o2, o4 are stored on disk. To control multiple
properties, we extend the property set into (resulting in table shown in Figure 2)
o1 = {N1, N4}, {asap}, {memory},o2 = {N2, N5}, {bounded}, {disk},o3 = {N2, N3}, {immediate}, {disk},o4 = {N3}, {immediate}, {disk},o5 = {N2, N3}, {immediate}, {memory},o6 = {N1, N4}, {asap}, {memory}.
The sort operation in the algorithm implemented contributes with the highest com-
putational complexity. One sort operation is used for each segmentation of interest, and
typically a reduced number of segmentations are used. Thus, the algorithm segments
the database in O(o log o) for multiple properties as well, where o is the number of ob-
jects in the database. This is far better than the naive algorithm for multiple property
segmentation that was presented in [49], with a computational complexity of O(o!) .
By selecting particular properties, segmentations for special purposes can be generated
23
/*Mark the read and write sets */clear(access);
for i ← 1 to numtransactions do
for j ← 1 to T [i].numnodes do
for k ← 1 to T [i].L.numreads do
access(j,T[i].L.read[k])=1;
end
for k ← 1 to T [i].W.numwrites do
access(j,T[i].W.write[k])=1;
endend
end
/*Assign key values for nodes */for j ← 1 to numnodes do
access.colKey[j]=2j ;
end
/*Calculate object key */for i ← 1 to numobjects do
access.Key[i] = BuildKey(access,i,numnodes);
end
/*Sort lines on object key */SortTable(access.Key, access);
/*Find segments and allocations */segmID=0;
currKey = -1;
currSeg = NIL;
for i ← 1 to numobjects do
if currKey != access.Key(i) then
currKey = access.Key(i);
currSeg = NewSeg(Key);end
currSeg.Add(access.oid(i));end
Algorithm 1: Segmentation algorithm for segment allocation
24
o4 x
o5 x x
o6 xx
o1 xx
o2 x x
o3 x x
1 4 8 16
548
294
329
329
402
294
2column
value
object keyobjects
ACCESSES
x
x
x
x
x
x
MEDIUM
diskmemN2N1 N4N3 N5
256 512
x
x
x
x
x
x
32 64 128
asap boundimm
CONSISTENCY
Figure 2: Multiple property table
from the complete property set. For instance, units of physical allocation can be gener-
ated as a subset the properties of ’access’ and ’storage medium’ only. By executing the
segmentation algorithm on this subset, the resulting segmentation gives the segments for
physical allocation to nodes and for recovery of segments.
4.2.2 Properties, dependencies and rules
There may be conditions for combining properties. For instance, to guarantee transaction
timeliness three conditions must be fulfilled: data need to be stored in main memory,
being available at the node where accessed, and the database need to replicate by detached
replication. Detached replication can be done as soon as possible (asap) or in bounded
time (bounded). Consider o2 in the property set above. To guarantee timeliness, we need
to change the storage of object o2 into main memory storage to make the replication
method property consistent with the storage property. Thus, we use a known dependency
between properties to ensure the consistency among the settings of the properties in use.
Also, we may combine objects o3 and o5 into the same segment by changing the storage
of either of them. Both need to be immediate consistency, but there is no relation rule
that requires a certain type of storage for that consistency. We can select storage either
in memory or on disk, but to put them in the same segment both need to use the same
storage setting. A choice could be to store them on disk, since disk storage is a cheaper
resource compared to storage in memory.
Further, by letting o4 be additionally located to node N2 we can create a segment
consisting of objects o3, o4, o5. This storage is not anymore optimal storage according to
the specification of known data accesses. Still this may be a valid choice to reduce the
number of segments in the segmentation.
25
By considering properties of data and the relations between them, segmentations can
be made consistent by fulfilling rules that specify the property dependence relation. There
may also be other rules that influence the table, such as guaranteeing a minimum degree of
data replication to provide a certain level of fault tolerance, or labeling of clustered data,
which is data objects that are replicated together and have the same property values.
Using such clustering rule, the entries for such data objects are set to the same value,
typically based on the most restricting combination of properties of the clustered objects.
4.2.3 Analysis model and assumptions
To evaluate resource usage for the implementation of static segmentation, we present an
analysis model for usage of three key resources. Bandwidth usage is reduced when less
nodes need to be updated. Also, such a system requires less overall storage, since each
node hosts only a subset of the database objects. As a consequence, overall processing of
conflict detection and resolution for inconsistent replicas is lowered since fewer nodes host
conflicting replicas.
We evaluate our approach by examining how three important system resources are
used when selected system parameters are scaled up. The baseline for comparison is a
fully replicated database with detached replication, such as DeeDS. First of all, a large-
scale distributed database with many nodes requires scalable bandwidth usage, and full
replication does not scale well since an update generates O(n2) replication messages for
n nodes. Secondly, large distributed and fully replicated databases store the entire data-
base on the local node, requiring storage of O(nm) data objects. Conflict detection and
resolution keep O(n2) processing time, since every node that has a replica must resolve a
conflict caused by any other node. Based on our database model, we present an analysis for
how these three resources scale, both for a virtually fully replicated and a fully replicated
system.
We assume that the database is distributed to n nodes, and with segmentation the
degree of replication is limited to k ≤ n nodes. In such a database, every update needs
to be replicated to k − 1 nodes. With network functions for multicast or broadcast the
system could replicate data specifically to a set of nodes by a single send message, reducing
the processing time required to send the data and the number of messages sent on the
network, but not reduce the integration time or the conflict detection and resolution time
of an update. Currently, we focus on networks with point to point communication only,
since we argue that scalability in such networks is a problem that can be generalized to
many different kinds of networks. Replication of several updates for the same segment
replica may also be coordinated to reduce communication overhead, but in our current
26
analysis model we assume that each update is sent individually. Thus, we consider each
object update to generate a send of k − 1 single update messages. Our simple analysis
model in this paper does not consider individual degrees of replication ki for each segment,
but the same degree of replication for all segments, k.
Selective replication by multicast has been studied before [4] and may be included in
future work as part of a larger study involving other network topologies than the sin-
gle network resource used in the current analysis model. We also consider coordinated
propagation of replication messages to be a future extension of the work in this paper.
The number of transaction instances in the database is denoted |Tj |. We characterize
each transaction, Tj , by its frequency, fj , the size of its read set, rj , the size of its write
set wj and the size of its conflict set, cj . Thus, for the purpose of the analysis the
characteristics of each transaction can be modeled as Tj =< fj , rj , wj , cj >
An analysis of network usage is essential, since a distributed system does not scale
well if the communication cost limits a large systems to be built. Bandwidth is a shared
resource that is critical for scalability of the system. We choose to model this resource as
a single shared network link, as is the case with traditional Ethernet or with time-shared
communication, such as used in wireless sensor networks. A single shared network resource
is more restrictive for scalability than e.g. a switched network, where the design topology
can relieve congestion by distributing communication on multiple independent network
links.
A segmented database stores k replicas of database objects, instead of n replicas that
are stored in a fully replicated database. For applications that typically has a low repli-
cation degree, k, much memory can be saved overall by segmentation and scalability is
improved. For an analysis of storage we may consider the storage of database object
replicas, the segment administrational data structures and the internal variables of the
database system. The analysis in this paper considers the storage of data object replicas
only, since additional data structures are expected to be small or independent of whether
the system is fully replicated or segmented.
Processing time for storing updates on the local node include time for locking, updat-
ing and unlocking the data objects, which we denote L. Updates for data objects that have
replicas on other nodes use additional processing time to propagate the update, including
logging, packing, marshalling, and sending the update on the network. We denote the
processing time for propagation P . With point to point communication, the sending node
spends P time for each node that has a replica of the updated data object. At the node
receiving an replicated update at another node, the database system must integrate the
update. This includes receiving, unpacking, unmarshalling, locking data objects, detect
27
conflicts, resolve conflicts, update to database objects, and finally unlocking of the data
objects that was replicated. We denote the time used for integration I, excluding conflict
detection and resolution, while the time used for for conflict detection and resolution is
denoted C. Processing time C only occurs when there are conflicting updates, while I is
used for all updates replicated to a node. Integration processing is needed at all nodes
having a replica of the originally updated data object.
4.2.4 Analysis
We analyze scalability by examining how the chosen resources are used according to our
analysis model.
Bandwidth usage depends on the number of updates replicated, including the new
values and their associated version vectors. In our model every transaction Tj , executed at
frequency fj generates one network message for each member of its write set wj to update
its replicas at k− 1 nodes. Such network messages are generated by all transactions at all
nodes. We express bandwidth usage in Equation 3.
(k − 1)|Tj |∑
i=1
wi ∗ fi [messages/sec] (3)
From the formula we see that bandwidth usage scales with the degree of replication. For
a virtually fully replicated database with a limit k on the degree of replication bandwidth
usage scales with number of replicas, O(k), rather than with number of nodes, O(n),
as is the case with a fully replicated database. However, the amount of transactions
often depends on the number of nodes, O(n), so bandwidth usage becomes O(kn) for the
virtually replicated database and O(n2) for the fully replicated database.
For storage of a database with s segments, there are ki replicas of each segment of the
size si. We express required storage for a virtually fully replicated database in Equation
4.
s∑
i=1
(si ∗ ki) [objects] (4)
In our analysis model we assume the same degree of replication, k, for all segments,
∀i(ki = k). Thus, each data object o in the database is replicated at k nodes and the
required storage can be expressed as o ∗ k. With full replication to n nodes, n replicas of
each data object is stored and the required storage is o∗n. This means that a virtually fully
replicated database scales with the limit on replication, k, rather than with the number of
nodes, n, in the system as is the case with a fully replicated database.
The processing time used for replication depends on the write set of each transaction,
wi, resulting in propagation time, P and integration time, I, for each node to replicate to.
28
Additionally the conflict set, ci, requires conflict detection and resolution time, C. Thus,
we can express the processing time as∑|Tj |
i=1 fi{[L+(k− 1)P +(k− 1)I]wi +[(k− 1)C]ci},or as Equation 5.
|Tj |∑
i=1
fi[Lwi + (k − 1){(P + I)wi + Cci}] [sec] (5)
Similarly to Equations 3 and 4, this formula shows that processing is constant with the
degree of replication, not growing with the number of nodes in the system. Also, the sizes
of the write set and the conflict set influences the amount of processing time required.
With multi-cast operations available at the network used, the analysis formulas be-
comes different for bandwidth and for processing. Replication of updates will send only
one message for all the replicas, and bandwidth usage can be expressed as in Equation 6.
For processing, each update is sent only once on the network, but there is still integra-
tion time, and conflict detection and resolution is also done at every replica receiving an
update. Thus, the corresponding processing time can be expressed as in Equation 7.
|Tj |∑
i=1
wj ∗ fj [messages/sec] (6)
|Tj |∑
i=1
fi[Lwi + Pwi + (k − 1){Iwi + Cci}] [sec] (7)
4.2.5 Implementation
The segmentation algorithm has been used to implement virtual full replication in the
DeeDS prototype, and the database was segmented into fixed segments with individual
degrees of replication [53].
In this implementation, transactions are specified by a configuration file that describes
the transactions with their data requirements, the data properties given by the application
semantics, and specification of key capabilities of the database system.
For this implementation transactions specifications were recorded in a configuration
file that was translated into the property table. Some basic rules were hard coded and use
only a few relations for properties. A generic system would need support for an explicit
rule specification. The replication schema and physical segments are derived from the
property table using the static segmentation algorithm (Algorithm 1).
In the experiment we show here, all parameters and all but one scale factors were
fixed. The number of nodes were changed from 1 to 10 nodes, and the three resources of
bandwidth, storage and processing were measured. For virtual full replication, we limited
the maximum degree of replication at 3 replicas.
29
Figure 3: Bandwidth usage vs. Number of nodes
segmentation overhead
Figure 4: Storage vs. Degree of replication
Figure 5: Processing vs. Number of nodes
Scalability has been examined by evaluating the usage of the resources. Additional
experiments were done where we varied the degree of replication, and the ratio of write
operations in transactions. In these experiments, we compared the original implementation
for a fully replicated database with the new implementation for segmentation using full
30
replication, as well as a lower degree of replication typically used in a virtually fully
replicated setting.
Figure 3 shows bandwidth usage for up to 10 nodes, with a replication degree of k = 3
for Virtual full replication (VFR). For systems larger than 3 nodes, the bandwidth usage
remains constant with Virtual full replication, while for a fully replicated (FR) database
it increases with number of nodes added. The measurements show the total amount of
messages sent at the network to replicate one fixed size segment in the database during a
fixed length run of the experiment.
Figure 4 shows how storage requirements change for a system up to 10 nodes, and the
measurements show the amount of storage for one fixed size segment. For a virtually fully
replicated database, storage of the database scales with the limit of replication, and a fully
replicated database scales with the number of nodes. In this experiment we also measure
the segment management storage overhead in the system overall, and the figures can be
seen at the bottom of Figure 4. Segment storage management data scales with the number
of nodes in the current implementation. This needs to be further analyzed and improved,
to ideally scale with the degree of replication instead.
Figure 5 shows the processing used at each node to replicate updates. Processing for
the virtually fully replicated database (VFR) does not increase with number of nodes
added, but remains constant with the degree of replication. The fully replicated database
(FR) scales with the number of nodes, while the implementation for Virtual full replication
using a fully replicated database (FR new) adds processing overhead, compared to FR.
4.2.6 Discussion
The curves match well the behavior of our analysis in Equations 3, 4 and 5. For a fully
large-scale evaluation, experiments need to be run for a larger number of nodes, which
could not be done successfully with the current database prototype implementation. Our
analysis in 4.2.4 complements the implementation evaluation for this research step by
giving an analyzing model for large-scale behavior. Our next step addresses the problem
of studying a large scale implementation.
4.3 Simulation for scalability evaluation
M3: We use simulation to examine large scale systems. The first simulation implemen-
tation experiment has been performed [50], strengthening the findings about bandwidth
usage from the implementation of static segmentation in DeeDS. The simulator will be
further developed to run experiments with 1000 nodes and more. Such large scale systems
cannot be easily run as a DeeDS implementation, since that would require an unfeasible
31
amount of resources. Our simulation approach uses a representation of bandwidth, storage
and processing, rather than using the actual resources as is the case with an implemen-
tation in the DeeDS prototype. An example advantage with simulation is the storage
of the database itself, which does not store the actual data in the simulation, but only
information to describe the data objects. By only using a value for the size of the object
instead of storing the object, the amount of storage required for the data objects can still
be modeled, while using less resources. With such representation, network messages never
replicates actual data objects values, but only information about the size and version of
the update. A simulation for large scale systems requires that the representation of the
system is done so that the simulation itself is scalable. To study resource usage by the
replication mechanisms used by DeeDS and extensions for virtual full replication, this
is what is primarily modeled. The representation for modeling this needs to be further
refined to be scalable, for 1000 nodes and more.
Simulation enables an evaluation of static segmentation for large scale virtually fully
replicated databases. In previous work we have implemented Virtual full replication in the
DeeDS database prototype. The implementation could not give precise results, since only
a few nodes could be run in the experiment. The conclusions about large scale behavior
could therefore only be limited from that experiment. With a simulation experiment,
the database nodes are simulated and resources that are used can be represented as data
entities rather than using actual resources in the system.
The simulation is only a model of the real system, but with an properly designed sim-
ulation we have the opportunity to get an understanding of large scale behavior of Virtual
Full Replication. Before using a certain simulation it must be valid for the purpose of the
study, so our approach in this research step is a two stage work. The first stage has been
to assemble relevant literature for a background on validation of simulations. The second
part is to model and implement the simulation, and this is a work that is in progress. An
initial simulation has been done and is reported [50].
4.3.1 Motivation for a simulation study
Our motivation for simulating a distributed real-time database are:
• Large scale experiments with the prototype database system are infeasible, due to
the amount of hardware and the size of the installation required.
• The analysis done in previous step, of resource usage for bandwidth, storage and
processing, gives a model and an estimate based on a set of assumptions. The
32
proposed algorithm for static segmentation need be used in a large scale setting to
measure actual resource usage. A simulation that is tailored to mimic the actual
distributed database system may give a more detailed understanding of large scale
behavior of a virtually fully replicated database compared to the analysis done.
• Executing large scale experiments with the actual system would require more software
engineering efforts than with a simulation, since the database prototype development
is expected to result in a fully working implementation. With a simulation, the
implementation can be focused on the segmentation and replication processing in
isolation.
4.3.2 Simulation objectives
The objectives of the simulation in this experiment are:
• To measure the usage of three essential key resources, bandwidth, storage and process-
ing time, in a simulation of a distributed real-time database with eventual consistency
that uses virtual full replication.
• To detect how system parameters (independent variables) influence resource usage
by measure the usage for increasingly larger systems. In particular we want to find
such variables that make resource usage grow at a non scalable rate.
• To evaluate if we can establish a useful simulation platform for our research, to be
able to experiment with subsequent research ideas in the area.
4.3.3 Validation of software simulations
Simulating a phenomenon of interest in a computer system involves creating a model of
the system to be studied. This inherently means making a simplification of the actual
system that is only valid under a set of known assumptions. Introductions to simulation
can be found in [44], [38] (in particular parts IV and V) and [11]. To measure and
conclude about such model, it is required that the model can be justified to be correct for
the intentions of the study. Thus, the simulation objectives need to be very clear. The
literature on validation of simulations is abundant, ranging from high level approaches
of establishing taxonomies for simulations in general, to detailed work on simulations of
parallel computing. In this paper, we relate work that have significance for validation of
simulations of large-scale systems, in particular for distributed databases. Important work
include [8] [11] and [69].
Simulation is useful where the behavior of the real system cannot easily be analyzed.
This includes when the input or the model has some stochastic component, or when compu-
33
tation of an analysis is complex. To evaluate such simulation results, statistics is important,
but not all statistic techniques may properly be applied for simulations [41]. Statistics can
also be used for validation of the simulation model itself [10].
In work by Shannon [72], it was concluded that a simulation cannot be a full represen-
tation of the system to study, but a simulation needs to be reduced to meet particular ob-
jectives of the study. Detailed general purpose simulations tend to cost much in processing
time. Validation of simulations is said to be the process of determining whether a simu-
lation model accurately represents the system for the objectives of the study. Validation
of simulations is a matter of creating confidence that the system represents the system
under the given objectives. Simulation confidence is not a binary value, but is gradually
strengthened by tests for validity. In [68] concrete tests for validation is presented: De-
generate tests - How does the model’s behavior change with changed parameters; Event
validity - How does sequence of events in the simulation correlate to the events of the real
world system; Extreme-Condition tests - How does the model react for extreme and un-
likely stimuli?; Face validity - How does the model correspond to expert knowledge about
the real system?; Fixed values - How does the model react to typical values for all combi-
nations of representative input variables?; Historical data validation - How are historical
data about the real system used for the simulation model?; Internal validity - How does
the system react to a series of replication runs for a stochastic model? For high variability
the model can be questioned.; Sensitivity analysis - How does the effect in changing input
parameters influence the output? The same effect should be seen in the real system. For
the parameters that has highest effect on the output should be carefully evaluated to be
accurate, compared to the real system.; Predictive validation - The outcome from fore-
casting by the simulation and the real system should correlate.; Traces - Execution paths
behavior should correlate.; Turing tests - Expert users of the real system are asked if they
can discriminate between outputs from the real system and the simulation.
The validation of our simulation is two-fold. First, the current DeeDS implementation
can run a limited set of the experiments to run with the simulation. Thus, a sanity
check is to run some selected experiments with the DeeDS implementation too. Also, the
implementation will be validated by an examining of the implementation by the DeeDS
project members. This will ensure that the simulation model used follows the intentions
and the semantics of the actual DeeDS prototype, both as described in research papers
published and by the actual prototype implementation.
34
4.3.4 Modeling detail
The accuracy of a model must be sufficient to be able to evaluate the objectives of the
purpose at hand. A more accurate model than required will result in a simulation that
uses excessive resources, and may not even be useful to run due to long execution times.
To establish the level of modeling detail, the simulation objectives must be defined. It is
infeasible to model all aspects of the system (absolute isomorphism), so the effort should
be to model at a detail level and a representation that is appropriate for the objectives.
Only reality itself is a generally valid model; and the model and the simulation can only be
valid in the context of the assumptions and the objectives. Balci [7] uses a set of criteria for
modeling: Model verification is correct transformation into the simulation model; Model
validation is the compliance with simulation objectives, within the domain of applicability.;
Model testing is used to determine that the model and the simulation functions properly;
Model accreditation is the official certification to use the model for a specific purpose.
Graph based approaches exist for modeling systems to be simulated, including state
charts or automata. Graphs can easily be understood and a clear notation of behavior
may increase confidence of that the model is valid. Simulation graphs have been used for
a long time to document the simulation model and were developed into event graphs [71].
Modeling by event graphs opens up for ways of structuring behavior representation, such as
done with Event pattern mappings [31]. A derivative of event graphs is the resource graphs
[37], where graph nodes represent events rather than states, and where transitions represent
conditions of the simulation. Resource graphs are shown to be valuable, compared to job-
driven simulations, for simulations of large-scale and congested systems [67].
4.3.5 Approaches for validity and credibility
A model is validated on the confidence in that the model reflects the real world problem in
terms of the objectives to evaluate. For validity it is more important to have a model with
confidence than to have a fully detailed and accurate model. Therefore, it is important to
know how to increase the confidence in the model. Robinson [66] describes how validation
is improved by addressing model confidence:
Iterated verification and validation need to be used along with iterative development
of the model itself. Also, the view of the world may not hold for validation. Real world
data is often inaccurate and the world representation is a result of a certain interpretation
of the world. There may not even be a real world situation that can be compared to.
Consider a full scale nuclear war. This is likely a subject for simulation, but it is very
hard to compare to a valid real world situation. Robinson refers to common techniques in
the area to handle typical problems in validation: Conceptual modeling : It is important
35
that the modeler acquires a deep understanding of the real world system to be modeled
and this requires close collaboration with domain experts. In this way the modeler can
understand alternative interpretations from which objectives and a conceptual model can
be developed. The conceptual model needs to be validated, with review by domain ex-
perts of the modeling objectives and the modeling approach. Data validation: Real world
data must be validated for accuracy by exploring and valuating the data source to deter-
mine data reliability, completeness and consistency. White-box validation: Inspection of
the simulation implementation improves correctness of model implementation and compli-
ance with the conceptual model. Control of events, flow and logic is central and for this
code reviews, execution trace analysis and output analysis are used. Black-box validation:
Inspection of the overall behavior of the model. This also includes correlation analysis
comparing with the behavior of a real world system. Comparison can also be done with
another model or simulation of the same or similar problem, but simulation objectives and
problem differences must be considered.
For a credible model there is an agreement on that the model reflects the real sys-
tem appropriately for the experiment [28]. To establish credibility, there must be an
agreement on the assumptions of the model, and proofs of validation and verification are
needed. There is a risk that an agreed model, that is regarded as credible, is still not valid
due to improper validation. To further improve credibility, the model can be accredited.
Accreditation is the formal acceptance that the model and the simulation is approved to
be used for a specific experiment or usage [9] [7].
4.3.6 Experimental process and experiment design
The experiment consists the following steps:
1 Modeling detail and simulator implementation
The simulation implementation in this experiment mimics replication in the DeeDS
database prototype system. In particular, specific functions for database operations,
data replication, and the usage of resources in these operations are modeled in detail.
2 Model and implementation validation
The implementation will be fully reviewed by the DRTS research groups members, as
a walk-through of the implementation, including a discussion of proper representation
of the DeeDS database prototype with respect to the issued studied. This step
is important for verification, but also for validation. Also, the implementation is
compared to a few reference cases of execution for the implementation of static
segmentation in DeeDS [53].
36
3 Experiment variables
In the first experiment we chosed to study three system parameters (the independent
variables): the number of nodes, the size of the database and the share of update
operations in transactions. We will extend the study to include more independent
variables that could affect scalability, and possibly use principal component analysis
[25] to group related variables. A full set of independent variables for such experiment
will be: database size, number of nodes, ratio of update transactions, ratio of write
operations in transactions, ratio of conflicting updates, frequency of transactions,
distribution of transactions (where transactions are instantiated), bound on degree
of replication.
4.3.7 Simulator implementation
The implementation is based on an existing distributed database simulator, which models
a database system with distributed transactions and load balancing. The purpose of earlier
usage of this simulator has been to evaluate different approaches for load balancing, and
several papers has been published based on these simulations [79] [78]. The simulator has
previously been assessed for modifiability for building the functions required to simulate
detached replication, and it was found to be sufficient as a base. According to Law
and Kelton [44] a discrete-event simulation needs a number of base components, and
our simulation includes all of the components pointed out.
• System state. The notion of a system state is represented by key variables in the
simulator we use.
• Simulation clock and a model of time. The simulator uses a ”next event time ad-
vance” model that steps up logical time to continuously process events. A simulation
with few short events step up time faster than a simulation that contains many events
with much processing in them.
• Event list. For both individual nodes and the overall event processing, such as
network events, there are event queues. The simulator always processes the next
coming events out of these queues. In addition the simulator contains queues for
transaction states, such as ”blocked” and ”ready” and also priorities of these actions.
There is a transaction scheduler and transactions can be prioritized and they can be
modeled as preemptable.
• Statistical counters. The simulator uses the statistical Java package Colt from CERN
and collects various statistics during execution.
• Report generator. After reaching a termination condition, the statistics collected is
37
assembled into a extensive report with key figures.
• Main program. The entire simulation is run in a single thread, coded in Java.
• Initialization routines.
• Library routines. A realistic load generator, based on parameters of actual system
inputs.
The simulation has been extended with key functions need for our experiments, in an
order where each step can be validated separately. The following extensions have been
made:
1 Update accounting
Version vectors were implemented for managing concurrent updates. Detection of
concurrent updates include both write-write conflicts and read-write conflicts.
2 Replication
DeeDS uses detached replication as the mechanism for replicating updates indepen-
dently of transaction execution, after transaction commit, and detached replication
was implemented in the simulation. The simulation implements features that are
needed for detached replication, such as shadow page updates, replication logs, prop-
agation of database updates to other nodes after commit of the transaction, conflict
detection by version vectors and has a simplified representation of conflict resolution.
By varying the bound of degree of replication from the number of nodes to a lower
degree, resource usage for full replication can be compared to usage for different lower
degrees of replication.
3 Segmentation
To mimic virtual full replication by static segmentation, the database initialization
locates data objects to randomly selected nodes in the system, where the maximum of
number of nodes for each data object allocation does not exceed the bound of degree
of replication. Replication of updates uses the data object allocations to replicate
only to the nodes that host data objects.
Further, transactions in the simulation are generated so that it contains accesses only
to data objects that are allocated at the node where the transaction executes.
4 Load generation
The generation of load for the simulated database should reflect the load of a real
system. For virtual full replication with static segmentation there is only transac-
tions that accesses data objects that are available at the nodes of execution. For
experiments with adaptive segmentation, there need to be load that in a controlled
way generates transactions that tries to access objects at ’new’ nodes.
38
5 Instrumentation
To measure the network load, network usage accounting measures the size of simu-
lated network messages and adds sizes to a total sum over a simulation.
4.3.8 Experiment execution
The execution of the first simulation experiment was divided into three parts, each where
one independent variable was varied and where the resulting bandwidth usage was mea-
sured. For all of the part experiments we kept the other independent variables constant.
The variables used in the experiments were: Number of nodes = 10, Size of the database
= 500 objects (where data object size was set randomly between 1 and 128 byte with
uniform distribution), Bound on degree of replication = 3, Number of transactions = 300,
Size of transaction = 2 update operations and 8 read operations. Each part experiment
was executed 10 times and the average, the max and the min values of bandwidth usage
was recorded. The bandwidth usage was measured both for a fully replicated database
and a virtually fully replicated database, using the variable settings above.
1 Varying number of nodes. The number of nodes was varied from 1 to 30. (Figure 6)
2 Varying database size. The database size was varied from 100 to 2000 data objects.
(Figure 7)
3 Varying update operation share in transactions. Transaction size was fixed to 10
operations, but the number of update operations was varied from 2 to 20. (Figure 8)
4.3.9 Discussion and results
0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
8000000
9000000
10000000
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
VFR MAX
VFR MIN
FR MAX
FR MIN
nodes
b y t e
s s
e n
t
Figure 6: Number of nodes, VFR and FR, large systems
39
0
200 000
400 000
600 000
800 000
1 000 000
1 200 000
1 400 000
1 600 000
1 800 000
2 000 000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
VFR MAX
VFR MIN
FR MAX
FR MIN
database size [x100]
b y t e
s s
e n
t
Figure 7: Database size, VFR and FR, large systems
b y t e
s s
e n
t
write operations in transactions
0
100000
200000
300000
400000
500000
600000
700000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
VFR MAX
VFR MIN
FR MAX
FR MIN
Figure 8: Update transactions
This experiment varies three important scale factors (Number of nodes, Database size,
Share of write operations), while in our experiments with the actual DeeDS implemen-
tation, we reported the usage of key resources (Bandwidth, Storage, Processing). Thus,
only one configuration in the simulation experiment is comparable with the DeeDS pro-
totype experiments, the bandwidth used when scaling the number of nodes. Additionally,
in the DeeDS prototype experiment the bandwidth for replicating one single segment was
measured, while the corresponding simulation experiment measured the total amount of
bandwidth usage, and therefore the diagrams are not identical. Knowing this difference
we can compare the growth of the bandwidth usage when adding nodes to the system
(Figure 3 and Figure 6). We can estimate from the graphs that the resource usage grows
as pointed out by Equation 3. However, the comparison is very limited and needs to be
extended in subsequent work. That work also includes extensions of the simulation, so
that storage and processing resource usage is modeled in the simulation, so that the simu-
lation can be compared and validated towards the experiments with the DeeDS prototype
40
implementation.
In an experiment like this it is possible to draw final conclusions only for the ranges
of the independent variables studied. In our case that is 1-30 nodes, 100-2000 database
objects and 2-20 update operations in transactions. The examination of a validated sim-
ulation may be sufficient for the purpose. Only an elaborated analysis model may allow
conclusions about bandwidth usage for infinite values on the independent variables.
By having built a simulation for virtual full replication we have gained detailed insights
in how to model the database system that can be used for refining the model. In a
simulation, variables can be better controlled, but also the transactions load can be varied
while fully controlled, which gives an opportunity for examining a large range of application
types. Also, a simulation enables the study of a system at a scale that would be impractical
when using the actual system.
There are some additional scalability problems that need to be addressed. The detached
replication uses version vectors that keep a copy of each nodes copy of the version vector.
This is not scalable in terms of number of nodes, but Dynamic version vectors [65] can be
used to solve this and use ’generations’ of version vectors as done in work by Gustavsson
[32]. Furthermore, the current simulation implementation is not scalable due to some yet
unknown design limitations, since some simulation settings run out of resources on the
computer where the simulation is executed. As a consequence, the continued simulation
implementation need to be assessed for how the simulation model need to model functions
of the prototype system for scalable, while correct representation. This problem will be
addressed in the design work of the subsequent simulation implementation.
4.4 Adaptive segmentation
M4: With a replication schema that meet the requirements of a static specification of
the database usage, the virtually fully replicated database does not make data objects
available for unspecified data accesses. This is a disadvantage, compared to a fully repli-
cated database. To address this problem for virtual full replication, we introduce adaptive
segmentation. With adaptive segmentation, we will allow unspecified data accesses for
soft transactions. Data objects that are not allocated to nodes are loaded on demand
and the database is re-segmented based on the new information that becomes available at
execution time. The approach in this part of our method is still under development, and
below we present our current findings and the related challenges.
The segmentation information at each node keep tracks of which segments are allocated
to the node. This includes the data objects in a segment, the consistency model used by the
segment, the required deadlines for replication, and a list of all the nodes that the segment
41
is allocated to. When an access arrives that use a data object that is not available at the
node, processing is initiated for for loading this data object from another node.
The representation for where data can be found is available by listing all data objects
in the database and their residence and segment assignment and store at each node. This
is a simple approach that does not scale on the size of the database. This representation
need to be elaborated in the work to come. One approach is to store information only
of the data objects available at the node. Missing data can still be detected, but finding
nodes where data can be loaded from then need a (bounded) search approach.
For adaptiveness, we will primarily consider the following changes to the replication
schema: Unspecified accesses, added and removed nodes, added and removed data objects.
Also, the level of fault tolerance is considered. We can apply a minimum degree of repli-
cation for a segment, and when a node fail, this degree needs to be restored again. It is a
similar problem to making data available for un-specified accesses, and we expect to use
the same approach to restore minimum degree of replication.
The segmentation algorithm we have presented [53] is expected to be efficient enough
for adapting segmentation online. Segment tables (based on the updated property sets)
can be updated when new information arrives, in O(o log o) time.
We propose to use incremental recovery [45] for loading segments that contain the data
objects that are detected as missing at a node. On-demand loading of segments will delay
transactions while loading a missing segment onto a node, and the timeliness will depend
on the timeliness of the network, which makes segmentation adaptation unsuitable for
hard transactions.
To maintain scalability and to reduce storage usage over time, segments also need
to be unloaded from nodes when not used anymore. The load and unload policies is
application dependent. Some applications may not need to unload segments, but will
reach an equilibrium where all segments ever needed is available, and are continuously
in use. Other applications, with mode changing characteristics, may need to switch data
sets when changing the mode of execution. The combination of the frequency of changes
and the available storage may cause ’thrashing’ that keeps the database system busy with
loading and unloading data, without leaving processing time for database transactions.
This is an open problem that need to be studied in this part of our work.
Reconfiguration of segments immediately take effect at the local node, and as soon as
data has been loaded to the local node, the pending transaction can continue its execution.
Subsequent transactions to the loaded segment will not need to wait, but will execute
timely at the node. Reconfiguration of segments is assumed to always be initiated at the
node of an access, then propagated to all other nodes that needs the information about
42
the change. In the DeeDS database, such propagation can be done by using the database
replication mechanism, detached from the update of the local segment knowledge. Since
such propagation has eventual consistency, the other nodes does not get the information
at the same time. Therefore user updates of data objects, in a segment that is being
loaded to a node, needs to be forwarded from the node where the data is loaded from as
long as not all nodes has received the propagated change of segmentation. The problem
of propagating and forwarding of updates to nodes that loaded to new nodes, is closely
connected to the issue of representation of information about segments at nodes, and will
be addressed together.
The conditions for maintaining scalability is an important issue in this part of the
work. Applications that require most the data at all nodes will closely resemble a fully
replicated database, and be less scalable than a database where the database client have
requirements of a low degree of replication. We need to find out more details for which
types applications virtual full replication has benefits, and for which applications it means
an overhead only.
4.4.1 Adaptive segmentation with pre-fetch
M5: With adaptive recovery of segments to nodes, virtual full replication can make data
available at request by soft transactions. For hard transactions, data must be available at
the time the transaction executes at the node. We extend adaptive segmentation with pre-
fetch of data to nodes, to make data available at a certain time point or repeating time
points separated by an interval. This enables adaptation for hard transactions as well.
One condition for this is that the time instant of an access, or the frequency of repeating
accesses, can be specified. Another condition that is that segments must be recovered in
bounded time, which imply that the system uses a real-time network.
Since many real-time systems run transactions periodically or at least with some re-
peating patterns, it may additionally be possible to detect the patterns for the accesses.
Having simple mechanisms for detection of periodic accesses, we may be able to detect
when data needs to be available. This can be extended to include advanced pattern detec-
tion and also pattern prediction. It seems unlikely that detection and prediction can be
correct all the time, and for this reason it may only be used to improve for soft real-time
transactions. Pre-fetch of data by advanced pattern detection and prediction is an optional
step in our methodology.
To maintain scalability, data objects that are not accessed for a long time may need
to be removed from nodes. This need to be put in relation to the set up time for data
objects. Adding and removing data objects from nodes has some resemblance with both
43
virtual memory, cache coherency and garbage collection and our approach will relate to
these concepts. Removal of objects is critical for hard transactions and must be done in a
way that guarantees availability at the time of access by the hard transaction.
Replicas of data objects that have been assigned to nodes, either by a pre-specified
static segmentation or by pre-fetching of data for hard transactions, cannot be removed
if there is a risk that removal jeopardizes timeliness of a hard transaction. Pinning of
such data objects is one approach to prevent those data objects from being removed from
memory by a generic removal policy based on known accesses. Removal may also mean to
move data from memory to disk at the same nodes. Pinning of object require an explicit
specification, in a similar manner as static segmentation or as an explicit specification of
time instants for accesses by hard transactions.
4.5 A framework of properties
M6: We have presented an initial approach where we can express the relations between
data, application and architecture properties [53]. We elaborate this into a generic frame-
work that can express properties and relations. This part of our methodology is more of a
result than of a separate step, since the content is a result from the preceding methodology
steps. Optionally, this step includes a study of typical applications and a mapping of their
properties into our framework. This can be done by a literature study or by study of
actual applications. Applications are expected to be available with the research group’s
collaborating partners, in the car industry (several Volvo companies), in applications for
real-time operating systems (ENEA), and in real-time database applications (Polyhedra).
4.6 Case study and implementation
M7: The properties of bounding resources for scalability and adaptation by reconfigura-
tion, and ability of disconnected operation, may be suitable in a wireless sensor network.
Compared to our system model that assumes a high bandwidth, high connectivity and
low latency network, a wireless sensor network has low bandwidth, lower connectivity,
and high latency by multi hop communication. Despite our current network model, we
have properties that seems to support the properties of a wireless sensor network, such as
scalability and disconnected operation, and we aim to do a case study to evaluate virtual
full replication in such a context.
Also, we have initiated a collaboration project with Polyhedra, which is one of the
major suppliers of real-time databases. A potential of this project is to apply virtual full
replication in practice and enable replicated database nodes with real-time performance,
where database clients have the same service independent of which node is accessed. This
44
part of the methodology is an optional step, and a successful collaboration project will
enable a commercial system context for validating the research, and give an opportunity
to apply the findings from this thesis in different problem domains of Polyhedra clients.
45
5 Related work
In addition to the related work referred to in each section of this Thesis Proposal, the
following areas are related to the work.
5.1 Strict consistency replication
In the area of distributed real-time databases there have been many efforts in defining syn-
chronizing mechanisms for supporting strict consistency between replicas. Efforts for strict
consistency in large distributed databases includes distributed two/three-phase-commit,
one-copy-serializability correctness [12], concepts of consistency control and confinement,
master-copy replication and quorum consensus approaches. Helal et al. [34] gives a good
introduction to this area. It is complex to provide predictability in replicated databases
with strict consistency, in particular at network partitioning, but approaches exists [21]
[1]. This system model differentiates from our system model, where we allow temporal
inconsistencies to favor availability and predictability.
5.2 Replication with relaxed consistency
Relaxing the requirement of having fully consistent replicas enable systems that are more
concurrent, since less replicas need to be involved for updates. Many approaches exist for
controlling consistency between replicas that allow relaxed consistency. Work on consis-
tency control includes work for handling partitioning [24] and for containing consistency
divergence [85]. Several approaches exist for consistency management with eventual con-
sistency [14], Epsilon serializability [62] and eventually-serializable consistency [27]. With
relaxed consistency there can be concurrent updates to different replicas of the same data.
There are write-write conflicts, where replicas are concurrently written and the last update
will be the one that is stored. There are also read-write conflicts, where a write operation
update a replica, using data that is not consistent with other replicas. To detect write-
write and read-write conflicts between replicas, version vectors [60] and variants [65] can
be used. Also global time or logical clocks can be used to find the order between updates.
5.3 Partial replication
For partially replicated distributed databases and also partially replicated distributed file
systems many approaches exist. Strict consistency models include: Coda file system [70],
the Replica modularization technique [76], Group communication based partial replication
[4], Allocation to nodes for database partitions [20]. Weak consistency system models in-
clude: Optimistically replicated file systems [64], Quasi copies [5]. Peer-to-peer techniques
46
(P2P) are used for sharing large data entities [22] [13], but often optimize on storage
structures and storage allocation to make search for particular content efficient [47], or for
improving efficiency by different replication and caching strategies [46] [63].
5.4 Asynchronous replication as a scalability approach
The Ficus file system [33] uses partial optimistic replication of files, but without a concept
of granularity. For replication models of large scale distributed databases, there are several
approaches. Hierarchical asynchronous replication protocol [2] replicates to neighbor nodes
using two consistency models, eventually resulting in replication to all nodes of the system.
Epidemic replication is used to disseminate updates [35]. Optimal placement of replicas
allows for large systems [83]. Recent approaches for selective replication in sensor networks
includes ’JITRTR’ [61] [77] and ’ORDER’ [78]. A framework is available for comparing
replication techniques for distributed systems, in particular for distributed databases [80].
5.5 Adaptive replication and local replicas
The need for adaptive change of allocation and replication of database objects in (large)
distributed databases has been identified in literature [82], including relational [15] and
object database approaches [84]. Our approach for adaptive allocation of data objects
(both allocation and deallocation) also relates to the abundant work available in the areas
of cache coherency [5] [81] [59], also for mobile databases [19], database buffer management
[36] [39] [23] and virtual memory [26]. We see some differences between our work and the
work done in these areas. With caching, copies are kept locally or in memory to improve
performance. Such cached data can be emptied from the cache to free up memory for more
frequently used data, and consequently slow down the data accesses for non cached data
in secondary memory. With our approach, we do not allow any data access to data other
than in the local memory, and we can not empty data objects from the local node for any
of the known accesses. Also, while virtual memory loads a previously not used piece of
memory from disk, adaptive segments contain data already in use at other nodes.
5.6 Usage of data properties
Sivasankaran et al. [73] presents an approach for using data properties for improved
performance and availability for real-time systems that uses cache coherency and buffer
management for several levels of memory, but for a non distributed database.
47
6 Conclusions
6.1 Summary
The problem of scalability of large distributed real-time database is pursued. There ex-
ists several approaches for bounding transaction timeliness and distributed agreement on
updates in distributed real-time database, and often such approaches aim to bound time
for distributed execution of transactions. In our approach, there are no distributed trans-
actions that the timeliness of user transaction depends on. All data is available at the
local node in main memory, and replication is done detached from the execution of the
transaction, so all user transactions executes timely at the local node independently of
network delays and of distributed agreements on updates.
This thesis proposal suggest an approach, for such a database system, that addresses
the problem of scalability. Such fully replicated database has a high cost of resource usage
that prevents large systems to be built. Scalability is evaluated in terms of bandwidth,
storage and processing time used for replicating the entire database to all nodes. Under
the assumption that only fractions of the database is shared by a few nodes in many
applications, the overall degree of replication can be reduced. With resource management
that uses knowledge about the actual data needs of database clients, replication can be
directed where needed instead of blindly replicate to all nodes. For the applications, the
database has an image of full replication and the database is said to be virtually fully
replicated.
The first part of the thesis proposal considers statically specified data accesses. Result-
ing replication schema provides the data at the node where data is specified to be used, but
lacks the advantage of a fully replicated database, where transactions can access arbitrary
data at any node. Our first implementation of virtual full replication partitions the data-
base into statically specified segments, where each segment have an individual degree of
replication, which bounds the usage of resources. For many applications it is not possible
to known all data accesses before the system executes. To address this problem, we suggest
an adaptive database system that can provide data at nodes even if data accesses can not
be specified in advance, by adapt to changed data requirements. The implementation will
use our segmentation algorithm to adapt segmentation and replication to follow the actual
data needs over time, while maintaining database scalability.
6.2 Initial results
We have introduced the concept of virtual full replication and proposed an architecture
to support it. We have presented an approach for static segmentation that does not give
48
a combinatorial problem, giving a large number of small segments and is computationally
complex to generate. This approach handles segmentation on multiple properties and
multiple segmentations can be generates in O(o log o) time on the number of objects.
We have presented an initial analysis model for three key resources that can be used for
scalability analysis of static segmentation. Our table-based implementation approach is
expected to be useful with adaptive segmentation as well, which is our next step to come.
We have implemented static segmentation in the DeeDS database prototype, and measured
the usage of three resources to evaluate scalability. We have implemented a simulation
model to evaluate static segmentation, which can be used for studying large scale behavior
of a virtually fully replicated distributed main memory database according to the database
system model we use.
6.3 Expected contributions
The thesis will present virtual full replication in detail, for a scalable distributed real-
time database that supports the required properties of the database clients and bounds
resource usage for scalability. We quantify scalability in the distributed database setting,
and evaluate the scalability achieved. We also find the usage conditions for scalability and
find out limitations of the concept. There is a framework for properties and their relations
for typical applications that is used for resource management. Virtual full replication is
evaluated by simulation, and by implementation, optionally in an industrial setting.
6.4 Expected future work
The data access patterns for pre-fetching data, studied in this thesis proposal, is limited
to simple access patterns. Here, we consider accesses as single instants of transactions
of periodically executing transaction with a minimum inter-arrival time. The amount
of work available in the area of pattern detection and prediction is abundant, and it is
expected that the waiting time for soft transactions in adapting the replication schema
can be improved, by pre-fetching data with more precision, or even by speculatively load
data to nodes. Branch prediction of execution caches is one area where similar problems
is expected to be found.
In this thesis proposal we exclude to give guidelines of how to design database appli-
cations based on the data properties and their relations we define in our work. As a result
of the thesis, we expect to know the conditions for how virtual full replication can provide
scalability and the limitations thereof. The next step would be to apply these findings for
applications.
The findings in this work may point out certain application types that benefit most
49
from virtual full replication. An implementation project to use it in a industrial application
would valuate the concept and our approach for large scale distributed real-time databases.
50
A Project plan and publications
The project plan contains both past and upcoming activities. Past activities can be seen
in Table 1.
Period Activity
Mid ’02 Admitted as PhD student
Sep ’02 Segmentation in a Distributed Main Memory Real-Time Database, MSc
Thesis [49] (M1)
Apr ’03 Virtual Full Replication: Achieving Scalability in Distributed Real-Time
Main Memory Systems [52] (M1)
Sep-Dec ’03 Bounded delay replication, definition and experiment (M1)
Dec ’04 Real-time communication through distributed resource reservation [51]
Sep ’04-May ’05 Implementation of segments in the DeeDS prototype database system,
definition and experiment (M2)
Aug ’05 Virtual Full Replication by Static Segmentation for Multiple Properties
of Data Objects [53] (M2)
Fall ’05 Simulator implementation and experiments for static segmentation (M3)
Table 1: Past activities
A.1 Time plan and milestones
Spring 2006
MILESTONE : Thesis Proposal at University of Skovde: Method and initial results.
Fall 2006
Implement simulation of adaptive segmentation, with support for multiple data proper-
ties. Experiments for adaptive segmentation. (M4)
Publication: ”Adaptive Segmentation for Virtual Full Replication”. (M4)
Thesis: Preliminary results
Spring 2007
Extend simulation of adaptive segmentation with pre-fetch. Experiments for adaptive
segmentation with pre-fetch. (M5)
Publication: ”Adaptive Segmentation with Pre-fetch for Virtual Full Replication”. (M5)
51
Case-study and publication: Virtual Full Replication by adaptive segmentation in a
wireless sensor network. (M7)
Optional implementation of virtual full replication in a commercial database system.
(M7)
MILESTONE : Thesis results
Write-up
Fall 2007
Publication: ”A data property framework for virtual full replication in distributed real-
time databases”. (M6)
Write-up
MILESTONE : Thesis defense
A.2 Publications
A.2.1 Current
Segmentation in a Distributed Main Memory Real-Time Database [49]
This Master’s Thesis outlines the approach of segmenting a database for lower resource
usage, where a few nodes share parts of the database. An initial segmentation algorithm,
basic architectural support, and an approach for specification of data needs is presented.
Virtual Full Replication: Achieving Scalability in Distributed Main
Memory Real-Time Systems [52] This work-in-progress paper relates the scal-
ability problem to a typical communication scenario and discuss the complexity of com-
munication in it. We show how segmentation can be used to maintain availability while
lowering resource usage in replication effort and storage requirements.
Real-time communication through distributed resource reservation [51]
This technical report presents an approach for real-time communication over Ethernet.
Real-time communication is required for bounded time replication in a segmented data-
base. Different segments may have different requirements on timeliness of replication
delays on the network, and this approach ensures bounded propagation time by distrib-
uted agreement on bandwidth usage among participating nodes, which enables bounded
time replication for database updates.
52
Virtual Full Replication by Static Segmentation for Multiple Proper-
ties of Data Objects [53] This conference paper presents a table based approach
for controlling multiple properties for data that is used in a segmentation algorithm that
runs in O(o log o), where o is the number of database objects. Such table allows appli-
cation of rules for how properties may be combined, such as limitations for combinations
of properties and clustering of data objects. From such property table, both logical and
physical segments may be derived. The paper also introduces an resource analysis model
for bandwidth, storage and processing time that can be used for scalability analysis. We
present measurement results from an implementation of static segmentation in DeeDS.
A.2.2 Planned
Adaptive Segmentation for Virtual Full Replication This paper will present
an extension to the table approach with algorithms and mechanisms for adaptation of
segmentation and replication to suit changed data needs. Multiple data properties and
their relations are considered for adaptive (re-) segmentation of the database. Data objects
used by hard transactions need to be available to guarantee availability and timeliness
of such transactions. Changes in the segment configuration can be made in bounded
time, but real-time communication is required for timely setup of data objects over the
network. Once data objects are setup by segment recovery to a node, subsequent data
accesses are timely. We use a simulation approach to show that scalability is maintained
throughout execution, supporting timeliness of hard transactions and efficient execution
of soft transactions.
Adaptive Segmentation with Pre-fetch for Virtual Full Replication Al-
location of data objects to nodes delays those soft transactions that access data objects
that are not available at a node. Thus, the efficiency of soft transactions depends on
availability, and to improve availability data objects are assigned to nodes by pre-fetching
objects to nodes where accesses can be expected. Pre-fetching may counteract the effort of
limiting the number of replicas of objects that was introduced by virtual full replication.
However, for timeliness of transactions, availability need to be prioritized over optimal
object allocation, and the paper will examine the conditions for guarantees of availability
in a scalable and adaptive system. We will present results from an experiment that shows
how scalability is influenced by pre-fetching data in an adaptive database with virtual full
replication.
A data property framework for virtual full replication in distributed
real-time databases Properties of database clients, properties of the data itself and
53
of the architecture is used for achieving virtual full replication. Considering the relations
between a selected set of key properties allow improved solutions for availability and scal-
ability for real-time applications using the database. Also, related relations may restrict
usage and support for timeliness and these important cases are defined. This paper will
present properties and relations in typical real-time applications and relate them in useful
profiles of usage.
54
References
[1] A. E. Abbadi and S. Toueg. Maintaining availability in partitioned replicated data-
bases. ACM Trans. Database Syst., 14(2):264–290, 1989.
[2] N. Adly, M. Nagi, and J. Bacon. A hierarchical asynchronous replication protocol for
large scale systems. In Proceedings of the IEEE Workshop on Advances in Parallel
and Distributed Systems, pages 152–157, 1993.
[3] C. Allison, P. Harrington, F. Huang, and M. Livesey. Scalable services for resource
management in distributed and networked environments. In SDNE ’96: Proceedings
of the 3rd Workshop on Services in Distributed and Networked Environments (SDNE
’96), pages 98–105, Washington DC, USA, 1996. IEEE Computer Society.
[4] G. Alonso. Partial database replication and group communication primitives (ex-
tended abstract. In Proceedings of the 2nd European Research Seminar on Advances
in Distributed Systems (ERSADS’97), pages 171–176, January 1997.
[5] R. Alonso, D. Barbara, and H. Garcıa-Molina. Data caching issues in an information
retrieval system. ACM Trans. Database Syst., 15(3):359–384, 1990.
[6] S. Andler, J. Hansson, J. Eriksson, J. Mellin, M. Berndtsson, and B. Eftring.
DeeDS towards a distributed and active real-time database system. SIGMOD Record,
25(1):38–40, March 1996.
[7] O. Balci. Verification validation and accreditation of simulation models. In WSC ’97:
Proceedings of the 29th conference on Winter simulation, pages 135–141, New York,
NY, USA, 1997. ACM Press.
[8] O. Balci. Verification, validation and testing. In J. Banks, editor, Handbook of
simulation. John Wiley, N.Y., 1998.
[9] O. Balci, P. Glosow, P. Muessig, E. Page, J. Sikora, S. Solick, and S. Youngblood.
Departement of Defence Verification, Validation and Accreditation, Recommended
Practice Guidelines. Defence modelling and Simulation Office, Alexandria, VA, 1996.
[10] O. Balci and R. G. Sargent. A methodology for cost-risk analysis in the statistical
validation of simulation models. Commun. ACM, 24(4):190–197, 1981.
[11] J. Banks, J. Carson, and B. Nelson. Discrete-Event System Simulation. Prentice-Hall,
Upper Saddle River, New Jersey, 2 edition, 1996.
[12] P. Bernstein and N. Goodman. An algorithm for concurrency control and recovery in
replicated distributed databases. ACM Transactions on Database Systems, 9(4):596–
615, December 1984.
[13] R. Bhagwan, D. Moore, S. Savage, and G. M. Voelker. Replication strategies for highly
available peer-to-peer storage. In A. Schiper, A. Shvartsman, H. Weatherspoon, and
B. Zhao, editors, Future Directions in Distributed Computing: Research and Position
Papers, volume 2584 of Lecture Notes in Computer Science, pages 153–157. Springer-
Verlag, Heidelberg, July 2003.
55
[14] A. Birrell, R. Levin, R. Needham, and M. Schroeder. Grapevine: an exercise in
distributed computing. Communications of the ACM, 25(4):260–274, April 1982.
[15] A. Brunstrom, S. T. Leutenegger, and R. Simha. Experimental evaluation of dy-
namic data allocation strategies in a distributed database with changing workloads.
In Proceedings of the fourth international conference on Information and knowledge
management, pages 395–402, 1995.
[16] A.-L. Burness, R. Titmuss, C. Lebre, K. Brown, and A. Brookland. Scalability evalua-
tion of a distributed agent system. In Distributed Systems Engineering 6, The British
Computer Society. IEE and IOP Publishing Ltd, 1999.
[17] L. Carrillo, J. L. Marzo, and P. Vila. About the scalability and case study of antnet
routing. In Proceedings of CCIA, Oct 2003.
[18] S. Ceri, M. A. W. Houtsma, A. M. Keller, and P. Samarati. Independent updates and
incremental agreement in replicated databases. Distrib. Parallel Databases, 3(3):225–
246, 1995.
[19] B. Y. Chan, A. Si, and H. V. Leong. A framework for cache management for mobile
databases: Design and evaluation. Distributed and Parallel Databases, 10(1):23–57,
July 2001.
[20] W. W. Chu, B. A. Ribeiro-Neto, and P. H. Ngai. Object allocation in distributed
systems with virtual replication. In F. Golshani, editor, Proceedings of the Eighth
International Conference on Data Engineering, pages 238–245, Tempe, Arizona, Feb-
ruary 3-7 1992.
[21] B. A. Coan, B. M. Oki, and E. K. Kolodner. Limitations on database availability when
networks partition. In Proceedings of the fifth annual ACM symposium on Principles
of distributed computing, pages 187–194, 1986.
[22] E. Cohen and S. Shenker. Replication strategies in unstructured peer-to-peer net-
works. In Proceedings of the 2002 conference on Applications, technologies, architec-
tures, and protocols for computer communications, pages 177–190, 2002.
[23] A. Datta, S. Mukherjee, and I. R. Viguier. Buffer management in real-time active
database systems. The Journal of Systems and Software, 42(3):227–246, 1998.
[24] S. Davidson. Optimism and consistency in partitioned distributed database system.
ACM Transactions on Database Systems, 9(3):456–481, 1984.
[25] G. Dunteman. Principal component analysis. Technical Report 07-69, Sage University
Paper, 1989.
[26] W. Effelsberg and T. Haerder. Principles of database buffer management. ACM
Trans. Database Syst., 9(4):560–595, 1984.
[27] A. Fekete, D. Gupta, V. Luchangco, N. A. Lynch, and A. A. Shvartsman. Eventually-
serializable data services. In Symposium on Principles of Distributed Computing,
pages 300–309, May 1996.
56
[28] C. A. Fossett, D. Harrison, H. Weintrob, and S. I. Gass. An assessment procedure for
simulation models: a case study. Oper. Res., 39(5):710–723, 1991.
[29] S. Frolund and P. Garg. Design-time simulation of a large-scale, distributed object
system. ACM Trans. Model. Comput. Simul., 8(4):374–400, 1998.
[30] D. Garlan. Software architecture: a roadmap. In ICSE ’00: Proceedings of the
Conference on The Future of Software Engineering, pages 91–101, New York, NY,
USA, 2000. ACM Press.
[31] B. A. Gennart and D. C. Luckham. Validating discrete event simulations using event
pattern mappings. In DAC ’92: Proceedings of the 29th ACM/IEEE conference on
Design automation, pages 414–419, Los Alamitos, CA, USA, 1992. IEEE Computer
Society Press.
[32] S. Gustavsson and S. Andler. Continuous consistency management in distributed
real-time databases with multiple writers of replicated data. In Workshop on parallel
and distributed real-time systems, Denver, CO, April 2005.
[33] R. Guy, J. Heidemann, W. Mak, T. Page Jr., G. Popek, and D. Rothmeier. Imple-
mentation of the Ficus replicated file system. In USENIX Conference Proceeedings,
pages 63–71, June 1990.
[34] A. Helal, A. Heddaya, and B. Bhargava. Replication techniques in distributed systems.
Kluwer Academic Publishers, 1996.
[35] J. Holliday, D. Agrawal, and A. El Abbadi. Partial database replication using epi-
demic communication. In Proceedings of the 22th IEEE International Conference on
Distributed Computing Systems (ICDCS02), pages 485–493, 2002.
[36] J. Huang and J. A. Stankovic. Buffer management in real-time databases. Technical
Report UM-CS-1990-065, University of Massachusetts, 1990.
[37] P. Hyden, L. Schruben, and T. Roeder. Resource graphs for modeling large-scale,
highly congested systems. In WSC ’01: Proceedings of the 33nd conference on Winter
simulation, pages 523–529, Washington, DC, USA, 2001. IEEE Computer Society.
[38] R. Jain. The art of computer systems performance analysis. John Wiley and Sons,
New York, 1991.
[39] R. Jauhari, M. J. Carey, and M. Livny. Priority-hints: an algorithm for priority-based
buffer management. In Proceedings of the sixteenth international conference on Very
large databases, pages 708–721. Morgan Kaufmann Publishers Inc., 1990.
[40] P. Jogalekar and M. Woodside. Evaluating the scalability of distributed systems.
IEEE Trans. Parallel Distrib. Syst., 11(6):589–603, 2000.
[41] J. P. C. Kleijnen. Validation of models: statistical techniques and data availability. In
WSC ’99: Proceedings of the 31st conference on Winter simulation, pages 647–654,
New York, NY, USA, 1999. ACM Press.
57
[42] A. Lahmadi, L. Andrey, and O. Festor. On the impact of management on the perfor-
mance of a managed system: a JMX-based management case study. In J. Schonwalder
and J. Serrat, editors, Ambient Networks: 16th Intl. WS on Distr. Systems: Oper-
ations and Management, volume 3775, pages 24–35, Oct 2005. Lecture Notes in
Computer Science.
[43] J. Landgren. Supporting fire crew sensemaking enroute to incidents. Int. J. Emer-
gency Management, 2(3), 2005.
[44] A. M. Law and D. M. Kelton. Simulation Modeling and Analysis. McGraw-Hill Higher
Education, Boston, 3rd edition, 2000.
[45] E. Leifsson. Recovery in distributed real-time database systems (HS-IDA-MD-99-
009). Master’s thesis, University of Skovde, Sweden, 1999.
[46] K. Lu, Z; McKinley. Partial replica selection based on relevance for information
retrieval. In Proceedings of SIGIR ’99. 22nd International Conference on Research
and Development in Information Retrieval, pages 97–104, 1999.
[47] Q. Lv, P. Cao, E. Cohen, K. Li, and S. Shenker. Search and replication in unstructured
peer-to-peer networks. In Proceedings of the 2002 ACM SIGMETRICS, Measurement
and modeling of computer systems, 2002.
[48] J. Matheis and M. Mussig. Bounded delay replication in distributed databases with
eventual consistency (HS-IDA-MD-03-105). Master’s thesis, University of Skovde,
Sweden, 2003.
[49] G. Mathiason. Segmentation in a distributed real-time main-memory database (HS-
IDA-MD-02-008). Master’s thesis, University of Skovde, Sweden, 2002.
[50] G. Mathiason. A simulation approach for evaluating scalability of a virtually fully
replicated real-time database. Technical Report HS-IKI-TR-06-002, University of
Skovde, Sweden, Mar 2006.
[51] G. Mathiason and M. Amirijoo. Real-time communication through a distributed
resource reservation approach. Technical Report HS-IKI-TR-04-004, University of
Skovde, Sweden, Dec 2004.
[52] G. Mathiason and S. F. Andler. Virtual full replication: Achieving scalability in
distributed real-time main-memory systems. In Proc. of the Work-in-Progress Session
of the 15th Euromicro Conf. on Real-Time Systems, pages 33–36, July 2003.
[53] G. Mathiason, S. F. Andler, and D. Jagszent. Virtual full replication by static seg-
mentation for multiple properties of data objects. In Proceedings of Real-time in
Sweden (RTiS 05), pages 11–18, Aug 2005.
[54] J. McManus and W. Bynum. Design and analysis techniques for concurrent black-
board systems. IEEE Transactions on Systems, Man and Cybernetics, 26(6):669–680,
1996.
58
[55] A. Mitra, M. Maheswaran, and S. Ali. Measuring scalability of resource management
systems. In IPDPS ’05: Proceedings of the 19th IEEE International Parallel and Dis-
tributed Processing Symposium (IPDPS’05) - Workshop 1, page 119.2, Washington,
DC, USA, 2005. IEEE Computer Society.
[56] P. Nii. The blackboard model of problem solving. AI Mag., 7(2):38–53, 1986.
[57] D. Nussbaum and A. Agarwal. Scalability of parallel machines. Commun. ACM,
34(3):57–61, 1991.
[58] A. Oomes. Organization awareness in crisis management. In B. Carle and B. Van de
Walle, editors, Proc. ISCRAM2004, pages 63–68, 2004.
[59] S. Park, D. Lee, M. Lim, and C. Yu. Scalable data management using user-based
caching and prefetching in distributed virtual environments. In Proceedings of the
ACM symposium on Virtual reality software and technology, pages 121–126, November
2001.
[60] D. Parker and R. Ramos. A distributed file system architecture supporting high avail-
ability. In Proceedings of the 6th Berkeley Workshop on Distributed Data Management
and Computer Networks, pages 161–183, February 1982.
[61] P. Peddi and L. C. DiPippo. A replication strategy for distributed real-time object
oriented databases. In Proceedings of the Fifth IEEE International Symposium on
Object-Oriented Real-Time Distributed Computing, pages 129–136, Washington DC,
Apr 2002.
[62] C. Pu and A. Leff. Replica control in distributed systems: an asynchronous approach.
ACM SIGMOD Record, 20(2):377–386, 1991.
[63] K. Ranganathan, A. Iamnitchi, and I. Foster. Improving data availability through
dynamic model-driven replication in large peer-to-peer communities. In CCGRID ’02:
Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing
and the Grid, page 376, Washington, DC, USA, 2002. IEEE Computer Society.
[64] D. Ratner. Roam: A scalable replication system for mobile and distributed computing.
PhD thesis, University of California, Los Angeles, Jan 1998.
[65] D. Ratner, R. P., and P. G.J. Dynamic version vector maintenance. Technical Report
CSD-970022, University of California, Los Angeles, 30, 1997.
[66] S. Robinson. Simulation model verification and validation: increasing the users’
confidence. In WSC ’97: Proceedings of the 29th conference on Winter simulation,
pages 53–59, New York, NY, USA, 1997. ACM Press.
[67] T. Roederand, S. Fischbein, M. Janakiram, and L. Schruben. Resource-driven and
job-driven simulations. In Proceedings of the 2002 International Conference on Mod-
eling and Analysis of Semiconductor Manufacturing, pages 78–83, 2002.
59
[68] R. G. Sargent. Validation and verification of simulation models. In WSC ’92: Pro-
ceedings of the 24th conference on Winter simulation, pages 104–114, New York, NY,
USA, 1992. ACM Press.
[69] R. G. Sargent. Verifying and validating simulation models. In WSC ’96: Proceedings
of the 28th conference on Winter simulation, pages 55–64, New York, NY, USA, 1996.
ACM Press.
[70] M. Satyanarayanan, J. Kistler, P. Kumar, M. Okasaki, E. Siegel, and D. Steere. Coda:
A highly available file system for a distributed workstation environment. IEEE Trans.
Comput., 39(4):447–459, Apr 1990.
[71] L. Schruben. Simulation modeling with event graphs. Commun. ACM, 26(11):957–
963, 1983.
[72] R. E. Shannon. Tests for the verification and validation of computer simulation
models. In WSC ’81: Proceedings of the 13th conference on Winter simulation, pages
573–577, Piscataway, NJ, USA, 1981. IEEE Press.
[73] R. M. Sivasankaran, K. Ramamritham, J. A. Stankovic, and D. Towsley. Data place-
ment, logging and recovery in real-time active databases. In M. Berndtsson and
J. Hansson, editors, Proceedings of the First International Workshop on Active and
Real-Time Database Systems, pages 226–241, Skovde, Sweden, June 1995.
[74] X.-H. Sun and J. Zhu. Performance considerations of shared virtual memory machines.
IEEE Transactions on Parallel and Distributed Systems, 6(11):1185–1194, Nov 1995.
[75] B. Tatomir and L. Rothkrantz. Crisis management using mobile ad-hoc wireless
networks. In Proc. of Information Systems for Crisis Response and Management
ISCRAM 2005, pages 147–149, Apr 2005.
[76] P. Triantafillou and F. Xiao. Supporting partial data accesses to replicated data.
In Proceedings of the Tenth International Conference on Data Engineering, February
14-18, 1994, Houston, Texas, USA, pages 32–42, 1994.
[77] A. Uvarov and V. Fay-Wolfe. Towards a definition of the real-time data distribu-
tion problem space. In Proceedings of the 23th IEEE International Conference on
Distributed Computing Systems Workshop (DDRTS03), pages 170–175, May 2003.
[78] Y. Wei, A. Aslinger, S. H. Son, and J. A. Stankovic. ORDER: A dynamic replication
algorithm for periodic transactions in distributed real-time databases. In Proceedings
of Real-time and Embedded Computing Systems and Applications (RTCSA04), pages
152–169, Aug 2004.
[79] Y. Wei, S. H. Son, J. A. Stankovic, and K. Kang. QoS management in distributed
real-time databases. In Proceedings of the 24th IEEE Real-Time Systems Symposium,
Cancun, Mexico, pages 86–97, Dec 2003.
[80] M. Wiesmann, F. Pedone, A. Schiper, B. Kemme, and G. Alonso. Understanding
replication in databases and distributed systems. In Proceedings of the 20th IEEE
60
International Conference on Distributed Computing Systems (ICDCS00), pages 464–
474, Apr 2000.
[81] O. Wolfson and Y. Huang. Competitive analysis of caching in distributed databases.
IEEE Transactions on Parallel and Distributed Systems, 9(4):391–409, Apr 1998.
[82] O. Wolfson, S. Jajodia, and Y. Huang. An adaptive data replication algorithm. ACM
Transactions on Database Systems, 22(2):255–314, 1997.
[83] O. Wolfson and A. Milo. The multicast policy and its relationship to replicated data
placement. ACM Transactions on Database Systems, 16(1):181–205, 1991.
[84] L. Wujuan and B. Veeravalli. An adaptive object allocation and replication algorithm
in distributed databases. In Proceedings of the 23th IEEE International Conference
on Distributed Computing Systems Workshop (DARES), pages 132–137, 2003.
[85] H. Yu and A. Vahdat. Design and evaluation of a conit-based continuous consistency
model for replicated services. ACM Transactions on Computer Systems, 20(3):239–
282, 2002.
[86] J. R. Zirbas, D. J. Reble, and R. E. vanKooten. Measuring the scalability of parallel
computer systems. In Supercomputing ’89: Proceedings of the 1989 ACM/IEEE con-
ference on Supercomputing, pages 832–841, New York, NY, USA, 1989. ACM Press.
61