Virtual Full Replication for Scalable Distributed Real

Virtual Full Replication for Scalable

Distributed Real-Time Databases

Thesis Proposal

Technical Report HS-IKI-TR-06-006

Gunnar Mathiason

[email protected]

University of Skovde

June, 2006

1

Abstract

Distributed real-time systems increase in size an complexity, and the nodes in such

systems become difficult to implement and test. In particular, communication for synchro-

nization of shared information in groups of nodes becomes complex to manage. Several

authors have proposed to using a distributed database as a communication subsystem,

to off-load database applications from explicit communication. This lets the task for in-

formation dissemination be done by the replication mechanisms of the database. With

increasingly larger systems, however, there is a need for managing the scalability for such

database approach. Furthermore, timeliness for database clients requires predictable re-

source usage, and scalability requires bounded resource usage in the database system.

Thus, predictable resource management is an essential function for realizing timeliness in

a large scale setting.

We discuss scalability problems and methods for distributed real-time databases in the

context of the DeeDS database prototype. Here, all transactions can be executed timely at

the local node due to main memory residence, full replication and detached replication of

updates. Full replication contributes to timeliness and availability, but has a high cost in

excessive usage of bandwidth, storage, and processing, in sending all updates to all nodes

regardless of updates will be used there or not. In particular, unbounded resource usage

is an obstacle for building large scale distributed databases.

For many application scenarios it can be assumed that most of the database is shared

by only a limited number of nodes. Under this assumption it is reasonable to believe that

the degree of replication can be bounded, so that a bound also can be set on resource

usage.

The thesis proposal identifies and elaborates research problems for bounding resource

usage in large scale distributed real-time databases. One objective is to bound resource

usage by taking advantages of pre-specified data needs, but also by detecting unspecified

data needs and adapting resource management accordingly. We elaborate and evaluate the

concept of virtual full replication, which provides an image of a fully replicated database to

database clients. It makes data objects available where needed, while fulfilling timeliness

and consistency requirements on the data.

In the first part of our work, virtual full replication makes data available where needed

by taking advantages of pre-specified data accesses to the distributed database. For hard

real-time systems, the required data accesses are usually known since such systems need

to be well specified to guarantee timeliness. However, there are many applications where

a specification of data accesses can not be done before execution. The second part of

our work extends virtual full replication to be used with such applications. By detecting

2

new and changed data accesses during execution and adapt database replication, virtual

full replication can continuously provide the image of full replication while preserving

scalability.

One of the objective of the thesis work is to quantify scalability in the database context,

so that actual benefits and achievements can be evaluated. Further, we find out the

conditions for setting bounds on resource usage for scalability, under both static and

dynamic data requirements.

3

Contents

1 Introduction 7

1.1 Document layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Background 9

2.1 A distributed real-time database architecture . . . . . . . . . . . . . . . . . 9

2.2 The concept of scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Scalability in a fully replicated real-time main memory database . . . . . . 11

2.4 Virtual full replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4.1 Basic notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.5 Incremental recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Problem formulation 15

3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3 Problem decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.4 Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.5 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Methodology 19

4.1 Virtual full replication for scalability . . . . . . . . . . . . . . . . . . . . . . 19

4.2 Static segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2.1 Static segmentation algorithm . . . . . . . . . . . . . . . . . . . . . . 22

4.2.2 Properties, dependencies and rules . . . . . . . . . . . . . . . . . . . 25

4.2.3 Analysis model and assumptions . . . . . . . . . . . . . . . . . . . . 26

4.2.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.2.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.3 Simulation for scalability evaluation . . . . . . . . . . . . . . . . . . . . . . 31

4.3.1 Motivation for a simulation study . . . . . . . . . . . . . . . . . . . . 32

4.3.2 Simulation objectives . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3.3 Validation of software simulations . . . . . . . . . . . . . . . . . . . 33

4.3.4 Modeling detail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3.5 Approaches for validity and credibility . . . . . . . . . . . . . . . . . 35

4.3.6 Experimental process and experiment design . . . . . . . . . . . . . 36

4.3.7 Simulator implementation . . . . . . . . . . . . . . . . . . . . . . . . 37

4

4.3.8 Experiment execution . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.3.9 Discussion and results . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.4 Adaptive segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.4.1 Adaptive segmentation with pre-fetch . . . . . . . . . . . . . . . . . 43

4.5 A framework of properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.6 Case study and implementation . . . . . . . . . . . . . . . . . . . . . . . . . 44

5 Related work 46

5.1 Strict consistency replication . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.2 Replication with relaxed consistency . . . . . . . . . . . . . . . . . . . . . . 46

5.3 Partial replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.4 Asynchronous replication as a scalability approach . . . . . . . . . . . . . . 47

5.5 Adaptive replication and local replicas . . . . . . . . . . . . . . . . . . . . . 47

5.6 Usage of data properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6 Conclusions 48

6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.2 Initial results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.3 Expected contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.4 Expected future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

A Project plan and publications 51

A.1 Time plan and milestones . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

A.2 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

A.2.1 Current . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

A.2.2 Planned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5

List of papers

This Thesis Proposal is based on the work in the following papers.

G. Mathiason and S. F. Andler. Virtual full replication: Achieving scalability in distrib-

uted real-time main-memory systems. In Proc. of the Work-in-Progress Session of

the 15th Euromicro Conf. on Real-Time Systems, pages 33-36, July 2003. (ISBN

972-8688-11-3) [52]

G. Mathiason, S. F. Andler, and D. Jagszent. Virtual full replication by static segmen-

tation for multiple properties of data objects. In Proceedings of Real-time in Sweden

(RTiS 05), pages 11-18, Aug 2005. ISBN 91-631-7349-2, ISSN 1653-2325) [53]

G. Mathiason. A simulation approach for evaluating scalability of a virtually fully repli-

cated real-time database. Technical Report HS-IKI-TR-06-002, University of Skovde,

Sweden, Mar 2006. [50]

6

1 Introduction

In a distributed database, availability of data can be improved by allocating data at the

local nodes where data is used. For real-time databases transaction timeliness is a major

concern. Data can be allocated to the local node, to avoid remote data access with the

risk of unpredictable delays on the network. Further, multiple replicas of the same data

at different nodes allow data redundancy that can be used for recovery from failures,

avoiding corruption or losses of data when nodes fail. In a fully replicated database the

entire databse is available at all nodes. This offers full local availability of all data objects

at each node.

The DeeDS [6] database prototype stores the fully replicated database in main mem-

ory to make transaction timeliness independent of disk accesses. Full replication of data

together with detached replication of updates allow transactions to execute on the local

node entirely, independent of network delays. Detached replication sends updates to other

nodes independent from the execution of the transaction. Database clients get the per-

ception of a single local database and need not to consider specific data locations or how

to synchronize concurrent updates of different replicas of the same data.

Full replication uses excessive system resources, since the system must replicate all

updates to all the nodes in such a system. This causes a scalability problem for resource

usage, such as bandwidth for replication of updates, storage for data replicas, and process-

ing for replicating updates and resolving conflicts for concurrent updates at different nodes

for the same data object.

With existing approaches, real-time databases does not scale well, and the need for

increasingly larger real-time databases increase. With this Thesis we aim show how to

bound resource usage in a distributed real-time database such as DeeDS, by using virtual

full replication to make it scalable. We also quantify scalability for the domain, for an

evaluation of the benefits

We elaborate virtual full replication [6] as an approach to manage resources for scala-

bility, to replicate and store updates only for those data objects that are used at a node,

and maintain scalability over time. This avoids irrelevant replication and maintains real-

time properties, while providing an image of full replication for local availability of data.

Virtual full replication reduces resource usage to meet the actual need, rather than using

resources to replicate all updates blindly to all nodes, resulting in a scalable distributed

real-time database. Virtual full replication uses knowledge about the needs of the ap-

plication, and uses application requirements for properties of the data, such as location,

consistency model, and storage media to support timeliness requirements of the clients

in a scalable system. Furthermore, properties have relations that can be used to control

7

resource usage in more detail.

This Thesis Proposal claims that a replicated database with virtual full replication

and eventual consistency enables large scale distributed real-time databases, where each

database client have an image of full availability of a local database.

1.1 Document layout

Section 2 contains a background on distributed real-time databases and in particular how

full replication may support timeliness. An generic approach for scalability evaluation is

introduced, and indicates how such approach can be used for a scalability evaluation of a

distributed databases. In section 3 the problem of scalability in a fully replicated database

is detailed and how we decompose the problem, how the problem can be defined in terms

of assessable objectives. Section 4 describes our methodology in detail, for how to reach

the objectives. Some of the steps in the methodology have been performed already and

the results are presented in conjunction, while other steps remains to be done. Section 5

describes related work, both for distributed real-time databases and for other areas that

relates to our approach. Finally, section 6 summarizes the conclusions in this proposal,

highlights the expected contributions and expected consequences for subsequent future

work.

8

2 Background

2.1 A distributed real-time database architecture

The main property of real-time systems is timeliness, which can only be achieved with

predictable resource usage and with sufficiently efficient execution. Predictability is essen-

tial and for a hard real-time system the consequence of a missed deadline may be fatal,

while for soft real-time systems a missed deadline lowers the value of the provided service.

Thus, for real-time systems, predictable resource usage is the primary design concern that

enables timeliness.

To improve predictability and efficiency, the database of the distributed real-time data-

base system DeeDS [6] resides entirely in main memory, removing dependability on disk

I/O delays caused by unpredictable access times for hard drives. Also, accesses to main

memory are many times faster.

To further improve predictability of database transactions, the database is fully repli-

cated to all nodes, to make transaction execution independent of network delays or network

partitioning. With full replication there is no need for remote data access during trans-

actions. Replication also improves fault tolerance, since there are redundant copies of the

data. Full replication allows transactions to have all of its operations running at the local

node. A fully replicated database with detached replication [32], where replication is done

after transaction commit, allow independent updates [18], that is, concurrent and unsyn-

chronized updates for replicas of the same data objects. Independent updates may cause

database replicas to become inconsistent, and inconsistencies must be resolved in the repli-

cation process by a conflict detection and resolution mechanism. In DeeDS, updates are

replicated to all nodes detached from transaction execution, by propagation from the node

executing an updating transaction after transaction commit, and integration of replicated

updates at all the other nodes. Conflicting updates as result of independent updates are

resolved at integration time. Temporary inconsistencies are allowed and guaranteed to be

resolved at some point in time, giving the database the property of eventual consistency.

Applications that use eventually consistent databases need to be tolerant to temporarily

inconsistent replicas, and many distributed applications are. Further, in an eventually con-

sistent database that supports a bounded time for such temporary inconsistencies (which

can be achieved by using bounded time replication), applications that have requirements

on timely replication and consistency can use a temporary inconsistent database as well.

In a fully replicated database using detached replication, a number of predictability

problems can be avoided that are associated with synchronization of concurrent updates

at different nodes, such as agreement protocols or distributed locking of replicas of objects,

9

and reliance on stable communication to access data. Also, application programming is

easier since the application programmer may assume that the entire database is available,

and that the application program has exclusive access to it.

2.2 The concept of scalability

Scalability is a concept that intuitively may appear obvious. It is used in many areas

of research, but the availability of a generic definition and a theoretical framework with

metrics is limited. A system is scalable if the growth function for required amount of

resources, req(p), does not exceed the function for available amount of resources, res(p),

when the system is scaled up for some system parameter p (also called the scale factor).

The resource usage may follow some function of the scale factor, and the upper bound for

this function, O(req(p)), must not exceed the function of available resources. For a system

with linear scalability this relation must be valid for all sizes of p, but for other systems

scalability may be related to only certain sizes of p. In a few research areas, scalability

concepts are well developed and related metrics for scalability are available, namely in

the areas of parallel computing systems [86] [57] in particular for resource management

[55], and for shared virtual memory [74], for design of system architectures for distributed

systems [16], and for network resource management [3].

Frolund and Garg define a list of generic terms for scalability analysis in distributed

application design [29]:

• Scalability : A distributed (software) design D, is scalable if its performance model

predicts that there are possible deployment and implementation configurations of D

that would meet the Quality of Service (QoS) expected by the end user, within the

scalability tolerance, over a range over scale factor variations, within the scalability

limits.

• Scale factor : A variable that captures a dimension of the input vector that defines

the usage of a system.

• Scaling enabler : Entities of design, implementation or deployment that can be

changed to enable scalability of the design.

Further, the authors also define Scalability point, Scalability tolerance, Scalability limits

and Scaling parameters, which we exclude here.

We can map the terms above into our context. QoS can be seen as timeliness of local

transactions (deadline miss ratio), the level of consistency (consistency model properties

and the bound on replication delay). Scale factors may be the database size, the number

of nodes, the ratio and frequency of update transactions or others.

10

Jogalekar and Woodside [40] have presented a generic metric for scalability in distrib-

uted systems based on productivity, and a system is regarded as scalable if productivity is

maintained as the scale changes. Given the quantities:

• λ(k) = throughput in responses/sec, at scale k

• f(k) = average value of each response, calculated from its quality of service at scale

k

• C(k) = cost at scale k, expressed as running cost per second to be uniform with λ

The value function f(k) includes appropriate system measures, such as response de-

lay, availability, timeouts or probability of data loss. The productivity F (k) is the value

delivered per second, divided by the cost per second:

F (k) = λ(k) ∗ f(k)/C(k) (1)

The scalability metric ψ(k1, k2) relates the productivity at two different scales such

that

ψ(k1, k2) = (F (k2))/(F (k1)) (2)

With ψ(kx, ky), design alternatives x and y can be compared, such that a higher ratio

indicates better scalability for ky. It it not possible to compare scalability between different

systems since the ratio is based on specific value functions, but scalability can be evaluated

for finding alternative scaling enablers and settings for scale factors for a particular system.

The authors use the scalability metric for evaluating actual systems, and the metric has

also been used by other authors using other value functions [17] [42]. Similarly, we need

to define an appropriate value function for a distributed real-time database, which would

be connected to the real-time properties of the system. Typically this would include the

timeliness of transactions for different classes of transactions. The cost function would

relate to the amount of resources used, in terms of bandwidth, storage and processing

time.

2.3 Scalability in a fully replicated real-time main memory data-

base

For a fully replicated database with immediate consistency, updates are replicated to

all data replicas during the execution of the updating transaction, using a distributed

agreement algorithm to update all replicas at one instant, and where there is no state where

replicas can differ during the update. Such updates must lock a majority of replicas of the

11

updated object during the transaction. With detached replication, transaction timeliness

does not depend on timeliness for locking replicas, the network delays, or the waiting time

for release of locks set by transactions at other nodes. Thus, detached replication is a

scaling enabler for such database. However, a fully replicated database with detached

replication has another scalability problem in that all updates need to be sent to all other

nodes, regardless of whether the data will be used there or not. Full replication also

requires that replicas of all data objects must be stored at all the nodes, independent of

whether the data ever will by used there or not. Also, updates in fully replicated databases

must be integrated at all nodes, requiring integration processing of updates at all nodes.

Under the assumption that replicas of data objects will only be used at a bounded

subset of the nodes, the required degree of replication becomes lower than in a fully

replicated database. The resource usage of bandwidth, storage and processing depends on

the degree of replication, and these resources are wasted in a fully replicated database,

compared to a database with a lower degree of replication. With replication of only those

data objects that are actually used, resources can be saved and thereby scalability can be

improved.

2.4 Virtual full replication

With virtual full replication [6] the database clients has a image of a fully replicated

database. The database system manages knowledge of what is needed for database clients

to perceive such an image. This knowledge includes a specification of the data accesses

required by database clients. Virtual full replication use the knowledge to replicate data

objects to a subset of the nodes where data is needed, and also to replicate with the

least resource usage needed to maintain the image. With virtual full replication several

consistency models for data may coexist in the same database, and the knowledge is also

used to ensure the consistency model for each data object.

Scalable usage of resources is dependent on the number of replicas to update. Since a

virtually fully replicated database has fewer replicas, resource consumption is lower, while

in a fully replicated database there are as many replicas as there are number of nodes.

The overall degree of replication is lower and replication processing that serves no purpose

is avoided. Only those data objects that are used at a node are replicated there, which

reduces resource usage of both bandwidth, storage and processing. With such resource

management, scalability is improved without changing the application’s assumption of

having a complete database replica available at the local node.

However, virtual full replication that considers only specified data accesses does not

make data available for accesses to arbitrary data at an arbitrarily selected node, which

12

makes such a system less flexible than a full replicated database. For unspecified data

accesses and for changes in data requirements, virtual full replication adapts by detecting

changes and reconfigure replica allocation and replication. This preserves scalability by

managing the degree of replication over time.

Segmentation of the database is an approach for limiting the degree of replication in a

virtually fully replicated database. A segment is a subset of the objects in the database

and each segment has an individual degree of replication over the nodes. The degree of

replication is a result of allocating segments only to the nodes where its data objects are

accessed. This is typically much fewer nodes than used in a fully replicated database. Also,

a database may have multiple segmentations, each for a different purpose. Segmenting for

data availability strives to minimize the degree of replication, while a segmentation for

consistency points out the method for replicating updates in the database.

2.4.1 Basic notation

In this thesis proposal the following notation is used.

A database maintains a finite set of logical data objects O = {o0, o1, ...}, representing

database values. Object replicas as physical manifestations of logical objects. A distributed

database is stored at a finite set of nodes N = {N0, N1, ...}. A replicated database contains

a set of object replicas R = {ro, r1, ...}.The function R : O × N → R identifies the replica r ∈ R of a logical object o ∈ O

on a node N ∈ N if such a replica exists. R(o, N) = r if r is the replica of o on node

N . If no such replica exists, R(o,N) = null. node(r) is the node where replica r is

located, i.e. node(R(o,N)) .= N . object(r) is the logical object that r represents, i.e.

object(R(o,N)) .= o.

A distributed global database (or simply database) D is a tuple < O,R,N >, where

O is the set of objects in D, and R is the set of replicas of objects in O, and N is the set

of nodes such that each node N ∈ N hosts at least one replica in R, i.e. N = {N | ∃r ∈R(node(r) = N)}.

We model transaction programs, T , using four parameters, including the set of objects

read by the transaction, READT (the read set), the set of objects written by the transac-

tion, WRIT ET (the write set), the conflict set CONFLICT T is the set of objects that

conflicts with updates at other nodes. The transaction program T can thus be defined as

T = {READT ,WRIT ET , CONFLICT T }. Also, we refer to the size of the read set as

rT =| READT |, the size of the write set as wT =| WRIT ET |, and the size of the conflict

set as cT =| CONFLICT T |. Also, the working set WST is the union of the read and

write sets of the transaction program WST = {READT ∪WRIT ET }.

13

An transaction instance Tj of a transaction program is executing at a given node n

with a certain maximal frequency fj . We define such transaction instance by a tuple such

that Tj =< fj , n, T >. When the node for execution is implicit and when we only need the

sizes of the read, write, and conflict sets, we simplify the notation as Tj =< fj , rj , wj , cj >.

Further, node(Tj) is the node where transaction T is executed, begin(Tj) is the time

at which transaction Tj begun its execution. commit(Tj) is the commit time of Tj and

abort(Tj) is the abort time of Tj (if Tj is aborted, commit(Tj) = ⊥, if Tj is committed,

abort(Tj) = ⊥). end(Tj) is the completion time of Tj .

2.4.2 Definition

Virtual full replication ensures that for each transaction that read or write database objects

at a node, there exists a replica of the object. Formally,

∀o ∈ O∀T (o ∈ {READT ∪WRIT ET } → ∃r ∈ R(r = R(o, node(T ))))

2.5 Incremental recovery

With incremental recovery, a node in a full replicated distributed database can be recovered

into a consistent database replica, without that any of the other working replicas need to

stopped or even locked [45]. Incremental recovery was proven to give a consistent database

copy at the recovered node.

In a main memory database, data is stored by memory pages, which is the smallest

memory entity that is access during read or write operations. It is assumed that memory

management circuitry ensures that a read or write operation has exclusively access to a

certain memory page during the operation. Fuzzy checkpointing uses this mechanism for

sequentially copying all memory pages at a node (the recovery source), and such a copy

can be sent over the network to recover a failed node (the recovery target). Selecting a

recovery source is done by a negotiation process, to select the most appropriate source

node. Each page that is copied is logged at the recovery source, and pages in the log that

are updated after the memory page was sent to the target node need to be sent again.

For such updates, updated memory pages are forwarded to the recovery target as long as

fuzzy checkpointing proceeds. Once the entire database has been copied and all subsequent

updates have been forwarded, the fuzzy checkpoint finishes atomically with a consistent

replica at the recovered node.

14

3 Problem formulation

3.1 Motivation

A fully replicated distributed database scales badly since it replicates all updates to all

nodes, using excess resources. Scalability is an increasingly important issue for distributed

real-time databases, since the number of nodes, the database size, the number of users

and the workload involved in typical distributed database applications are increasing. For

many such systems it can be assumed that only a fraction of the replicated data is used

at the local node, which is motivated by hot-spot and locality behavior of accesses in

distributed databases.

This thesis argues that a fully replicated distributed main memory real-time database

can be made scalable by effective resource management by using virtual full replication, and

that degrees of scalability can be quantified by metrics. Different scale factors and different

scaling enablers influence resource usage differently are varied to evaluate scalability of

the database. Also, resource usage is compared to alternative approaches for timeliness

in large scale distributed databases, such as fully replicated databases, or approaches that

use partial replication and remote transactions.

For many applications, we believe that a distributed real-time database can be a suit-

able infrastructure for communication between nodes. Publishing and using exchange data

through a distributed real-time database facilitates structured storage and access, im-

plicit consistency management between replicas, fault-tolerance and higher independence

by lower coupling between applications. With a distributed database as an infrastructure

there is no need to explicitly coordinate communication in a distributed application, which

reduces complexity for the communicating application, in particular where the groups that

communicate often change.

Consider a wildfire fighting mission. In such dynamic situation, hundreds of different

actors need to coordinate and distribute information in real time. In such scenario, actors

and information are added and removed dynamically when the mission situation suddenly

change. A distributed database has been pointed out as a suitable infrastructure for emer-

gency management [75], and could be used as a white-board (also called ’black-board’ [56],

in particular as an software architecture approach [30], or for loosely coupled agents [54]),

for reading and publishing current information about the state of a mission, supporting

situation awareness from information assembled from all the actors. Using a database,

implying the usage of transactions, ensures consistency of the information and also avoids

the need of specific addressing within the communication infrastructure. Ethnographical

field studies [43] show that such infrastructure supports sense-making by enabling hu-

15

mans to interact, and it also gives support for the chain of command and strengthens

organizational awareness, which is essential for success of the mission [58]. In such an

infrastructure actors may have access to the complete information when needed, but each

actor will most of the time use only parts of the information for their local actions and

for collaboration with close by peers. By using virtual full replication, scalability of such

distributed database can be preserved.

Virtual full replication uses known properties about data objects, the applications and

the database system, to reduce resource usage by resource management based on the

data needs. We have shown a database segmentation approach that considers multiple

properties for data, and relations between the data properties. So far, only a few selected

properties have been used, but an extended set of useful properties and their relations will

to be defined, to setup segments that meet data requirements from database applications.

Since properties are related, it can be expected that groups of properties can be structured

into profiles of data properties to reduce the resulting number of segments for our approach.

3.2 Problem statement

Predictable execution of transactions is inherent for distributed real-time databases. Pre-

dictable transactions can be achieved by predictable resource usage and sufficient efficiency

of execution, and by excluding involvement of sources of unpredictability in execution of

transactions. With main memory residence, access times become independent of disk ac-

cess times. With full replication, the entire database becomes locally available and local

accesses become independent of network delays. With detached replication, the network

delays for updating replicas is separated from the execution time of the transaction. Com-

bining main memory residence, full replication and detached replication of updates gives

predictable transactions in a distributed real-time database.

Such database does not scale well, since all updates need to be replicated to all other

nodes. By using knowledge about data needs, irrelevant replication can be avoided and

the database becomes scalable, but such replication makes data available only for the data

needs that are known prior to execution. Thereby, the database loses the flexibility to

execute arbitrary transactions at arbitrary nodes. To maintain scalability while making

data available, the virtually fully replicated database also needs to detect and adapt to

unspecified data needs that arises during execution.

Our initial work shows that segmentation of the database, based on a priori known

data needs and database and application properties, improves scalability by scalable usage

of three key resources. Therefore, reconfigurable segmentation of the database seems to be

a viable extension to be able to adapt to changed data needs and to regain the flexibility

16

lost by fixed segmentation. Further, scalability of such an approach need to be evaluated.

A quantification of scalability is needed to evaluate the influence of different scale factors

and alternative scaling enablers.

This thesis explores, by detailed simulation, the scalability of a large scale distributed

real-time database that manage resources by using virtual full replication. Such database

is also capable of adapting to changed data needs while preserving real-time properties

of the application. The hypothesis of this thesis is that such system is scalable, and that

scalability can be preserved by adaptation during execution.

3.3 Problem decomposition

The problem has several aspects and we choose to divide it into the following components:

• Concept elaboration. The concept of virtual full replication has been pointed out as

an approach for providing an image of full replication, while replicating only what

is needed to create such an image [6]. An approach needs to be elaborated with

algorithms and an architecture that can support it. Also, conditions need to be

found out for which virtual full will manage to provide such an image.

• Formation and allocation of units of replication. Segmentation of the database, and

replication of segments only to the nodes where the data is used, seems to be useful

approach for bounding resource requirements. The cost of segmentation needs to be

elaborated, both the processing cost for using a segmentation algorithm, in partic-

ular during system run time, and the storage cost for data structures to maintain

segments. A part of this subproblem is to define how segments are appropriately

formed and adapted, and what data properties to consider for saving resources by

using segmentation. Also, architectural support need to be found out for segment

management and replication of updates.

• Adaptation of database configuration. To maintain scalability of the database through-

out execution, the virtually fully replicated database will need to adapt to changed

data needs of the database clients. For this purpose, adaptation algorithms and ar-

chitectures need to be developed and evaluated, as well as the type of changes to

consider for adaptation.

• Properties and their relations. Virtual full replication is based on knowledge about

data properties and application requirements. In our present work, we have used only

a few of the data properties that could be considered for segmentation and resource

management. Further, we have shown that there exists relations between some of

the properties that can be used, and such relations may be used to improve manage-

17

ment of resources. However, we have not defined a full framework of properties and

their relations. In such framework it may be useful to define application profiles of

related properties to match typical groups of applications, to reduce the amount of

information used for resource management.

• Scalability analysis and evaluation. From our initial studies we see that different scale

factors influence scalability differently. So far, we have only evaluated a few scale

factors, such as the number of nodes and the bound on replication degree, and we

will add others, for a more extensive evaluation. The concept of virtual full replica-

tion does not put limitations on how the image of full replication is built. Therefore,

alternative scaling enablers may be explored and combined for resource management,

including approaches for replication of updates, segmentation considering differenti-

ation in cost of communication links, or different propagation topologies.

3.4 Aims

Our aims with this thesis are:

A1 To bound resource usage to achieve scalability in a distributed main memory real-

time database, by exploring virtual full replication as a scaling enabler.

A2 To show how scalability can be quantified for such a system.

A3 To show how properties and their relations can be used in resource management for

virtual full replication.

3.5 Objectives

O1 Elaborate the concept of virtual full replication

O2 Bound used resources by virtual full replication for expected data requirements.

O3 Bound used resources by virtual full replication for unexpected data requirements.

O4 Define conditions for scalability, for expected and unexpected requirements.

O5 Quantify scalability in the domain, and valuate scalability achieved related to the

conditions.

18

4 Methodology

We develop and evaluate an approach of resource management for scalability by using vir-

tual full replication. We assess scalability using analysis, simulation and implementation,

while evaluating the proposed scalability enablers. We start by elaborating the problem

with full replication and we present possible approaches for virtual full replication. As

next steps we refine the approach by using static and adaptive segmentation, and we elab-

orate segmentation algorithms for multiple properties of data and multiple requirements

from database applications. For an initial evaluation of segmentation, we implement static

segmentation in the DeeDS database prototype. We develop a simulation for evaluating

adaptive segmentation and to allow study of large scale systems. Adaptive segmentation

is more complex to analyze and also to implement in the actual database system, so we

study scalability in an accurate simulation of the system. For sanity control of the simula-

tion, we replicate the experiments done with the implementation already done for DeeDS.

Further, the approach for adaptive segmentation is extended with pre-fetch of data for

availability. Finally, we conclude the findings about data properties and their relations,

as used in our segmentation approach, to document a framework of useful properties that

can be use to improve scalability of distributed real-time databases. Some of the steps in

our methodology have already been done while other steps remains. Some of the steps

described can optionally be excluded, which is indicated at each step.

4.1 Virtual full replication for scalability

M1: In the first research step, virtual full replication has been introduced and segmen-

tation has been proposed as a mean for achieving it [52]. The concept of virtual full

replication was elaborated. By using knowledge about the actual data needs, and also

recognizing differences in properties of the data, segments can be allocated only to the

nodes where data is used, and thereby bound resource requirements at a lower level, while

maintaining the same degree of (perceived) availability of data for the application. This

reduces flexibility compared to a fully replicated database, since not all objects are avail-

able at all nodes. Consequently, availability for unspecified accesses can not be guaranteed.

For hard real-time systems, access patterns and resource requirements are usually known a

priori of execution and the data needs can easily be specified. For such systems the reduced

flexibility is not a problem, since virtual full replication then will guarantee availability for

all data needs.

The aspect of cohesive nodes is emphasized, where some nodes share information more

frequently than other nodes, or some nodes use data with the same consistency model.

The paper describes a work in progress and points out essential research problems in

19

the area, such as the need for an analysis model for how system parameters influence

critical resources. A first definition of scalability for a fully replicated database (’replication

effort’) was presented and an algorithm for segmentation was introduced, however with

high complexity and a combinatorial problem when using multiple data properties.

Multiple consistency models may coexist in the same database, since different segments

can use different consistency models. To evaluate approach for this, an implementation for

co-exstince of both bounded and unbounded replication time for updates was implemented

in the DeeDS prototype [48], following the proposed architecture for virtual full replication.

In this implementation there are two segments with different types of eventual consistency,

one with unbounded replication delays and another with a bound for replication delays.

4.2 Static segmentation

M2: In the second research step, we pursued deeper knowledge about static segmentation.

The segmentation approach for virtual full replication was refined [53].

We explore how Virtual full replication can be supported by grouping data objects into

segments. The database is segmented to group data objects that need to be accessed at the

same set of nodes and each segment is allocated only at the nodes where its data objects

need to be accessed. Segmentation is an approach for managing resources by limiting

the degree of replication for data objects to avoiding excessive resource usage. The data

objects in a segment share key properties, where node allocation is one such property.

Segmenting a database for the known data references is a trivial problem, since a

partitioning and replication schema can be derived directly from the list of required ac-

cesses. This can be done by a database designer or automatically from a list operations in

transactions. However, when adding other data properties, there will be a combinatorial

increase in the resulting number of segments with unique combinations of properties, such

as described in [49].

With a segmented database, each segmentation enables support for a specific property

and by allowing multiple segmentations on the same database, it is possible support several

properties. A segmentation on consistency allows multiple concurrent consistency models

in the same replicated database. This enables new types of applications for the DeeDS

database system, since it allows data objects that can not use eventual consistency. With

a segmentation on storage medium, some segments may be allocated to disk instead of

memory, when the requirement is to prioritize durability in favor of timeliness or even

efficiency.

By segmenting the database on combinations of properties we can find the largest

possible physical segments, where all data objects of a segment can be managed uniformly.

20

We can recover segments faster by block-reading an entire physical segment from a backup

node and also prioritize recovery of critical segments.

When introducing combination of properties of the data to replicate, the number of

segments will multiply with the number of data properties used and the possible values

that properties can be set to. Consider a segmentation that allows two consistency models

combined with segmentation for allocation. The resulting number of possible segments

may double compared to a segmentation only for allocation. To find the the segments in a

database where multiple properties are combined, a naive algorithm need to loop through

the combination of values of each property for all data objects. Also, the more properties

that are used for segmenting the database, the smaller the segments will become. In a case

where each object has its own unique combination of properties, each segment will become

one object large. This is a bound on the number of possible segments, which is less likely

in a database where usage of objects rather cluster to cohesive sets of data objects. Thus,

there is a need for an approach of segmenting data on multiple properties that is efficient

and that considers clustering but also capture and use knowledge on how properties can

be combined.

In this step we have presented an algorithm that handles the combinatorial problem

of segmenting a database on multiple properties, where multiple and overlapping seg-

mentations of the database are allowed, and where dependencies between properties are

considered[53]. We introduce logical segments that allow overlapping segmentations, where

each segmentation represents a property or a set of properties of interest. From logical

segments, we can derive physical segments, such as allocation and recovery units. With

the refined segmentation algorithm presented, we can segment a database by multiple

properties without a combinatorial problem. This algorithm also recognizes the relations

between data properties, where relations can be dependent and unrelated of each other,

or even mutual. The dependency relation implies that a dependent property can not be

supported if the property dependent upon is not supported. Relations between properties

reduce the combinations of segmentations for multiple properties, and is the way profiles

may be created. Our approach allows control of relations, by rules that can be applied on

the set of combined properties. The approach also allows specification of data clustering,

which sets the common properties equal for clustered objects typically shared by cohesive

nodes.

In current work, we use our algorithm for automatic segmentation and segment alloca-

tion of units of distribution, physical segments, based on application knowledge about data

used by transactions and other knowledge about data properties originating in application

semantics.

21

For our future work, we plan to extend the algorithm to maintain scalability at execu-

tion time, supporting mode changes and unspecified data requests.

4.2.1 Static segmentation algorithm

Our approach for static segmentation is based on that sets of properties are associated

with each object, and that objects with same or similar property sets can be combined

into segments. We introduce an example here by first considering a single property, where

data is accessed.

Consider a scenario, with the following five transaction programs executing as seven

transaction instances, in a database with at least six objects replicated to at least five

nodes: T1.1 =< r : o1, o6, w : o6, N1 >, T1.2 =< r : o1, o6, w : o6, N4 >, T2 =< r : o3, w :

o5, N3 >, T3 =< r : o5, w : o3, N2 >, T4.1 =< r : o2, w : o2, N2 >, T4.2 =< r : o2, w :

o2, N5 >, T5 =< r : o4, N3 >. Based on these accesses the objects have the following

access sets:

o1 = {N1, N4},o2 = {N2, N5},o3 = {N2, N3},o4 = {N3},o5 = {N2, N3},o6 = {N1, N4}.These particular access sets can give a segmentation that has an optimal placement of

data: s1 =< {o4}, N3 >, s2 =< {o3, o5}, N2, N3 >, s3 =< {o1, o6}, N1, N4 >, s4 =<

{o2}, N2, N5 >.

For an implementation, we use a table to collect all properties of interest. For the

algorithm used with the table, the data accesses of all transactions are marked in a table,

where we have data objects in rows and nodes in columns. See Figure 1 (left) for a

database of 6 objects replicated to 5 nodes. Note that rows in the table equally well could

represent groups of data objects, such as object classes or user-defined object clusters that

share the same properties. By assigning a binary value to each column, each row can be

interpreted as a binary number that forms an object key that identifies the nodes where

the object is accessed. By sorting the table on the object key, we get the table in Figure

1 (right). Passing through this table once, we can collect rows with same key value into

unique segments and allocate each segment at the nodes marked. See Algorithm 1 for

pseudo code.

22

sort

N1 N2 N3 N4 N5

o4 x

o5 x x

o6 xx

o1 xx

o2 x x

o3 x x

4

6

9

9

18

6 Seg

2

Seg

3

Seg

4

Seg

1

segment keyN1 N2 N3 N4 N5

o4 x

o5 x x

o6 xx

o1 xx

o2 x x

o3 x x

1 4 8 16

4

6

9

9

18

6

2column

value

objects

ACCESS LOCATIONSACCESS LOCATIONS

segment key

Figure 1: Segmentation principle, used for segment allocation

The sort operation in this algorithm contributes with the highest computational com-

plexity. Thus this algorithm segments the database in O(o log o), where o is the number

of objects in the database.

Such property set can be extended with additional knowledge. We can uniformly handle

more properties on objects by extending the property sets. Assume that the database

clients require that o2 must have a bound on the replication time, and that o1 and o6

can be allowed to be temporality inconsistent. Also, assume that objects o3, o4, o5 need

to be immediately consistent. Further, assume that objects o1, o3, o5, o6 are specified to

be stored in main memory, while objects o2, o4 are stored on disk. To control multiple

properties, we extend the property set into (resulting in table shown in Figure 2)

o1 = {N1, N4}, {asap}, {memory},o2 = {N2, N5}, {bounded}, {disk},o3 = {N2, N3}, {immediate}, {disk},o4 = {N3}, {immediate}, {disk},o5 = {N2, N3}, {immediate}, {memory},o6 = {N1, N4}, {asap}, {memory}.

The sort operation in the algorithm implemented contributes with the highest com-

putational complexity. One sort operation is used for each segmentation of interest, and

typically a reduced number of segmentations are used. Thus, the algorithm segments

the database in O(o log o) for multiple properties as well, where o is the number of ob-

jects in the database. This is far better than the naive algorithm for multiple property

segmentation that was presented in [49], with a computational complexity of O(o!) .

By selecting particular properties, segmentations for special purposes can be generated

23

/*Mark the read and write sets */clear(access);

for i ← 1 to numtransactions do

for j ← 1 to T [i].numnodes do

for k ← 1 to T [i].L.numreads do

access(j,T[i].L.read[k])=1;

end

for k ← 1 to T [i].W.numwrites do

access(j,T[i].W.write[k])=1;

endend

end

/*Assign key values for nodes */for j ← 1 to numnodes do

access.colKey[j]=2j ;

end

/*Calculate object key */for i ← 1 to numobjects do

access.Key[i] = BuildKey(access,i,numnodes);

end

/*Sort lines on object key */SortTable(access.Key, access);

/*Find segments and allocations */segmID=0;

currKey = -1;

currSeg = NIL;

for i ← 1 to numobjects do

if currKey != access.Key(i) then

currKey = access.Key(i);

currSeg = NewSeg(Key);end

currSeg.Add(access.oid(i));end

Algorithm 1: Segmentation algorithm for segment allocation

24

o4 x

o5 x x

o6 xx

o1 xx

o2 x x

o3 x x

1 4 8 16

548

294

329

329

402

294

2column

value

object keyobjects

ACCESSES

x

x

x

x

x

x

MEDIUM

diskmemN2N1 N4N3 N5

256 512

x

x

x

x

x

x

32 64 128

asap boundimm

CONSISTENCY

Figure 2: Multiple property table

from the complete property set. For instance, units of physical allocation can be gener-

ated as a subset the properties of ’access’ and ’storage medium’ only. By executing the

segmentation algorithm on this subset, the resulting segmentation gives the segments for

physical allocation to nodes and for recovery of segments.

4.2.2 Properties, dependencies and rules

There may be conditions for combining properties. For instance, to guarantee transaction

timeliness three conditions must be fulfilled: data need to be stored in main memory,

being available at the node where accessed, and the database need to replicate by detached

replication. Detached replication can be done as soon as possible (asap) or in bounded

time (bounded). Consider o2 in the property set above. To guarantee timeliness, we need

to change the storage of object o2 into main memory storage to make the replication

method property consistent with the storage property. Thus, we use a known dependency

between properties to ensure the consistency among the settings of the properties in use.

Also, we may combine objects o3 and o5 into the same segment by changing the storage

of either of them. Both need to be immediate consistency, but there is no relation rule

that requires a certain type of storage for that consistency. We can select storage either

in memory or on disk, but to put them in the same segment both need to use the same

storage setting. A choice could be to store them on disk, since disk storage is a cheaper

resource compared to storage in memory.

Further, by letting o4 be additionally located to node N2 we can create a segment

consisting of objects o3, o4, o5. This storage is not anymore optimal storage according to

the specification of known data accesses. Still this may be a valid choice to reduce the

number of segments in the segmentation.

25

By considering properties of data and the relations between them, segmentations can

be made consistent by fulfilling rules that specify the property dependence relation. There

may also be other rules that influence the table, such as guaranteeing a minimum degree of

data replication to provide a certain level of fault tolerance, or labeling of clustered data,

which is data objects that are replicated together and have the same property values.

Using such clustering rule, the entries for such data objects are set to the same value,

typically based on the most restricting combination of properties of the clustered objects.

4.2.3 Analysis model and assumptions

To evaluate resource usage for the implementation of static segmentation, we present an

analysis model for usage of three key resources. Bandwidth usage is reduced when less

nodes need to be updated. Also, such a system requires less overall storage, since each

node hosts only a subset of the database objects. As a consequence, overall processing of

conflict detection and resolution for inconsistent replicas is lowered since fewer nodes host

conflicting replicas.

We evaluate our approach by examining how three important system resources are

used when selected system parameters are scaled up. The baseline for comparison is a

fully replicated database with detached replication, such as DeeDS. First of all, a large-

scale distributed database with many nodes requires scalable bandwidth usage, and full

replication does not scale well since an update generates O(n2) replication messages for

n nodes. Secondly, large distributed and fully replicated databases store the entire data-

base on the local node, requiring storage of O(nm) data objects. Conflict detection and

resolution keep O(n2) processing time, since every node that has a replica must resolve a

conflict caused by any other node. Based on our database model, we present an analysis for

how these three resources scale, both for a virtually fully replicated and a fully replicated

system.

We assume that the database is distributed to n nodes, and with segmentation the

degree of replication is limited to k ≤ n nodes. In such a database, every update needs

to be replicated to k − 1 nodes. With network functions for multicast or broadcast the

system could replicate data specifically to a set of nodes by a single send message, reducing

the processing time required to send the data and the number of messages sent on the

network, but not reduce the integration time or the conflict detection and resolution time

of an update. Currently, we focus on networks with point to point communication only,

since we argue that scalability in such networks is a problem that can be generalized to

many different kinds of networks. Replication of several updates for the same segment

replica may also be coordinated to reduce communication overhead, but in our current

26

analysis model we assume that each update is sent individually. Thus, we consider each

object update to generate a send of k − 1 single update messages. Our simple analysis

model in this paper does not consider individual degrees of replication ki for each segment,

but the same degree of replication for all segments, k.

Selective replication by multicast has been studied before [4] and may be included in

future work as part of a larger study involving other network topologies than the sin-

gle network resource used in the current analysis model. We also consider coordinated

propagation of replication messages to be a future extension of the work in this paper.

The number of transaction instances in the database is denoted |Tj |. We characterize

each transaction, Tj , by its frequency, fj , the size of its read set, rj , the size of its write

set wj and the size of its conflict set, cj . Thus, for the purpose of the analysis the

characteristics of each transaction can be modeled as Tj =< fj , rj , wj , cj >

An analysis of network usage is essential, since a distributed system does not scale

well if the communication cost limits a large systems to be built. Bandwidth is a shared

resource that is critical for scalability of the system. We choose to model this resource as

a single shared network link, as is the case with traditional Ethernet or with time-shared

communication, such as used in wireless sensor networks. A single shared network resource

is more restrictive for scalability than e.g. a switched network, where the design topology

can relieve congestion by distributing communication on multiple independent network

links.

A segmented database stores k replicas of database objects, instead of n replicas that

are stored in a fully replicated database. For applications that typically has a low repli-

cation degree, k, much memory can be saved overall by segmentation and scalability is

improved. For an analysis of storage we may consider the storage of database object

replicas, the segment administrational data structures and the internal variables of the

database system. The analysis in this paper considers the storage of data object replicas

only, since additional data structures are expected to be small or independent of whether

the system is fully replicated or segmented.

Processing time for storing updates on the local node include time for locking, updat-

ing and unlocking the data objects, which we denote L. Updates for data objects that have

replicas on other nodes use additional processing time to propagate the update, including

logging, packing, marshalling, and sending the update on the network. We denote the

processing time for propagation P . With point to point communication, the sending node

spends P time for each node that has a replica of the updated data object. At the node

receiving an replicated update at another node, the database system must integrate the

update. This includes receiving, unpacking, unmarshalling, locking data objects, detect

27

conflicts, resolve conflicts, update to database objects, and finally unlocking of the data

objects that was replicated. We denote the time used for integration I, excluding conflict

detection and resolution, while the time used for for conflict detection and resolution is

denoted C. Processing time C only occurs when there are conflicting updates, while I is

used for all updates replicated to a node. Integration processing is needed at all nodes

having a replica of the originally updated data object.

4.2.4 Analysis

We analyze scalability by examining how the chosen resources are used according to our

analysis model.

Bandwidth usage depends on the number of updates replicated, including the new

values and their associated version vectors. In our model every transaction Tj , executed at

frequency fj generates one network message for each member of its write set wj to update

its replicas at k− 1 nodes. Such network messages are generated by all transactions at all

nodes. We express bandwidth usage in Equation 3.

(k − 1)|Tj |∑

i=1

wi ∗ fi [messages/sec] (3)

From the formula we see that bandwidth usage scales with the degree of replication. For

a virtually fully replicated database with a limit k on the degree of replication bandwidth

usage scales with number of replicas, O(k), rather than with number of nodes, O(n),

as is the case with a fully replicated database. However, the amount of transactions

often depends on the number of nodes, O(n), so bandwidth usage becomes O(kn) for the

virtually replicated database and O(n2) for the fully replicated database.

For storage of a database with s segments, there are ki replicas of each segment of the

size si. We express required storage for a virtually fully replicated database in Equation

4.

s∑

i=1

(si ∗ ki) [objects] (4)

In our analysis model we assume the same degree of replication, k, for all segments,

∀i(ki = k). Thus, each data object o in the database is replicated at k nodes and the

required storage can be expressed as o ∗ k. With full replication to n nodes, n replicas of

each data object is stored and the required storage is o∗n. This means that a virtually fully

replicated database scales with the limit on replication, k, rather than with the number of

nodes, n, in the system as is the case with a fully replicated database.

The processing time used for replication depends on the write set of each transaction,

wi, resulting in propagation time, P and integration time, I, for each node to replicate to.

28

Additionally the conflict set, ci, requires conflict detection and resolution time, C. Thus,

we can express the processing time as∑|Tj |

i=1 fi{[L+(k− 1)P +(k− 1)I]wi +[(k− 1)C]ci},or as Equation 5.

|Tj |∑

i=1

fi[Lwi + (k − 1){(P + I)wi + Cci}] [sec] (5)

Similarly to Equations 3 and 4, this formula shows that processing is constant with the

degree of replication, not growing with the number of nodes in the system. Also, the sizes

of the write set and the conflict set influences the amount of processing time required.

With multi-cast operations available at the network used, the analysis formulas be-

comes different for bandwidth and for processing. Replication of updates will send only

one message for all the replicas, and bandwidth usage can be expressed as in Equation 6.

For processing, each update is sent only once on the network, but there is still integra-

tion time, and conflict detection and resolution is also done at every replica receiving an

update. Thus, the corresponding processing time can be expressed as in Equation 7.

|Tj |∑

i=1

wj ∗ fj [messages/sec] (6)

|Tj |∑

i=1

fi[Lwi + Pwi + (k − 1){Iwi + Cci}] [sec] (7)

4.2.5 Implementation

The segmentation algorithm has been used to implement virtual full replication in the

DeeDS prototype, and the database was segmented into fixed segments with individual

degrees of replication [53].

In this implementation, transactions are specified by a configuration file that describes

the transactions with their data requirements, the data properties given by the application

semantics, and specification of key capabilities of the database system.

For this implementation transactions specifications were recorded in a configuration

file that was translated into the property table. Some basic rules were hard coded and use

only a few relations for properties. A generic system would need support for an explicit

rule specification. The replication schema and physical segments are derived from the

property table using the static segmentation algorithm (Algorithm 1).

In the experiment we show here, all parameters and all but one scale factors were

fixed. The number of nodes were changed from 1 to 10 nodes, and the three resources of

bandwidth, storage and processing were measured. For virtual full replication, we limited

the maximum degree of replication at 3 replicas.

29

Figure 3: Bandwidth usage vs. Number of nodes

segmentation overhead

Figure 4: Storage vs. Degree of replication

Figure 5: Processing vs. Number of nodes

Scalability has been examined by evaluating the usage of the resources. Additional

experiments were done where we varied the degree of replication, and the ratio of write

operations in transactions. In these experiments, we compared the original implementation

for a fully replicated database with the new implementation for segmentation using full

30

replication, as well as a lower degree of replication typically used in a virtually fully

replicated setting.

Figure 3 shows bandwidth usage for up to 10 nodes, with a replication degree of k = 3

for Virtual full replication (VFR). For systems larger than 3 nodes, the bandwidth usage

remains constant with Virtual full replication, while for a fully replicated (FR) database

it increases with number of nodes added. The measurements show the total amount of

messages sent at the network to replicate one fixed size segment in the database during a

fixed length run of the experiment.

Figure 4 shows how storage requirements change for a system up to 10 nodes, and the

measurements show the amount of storage for one fixed size segment. For a virtually fully

replicated database, storage of the database scales with the limit of replication, and a fully

replicated database scales with the number of nodes. In this experiment we also measure

the segment management storage overhead in the system overall, and the figures can be

seen at the bottom of Figure 4. Segment storage management data scales with the number

of nodes in the current implementation. This needs to be further analyzed and improved,

to ideally scale with the degree of replication instead.

Figure 5 shows the processing used at each node to replicate updates. Processing for

the virtually fully replicated database (VFR) does not increase with number of nodes

added, but remains constant with the degree of replication. The fully replicated database

(FR) scales with the number of nodes, while the implementation for Virtual full replication

using a fully replicated database (FR new) adds processing overhead, compared to FR.

4.2.6 Discussion

The curves match well the behavior of our analysis in Equations 3, 4 and 5. For a fully

large-scale evaluation, experiments need to be run for a larger number of nodes, which

could not be done successfully with the current database prototype implementation. Our

analysis in 4.2.4 complements the implementation evaluation for this research step by

giving an analyzing model for large-scale behavior. Our next step addresses the problem

of studying a large scale implementation.

4.3 Simulation for scalability evaluation

M3: We use simulation to examine large scale systems. The first simulation implemen-

tation experiment has been performed [50], strengthening the findings about bandwidth

usage from the implementation of static segmentation in DeeDS. The simulator will be

further developed to run experiments with 1000 nodes and more. Such large scale systems

cannot be easily run as a DeeDS implementation, since that would require an unfeasible

31

amount of resources. Our simulation approach uses a representation of bandwidth, storage

and processing, rather than using the actual resources as is the case with an implemen-

tation in the DeeDS prototype. An example advantage with simulation is the storage

of the database itself, which does not store the actual data in the simulation, but only

information to describe the data objects. By only using a value for the size of the object

instead of storing the object, the amount of storage required for the data objects can still

be modeled, while using less resources. With such representation, network messages never

replicates actual data objects values, but only information about the size and version of

the update. A simulation for large scale systems requires that the representation of the

system is done so that the simulation itself is scalable. To study resource usage by the

replication mechanisms used by DeeDS and extensions for virtual full replication, this

is what is primarily modeled. The representation for modeling this needs to be further

refined to be scalable, for 1000 nodes and more.

Simulation enables an evaluation of static segmentation for large scale virtually fully

replicated databases. In previous work we have implemented Virtual full replication in the

DeeDS database prototype. The implementation could not give precise results, since only

a few nodes could be run in the experiment. The conclusions about large scale behavior

could therefore only be limited from that experiment. With a simulation experiment,

the database nodes are simulated and resources that are used can be represented as data

entities rather than using actual resources in the system.

The simulation is only a model of the real system, but with an properly designed sim-

ulation we have the opportunity to get an understanding of large scale behavior of Virtual

Full Replication. Before using a certain simulation it must be valid for the purpose of the

study, so our approach in this research step is a two stage work. The first stage has been

to assemble relevant literature for a background on validation of simulations. The second

part is to model and implement the simulation, and this is a work that is in progress. An

initial simulation has been done and is reported [50].

4.3.1 Motivation for a simulation study

Our motivation for simulating a distributed real-time database are:

• Large scale experiments with the prototype database system are infeasible, due to

the amount of hardware and the size of the installation required.

• The analysis done in previous step, of resource usage for bandwidth, storage and

processing, gives a model and an estimate based on a set of assumptions. The

32

proposed algorithm for static segmentation need be used in a large scale setting to

measure actual resource usage. A simulation that is tailored to mimic the actual

distributed database system may give a more detailed understanding of large scale

behavior of a virtually fully replicated database compared to the analysis done.

• Executing large scale experiments with the actual system would require more software

engineering efforts than with a simulation, since the database prototype development

is expected to result in a fully working implementation. With a simulation, the

implementation can be focused on the segmentation and replication processing in

isolation.

4.3.2 Simulation objectives

The objectives of the simulation in this experiment are:

• To measure the usage of three essential key resources, bandwidth, storage and process-

ing time, in a simulation of a distributed real-time database with eventual consistency

that uses virtual full replication.

• To detect how system parameters (independent variables) influence resource usage

by measure the usage for increasingly larger systems. In particular we want to find

such variables that make resource usage grow at a non scalable rate.

• To evaluate if we can establish a useful simulation platform for our research, to be

able to experiment with subsequent research ideas in the area.

4.3.3 Validation of software simulations

Simulating a phenomenon of interest in a computer system involves creating a model of

the system to be studied. This inherently means making a simplification of the actual

system that is only valid under a set of known assumptions. Introductions to simulation

can be found in [44], [38] (in particular parts IV and V) and [11]. To measure and

conclude about such model, it is required that the model can be justified to be correct for

the intentions of the study. Thus, the simulation objectives need to be very clear. The

literature on validation of simulations is abundant, ranging from high level approaches

of establishing taxonomies for simulations in general, to detailed work on simulations of

parallel computing. In this paper, we relate work that have significance for validation of

simulations of large-scale systems, in particular for distributed databases. Important work

include [8] [11] and [69].

Simulation is useful where the behavior of the real system cannot easily be analyzed.

This includes when the input or the model has some stochastic component, or when compu-

33

tation of an analysis is complex. To evaluate such simulation results, statistics is important,

but not all statistic techniques may properly be applied for simulations [41]. Statistics can

also be used for validation of the simulation model itself [10].

In work by Shannon [72], it was concluded that a simulation cannot be a full represen-

tation of the system to study, but a simulation needs to be reduced to meet particular ob-

jectives of the study. Detailed general purpose simulations tend to cost much in processing

time. Validation of simulations is said to be the process of determining whether a simu-

lation model accurately represents the system for the objectives of the study. Validation

of simulations is a matter of creating confidence that the system represents the system

under the given objectives. Simulation confidence is not a binary value, but is gradually

strengthened by tests for validity. In [68] concrete tests for validation is presented: De-

generate tests - How does the model’s behavior change with changed parameters; Event

validity - How does sequence of events in the simulation correlate to the events of the real

world system; Extreme-Condition tests - How does the model react for extreme and un-

likely stimuli?; Face validity - How does the model correspond to expert knowledge about

the real system?; Fixed values - How does the model react to typical values for all combi-

nations of representative input variables?; Historical data validation - How are historical

data about the real system used for the simulation model?; Internal validity - How does

the system react to a series of replication runs for a stochastic model? For high variability

the model can be questioned.; Sensitivity analysis - How does the effect in changing input

parameters influence the output? The same effect should be seen in the real system. For

the parameters that has highest effect on the output should be carefully evaluated to be

accurate, compared to the real system.; Predictive validation - The outcome from fore-

casting by the simulation and the real system should correlate.; Traces - Execution paths

behavior should correlate.; Turing tests - Expert users of the real system are asked if they

can discriminate between outputs from the real system and the simulation.

The validation of our simulation is two-fold. First, the current DeeDS implementation

can run a limited set of the experiments to run with the simulation. Thus, a sanity

check is to run some selected experiments with the DeeDS implementation too. Also, the

implementation will be validated by an examining of the implementation by the DeeDS

project members. This will ensure that the simulation model used follows the intentions

and the semantics of the actual DeeDS prototype, both as described in research papers

published and by the actual prototype implementation.

34

4.3.4 Modeling detail

The accuracy of a model must be sufficient to be able to evaluate the objectives of the

purpose at hand. A more accurate model than required will result in a simulation that

uses excessive resources, and may not even be useful to run due to long execution times.

To establish the level of modeling detail, the simulation objectives must be defined. It is

infeasible to model all aspects of the system (absolute isomorphism), so the effort should

be to model at a detail level and a representation that is appropriate for the objectives.

Only reality itself is a generally valid model; and the model and the simulation can only be

valid in the context of the assumptions and the objectives. Balci [7] uses a set of criteria for

modeling: Model verification is correct transformation into the simulation model; Model

validation is the compliance with simulation objectives, within the domain of applicability.;

Model testing is used to determine that the model and the simulation functions properly;

Model accreditation is the official certification to use the model for a specific purpose.

Graph based approaches exist for modeling systems to be simulated, including state

charts or automata. Graphs can easily be understood and a clear notation of behavior

may increase confidence of that the model is valid. Simulation graphs have been used for

a long time to document the simulation model and were developed into event graphs [71].

Modeling by event graphs opens up for ways of structuring behavior representation, such as

done with Event pattern mappings [31]. A derivative of event graphs is the resource graphs

[37], where graph nodes represent events rather than states, and where transitions represent

conditions of the simulation. Resource graphs are shown to be valuable, compared to job-

driven simulations, for simulations of large-scale and congested systems [67].

4.3.5 Approaches for validity and credibility

A model is validated on the confidence in that the model reflects the real world problem in

terms of the objectives to evaluate. For validity it is more important to have a model with

confidence than to have a fully detailed and accurate model. Therefore, it is important to

know how to increase the confidence in the model. Robinson [66] describes how validation

is improved by addressing model confidence:

Iterated verification and validation need to be used along with iterative development

of the model itself. Also, the view of the world may not hold for validation. Real world

data is often inaccurate and the world representation is a result of a certain interpretation

of the world. There may not even be a real world situation that can be compared to.

Consider a full scale nuclear war. This is likely a subject for simulation, but it is very

hard to compare to a valid real world situation. Robinson refers to common techniques in

the area to handle typical problems in validation: Conceptual modeling : It is important

35

that the modeler acquires a deep understanding of the real world system to be modeled

and this requires close collaboration with domain experts. In this way the modeler can

understand alternative interpretations from which objectives and a conceptual model can

be developed. The conceptual model needs to be validated, with review by domain ex-

perts of the modeling objectives and the modeling approach. Data validation: Real world

data must be validated for accuracy by exploring and valuating the data source to deter-

mine data reliability, completeness and consistency. White-box validation: Inspection of

the simulation implementation improves correctness of model implementation and compli-

ance with the conceptual model. Control of events, flow and logic is central and for this

code reviews, execution trace analysis and output analysis are used. Black-box validation:

Inspection of the overall behavior of the model. This also includes correlation analysis

comparing with the behavior of a real world system. Comparison can also be done with

another model or simulation of the same or similar problem, but simulation objectives and

problem differences must be considered.

For a credible model there is an agreement on that the model reflects the real sys-

tem appropriately for the experiment [28]. To establish credibility, there must be an

agreement on the assumptions of the model, and proofs of validation and verification are

needed. There is a risk that an agreed model, that is regarded as credible, is still not valid

due to improper validation. To further improve credibility, the model can be accredited.

Accreditation is the formal acceptance that the model and the simulation is approved to

be used for a specific experiment or usage [9] [7].

4.3.6 Experimental process and experiment design

The experiment consists the following steps:

1 Modeling detail and simulator implementation

The simulation implementation in this experiment mimics replication in the DeeDS

database prototype system. In particular, specific functions for database operations,

data replication, and the usage of resources in these operations are modeled in detail.

2 Model and implementation validation

The implementation will be fully reviewed by the DRTS research groups members, as

a walk-through of the implementation, including a discussion of proper representation

of the DeeDS database prototype with respect to the issued studied. This step

is important for verification, but also for validation. Also, the implementation is

compared to a few reference cases of execution for the implementation of static

segmentation in DeeDS [53].

36

3 Experiment variables

In the first experiment we chosed to study three system parameters (the independent

variables): the number of nodes, the size of the database and the share of update

operations in transactions. We will extend the study to include more independent

variables that could affect scalability, and possibly use principal component analysis

[25] to group related variables. A full set of independent variables for such experiment

will be: database size, number of nodes, ratio of update transactions, ratio of write

operations in transactions, ratio of conflicting updates, frequency of transactions,

distribution of transactions (where transactions are instantiated), bound on degree

of replication.

4.3.7 Simulator implementation

The implementation is based on an existing distributed database simulator, which models

a database system with distributed transactions and load balancing. The purpose of earlier

usage of this simulator has been to evaluate different approaches for load balancing, and

several papers has been published based on these simulations [79] [78]. The simulator has

previously been assessed for modifiability for building the functions required to simulate

detached replication, and it was found to be sufficient as a base. According to Law

and Kelton [44] a discrete-event simulation needs a number of base components, and

our simulation includes all of the components pointed out.

• System state. The notion of a system state is represented by key variables in the

simulator we use.

• Simulation clock and a model of time. The simulator uses a ”next event time ad-

vance” model that steps up logical time to continuously process events. A simulation

with few short events step up time faster than a simulation that contains many events

with much processing in them.

• Event list. For both individual nodes and the overall event processing, such as

network events, there are event queues. The simulator always processes the next

coming events out of these queues. In addition the simulator contains queues for

transaction states, such as ”blocked” and ”ready” and also priorities of these actions.

There is a transaction scheduler and transactions can be prioritized and they can be

modeled as preemptable.

• Statistical counters. The simulator uses the statistical Java package Colt from CERN

and collects various statistics during execution.

• Report generator. After reaching a termination condition, the statistics collected is

37

assembled into a extensive report with key figures.

• Main program. The entire simulation is run in a single thread, coded in Java.

• Initialization routines.

• Library routines. A realistic load generator, based on parameters of actual system

inputs.

The simulation has been extended with key functions need for our experiments, in an

order where each step can be validated separately. The following extensions have been

made:

1 Update accounting

Version vectors were implemented for managing concurrent updates. Detection of

concurrent updates include both write-write conflicts and read-write conflicts.

2 Replication

DeeDS uses detached replication as the mechanism for replicating updates indepen-

dently of transaction execution, after transaction commit, and detached replication

was implemented in the simulation. The simulation implements features that are

needed for detached replication, such as shadow page updates, replication logs, prop-

agation of database updates to other nodes after commit of the transaction, conflict

detection by version vectors and has a simplified representation of conflict resolution.

By varying the bound of degree of replication from the number of nodes to a lower

degree, resource usage for full replication can be compared to usage for different lower

degrees of replication.

3 Segmentation

To mimic virtual full replication by static segmentation, the database initialization

locates data objects to randomly selected nodes in the system, where the maximum of

number of nodes for each data object allocation does not exceed the bound of degree

of replication. Replication of updates uses the data object allocations to replicate

only to the nodes that host data objects.

Further, transactions in the simulation are generated so that it contains accesses only

to data objects that are allocated at the node where the transaction executes.

4 Load generation

The generation of load for the simulated database should reflect the load of a real

system. For virtual full replication with static segmentation there is only transac-

tions that accesses data objects that are available at the nodes of execution. For

experiments with adaptive segmentation, there need to be load that in a controlled

way generates transactions that tries to access objects at ’new’ nodes.

38

5 Instrumentation

To measure the network load, network usage accounting measures the size of simu-

lated network messages and adds sizes to a total sum over a simulation.

4.3.8 Experiment execution

The execution of the first simulation experiment was divided into three parts, each where

one independent variable was varied and where the resulting bandwidth usage was mea-

sured. For all of the part experiments we kept the other independent variables constant.

The variables used in the experiments were: Number of nodes = 10, Size of the database

= 500 objects (where data object size was set randomly between 1 and 128 byte with

uniform distribution), Bound on degree of replication = 3, Number of transactions = 300,

Size of transaction = 2 update operations and 8 read operations. Each part experiment

was executed 10 times and the average, the max and the min values of bandwidth usage

was recorded. The bandwidth usage was measured both for a fully replicated database

and a virtually fully replicated database, using the variable settings above.

1 Varying number of nodes. The number of nodes was varied from 1 to 30. (Figure 6)

2 Varying database size. The database size was varied from 100 to 2000 data objects.

(Figure 7)

3 Varying update operation share in transactions. Transaction size was fixed to 10

operations, but the number of update operations was varied from 2 to 20. (Figure 8)

4.3.9 Discussion and results

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

8000000

9000000

10000000

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

VFR MAX

VFR MIN

FR MAX

FR MIN

nodes

b y t e

s s

e n

t

Figure 6: Number of nodes, VFR and FR, large systems

39

0

200 000

400 000

600 000

800 000

1 000 000

1 200 000

1 400 000

1 600 000

1 800 000

2 000 000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

VFR MAX

VFR MIN

FR MAX

FR MIN

database size [x100]

b y t e

s s

e n

t

Figure 7: Database size, VFR and FR, large systems

b y t e

s s

e n

t

write operations in transactions

0

100000

200000

300000

400000

500000

600000

700000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

VFR MAX

VFR MIN

FR MAX

FR MIN

Figure 8: Update transactions

This experiment varies three important scale factors (Number of nodes, Database size,

Share of write operations), while in our experiments with the actual DeeDS implemen-

tation, we reported the usage of key resources (Bandwidth, Storage, Processing). Thus,

only one configuration in the simulation experiment is comparable with the DeeDS pro-

totype experiments, the bandwidth used when scaling the number of nodes. Additionally,

in the DeeDS prototype experiment the bandwidth for replicating one single segment was

measured, while the corresponding simulation experiment measured the total amount of

bandwidth usage, and therefore the diagrams are not identical. Knowing this difference

we can compare the growth of the bandwidth usage when adding nodes to the system

(Figure 3 and Figure 6). We can estimate from the graphs that the resource usage grows

as pointed out by Equation 3. However, the comparison is very limited and needs to be

extended in subsequent work. That work also includes extensions of the simulation, so

that storage and processing resource usage is modeled in the simulation, so that the simu-

lation can be compared and validated towards the experiments with the DeeDS prototype

40

implementation.

In an experiment like this it is possible to draw final conclusions only for the ranges

of the independent variables studied. In our case that is 1-30 nodes, 100-2000 database

objects and 2-20 update operations in transactions. The examination of a validated sim-

ulation may be sufficient for the purpose. Only an elaborated analysis model may allow

conclusions about bandwidth usage for infinite values on the independent variables.

By having built a simulation for virtual full replication we have gained detailed insights

in how to model the database system that can be used for refining the model. In a

simulation, variables can be better controlled, but also the transactions load can be varied

while fully controlled, which gives an opportunity for examining a large range of application

types. Also, a simulation enables the study of a system at a scale that would be impractical

when using the actual system.

There are some additional scalability problems that need to be addressed. The detached

replication uses version vectors that keep a copy of each nodes copy of the version vector.

This is not scalable in terms of number of nodes, but Dynamic version vectors [65] can be

used to solve this and use ’generations’ of version vectors as done in work by Gustavsson

[32]. Furthermore, the current simulation implementation is not scalable due to some yet

unknown design limitations, since some simulation settings run out of resources on the

computer where the simulation is executed. As a consequence, the continued simulation

implementation need to be assessed for how the simulation model need to model functions

of the prototype system for scalable, while correct representation. This problem will be

addressed in the design work of the subsequent simulation implementation.

4.4 Adaptive segmentation

M4: With a replication schema that meet the requirements of a static specification of

the database usage, the virtually fully replicated database does not make data objects

available for unspecified data accesses. This is a disadvantage, compared to a fully repli-

cated database. To address this problem for virtual full replication, we introduce adaptive

segmentation. With adaptive segmentation, we will allow unspecified data accesses for

soft transactions. Data objects that are not allocated to nodes are loaded on demand

and the database is re-segmented based on the new information that becomes available at

execution time. The approach in this part of our method is still under development, and

below we present our current findings and the related challenges.

The segmentation information at each node keep tracks of which segments are allocated

to the node. This includes the data objects in a segment, the consistency model used by the

segment, the required deadlines for replication, and a list of all the nodes that the segment

41

is allocated to. When an access arrives that use a data object that is not available at the

node, processing is initiated for for loading this data object from another node.

The representation for where data can be found is available by listing all data objects

in the database and their residence and segment assignment and store at each node. This

is a simple approach that does not scale on the size of the database. This representation

need to be elaborated in the work to come. One approach is to store information only

of the data objects available at the node. Missing data can still be detected, but finding

nodes where data can be loaded from then need a (bounded) search approach.

For adaptiveness, we will primarily consider the following changes to the replication

schema: Unspecified accesses, added and removed nodes, added and removed data objects.

Also, the level of fault tolerance is considered. We can apply a minimum degree of repli-

cation for a segment, and when a node fail, this degree needs to be restored again. It is a

similar problem to making data available for un-specified accesses, and we expect to use

the same approach to restore minimum degree of replication.

The segmentation algorithm we have presented [53] is expected to be efficient enough

for adapting segmentation online. Segment tables (based on the updated property sets)

can be updated when new information arrives, in O(o log o) time.

We propose to use incremental recovery [45] for loading segments that contain the data

objects that are detected as missing at a node. On-demand loading of segments will delay

transactions while loading a missing segment onto a node, and the timeliness will depend

on the timeliness of the network, which makes segmentation adaptation unsuitable for

hard transactions.

To maintain scalability and to reduce storage usage over time, segments also need

to be unloaded from nodes when not used anymore. The load and unload policies is

application dependent. Some applications may not need to unload segments, but will

reach an equilibrium where all segments ever needed is available, and are continuously

in use. Other applications, with mode changing characteristics, may need to switch data

sets when changing the mode of execution. The combination of the frequency of changes

and the available storage may cause ’thrashing’ that keeps the database system busy with

loading and unloading data, without leaving processing time for database transactions.

This is an open problem that need to be studied in this part of our work.

Reconfiguration of segments immediately take effect at the local node, and as soon as

data has been loaded to the local node, the pending transaction can continue its execution.

Subsequent transactions to the loaded segment will not need to wait, but will execute

timely at the node. Reconfiguration of segments is assumed to always be initiated at the

node of an access, then propagated to all other nodes that needs the information about

42

the change. In the DeeDS database, such propagation can be done by using the database

replication mechanism, detached from the update of the local segment knowledge. Since

such propagation has eventual consistency, the other nodes does not get the information

at the same time. Therefore user updates of data objects, in a segment that is being

loaded to a node, needs to be forwarded from the node where the data is loaded from as

long as not all nodes has received the propagated change of segmentation. The problem

of propagating and forwarding of updates to nodes that loaded to new nodes, is closely

connected to the issue of representation of information about segments at nodes, and will

be addressed together.

The conditions for maintaining scalability is an important issue in this part of the

work. Applications that require most the data at all nodes will closely resemble a fully

replicated database, and be less scalable than a database where the database client have

requirements of a low degree of replication. We need to find out more details for which

types applications virtual full replication has benefits, and for which applications it means

an overhead only.

4.4.1 Adaptive segmentation with pre-fetch

M5: With adaptive recovery of segments to nodes, virtual full replication can make data

available at request by soft transactions. For hard transactions, data must be available at

the time the transaction executes at the node. We extend adaptive segmentation with pre-

fetch of data to nodes, to make data available at a certain time point or repeating time

points separated by an interval. This enables adaptation for hard transactions as well.

One condition for this is that the time instant of an access, or the frequency of repeating

accesses, can be specified. Another condition that is that segments must be recovered in

bounded time, which imply that the system uses a real-time network.

Since many real-time systems run transactions periodically or at least with some re-

peating patterns, it may additionally be possible to detect the patterns for the accesses.

Having simple mechanisms for detection of periodic accesses, we may be able to detect

when data needs to be available. This can be extended to include advanced pattern detec-

tion and also pattern prediction. It seems unlikely that detection and prediction can be

correct all the time, and for this reason it may only be used to improve for soft real-time

transactions. Pre-fetch of data by advanced pattern detection and prediction is an optional

step in our methodology.

To maintain scalability, data objects that are not accessed for a long time may need

to be removed from nodes. This need to be put in relation to the set up time for data

objects. Adding and removing data objects from nodes has some resemblance with both

43

virtual memory, cache coherency and garbage collection and our approach will relate to

these concepts. Removal of objects is critical for hard transactions and must be done in a

way that guarantees availability at the time of access by the hard transaction.

Replicas of data objects that have been assigned to nodes, either by a pre-specified

static segmentation or by pre-fetching of data for hard transactions, cannot be removed

if there is a risk that removal jeopardizes timeliness of a hard transaction. Pinning of

such data objects is one approach to prevent those data objects from being removed from

memory by a generic removal policy based on known accesses. Removal may also mean to

move data from memory to disk at the same nodes. Pinning of object require an explicit

specification, in a similar manner as static segmentation or as an explicit specification of

time instants for accesses by hard transactions.

4.5 A framework of properties

M6: We have presented an initial approach where we can express the relations between

data, application and architecture properties [53]. We elaborate this into a generic frame-

work that can express properties and relations. This part of our methodology is more of a

result than of a separate step, since the content is a result from the preceding methodology

steps. Optionally, this step includes a study of typical applications and a mapping of their

properties into our framework. This can be done by a literature study or by study of

actual applications. Applications are expected to be available with the research group’s

collaborating partners, in the car industry (several Volvo companies), in applications for

real-time operating systems (ENEA), and in real-time database applications (Polyhedra).

4.6 Case study and implementation

M7: The properties of bounding resources for scalability and adaptation by reconfigura-

tion, and ability of disconnected operation, may be suitable in a wireless sensor network.

Compared to our system model that assumes a high bandwidth, high connectivity and

low latency network, a wireless sensor network has low bandwidth, lower connectivity,

and high latency by multi hop communication. Despite our current network model, we

have properties that seems to support the properties of a wireless sensor network, such as

scalability and disconnected operation, and we aim to do a case study to evaluate virtual

full replication in such a context.

Also, we have initiated a collaboration project with Polyhedra, which is one of the

major suppliers of real-time databases. A potential of this project is to apply virtual full

replication in practice and enable replicated database nodes with real-time performance,

where database clients have the same service independent of which node is accessed. This

44

part of the methodology is an optional step, and a successful collaboration project will

enable a commercial system context for validating the research, and give an opportunity

to apply the findings from this thesis in different problem domains of Polyhedra clients.

45

5 Related work

In addition to the related work referred to in each section of this Thesis Proposal, the

following areas are related to the work.

5.1 Strict consistency replication

In the area of distributed real-time databases there have been many efforts in defining syn-

chronizing mechanisms for supporting strict consistency between replicas. Efforts for strict

consistency in large distributed databases includes distributed two/three-phase-commit,

one-copy-serializability correctness [12], concepts of consistency control and confinement,

master-copy replication and quorum consensus approaches. Helal et al. [34] gives a good

introduction to this area. It is complex to provide predictability in replicated databases

with strict consistency, in particular at network partitioning, but approaches exists [21]

[1]. This system model differentiates from our system model, where we allow temporal

inconsistencies to favor availability and predictability.

5.2 Replication with relaxed consistency

Relaxing the requirement of having fully consistent replicas enable systems that are more

concurrent, since less replicas need to be involved for updates. Many approaches exist for

controlling consistency between replicas that allow relaxed consistency. Work on consis-

tency control includes work for handling partitioning [24] and for containing consistency

divergence [85]. Several approaches exist for consistency management with eventual con-

sistency [14], Epsilon serializability [62] and eventually-serializable consistency [27]. With

relaxed consistency there can be concurrent updates to different replicas of the same data.

There are write-write conflicts, where replicas are concurrently written and the last update

will be the one that is stored. There are also read-write conflicts, where a write operation

update a replica, using data that is not consistent with other replicas. To detect write-

write and read-write conflicts between replicas, version vectors [60] and variants [65] can

be used. Also global time or logical clocks can be used to find the order between updates.

5.3 Partial replication

For partially replicated distributed databases and also partially replicated distributed file

systems many approaches exist. Strict consistency models include: Coda file system [70],

the Replica modularization technique [76], Group communication based partial replication

[4], Allocation to nodes for database partitions [20]. Weak consistency system models in-

clude: Optimistically replicated file systems [64], Quasi copies [5]. Peer-to-peer techniques

46

(P2P) are used for sharing large data entities [22] [13], but often optimize on storage

structures and storage allocation to make search for particular content efficient [47], or for

improving efficiency by different replication and caching strategies [46] [63].

5.4 Asynchronous replication as a scalability approach

The Ficus file system [33] uses partial optimistic replication of files, but without a concept

of granularity. For replication models of large scale distributed databases, there are several

approaches. Hierarchical asynchronous replication protocol [2] replicates to neighbor nodes

using two consistency models, eventually resulting in replication to all nodes of the system.

Epidemic replication is used to disseminate updates [35]. Optimal placement of replicas

allows for large systems [83]. Recent approaches for selective replication in sensor networks

includes ’JITRTR’ [61] [77] and ’ORDER’ [78]. A framework is available for comparing

replication techniques for distributed systems, in particular for distributed databases [80].

5.5 Adaptive replication and local replicas

The need for adaptive change of allocation and replication of database objects in (large)

distributed databases has been identified in literature [82], including relational [15] and

object database approaches [84]. Our approach for adaptive allocation of data objects

(both allocation and deallocation) also relates to the abundant work available in the areas

of cache coherency [5] [81] [59], also for mobile databases [19], database buffer management

[36] [39] [23] and virtual memory [26]. We see some differences between our work and the

work done in these areas. With caching, copies are kept locally or in memory to improve

performance. Such cached data can be emptied from the cache to free up memory for more

frequently used data, and consequently slow down the data accesses for non cached data

in secondary memory. With our approach, we do not allow any data access to data other

than in the local memory, and we can not empty data objects from the local node for any

of the known accesses. Also, while virtual memory loads a previously not used piece of

memory from disk, adaptive segments contain data already in use at other nodes.

5.6 Usage of data properties

Sivasankaran et al. [73] presents an approach for using data properties for improved

performance and availability for real-time systems that uses cache coherency and buffer

management for several levels of memory, but for a non distributed database.

47

6 Conclusions

6.1 Summary

The problem of scalability of large distributed real-time database is pursued. There ex-

ists several approaches for bounding transaction timeliness and distributed agreement on

updates in distributed real-time database, and often such approaches aim to bound time

for distributed execution of transactions. In our approach, there are no distributed trans-

actions that the timeliness of user transaction depends on. All data is available at the

local node in main memory, and replication is done detached from the execution of the

transaction, so all user transactions executes timely at the local node independently of

network delays and of distributed agreements on updates.

This thesis proposal suggest an approach, for such a database system, that addresses

the problem of scalability. Such fully replicated database has a high cost of resource usage

that prevents large systems to be built. Scalability is evaluated in terms of bandwidth,

storage and processing time used for replicating the entire database to all nodes. Under

the assumption that only fractions of the database is shared by a few nodes in many

applications, the overall degree of replication can be reduced. With resource management

that uses knowledge about the actual data needs of database clients, replication can be

directed where needed instead of blindly replicate to all nodes. For the applications, the

database has an image of full replication and the database is said to be virtually fully

replicated.

The first part of the thesis proposal considers statically specified data accesses. Result-

ing replication schema provides the data at the node where data is specified to be used, but

lacks the advantage of a fully replicated database, where transactions can access arbitrary

data at any node. Our first implementation of virtual full replication partitions the data-

base into statically specified segments, where each segment have an individual degree of

replication, which bounds the usage of resources. For many applications it is not possible

to known all data accesses before the system executes. To address this problem, we suggest

an adaptive database system that can provide data at nodes even if data accesses can not

be specified in advance, by adapt to changed data requirements. The implementation will

use our segmentation algorithm to adapt segmentation and replication to follow the actual

data needs over time, while maintaining database scalability.

6.2 Initial results

We have introduced the concept of virtual full replication and proposed an architecture

to support it. We have presented an approach for static segmentation that does not give

48

a combinatorial problem, giving a large number of small segments and is computationally

complex to generate. This approach handles segmentation on multiple properties and

multiple segmentations can be generates in O(o log o) time on the number of objects.

We have presented an initial analysis model for three key resources that can be used for

scalability analysis of static segmentation. Our table-based implementation approach is

expected to be useful with adaptive segmentation as well, which is our next step to come.

We have implemented static segmentation in the DeeDS database prototype, and measured

the usage of three resources to evaluate scalability. We have implemented a simulation

model to evaluate static segmentation, which can be used for studying large scale behavior

of a virtually fully replicated distributed main memory database according to the database

system model we use.

6.3 Expected contributions

The thesis will present virtual full replication in detail, for a scalable distributed real-

time database that supports the required properties of the database clients and bounds

resource usage for scalability. We quantify scalability in the distributed database setting,

and evaluate the scalability achieved. We also find the usage conditions for scalability and

find out limitations of the concept. There is a framework for properties and their relations

for typical applications that is used for resource management. Virtual full replication is

evaluated by simulation, and by implementation, optionally in an industrial setting.

6.4 Expected future work

The data access patterns for pre-fetching data, studied in this thesis proposal, is limited

to simple access patterns. Here, we consider accesses as single instants of transactions

of periodically executing transaction with a minimum inter-arrival time. The amount

of work available in the area of pattern detection and prediction is abundant, and it is

expected that the waiting time for soft transactions in adapting the replication schema

can be improved, by pre-fetching data with more precision, or even by speculatively load

data to nodes. Branch prediction of execution caches is one area where similar problems

is expected to be found.

In this thesis proposal we exclude to give guidelines of how to design database appli-

cations based on the data properties and their relations we define in our work. As a result

of the thesis, we expect to know the conditions for how virtual full replication can provide

scalability and the limitations thereof. The next step would be to apply these findings for

applications.

The findings in this work may point out certain application types that benefit most

49

from virtual full replication. An implementation project to use it in a industrial application

would valuate the concept and our approach for large scale distributed real-time databases.

50

A Project plan and publications

The project plan contains both past and upcoming activities. Past activities can be seen

in Table 1.

Period Activity

Mid ’02 Admitted as PhD student

Sep ’02 Segmentation in a Distributed Main Memory Real-Time Database, MSc

Thesis [49] (M1)

Apr ’03 Virtual Full Replication: Achieving Scalability in Distributed Real-Time

Main Memory Systems [52] (M1)

Sep-Dec ’03 Bounded delay replication, definition and experiment (M1)

Dec ’04 Real-time communication through distributed resource reservation [51]

Sep ’04-May ’05 Implementation of segments in the DeeDS prototype database system,

definition and experiment (M2)

Aug ’05 Virtual Full Replication by Static Segmentation for Multiple Properties

of Data Objects [53] (M2)

Fall ’05 Simulator implementation and experiments for static segmentation (M3)

Table 1: Past activities

A.1 Time plan and milestones

Spring 2006

MILESTONE : Thesis Proposal at University of Skovde: Method and initial results.

Fall 2006

Implement simulation of adaptive segmentation, with support for multiple data proper-

ties. Experiments for adaptive segmentation. (M4)

Publication: ”Adaptive Segmentation for Virtual Full Replication”. (M4)

Thesis: Preliminary results

Spring 2007

Extend simulation of adaptive segmentation with pre-fetch. Experiments for adaptive

segmentation with pre-fetch. (M5)

Publication: ”Adaptive Segmentation with Pre-fetch for Virtual Full Replication”. (M5)

51

Case-study and publication: Virtual Full Replication by adaptive segmentation in a

wireless sensor network. (M7)

Optional implementation of virtual full replication in a commercial database system.

(M7)

MILESTONE : Thesis results

Write-up

Fall 2007

Publication: ”A data property framework for virtual full replication in distributed real-

time databases”. (M6)

Write-up

MILESTONE : Thesis defense

A.2 Publications

A.2.1 Current

Segmentation in a Distributed Main Memory Real-Time Database [49]

This Master’s Thesis outlines the approach of segmenting a database for lower resource

usage, where a few nodes share parts of the database. An initial segmentation algorithm,

basic architectural support, and an approach for specification of data needs is presented.

Virtual Full Replication: Achieving Scalability in Distributed Main

Memory Real-Time Systems [52] This work-in-progress paper relates the scal-

ability problem to a typical communication scenario and discuss the complexity of com-

munication in it. We show how segmentation can be used to maintain availability while

lowering resource usage in replication effort and storage requirements.

Real-time communication through distributed resource reservation [51]

This technical report presents an approach for real-time communication over Ethernet.

Real-time communication is required for bounded time replication in a segmented data-

base. Different segments may have different requirements on timeliness of replication

delays on the network, and this approach ensures bounded propagation time by distrib-

uted agreement on bandwidth usage among participating nodes, which enables bounded

time replication for database updates.

52

Virtual Full Replication by Static Segmentation for Multiple Proper-

ties of Data Objects [53] This conference paper presents a table based approach

for controlling multiple properties for data that is used in a segmentation algorithm that

runs in O(o log o), where o is the number of database objects. Such table allows appli-

cation of rules for how properties may be combined, such as limitations for combinations

of properties and clustering of data objects. From such property table, both logical and

physical segments may be derived. The paper also introduces an resource analysis model

for bandwidth, storage and processing time that can be used for scalability analysis. We

present measurement results from an implementation of static segmentation in DeeDS.

A.2.2 Planned

Adaptive Segmentation for Virtual Full Replication This paper will present

an extension to the table approach with algorithms and mechanisms for adaptation of

segmentation and replication to suit changed data needs. Multiple data properties and

their relations are considered for adaptive (re-) segmentation of the database. Data objects

used by hard transactions need to be available to guarantee availability and timeliness

of such transactions. Changes in the segment configuration can be made in bounded

time, but real-time communication is required for timely setup of data objects over the

network. Once data objects are setup by segment recovery to a node, subsequent data

accesses are timely. We use a simulation approach to show that scalability is maintained

throughout execution, supporting timeliness of hard transactions and efficient execution

of soft transactions.

Adaptive Segmentation with Pre-fetch for Virtual Full Replication Al-

location of data objects to nodes delays those soft transactions that access data objects

that are not available at a node. Thus, the efficiency of soft transactions depends on

availability, and to improve availability data objects are assigned to nodes by pre-fetching

objects to nodes where accesses can be expected. Pre-fetching may counteract the effort of

limiting the number of replicas of objects that was introduced by virtual full replication.

However, for timeliness of transactions, availability need to be prioritized over optimal

object allocation, and the paper will examine the conditions for guarantees of availability

in a scalable and adaptive system. We will present results from an experiment that shows

how scalability is influenced by pre-fetching data in an adaptive database with virtual full

replication.

A data property framework for virtual full replication in distributed

real-time databases Properties of database clients, properties of the data itself and

53

of the architecture is used for achieving virtual full replication. Considering the relations

between a selected set of key properties allow improved solutions for availability and scal-

ability for real-time applications using the database. Also, related relations may restrict

usage and support for timeliness and these important cases are defined. This paper will

present properties and relations in typical real-time applications and relate them in useful

profiles of usage.

54

References

[1] A. E. Abbadi and S. Toueg. Maintaining availability in partitioned replicated data-

bases. ACM Trans. Database Syst., 14(2):264–290, 1989.

[2] N. Adly, M. Nagi, and J. Bacon. A hierarchical asynchronous replication protocol for

large scale systems. In Proceedings of the IEEE Workshop on Advances in Parallel

and Distributed Systems, pages 152–157, 1993.

[3] C. Allison, P. Harrington, F. Huang, and M. Livesey. Scalable services for resource

management in distributed and networked environments. In SDNE ’96: Proceedings

of the 3rd Workshop on Services in Distributed and Networked Environments (SDNE

’96), pages 98–105, Washington DC, USA, 1996. IEEE Computer Society.

[4] G. Alonso. Partial database replication and group communication primitives (ex-

tended abstract. In Proceedings of the 2nd European Research Seminar on Advances

in Distributed Systems (ERSADS’97), pages 171–176, January 1997.

[5] R. Alonso, D. Barbara, and H. Garcıa-Molina. Data caching issues in an information

retrieval system. ACM Trans. Database Syst., 15(3):359–384, 1990.

[6] S. Andler, J. Hansson, J. Eriksson, J. Mellin, M. Berndtsson, and B. Eftring.

DeeDS towards a distributed and active real-time database system. SIGMOD Record,

25(1):38–40, March 1996.

[7] O. Balci. Verification validation and accreditation of simulation models. In WSC ’97:

Proceedings of the 29th conference on Winter simulation, pages 135–141, New York,

NY, USA, 1997. ACM Press.

[8] O. Balci. Verification, validation and testing. In J. Banks, editor, Handbook of

simulation. John Wiley, N.Y., 1998.

[9] O. Balci, P. Glosow, P. Muessig, E. Page, J. Sikora, S. Solick, and S. Youngblood.

Departement of Defence Verification, Validation and Accreditation, Recommended

Practice Guidelines. Defence modelling and Simulation Office, Alexandria, VA, 1996.

[10] O. Balci and R. G. Sargent. A methodology for cost-risk analysis in the statistical

validation of simulation models. Commun. ACM, 24(4):190–197, 1981.

[11] J. Banks, J. Carson, and B. Nelson. Discrete-Event System Simulation. Prentice-Hall,

Upper Saddle River, New Jersey, 2 edition, 1996.

[12] P. Bernstein and N. Goodman. An algorithm for concurrency control and recovery in

replicated distributed databases. ACM Transactions on Database Systems, 9(4):596–

615, December 1984.

[13] R. Bhagwan, D. Moore, S. Savage, and G. M. Voelker. Replication strategies for highly

available peer-to-peer storage. In A. Schiper, A. Shvartsman, H. Weatherspoon, and

B. Zhao, editors, Future Directions in Distributed Computing: Research and Position

Papers, volume 2584 of Lecture Notes in Computer Science, pages 153–157. Springer-

Verlag, Heidelberg, July 2003.

55

[14] A. Birrell, R. Levin, R. Needham, and M. Schroeder. Grapevine: an exercise in

distributed computing. Communications of the ACM, 25(4):260–274, April 1982.

[15] A. Brunstrom, S. T. Leutenegger, and R. Simha. Experimental evaluation of dy-

namic data allocation strategies in a distributed database with changing workloads.

In Proceedings of the fourth international conference on Information and knowledge

management, pages 395–402, 1995.

[16] A.-L. Burness, R. Titmuss, C. Lebre, K. Brown, and A. Brookland. Scalability evalua-

tion of a distributed agent system. In Distributed Systems Engineering 6, The British

Computer Society. IEE and IOP Publishing Ltd, 1999.

[17] L. Carrillo, J. L. Marzo, and P. Vila. About the scalability and case study of antnet

routing. In Proceedings of CCIA, Oct 2003.

[18] S. Ceri, M. A. W. Houtsma, A. M. Keller, and P. Samarati. Independent updates and

incremental agreement in replicated databases. Distrib. Parallel Databases, 3(3):225–

246, 1995.

[19] B. Y. Chan, A. Si, and H. V. Leong. A framework for cache management for mobile

databases: Design and evaluation. Distributed and Parallel Databases, 10(1):23–57,

July 2001.

[20] W. W. Chu, B. A. Ribeiro-Neto, and P. H. Ngai. Object allocation in distributed

systems with virtual replication. In F. Golshani, editor, Proceedings of the Eighth

International Conference on Data Engineering, pages 238–245, Tempe, Arizona, Feb-

ruary 3-7 1992.

[21] B. A. Coan, B. M. Oki, and E. K. Kolodner. Limitations on database availability when

networks partition. In Proceedings of the fifth annual ACM symposium on Principles

of distributed computing, pages 187–194, 1986.

[22] E. Cohen and S. Shenker. Replication strategies in unstructured peer-to-peer net-

works. In Proceedings of the 2002 conference on Applications, technologies, architec-

tures, and protocols for computer communications, pages 177–190, 2002.

[23] A. Datta, S. Mukherjee, and I. R. Viguier. Buffer management in real-time active

database systems. The Journal of Systems and Software, 42(3):227–246, 1998.

[24] S. Davidson. Optimism and consistency in partitioned distributed database system.

ACM Transactions on Database Systems, 9(3):456–481, 1984.

[25] G. Dunteman. Principal component analysis. Technical Report 07-69, Sage University

Paper, 1989.

[26] W. Effelsberg and T. Haerder. Principles of database buffer management. ACM

Trans. Database Syst., 9(4):560–595, 1984.

[27] A. Fekete, D. Gupta, V. Luchangco, N. A. Lynch, and A. A. Shvartsman. Eventually-

serializable data services. In Symposium on Principles of Distributed Computing,

pages 300–309, May 1996.

56

[28] C. A. Fossett, D. Harrison, H. Weintrob, and S. I. Gass. An assessment procedure for

simulation models: a case study. Oper. Res., 39(5):710–723, 1991.

[29] S. Frolund and P. Garg. Design-time simulation of a large-scale, distributed object

system. ACM Trans. Model. Comput. Simul., 8(4):374–400, 1998.

[30] D. Garlan. Software architecture: a roadmap. In ICSE ’00: Proceedings of the

Conference on The Future of Software Engineering, pages 91–101, New York, NY,

USA, 2000. ACM Press.

[31] B. A. Gennart and D. C. Luckham. Validating discrete event simulations using event

pattern mappings. In DAC ’92: Proceedings of the 29th ACM/IEEE conference on

Design automation, pages 414–419, Los Alamitos, CA, USA, 1992. IEEE Computer

Society Press.

[32] S. Gustavsson and S. Andler. Continuous consistency management in distributed

real-time databases with multiple writers of replicated data. In Workshop on parallel

and distributed real-time systems, Denver, CO, April 2005.

[33] R. Guy, J. Heidemann, W. Mak, T. Page Jr., G. Popek, and D. Rothmeier. Imple-

mentation of the Ficus replicated file system. In USENIX Conference Proceeedings,

pages 63–71, June 1990.

[34] A. Helal, A. Heddaya, and B. Bhargava. Replication techniques in distributed systems.

Kluwer Academic Publishers, 1996.

[35] J. Holliday, D. Agrawal, and A. El Abbadi. Partial database replication using epi-

demic communication. In Proceedings of the 22th IEEE International Conference on

Distributed Computing Systems (ICDCS02), pages 485–493, 2002.

[36] J. Huang and J. A. Stankovic. Buffer management in real-time databases. Technical

Report UM-CS-1990-065, University of Massachusetts, 1990.

[37] P. Hyden, L. Schruben, and T. Roeder. Resource graphs for modeling large-scale,

highly congested systems. In WSC ’01: Proceedings of the 33nd conference on Winter

simulation, pages 523–529, Washington, DC, USA, 2001. IEEE Computer Society.

[38] R. Jain. The art of computer systems performance analysis. John Wiley and Sons,

New York, 1991.

[39] R. Jauhari, M. J. Carey, and M. Livny. Priority-hints: an algorithm for priority-based

buffer management. In Proceedings of the sixteenth international conference on Very

large databases, pages 708–721. Morgan Kaufmann Publishers Inc., 1990.

[40] P. Jogalekar and M. Woodside. Evaluating the scalability of distributed systems.

IEEE Trans. Parallel Distrib. Syst., 11(6):589–603, 2000.

[41] J. P. C. Kleijnen. Validation of models: statistical techniques and data availability. In

WSC ’99: Proceedings of the 31st conference on Winter simulation, pages 647–654,

New York, NY, USA, 1999. ACM Press.

57

[42] A. Lahmadi, L. Andrey, and O. Festor. On the impact of management on the perfor-

mance of a managed system: a JMX-based management case study. In J. Schonwalder

and J. Serrat, editors, Ambient Networks: 16th Intl. WS on Distr. Systems: Oper-

ations and Management, volume 3775, pages 24–35, Oct 2005. Lecture Notes in

Computer Science.

[43] J. Landgren. Supporting fire crew sensemaking enroute to incidents. Int. J. Emer-

gency Management, 2(3), 2005.

[44] A. M. Law and D. M. Kelton. Simulation Modeling and Analysis. McGraw-Hill Higher

Education, Boston, 3rd edition, 2000.

[45] E. Leifsson. Recovery in distributed real-time database systems (HS-IDA-MD-99-

009). Master’s thesis, University of Skovde, Sweden, 1999.

[46] K. Lu, Z; McKinley. Partial replica selection based on relevance for information

retrieval. In Proceedings of SIGIR ’99. 22nd International Conference on Research

and Development in Information Retrieval, pages 97–104, 1999.

[47] Q. Lv, P. Cao, E. Cohen, K. Li, and S. Shenker. Search and replication in unstructured

peer-to-peer networks. In Proceedings of the 2002 ACM SIGMETRICS, Measurement

and modeling of computer systems, 2002.

[48] J. Matheis and M. Mussig. Bounded delay replication in distributed databases with

eventual consistency (HS-IDA-MD-03-105). Master’s thesis, University of Skovde,

Sweden, 2003.

[49] G. Mathiason. Segmentation in a distributed real-time main-memory database (HS-

IDA-MD-02-008). Master’s thesis, University of Skovde, Sweden, 2002.

[50] G. Mathiason. A simulation approach for evaluating scalability of a virtually fully

replicated real-time database. Technical Report HS-IKI-TR-06-002, University of

Skovde, Sweden, Mar 2006.

[51] G. Mathiason and M. Amirijoo. Real-time communication through a distributed

resource reservation approach. Technical Report HS-IKI-TR-04-004, University of

Skovde, Sweden, Dec 2004.

[52] G. Mathiason and S. F. Andler. Virtual full replication: Achieving scalability in

distributed real-time main-memory systems. In Proc. of the Work-in-Progress Session

of the 15th Euromicro Conf. on Real-Time Systems, pages 33–36, July 2003.

[53] G. Mathiason, S. F. Andler, and D. Jagszent. Virtual full replication by static seg-

mentation for multiple properties of data objects. In Proceedings of Real-time in

Sweden (RTiS 05), pages 11–18, Aug 2005.

[54] J. McManus and W. Bynum. Design and analysis techniques for concurrent black-

board systems. IEEE Transactions on Systems, Man and Cybernetics, 26(6):669–680,

1996.

58

[55] A. Mitra, M. Maheswaran, and S. Ali. Measuring scalability of resource management

systems. In IPDPS ’05: Proceedings of the 19th IEEE International Parallel and Dis-

tributed Processing Symposium (IPDPS’05) - Workshop 1, page 119.2, Washington,

DC, USA, 2005. IEEE Computer Society.

[56] P. Nii. The blackboard model of problem solving. AI Mag., 7(2):38–53, 1986.

[57] D. Nussbaum and A. Agarwal. Scalability of parallel machines. Commun. ACM,

34(3):57–61, 1991.

[58] A. Oomes. Organization awareness in crisis management. In B. Carle and B. Van de

Walle, editors, Proc. ISCRAM2004, pages 63–68, 2004.

[59] S. Park, D. Lee, M. Lim, and C. Yu. Scalable data management using user-based

caching and prefetching in distributed virtual environments. In Proceedings of the

ACM symposium on Virtual reality software and technology, pages 121–126, November

2001.

[60] D. Parker and R. Ramos. A distributed file system architecture supporting high avail-

ability. In Proceedings of the 6th Berkeley Workshop on Distributed Data Management

and Computer Networks, pages 161–183, February 1982.

[61] P. Peddi and L. C. DiPippo. A replication strategy for distributed real-time object

oriented databases. In Proceedings of the Fifth IEEE International Symposium on

Object-Oriented Real-Time Distributed Computing, pages 129–136, Washington DC,

Apr 2002.

[62] C. Pu and A. Leff. Replica control in distributed systems: an asynchronous approach.

ACM SIGMOD Record, 20(2):377–386, 1991.

[63] K. Ranganathan, A. Iamnitchi, and I. Foster. Improving data availability through

dynamic model-driven replication in large peer-to-peer communities. In CCGRID ’02:

Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing

and the Grid, page 376, Washington, DC, USA, 2002. IEEE Computer Society.

[64] D. Ratner. Roam: A scalable replication system for mobile and distributed computing.

PhD thesis, University of California, Los Angeles, Jan 1998.

[65] D. Ratner, R. P., and P. G.J. Dynamic version vector maintenance. Technical Report

CSD-970022, University of California, Los Angeles, 30, 1997.

[66] S. Robinson. Simulation model verification and validation: increasing the users’

confidence. In WSC ’97: Proceedings of the 29th conference on Winter simulation,

pages 53–59, New York, NY, USA, 1997. ACM Press.

[67] T. Roederand, S. Fischbein, M. Janakiram, and L. Schruben. Resource-driven and

job-driven simulations. In Proceedings of the 2002 International Conference on Mod-

eling and Analysis of Semiconductor Manufacturing, pages 78–83, 2002.

59

[68] R. G. Sargent. Validation and verification of simulation models. In WSC ’92: Pro-

ceedings of the 24th conference on Winter simulation, pages 104–114, New York, NY,

USA, 1992. ACM Press.

[69] R. G. Sargent. Verifying and validating simulation models. In WSC ’96: Proceedings

of the 28th conference on Winter simulation, pages 55–64, New York, NY, USA, 1996.

ACM Press.

[70] M. Satyanarayanan, J. Kistler, P. Kumar, M. Okasaki, E. Siegel, and D. Steere. Coda:

A highly available file system for a distributed workstation environment. IEEE Trans.

Comput., 39(4):447–459, Apr 1990.

[71] L. Schruben. Simulation modeling with event graphs. Commun. ACM, 26(11):957–

963, 1983.

[72] R. E. Shannon. Tests for the verification and validation of computer simulation

models. In WSC ’81: Proceedings of the 13th conference on Winter simulation, pages

573–577, Piscataway, NJ, USA, 1981. IEEE Press.

[73] R. M. Sivasankaran, K. Ramamritham, J. A. Stankovic, and D. Towsley. Data place-

ment, logging and recovery in real-time active databases. In M. Berndtsson and

J. Hansson, editors, Proceedings of the First International Workshop on Active and

Real-Time Database Systems, pages 226–241, Skovde, Sweden, June 1995.

[74] X.-H. Sun and J. Zhu. Performance considerations of shared virtual memory machines.

IEEE Transactions on Parallel and Distributed Systems, 6(11):1185–1194, Nov 1995.

[75] B. Tatomir and L. Rothkrantz. Crisis management using mobile ad-hoc wireless

networks. In Proc. of Information Systems for Crisis Response and Management

ISCRAM 2005, pages 147–149, Apr 2005.

[76] P. Triantafillou and F. Xiao. Supporting partial data accesses to replicated data.

In Proceedings of the Tenth International Conference on Data Engineering, February

14-18, 1994, Houston, Texas, USA, pages 32–42, 1994.

[77] A. Uvarov and V. Fay-Wolfe. Towards a definition of the real-time data distribu-

tion problem space. In Proceedings of the 23th IEEE International Conference on

Distributed Computing Systems Workshop (DDRTS03), pages 170–175, May 2003.

[78] Y. Wei, A. Aslinger, S. H. Son, and J. A. Stankovic. ORDER: A dynamic replication

algorithm for periodic transactions in distributed real-time databases. In Proceedings

of Real-time and Embedded Computing Systems and Applications (RTCSA04), pages

152–169, Aug 2004.

[79] Y. Wei, S. H. Son, J. A. Stankovic, and K. Kang. QoS management in distributed

real-time databases. In Proceedings of the 24th IEEE Real-Time Systems Symposium,

Cancun, Mexico, pages 86–97, Dec 2003.

[80] M. Wiesmann, F. Pedone, A. Schiper, B. Kemme, and G. Alonso. Understanding

replication in databases and distributed systems. In Proceedings of the 20th IEEE

60

International Conference on Distributed Computing Systems (ICDCS00), pages 464–

474, Apr 2000.

[81] O. Wolfson and Y. Huang. Competitive analysis of caching in distributed databases.

IEEE Transactions on Parallel and Distributed Systems, 9(4):391–409, Apr 1998.

[82] O. Wolfson, S. Jajodia, and Y. Huang. An adaptive data replication algorithm. ACM

Transactions on Database Systems, 22(2):255–314, 1997.

[83] O. Wolfson and A. Milo. The multicast policy and its relationship to replicated data

placement. ACM Transactions on Database Systems, 16(1):181–205, 1991.

[84] L. Wujuan and B. Veeravalli. An adaptive object allocation and replication algorithm

in distributed databases. In Proceedings of the 23th IEEE International Conference

on Distributed Computing Systems Workshop (DARES), pages 132–137, 2003.

[85] H. Yu and A. Vahdat. Design and evaluation of a conit-based continuous consistency

model for replicated services. ACM Transactions on Computer Systems, 20(3):239–

282, 2002.

[86] J. R. Zirbas, D. J. Reble, and R. E. vanKooten. Measuring the scalability of parallel

computer systems. In Supercomputing ’89: Proceedings of the 1989 ACM/IEEE con-

ference on Supercomputing, pages 832–841, New York, NY, USA, 1989. ACM Press.

61

Documents

Virtual Full Replication for Scalable Distributed Real