68
Durability for Memory-Based Key-Value Stores Kiarash Rezahanjani Dissertation for European Master in Distributed Computing Programme Supervisor: Flavio Junqueira Tutor: Yolanda Becerra uri President: Felix Freitag (UPC) Secretary: Jordi Guitart (UPC) Vocal: Johan Montelius (KTH) July 4, 2012

Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

Durability for Memory-Based Key-Value Stores

Kiarash Rezahanjani

Dissertation for European Master in Distributed Computing Programme

Supervisor: Flavio JunqueiraTutor: Yolanda Becerra

Juri

President: Felix Freitag (UPC)Secretary: Jordi Guitart (UPC)Vocal: Johan Montelius (KTH)

July 4, 2012

Page 2: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with
Page 3: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

Acknowledgements

I would like to thank Flavio Junqueira, Vincent Leroy and Yolanda Becerra who helped me in

this work, especially when my steps faltered. Moreover, I owe my gratitude to my parents, Souri

and Mohammad, who have been a constant source of love, motivation, support and strength all

these years.

Page 4: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with
Page 5: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

Hosting Institution

Yahoo! Inc. is the world’s largest global online network of integrated services with more

than 500 million users worldwide. Yahoo! Inc. provides Internet services to users, advertisers,

publishers, and developers worldwide. The company owns and operates online properties and

services, and provides advertising offerings and access to Internet users through its distribution

network of third-party entities, as well as offers marketing services to advertisers and publishers.

Social media sites consist of Yahoo! Groups, Yahoo! Answers, and Flickr to organize into

groups and share knowledge and photos. Search products comprise Yahoo! Search, Yahoo!

Local, Yahoo! Yellow Pages, and Yahoo! Maps to navigate through the Internet and search for

information. Yahoo! also provides a large number of specific communication, information and

life-style services. In the business domain, Yahoo! HotJobs, provides solutions for employers,

staffing firms, and job seekers; and Yahoo! Small Business that offers an integrated suite of

fee-based online services, including web hosting, business mail and an e-commerce platform.

Yahoo! Research Barcelona is the research lab hosted in the Barcelona Media Innovation Center

focuses on Scalable computing, web retrieval, data mining and social media, including distributed

and semantic search. This work has been done in Scalable computing group of Yahoo! Research

Barcelona.

Barcelona, July 4, 2012

Kiarash Rezahanjani

Page 6: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with
Page 7: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

Abstract

The emergence of multicore architecture as well as larger, less expensive RAM has made it

possible to leverage the performance superiority of main memory for large databases. Increas-

ingly, large scale applications demanding high performance have also made RAM an appealing

candidate for primary storage. However, conventional DRAM is volatile, meaning that hardware

or software crashes result in the loss of data. The existing solutions to this, such as write-ahead

logging and replication, result in either partial loss of data or significant performance reduction.

We propose an approach to provide durability to memory databases, with a negligible overhead

and a low probability of data loss. We exploit known techniques such as chain replication,

write-ahead logging and sequential writes to disk to provide durability while maintaining the

high throughput and the low latency of main memory.

Page 8: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with
Page 9: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Structure of the Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background and Related Work 5

2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Memory Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.2 Stable Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.3 Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.3.1 Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.3.2 Message Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.3.3 Pessimistic vs. Optimistic Logging . . . . . . . . . . . . . . . . . 7

2.1.4 Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.4.1 Replication Through Atomic Broadcast . . . . . . . . . . . . . . 8

2.1.4.2 Chain Replication . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.5 Disk vs RAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.1 Redis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.2 RAMCloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

i

Page 10: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

2.2.3 Bookkeeper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.4 HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Design and Architecture 19

3.1 Durability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2 Target Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3 Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.4 Design Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.5 System Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.5.1 Fault Tolerance Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.5.2 Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.5.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.5.4 Safety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.5.4.1 Consistent Replicas and Correct Recovered State . . . . . . . . . 24

3.5.4.2 Integrity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.5.5 Operational Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.6 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.6.1 Abstractions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.6.2 Coordination of Distributed Processes . . . . . . . . . . . . . . . . . . . . 27

3.6.3 Server Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.6.3.1 Coordination Protocol . . . . . . . . . . . . . . . . . . . . . . . . 28

3.6.3.2 Concurrency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.6.4 Stable Storage Unit (SSU) . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.6.5 Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

ii

Page 11: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

3.6.6 Failover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.6.7 API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.7 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 Experimental Evaluation 37

4.1 Network Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.2 Stable Storage Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2.1 Impact of Log Entry Size . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2.2 Impact of Replication Factor . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.2.3 Impact of Persistence on Disk . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.3 Load Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.4 Durability and Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . 44

5 Conclusions 47

5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

References 52

iii

Page 12: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

iv

Page 13: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

List of Figures

2.1 Buffered logging in RAMCloud. Based on (1). . . . . . . . . . . . . . . . . . . . 12

2.2 Bookkeeper write operation. Extracted from bookkeeper presentation slide (2). 13

2.3 Pipeline during block constraction. Based on (3) . . . . . . . . . . . . . . . . . . 14

3.1 System entities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2 Leader states. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3 Follower states. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.4 Log server operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.5 Storage unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.6 Clustering decision based on the servers available resources. . . . . . . . . . . . . 33

3.7 Failover. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.1 Thoughput vs. Latency graph for our stable storage unit for different entry sizes

with replication factor of three. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2 Throughput vs. Latency for stable storage unit with replication factor of two and

three for log entry size of 200 bytes. . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.3 Throughput vs. Latency of a stable storage unit for log entries of 200 bytes, when

persistence to local disk is enabled and disabled. . . . . . . . . . . . . . . . . . . 43

4.4 Throughput of stable storage unit under sustained load. . . . . . . . . . . . . . . 44

4.5 Latency of stable storage unit under sustained load. . . . . . . . . . . . . . . . . 44

4.6 Performance comparison of stable storage unit and hard disk. . . . . . . . . . . . 46

v

Page 14: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

vi

Page 15: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

List of Tables

4.1 RPC latency for different packet sizes within a datacenter . . . . . . . . . . . . . 38

4.2 Latency and throughput for a single client synchronously writing to stable storage

unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

vii

Page 16: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

viii

Page 17: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

1Introduction1.1 Motivation

In the past decades, disk has been the primary system of storage. Magnetic disks offer

reliable storage and a large capacity at a low cost. Although disk capacity has dramatically

improved over the past decades; access latency and bandwidth of disks have not shown such

improvements. Disk bandwidth can be improved by aggregating the bandwidth of several disks

(e.g. RAID) but high access latency remains an issue.

To mitigate these shortcomings and improve the performance of disk-based approaches a

number of techniques are employed such as adding caching layers and data striping. However,

these techniques complicate large scale application development and often become costly.

In comparison to disk, RAM (refering to DRAM) offers hundreds of times higher bandwidth

and thousands of times lower latency. In today’s datacenters, commodity machines with up to

32 gigabyte of DRAM is common and it is cost-effective to have up to 64GB of DRAM (1).

This makes it possible to deploy terabytes of data entirely in few dozens of commodity machines

by aggregating their RAM. The superior performance of RAM and its dropping cost has made

it an attractive storage means for applications demanding low latency and high throughput.

As an example of such applications, Google search engine keeps entire its index table in RAM

(4) , the social network LinkedIn stores the social graph of all the members in memory and

Google Bigtable holds SSTables’ block indexes in memory (5). This trend can also be seen in

the appearance of many in-memory databases such as Redis (6) and Couchbase (7) that use

memory as their primary storage.

Despite the advantages of RAM over disk, RAM is subject to one major issue: volatility

and consequently non-durability. Therefore, in the event of a power outage, hardware or soft-

ware failures data stored in RAM will be lost. In memory-based storage systems, operating on

commodity machines, providing durability while maintaining good performance is a major chal-

Page 18: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

2 CHAPTER 1. INTRODUCTION

lenge. The majority of existing techniques to provide durability of data, such as checkpointing

and write-ahead logging either do not guarantee persistence of the entire data or result in signif-

icant performance degradation. For example, in periodic checkpointing, committed updates at

the time interval between last checkpoint and failure point are lost or in the case of write-ahead

logging to disk, the write latency is tighten to disk access latency.

This work proposes a solution to provide durability for memory databases while preserving

their high performance.

1.2 Contributions

We propose an approach to provide durability for a cluster of memory databases, on a set of

commodity servers with negligible impact on the database performance. We have designed and

implemented a highly available stable storage system that provides low-latency high-throughput

write operations allowing a memory database to log the state changes. This allows durable

writes with low latency and recovery of the latest database state in case of failure.

Our stable storage consists of a set of storage units that collectively provide fault-tolerance

and load-balancing. Each storage unit consists of a set of servers; each server performs asyn-

chronous message logging to record changes of database state. Log entries are replicated on

memory of all the servers in the storage unit through chain replication. This minimizes the pos-

sibility of data loss derived by asynchronous writes in the case of servers failure and increases

availability of logs for the purpose of recovery. Each server exploits the maximum throughput

of the hard disk by sequentially writing the log entries.

Our solution is tailored for a large cluster of memory-based databases that store data in

the form of key-value pairs and comply to the characteristics of social network platforms. The

evaluation results indicate our approach enables durable write operations with latency of less

than one millisecond while providing a good level of durability. The results also indicate that

our storage solution is able to outperform the conventional write-ahead logging on local disk

in terms of latency. In addition to low response time, the system is designed to achieve high

availability and read throughput through replication of log entries in several servers. The design

also accomodates scalability by minimizing the interactions amongst the servers and utilizing

local resources.

Page 19: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

1.3. STRUCTURE OF THE DOCUMENT 3

1.3 Structure of the Document

The rest of this document is organized as follows. Chapter 2 provides a brief introduction

on several techniques and concepts related to this work. Further in this chapter, we review

four systems that have influenced the design and discuss the approach used by each one of the

systems. In Chapter 3 we present our solution to the durability problem. We describe the

properties of our system as well as the architecture and the implementation. Chapter 4 presents

the results of the experimental evaluation and analyzes the results. Finally, Chapter 5 concludes

this thesis by summarizing its main points and presenting the future work.

Page 20: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

4 CHAPTER 1. INTRODUCTION

Page 21: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

2Background and Related

Work

2.1 Background

2.1.1 Memory Database

In-memory or main memory database systems store the data permanently in main memory

and disk is usually used only for backup. In disk-oriented databases data is stored in disk and

it may be cached in memory for faster data access. Memcached (8) is an in-memory key-value

store that is widely used for such as a purpose. For example, Facebook uses Memcached to

put the data from MySQL database into memory (9) and consistency between Memcached and

MySQL servers is managed by application software. In both systems an object can be kept in

memory or in disk but the major difference is that in main memory databases the primary copy

of an object lives in memory and in disk-oriented databases the primary copy lives in the disk.

Main memory databases pose several properties different from disk-oriented databases and here

we mention the most relevant ones to this project.

The layout of data stored in disk is important, for example, sequential access and random

access to data stored in disk causes a major performance difference while the method of access

to memory is of no importance. Memory databases use data strcuctures that allow leveraging

performance benefits of main memory. For example, T-tree is mainly used for indexing of

memory databases while B-tree is prefered for index of disk-based relational databases (10).

Main memory databases are able to provide a far faster access time and a higher throughput

than disk-oriented databases. Although the latter provides a stronger durability as main memory

is volatile and in case of a process crash or power outage data residing in memory will be lost

(11). To mitigate this issue, disk is used as a backup for memory databases; hence, at the time of

a crash the database can be recovered. We will discuss several approaches to provide durability

of data and recovery of the system state.

Page 22: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

6 CHAPTER 2. BACKGROUND AND RELATED WORK

2.1.2 Stable Storage

There are three storage categories (12):

1. Memory storage which loses the data at the time of process or machine failure and power

outage.

2. Disk storage which survives the power outage and process failures except disk related

crashes such as disk head crash and bad-sectors.

3. Stable storage which survives any type of failures and provides high degree of fault tolerance

that is usually achieved through replication. This storage model suites applications which

require reading back the correct data after writing with a very small probability of data

loss.

2.1.3 Recovery

Recovery techniques in a distributed environment can become complicated when a globally

consistent state has to be recovered at several nodes and there are several writers or readers.

Our approach is based on a single-writer single-reader model that simplifies the recovery; hence

we discuss the recovery techniques given the single-reader single-writer model.

2.1.3.1 Checkpoint

Checkpoint (snapshot) is a technique in fault-tolerant distributed systems to enable back-

ward recovery by saving the system state from time to time onto a stable storage. Checkpoint

is a suitable option for backup and disaster recovery as it allows having different versions of the

system states at different point in time. Since checkpointing produces a single file that can be

compressed, it can easily and quickly be transferred over the network to other data centers to

enhance the availability and recovery of the service. Checkpoints are ready state of a system

therefore it is only required to read the snapshot to reconstruct the state and there is no need

for further processing.

The downside of this approach is this method stores the snapshot of the server state from

one point in time to another which means failure at any point in time will result in losing all

Page 23: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

2.1. BACKGROUND 7

the changes made from the last snapshot up to the failure point. This characteristic makes this

method undesirable if the latest state needs to be recovered.

In practice this is implemented by forking a child process (with copy-on-write semantic)

to persist the state (13). This could significantly slow down the parent process serving a large

dataset or interrupts the service for hundreds of milliseconds particularly on a machine with

poor CPU performance. This can specially become an issue when the system is at its peak load.

2.1.3.2 Message Logging

It is not possible to always recover the latest state of a database using snapshots and in

order to have a more recent state, the more frequent snapshots is required. This yields a high

cost in terms of operations required for writing the entire state in a stable storage. To reduce

the number of checkpoints and enable the latest state recovery, message logging technique can

be used.

In message-logging, a sequence number is associated with messages are recorded onto a

stable storage. The underlying idea of message logging is to use the logs stored in stable storage

and a check pointed state (as a starting point) to reconstruct the latest state by replying the

logs on the given checkpoint. The checkpoint is only needed to limit the number of logs; hence

shortening the recovery time.

Message logging requires that after completion of recovery no orphan processes exist. An

orphan process is a process that survived the crash and they are in different state from the

recovered process (14). In Chapter 3 we will discuss this property in our design.

2.1.3.3 Pessimistic vs. Optimistic Logging

Message logging can be categorized into two categories: Optimistic logging and pessimistic

logging (14). Message logging takes time and logging methods can be categorized depending on

whether a process waits to ensure every event is safely logged before the process can impact the

rest of the system.

Processes that do not wait for completion of logging of an event are optimistic and pro-

cesses that block sending a message until the completion of logging of the previous message are

Page 24: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

8 CHAPTER 2. BACKGROUND AND RELATED WORK

pessimistic processes. Pessimistic logging sacrifices a better performance during failure-free run

for a guarantee of recovering a consistent state with the crashed process state.

In conclusion, optimistic logging is desirable from performance point of view and it is suitable

for systems with a low failure rate. Pessimistic logging is suitable for systems with high failure

rate or systems that reliability is critical.

Write-ahead logging (WAL) can be considered as an example of pessimistic method that

the logs should be persisted before the changes take place. WAL is widely used in databases to

implement roll-forward recovery (redo) .

2.1.4 Replication

There are two main reasons for replication: scalability and reliability. Replication enables

fault-tolerance as in the event of a crash, system can continue working using other available

replicas. Replication can also be used to improve performance and scalability, when many

processes access a service provided by a single server, replication can be used to divide the load

among several servers.

There is variety of replication techniques with different consistency model, in this document

we explain two major replication techniques and further we describe how our system benefits

from replication in order to improve its reliability and minimizes data loss.

2.1.4.1 Replication Through Atomic Broadcast

Atomic broadcast or total order broadcast is a well-known approach that guarantees all the

messages are received reliably and in order by all the participants (15). Using atomic broadcast

all updates can be delivered and processed in order, this property can be used to create a

replicated data store that all the replicas have consistent states.

2.1.4.2 Chain Replication

Chain replication is a simple straightforward replication protocol intended to support high

throughput and high availability without sacrificing strong consistency guarantee. In chain

replication servers are linearly ordered to form a chain.

Page 25: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

2.1. BACKGROUND 9

The first server in the chain which is the entry point for queries is called head and the the

last server which sends the replies is called tail. Each update request enters at the head of the

chain and after being processed by the head the state changes are forwarded along a reliable

FIFO channel to the next node in the chain and it continues in the same manner until it reaches

the tail. This method handles queries by forwarding the queries to the tail of the chain.

This method is not tolerant to network partition but instead it offers high throughput,

scalability and consistency (16).

2.1.5 Disk vs RAM

Magnetic disk and RAM have several well-known differences. The RAM access time is orders

of magnitude less than magnetic disk and its throughput is orders of magnitude higher. Access

time for a record in magnetic disk consists of a seek time, rotational latency and transfer time.

Among the three, seek time is dominant when records are not large (megabytes). The seek time

of disk is several milliseconds and the transfer time varies depending on the bandwidth. For

instance, for 1 MB the transfer time is 10ms for a disk with bandwidth of 100 MB/s. On the

other hand, the access latency of a record in memory is a few nanoseconds and its bandwidth is

several gigabytes per second (17), (18). This means RAM performs orders of magnitude better

in terms of latency and throughput.

The access method and the way data is structured in RAM do not make a difference in

performance of RAM, although this is not the case for disk. Sequential writes to disk provide a

far better latency and throughput than random writes because it eliminates the need for constant

seek operations (19). Everest (20) is an example of a system that uses sequential writing to disk

in order to increase the throughput.

The other difference of RAM and magnetic disk is volatility. Memory is volatile and data

will be lost at the time of power outage or crashing the process referencing the data. Magnetic

disk is a non-volatile storage and data written to disk survives power outage and process crashes.

However, writing to disk (forcing the data to disk) does not guarantee that data is persisted

immediately. Disk has a cache layer which is non-volatile, therefore loss of power to cache

results in loss of data being written to disk. One solution is to disable the cache, though this

is not practical as it significantly degrades the disk performance; hence the application writing

Page 26: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

10 CHAPTER 2. BACKGROUND AND RELATED WORK

to disk. Other solutions are using non-volatile RAM as used by NetApp filer (21) or disks with

battery-backed write cache such as HP SmartArray RAID controllers, this provides a power

source independent from external environment to maintain the data for a short time allowing

it to be written to disk at the time of power outage (22). Although, the latter options are not

considered as commodity hardware.

2.2 Related Work

In this part we present some of the existing systems related to this work that has influenced

our solution in one way or another. Another reason to select the following systems to present

in this report is that the collection of approaches implemented by these systems represents a

comprehensive set of common methods applied to provide durability for many main memory

databases.

We describe:

• Redis (23) which is an in-memory database and uses writes to local disk as well as repli-

cation to achieve durability.

• Bookkeeper (24) that provides a write-ahead logging as a reliable fault tolerance distributed

service.

• RAMCloud (1) a new approach to datacenter storage by keeping the data entirely in

DRAM of thousands of commodity servers.

• HDFS (3), a highly available distributed file system with append-only capability for key-

value pairs.

At the end we discuss the pros and cons of each approach taken by these systems.

2.2.1 Redis

Redis (23) is an in-memory key-value store that aims at providing low latency. To meet

this objective Redis server holds the entire data in memory to avoid page swapping between

memory and disk, and consequently the serialization/deserialization process. Redis provides a

comprehensive set of options for durability of data as follows.

Page 27: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

2.2. RELATED WORK 11

1. Replication of full-state in memory Redis applies master-slave replication model so that

all the slaves’ servers synchronize their states with the master server. The synchronization

process is performed using non-blocking operations on both master and slaves; therefore

they are able to serve clients’ queries while performing synchronization. This implies even-

tual consistency model of Redis server, meaning that slave servers might reply to clients’

queries with an old version of data while performing the synchronization. MongoDB is

another example of a database system that uses a similar technique for replication (25).

Redis implements a single-writer and multiple-readers model that clients are able to read

from any replica but only permitted to write to one server. This model along with eventual

consistency ensures all the replicas will eventually be in a same state, while maintaining a

good performance in terms of latency and read throughput.

2. The other durability method of Redis is persisting the data into local disk using point-

in-time snapshot (checkpoint) at specified intervals. In this method Redis server stores

the entire state of the database server every T seconds or every W write operations onto

the local disk. Copy-on-write semantic is applied to avoid interruption of service during

persisting the data on disk.

3. Asynchronous logging is another approach taken by Redis to provide durability. Write

operations are buffered in memory and flushed into disk by a background process in

append-only fashion. The time to sink the data depends on the sync policy specified

in configuration parameters (flush to disk every second or for every write) (26).

2.2.2 RAMCloud

RAMCloud (1) is a large scale storage system designed for cloud scale data-intensive appli-

cations requiring very low latency and high throughput. It stores a large volume of data entirely

in DRAM by aggregating the main memory of hundreds or thousands commodity machines and

aims at providing the same level of durability as disk by using a mixture of replication and

backup techniques.

RAMCloud applies buffered logging method for durability that utilizes both memory repli-

cation and logging onto disk. In RAMCloud only one copy of every object is kept in the memory

and the backup is stored in the disks of several machines. The primary server updates its state

Page 28: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

12 CHAPTER 2. BACKGROUND AND RELATED WORK

Figure 2.1: Buffered logging in RAMCloud. Based on (1).

upon receiving a write query and forward the log to the backup servers, acknowledgement is

sent by a backup server once the log is stored in the memory. A write operation is returned by

the primary server once all the backup servers acknowledge. Backup servers write the logs into

disk asynchronously and remove the logs from the memory.

To recover quickly and avoid disruption of the service, RAMCloud applies two optimizations.

First is by truncating the logs to reduce the required amount of data to be read during recovery.

This can be achieved by creating frequent checkpoint and discarding the logs up to that point

or by cleaning the stale logs occasionally to reduce the size of log file. Second optimization is

to divide the DRAM of each primary server into hundreds of shards and assigning each shard

to one backup server. At the time of a crash each backup server reads the logs and act as a

temporary primary server until a full state of the failed server can be reconstructed.

2.2.3 Bookkeeper

Bookkeeper (24) provides write-ahead logging as a reliable distributed service (D-WAL). It

is designed to tolerate failure by replicating the logs in several locations. It ensures that write-

ahead logs are durable and available to other servers so in the event of failure other servers can

take over and resume the service.

Bookkeeper allows WAL by replicating log entries across remote servers using a simple

Page 29: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

2.2. RELATED WORK 13

quorum protocol. A write is successful if the entry is successfully written to all the servers in a

quorum. A quorum of size f +1 is needed to tolerate concurrent failure of f servers.

Bookkeeper allows aggregating disk bandwidth by stripping logs across multiple servers. An

application using Bookkeeper service is able to choose the quorum size as well the number of

servers used for logging. When the number of selected servers is greater than the quorum size

Bookkeeper performs stripping the logs among the servers. Figure below shows the bookkeeper

write operation and how it takes advantage of stripping.

Figure 2.2: Bookkeeper write operation. Extracted from bookkeeper presentation slide (2).

In figure 2.2 Ledger corresponds to a log file of an application, a Bookie is a storage server

storing the ledgers and BK Client is used by an application to process the requests and interact

with bookies.

Assuming client selects three bookies and quorum size of two. Bookkeeper performs strip-

ping by switching the quorums and spreading the load among the three bookies. This allows

distribution of load among the servers and if a servers crashes service continues without inter-

ruption. A client can read different entries from different bookies, this allows a higher read

throughput by aggregating the read throughput of individual servers. Bookkeeper also sequen-

tially writes to disk by interleaving the entries into a single file and stores index of the entries

Page 30: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

14 CHAPTER 2. BACKGROUND AND RELATED WORK

Figure 2.3: Pipeline during block constraction. Based on (3)

to locate and read the entries. This allows maximizing the disk bandwidth utilization of disk

and the throughput.

Bookkeeper follows a single-writer multiple-reader model and guarantees that once a ledger

is closed by a client all the readers read the same data.

2.2.4 HDFS

HDFS (3) is a scalable distributed file system for reliable storage of large datasets and it

delivers the data at a high bandwidth to applications. What makes HDFS interesting with

regard to our work is the way it performs I/O operations, and achieves high reliability and

availability.

HDFS allows an application to create a new file and write to the file. HDFS implements

a single-writer multiple-reader model. When a client opens a file for writing, no other client is

permitted to write to the same file until the file is closed. After a file is closed the file content

cannot be altered, although new bytes can be appended to the file. HDFS splits a file into

large blocks and stores the replicas of each block on different DataNodes. NameNode stores

namespace tree and the mapping of file blocks to DataNodes.

When writing to a file, if there is a need for a new block, NameNode allocates a new block

and assigns a set of DataNodes to store the replicas of the block, then these DataNodes form a

pipeline (chain). Data is buffered at the client side and when the buffer is filled bytes are pushed

Page 31: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

2.3. DISCUSSION 15

through the pipeline. This prevents the overhead of the packet headers. The DataNodes are

ordered in such a way that minimizes the distance of the client from the last node in the pipeline,

thereby minimizes the latency. HDFSFileSink operator in Datanodes buffers the writes and the

buffer is written into disk only when adding the next tuple exceeds the size of the buffer. Thereby,

each server writes to disk asynchronously which enables a low latency of writes in HDFS.

Placement of blocks’ replicas is critical for reliability, availability, and network bandwidth

utilization. HDFS applies an interesting strategy to place the replicas. It provides a tradeoff

between minimizing the write cost, maximizing reliability, availability and read bandwidth.

HDFS places the first replica of each block on the same node as the writer and the second

and third one on two different nodes in two different racks. HDFS enforces two restrictions:

DataNodes cannot store more than one replica of any block, provided that there are sufficient

racks in the cluster, no rack should store more than two replicas of any block. In this way,

HDFS minimizes the probability of correlated failures as failure of two nodes in a same rack is

more likely to occur than two nodes in different racks which maximize the availability and read

bandwidth (3).

2.3 Discussion

We summarize the approaches towards durability into four major categories.

• Replication of the full state into several locations.

• Periodic snapshots of the system state.

• Asynchronous logging of writes onto a stable storage.

• Synchronous logging of writes onto a stable storage.

The full replication approach along with eventual consistency (e.g. Redis) ensures all the

replicas will eventually be in a same state, while maintaining a good performance in terms of

latency and read throughput.

This approach provides low latency and high read throughput that linearly scales with the

number of slave servers because all the read and write queries can be served from the memory

Page 32: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

16 CHAPTER 2. BACKGROUND AND RELATED WORK

without involving disk. However, this approach is subject to one major drawback; large memory

requirement.

This method becomes costly in term of hardware and more importantly utility cost when

we have a large cluster of servers. DRAM memory is volatile and it requires constant electricity

power meaning the machines need to be powered at all the times. For example, in today’s

datacenters the largest amount of DRAM which is cost effective is 64GB (1), having such a

datacenter, to store 1TB of data requires 16 machines. To have a replication factor of three

which is considered a norm to have a good level of durability (3), we need 32 extra servers. Even

though, this approach offers great benefits but it is not a proper choice for a large cluster of

in-memory databases as it becomes costly.

The other drawback is the possibility of data loss. For example, in case of Redis, master

server replies to updates before replication on slave servers has been completed (for a lower

latency); hence if master fails in the time between the reply to update and before sending the

update to replicas the data can be lost. To prevent such a risk the update should not return until

all the replicas have received the update, although this increases the latency. This is a tradeoff

that needs to be made between high performance and durability. The other risk associated

with this approach is that in case of concurrent failure of all the servers holding the replicas

(datacenter power outage) the entire data will be permanently lost. To mitigate this issue, data

can be replicated in multiple datacenters, however this methods results in a high latency of

updates (hundreds of milliseconds) for blocking calls or partial loss of data for asynchronous

calls.

Redis provides periodic snapshot. This is a good choice for backup and disaster recovery as

it allows having different versions of the system states at different point in time. Since the full

state is contained in a single file it can be compressed and be transferred to other data centers

to enhance the availability and recovery of the service.

Periodic snapshot stores server state from one point in time to another, however, failure at

any point in time will result in losing all the updates from the last snapshot up to the failure

point. This propery makes this method undesirable when the latest state needs to be recovered.

The other point to consider is that forking a child process to persist the state could significantly

slow down the parent process serving a large dataset or interrupting the service.

Page 33: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

2.3. DISCUSSION 17

In comparison to snapshot, this method provides a better durability as every write operation

can be written into disk. To improve performance, write operations are batched in memory

before being written to disk. Thus, a failure results in loss of the buffered data. Logging

performance can be improved by writing the updates into disk in an append-only fashion. This

prevents the long latency of seek operations on disk (dedicated disks) by sequentially writing

the logs. Therefore, if the sink thread is the only thread writing to disk (in append-only fashion)

it can achieve a better write throughput.

Logging provides a stronger durability than snapshot but it results in creating a larger log

file and slower recovery process, since all the logs need to be played in order for rebuilding the

full state of the dataset. To accelerate the recovery process number of logs required to build

the state should be deducted. Two major techniques for truncating the log file are as follows.

System state should frequently be checkpointed, so that the logs before the checkpoint can be

removed. The other technique, that is implemented by Redis is cleaning the old logs. Redis

rewrites the log file in the background to drop unneeded logs and minimize the log file size.

Asyncronous logging to disk provides a better performance than synchronous loggig, how-

ever, this increases the possibility of losing the updates. Asynchronous logging is usually used

along with replication to mitigate this issue. RAMCloud takes this approach by replicating the

logs through broadcast, refers to it as buffered logging, that allows writes (also reads) to proceed

at the speed of RAM along with a good durability and availability. Buffered logging allows a

high write throughput, however if the write throughput continues at a sustained rate higher

than disk throughput it eventually results in filling the entire OS memory and throughput drops

to the throughput of the disk. Therefore, buffered logging provides a good performance as long

as free memory is available.

Moreover, buffered logging does not always guarantee durability as in case of a sudden power

outage the buffered data will be lost. Therefore, it is suitable for applications that can afford

the loss some updates. To deal with such scenarios cross-data center replication can be done,

however the latency of write are expected to drop significantly.

HDFS provides an append-only operation that can be used for the purpose of logging. HBase

is an example of an application using this capability of HDFS for logging purpose (27). The

idea of HDFS is similar to RAMCloud, though the major difference is that the replication model

applied in HDFS is similar to chain replication that enables high write throughput. HDFS buffers

Page 34: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

18 CHAPTER 2. BACKGROUND AND RELATED WORK

the bytes in memory and writes a big chunk of data into disk when the buffer is full. HDFS

creates one file for each client on each machine (residing replicas) which means if multiple clients

concurrently write to file’s blocks located in a same machine, the write performance degrades

as writing to several files in a same disk requires frequent seek operations. HDFS addresses

correlated failures through smart replication strategy by placing the replicas in multiple racks

on different machines.

In case of Bookkeeper, the quorum approach of Bookkeeper consume more resources from

one of the participants as one needs to perform the multicast. For instance, In Bookkeeper,

the client multicasts the log entries across several bookies consequently this consumes more

bandwidth and CPU power of the client. One way to resolve this, could be outsourcing the

replication responsibility to the sever ensemble and create a more balanced replication strategy.

For example, Zookeeper (28), a coordination system for distributed processes, applies a quorum

based approach on server side for replication by implementing a totally ordered bradcast protocol

called Zab (29), however this complicates the server implementation. Our design decision to

approach the durability problem in memory databases is mostly influenced by the approaches

described above. In next chapter, we describe our solution in details.

Page 35: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

3Design and Architecture

In this chapter, we define durability with respect to this work and describe how we approach

the durability problem in memory-based key-value stores. We explain the system design and

it’s properties, and finally how the system is built.

3.1 Durability

For the purpose of this work, ”durability” means that if the value v corespondent to the

key k is updated to u at time t, then a read for key k at time t’ such that t’ > t must return

u, if no updates occured between time t and t’. We assume that durability condition holds for

a memory database as long as no crash has occured. This work is to address the durability of a

memory database (in our case a key-value store) such that the latest committed value of every

key can be retrieved after a crash.

3.2 Target Systems

The proposed system design is tailored to provide durability for a cluster of in-memory

databases storing data in form of key-value pairs which complies to the following specifications.

1. Dataset is large and the cluster of in-memory key-value stores consists of at least dozens

of machines

2. Write query size (update/insert/delete) varies from few hundreds of bytes to few of KB

(an example of write query is ”SET K V ” to set the value of key K to V ).

3. Workload is read dominant(10 - 20% of queries are write)

4. High availability of service is important

Page 36: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

20 CHAPTER 3. DESIGN AND ARCHITECTURE

The above specification is common for social networking platforms such as facebook, twitter

and Yahoo! News Activity that store large amount of data in main memory and process large

amounts of events. For example, in only year 2008, facebook had been serving 28 terabytes of

data from memory (30) and this number is increasing. Based on (31), in facebook cluster, less

than 6 percent of the queries are write queries. In social network platforms users write queries

are generally small (less than 4KB) (32), for example, twitter message size is limited by 140

characters (33).

3.3 Design Goals

• In our design we aim to provide a high level of durability such that in the event of a

crash, the latest state of the system is recoverable with a low probability of data loss. The

objective is to achieve this goal with minimal impact on performance of memory database

(read operations do not make any changes to the database state, thereby, only the write

operations should be durable).

• We need to ensure that our system is highly available so that changes to database state can

be reliably recorded to stable storage and the records can be read at the time of recovery.

• The system needs to scale with increasing number of databases and write operations.

• Maximizing utilization of local resources of the database cluster is another objective and

we try to avoid additional dependency to external systems and create a self-contained

application.

• Any guarantee about durability of a write should be provided before acknowledging the

success of the write operation to the writer.

• Our durability mechanism should enable a low recovery time to enhance the availability

of the database service.

3.4 Design Decisions

In Chapter 2.2, we described and discussed the common approaches towards durability in

memory-based databases. In this section, we explain our design decision with respect to the

Page 37: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

3.4. DESIGN DECISIONS 21

target systems and the objectives.

Checkpoint vs. Logging As checkpointing consumes a considerable amount of resources

and it always leaves the possibility of data loss, we choose to use message logging to persist

the changes of database state, thereby, the state can be reconstructed by replaying the logs (To

reduce the recovery time and limit the number of logs, a snapshot of the system state is needed

or the unneeded logs should be truncated before recovery. To eliminate the cost of this process

during operation a background processes can be assigned to reconstruct the system state and

store it into stable storage when the system is not under stress. This is part of the future work.)

Pessimistic vs. Optimistic Logging We choose to use pessimistic logging to ensure that

the changes will take place only after they are durable in a stable storage system. Low latency

is one of our main objectives. In order to achieve this objective, we create a stable storage by

mixture of in memory replication and asynchronous logging of changes of the database state.

This allows storing log entries in several locations while providing low response time. We name

the set of servers cooperating to perform replication and logging a stable storage unit or SSU.

Asynchronous vs. Synchronous The asynchronous logging is the core of our design to

provide low response time. The reason to choose asynchronous logging is to eliminate the latency

of writing the logs onto disk. However, since DRAM is volatile this method carries the risk of

losing the logs upon a crash. To address this issue we replicate the logs into memory of several

machines before acknowledging for durability of the write. In this way, we can significantly

reduce the probability of data loss as it is very unlikely that all the machines crash at the same

time (3). The design targets low latency and high throughput for write operations by trading

the guaranteed durability with a low probability of data loss. Further in this chapter, we discuss

the possibility of losing the data and reliability of this method.

Chain Replication vs. Broadcast In order to replicate the logs we choose to use chain

replication for two main reasons. 1) Chain replication puts nearly the same load on the resources

of each server, while in broadcast one of the participants utilizes more resources than the others.

This allows providing an implicit load balancing. 2) Chain replication enables high throughput

logging as the symmetric load on the servers allows utilizing the maximum resources of each

server and minimizing the chance of the appearance of bottleneck. We also performed an exper-

iment to help us with the decision. We measured the latency caused by network transmission

using either approach. We discuss the experiment in Chapter 4.

Page 38: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

22 CHAPTER 3. DESIGN AND ARCHITECTURE

Local disk vs. Remote file System Logs can be persisted either in the local disk of

the servers or an existing reliable remote file system (e.g. NFS, HDFS). We choose to use local

disk of the server to maximize utilization of the local resources, reduce dependencies and avoid

the use of network bandwidth for persistence. As the logs are replicated in memory of several

machines and all the machines persist the logs onto their disk, we will have the replicas of the

logs on several hard disks. This enhances the availability of the logs at the time of recovery and

accelerates the recovery process by reading different partitions of the logs from different servers

(hence aggregating disk bandwidth of the replicas).

Faster recovery vs. Higher write performance During peak load where the system

is under sustained intensive load, if the write throughput to the stoarge is higher than the

write throughput to disk, the servers’buffer eventually become saturated and the performance

degrades significantly. Thereby, it is important to fully utilize the disk bandwidth and minimize

the write latency in order to prevent saturation of the buffer as much as possible. We write the

logs in an append-only fashion by sequentially writing them to disk in a single file to eliminate

the seek time and maximize the disk throughput utilization. Therefore, we need to interleave

the logs from all the writers into a single file.

As opposed to having one file per writer this method (sequential writes to a single file)

makes the recovery process slower since to recover the logs belong to one writer we need to

read all the logs in the file in a sequential manner. Recovery needs to be done only at the time

of crash and this does not happens frequently and is rare, on the other hand logging needs to

be performed constantly (constant write operations). Thus, we choose to have a faster logging

rather than faster recovery. Although the read performance can be improved by indexing the

log entries (Bookkeeper (24) implements this method).

Transport layer protocol We choose to use TCP/IP for communication as we want to

deliver the messages in order and reliably to provide a consistent view of the logs (and the stored

files) among all the servers in a chain.

3.5 System Properties

Our stable storage system consistes of a set of stable storage units (storage unit or SSU).

Every stable storage unit consists of several servers, each persisting log entries onto its local

Page 39: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

3.5. SYSTEM PROPERTIES 23

disk. A writer process writes to only one of the stable storage units. A storage unit follows a

fail-stop model and upon a SSU crash, its clients write to another storage unit. The system

environment allows detection of failure through membership management service provided by

an external system.

Our solution follows a Single-writer Single-reader model. Log entries of a database applica-

tion is written only by one process to the stable storage. The process that writes the logs is the

process (same identifier) that reads the logs from the storage. The read operation needs to be

performed at the time of recovery. Therefore, read and write operations on the same data item

is never performed simutanously.

A reader can read the logs from more than one server within the storage unit as all the

servers store identical set of data (acknowledged log entries). A process writes to a different

storage unit, if the storage unit fails or if the storage decides to disconnect the process.

3.5.1 Fault Tolerance Model

The system needs to be fault tolerant to continue its service at the time of failure. We

achieve fault tolerance through replication. In our system persistence of an acknowledged log is

guaranteed for f simultaneous servers’ failures, if we have f +1 servers in the replication chain.

However to guarantee the stable storage of a log we require f +2 servers to tolerate f simultaneous

failures.

We implement fail-stop model. A server halts in response to failure and we assume the

servers crash can be detected by all the other servers in the storage unit. In the event of a server

crash, the storage unit stops serving all its writers and it only persists the remaining logs in its

servers’ buffer onto disk (Writers connect to another storage unit to continue the logging). Once

all the logs are persisted into disk all servers restart and become available to form new storage

unit.

An alternative option to deal with failures is to repair the storage unit. However due to

following reason we prefer to re-create a storage unit and avoid repairing.

Repairing a storage unit requires addressing many failure scenarios which complicates the

implementation. In addition, possibility of corner cases which have not been taken into account

as well as the possibility of additional failures during repair further complicates the matters.

Page 40: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

24 CHAPTER 3. DESIGN AND ARCHITECTURE

3.5.2 Availability

The system allows creation of many stable storage units. Each storage unit can provide

different replication factors. Larger replication factors (number of servers in the replication

chain within storage unit) provide three advantages:

• higher availability of the stored entries since all the servers within the storage units host

a replica of logs.

• lower probability of data loss when a correlated failures occur. It is more likely that at

least one server, holding the buffered data survives the failure and persists it to disk.

• higher read bandwidth by aggregating the bandwidth of servers hosting the replicas.

Therefore, higher replication factor of storage unit enhances data availability and read through-

put as well as stronger durability. However, in the case of a catastrophic failure of all the servers

in a storage unit data can be lost. Larger number of storage units enhance the availability of

the write operations since writers can continue logging upon the storage unit crash.

3.5.3 Scalability

In our storage system, every storage unit is independent of every other unit and there are

no shared resources or coordination amongst them. The independent nature of the storage units

allows adding new units without impacting the service performance. The load is divided by

assigning each set of writers to different units. Therefore, to create a storage unit, a set of

servers with closest resource usage are selected in order to prevent a server from becoming a

bottleneck in the chain. This allows the maximum resource utilization within a storage unit.

3.5.4 Safety

3.5.4.1 Consistent Replicas and Correct Recovered State

We reply on TCP protocol to transfer the messages in order, reliably and without duplication

between the nodes. The servers in a chain are connected by a single TCP channel and messages

are forwarded and persisted in the same order that they have been received. This ensures that

Page 41: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

3.6. ARCHITECTURE 25

all the servers in a chain view and store the logs in the same order that messages are sent by

the writer (writer also writes through a single TCP channel). In our system it is not possible to

recover an incorrect or a stale state of database without the knowledge of the recovery processor

(reader). Every writer is represented by a unique Id (client Id) and every log is uniquely identified

by combination of the client Id and a log Id. The log Id increases by one for every new log entry.

During recovery of a database state, logs are read and played in order and in case of a missing

log (or duplicate), the recovery processor will be able to detect the missing (or duplicate) log

entry.

3.5.4.2 Integrity

During recovery we need to ensure that the object being read is not corrupted. This requires

adding a checksum to every object stored in storage to enable verification of data being read

(this feature is not part of the implementation).

3.5.5 Operational Constraints

The availability of the Zookeeper quorum is essential for the availability and operation of

the system since we reply on Zookeeper for failure detection and accessing metadata regarding

the nodes. The availability of write operation depends on availability of a storage unit and the

availability of read operations requires at least one server that stores the requested logs.

In order to operate continuously, at least two storage units should be available to quickly

resume the service upon a storage unit crash. For example, in the current implementation we

require six servers to have two storage units with replication factor of three. This requirement

can be reduced by fixing the storage units upon a crash through replacing the failed server from

a pool of available servers. However, this complicates the implementation.

3.6 Architecture

The main idea is to create several storage units, each capable of storing and retrieving

stream of logs reliably. Each storage consists of a number of coordinated servers that perform

chain replication and asynchronously persisting the logs onto their local disk to provide a lower

Page 42: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

26 CHAPTER 3. DESIGN AND ARCHITECTURE

response time. Each log entry is acknowledged only after it is replicated in all the servers within

the storage unit. Hence we ensure that the logs are persisted in the event of failure of some

of the servers. The number of servers in each storage unit is equal to the replication factor

it provides. In this section, we describe the architecture of the system (with respect to write

operation).

3.6.1 Abstractions

Our system consists of three types of processes. Figure 3.1 illustrates the processes and

below we describe their functions.

• Log server processes (server) form a storage unit and asynchronously store the log

entries on local disks (in append-only fashion). They also read and stream the requested

logs from the local disk upon the request of client process at the time of recovery. In

Figure 3.1 head, tail and the middle nodes are the log server processes.

• Stable Storage Unit or storage unit (SSU) provides stable storage of log entries. It

consists of a number of machines hosting two types of processes. Log server process (each

on different machine) to replicate and store logs and state builder process. The number of

machines is equal to replication factor provided by a stable storage unit.

• Client process (writer/reader) processes requests (writes) from an application and

creates log entries. It streams the entries to an appropriate storage unit and responds to

the application. The client process also reads the logs and reconstructs the database state

at the time of failure (read operation is a future work).

• State builder processes are the background processes that read the logs from the local

disk to compute the latest value of each key. Once the values are computed, they are stored

into the disk and the old logs are removed from the disk. The purpose of this process is to

reduce the recovery time by eagerly preparing the latest state of the key-value store. This

process takes place whenever system is not under stress (part of future work).

Page 43: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

3.6. ARCHITECTURE 27

Figure 3.1: System entities.

3.6.2 Coordination of Distributed Processes

Zookeeper is a coordination system for distributed processes (28). We use Zookeeper for

membership management and storing metadata about the server processes, storage units, and

client processes. Data in zookeeper is in a form of tree structure and each node is called a znode.

There are two type of znode: ephemeral and permanent. Ephemeral znodes exist as long as the

season of zookeeper client creating that znode is alive. Ephemeral znodes can be used to detect

failure of a process. Permanent znode stores the data permanently and ensures it is available.

We use these metadata and the Zookeeper membership service to coordinate server processes for

creating storage units and detect failures. Client processes also use Zookeeper service to locate

storage units and detect their failures. Below we describe the metadata and the types of nodes

used in our system.

MetaData

• Log Server znode (ephemeral)

– IP/Port for coordination protocol

– IP/Port for streaming

– Rack: the rack that server is located

– Status: accept or reject storage join request

– Load status : updated resource utilization status

• Storage unit znode (ephemeral)

Page 44: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

28 CHAPTER 3. DESIGN AND ARCHITECTURE

– Replication factor

– Status: accept/reject new clients

– List of log servers

– Load status: load of the log server with highest resource utilization

• File map znode (permanent)

– Mapping of logs to servers

• Client znode (ephemeral)

– Only used for failure detection

• Global view znode (permanent)

– List of servers and their roles (leader/follower) used to form stable storages units

3.6.3 Server Components

A log server process creates an ephemeral znode in Zookeeper upon its start and it constantly

updates its status data at this node. This process follows a protocol that allows it to cooperate

with other processes to form a storage unit. We first describe this protocol, and then we explain

how an individual log server operates within a storage unit.

3.6.3.1 Coordination Protocol

This protocol is used to form a replication chain (storage unit) and operates very similarly

to two-phase commit (12). The protocol defines two roles servers: leader and follower. Leader

is responsible to contact the followers and manages creation of a storage unit. Followers act as a

passive process and only respond to the leader. Figure 3.2 and 3.3 describe the state transition

of both leader and follower.

If a server process is not part of storage it sets its state to listening state. In listening state

a process frequently checks the global view data. If the process is listed as a leader it reads the

list of its followers’ addresses. It sends the followers a join-request message and set a failure

detector (sets a watch flag on their ephemeral znode) to detect their failures.

Page 45: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

3.6. ARCHITECTURE 29

Figure 3.2: Leader states.

Figure 3.3: Follower states.

Page 46: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

30 CHAPTER 3. DESIGN AND ARCHITECTURE

Followers are able to accept or reject the join request depending on their available resources.

If a follower fails or rejects the request, the leader triggers the abort process and all processes

resume their initial state. In order to abort, the leader sends an abort message to all the

followers. Upon receiving of an abort message, each follower (and leader itself) cleans all the

data structures and return to the initial state 3.3.

Each follower sets failure detector for the leader before accepting the join request so that

in case of the leader failure, it can detect the failure and resume the previous state. If all

the followers accept the join request the leader sends a connection-request message carrying an

ordered list of servers (including the leader). Each server connects to the previous and next

server in the list as its predecessor and successor in the chain. Once a server is connected and

ready to stream data, it sends connect-completion signal to the leader. If a server fails to connect

or crashes, the leader aborts the process. Otherwise a complete chain of servers is ready and

leader creates a znode for the new storage unit and sends a start signal along with the znode

path of the storage unit to all the followers to start the service 3.2.

3.6.3.2 Concurrency

Each server process consists of three main threads operating concurrently based on producer-

consumer model. Figure 3.4 shows how threads in a single server operate and interact through

shared data structures.

Three shared data structures among the threads are:

• DataBuffer stores the logs entries in memory.

• SenderQueue keeps the ordered index of the log entries that should be either sent to the

next server (head or middle server) or acknowledged to client (tail server).

• PersistQueue holds an ordered index of log entries that should be written to disk.

The receiver thread reads the entries from TCP buffer and inserts the entries into DataBuffer.

It also inserts the index of the entry into SenderQueue. DataBuffer has a pre-specified size

(number of entries) and if the DataBuffer is full, receiver thread must wait until an entry is

removed from the DataBuffer.

Page 47: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

3.6. ARCHITECTURE 31

Figure 3.4: Log server operation.

The sender thread waits until there exist an index in the SenderQueue. It reads the index

of the entry from the SenderQueue to find and read the entry from the DataBuffer. If the server

is the tail server for the entry it sends an acknowledgment to the corresponding client indicating

that the entry has been replicated in all the servers. If the server is not the tail it simply sends

the entry to the next server in the chain (successor). Once the message is sent to an the next

hop, the sender thread puts the index of the entry in Persist queue.

The persister thread waits until an index exists in the PersistQueue. It reads the index

of the entry in the DataBuffer and persists the entry into disk in append-only fashion. This

thread is the only thread persisting the entries and all the entries from different clients will be

interleaved into a single file. Once the entry is written to disk it is removed from the DataBuffer

by this thread.

3.6.4 Stable Storage Unit (SSU)

A stable storage consists of a set of servers forming a replication chain. It ensures the

replication and availability of delivered entries. One of the servers acts as a leader and holds

the lease to the znode of the storage unit. The storage unit is considered failed, when one or

more servers in the storage unit crash. Upon crash of storage unit, service is stopped and it

only persists the entries from memory to disk. When the leader crashes, the znode is removed

automatically and when the other servers fail, the leader removes the znode. Therefore, all

clients will be notified about the storage failure.

In a storage unit, every server can act as head, tail or middle server. Clients can connect to

Page 48: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

32 CHAPTER 3. DESIGN AND ARCHITECTURE

Figure 3.5: Storage unit.

any server in the chain. The server acting as entry point is the head of the chain for the client

and the last server in the chain (which sends the acknowledgment) is the tail. Figure 3.5 shows

how several clients can stream to the storage unit.

3.6.5 Load Balancing

We make load balancing decisions at three points.

• We ensure that a set of servers selected to create a storage unit (perform chain replication)

have nearly the same load. This minimizes the chance of appearance of a bottleneck in

the chain and maximizes resource utilization of each server. Figure 3.6 shows how servers

are clustered to form a storage unit.

• One of the servers within the storage unit constantly updates the available resource of the

storage unit and its status in zookeeper node. This enables the clients to select a storage

unit with the lowest load by reading this data from Zookeeper servers.

• In a storage unit every server can act as head, tail or middle server. The tail consumes less

bandwidth since it only sends acknowledgments to the client, while head and middle server

need to transfer the entry to the next server in the chain. Hence, if all the clients choose

the same server as the head, the tail server consumes half the bandwidth compared to the

rest of the servers. To mitigate this issue, clients of one storage unit connect to different

servers. In the current implementation clients randomly choose one server in the storage

Page 49: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

3.6. ARCHITECTURE 33

Figure 3.6: Clustering decision based on the servers available resources.

unit as the head. However, this can be improved by connecting clients to different servers

in a round-robin fashion. Thereby, every server serves nearly equal number of clients as the

head (and consequently as a tail). Figure 3.5 also shows the distribution of load through

selection of different head servers by the clients.

The decision to cluster the servers is based on the available resources of the servers. One of

the log server process is in charge of compiling servers’ data and makes the clustering decisions.

It reads the servers’ data from Zookeeper and sorts the available servers based on their free

resources. Using this information, servers with similar amount of avaiable resources are grouped.

The server with the largest amount of free resources in the group is chosen as the leader. This

process is performed frequently and the output is written to ”Global View” znode in Zookeeper.

Each process frequently reads this data to determine its group and its role. Upon a crash of the

process (responsible for updating the ”Global View” znode), another process takes over the job.

Our current implementation does not provide dynamic load balancing and it is part the future

work.

3.6.6 Failover

In the event of storage failure (any of the servers) clients of the storage are able to detect

the failure and find another storage unit by querying zookeeper. The client connects to another

storage unit (if one is available) to continue with writing the logs. An altenative to shorten the

service disruption is to allow the client to hold connections to two storage units and upon the

crash of one (the one writing to it), it immediately resumes the operation by switching to the

other storage unit.

Page 50: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

34 CHAPTER 3. DESIGN AND ARCHITECTURE

Figure 3.7: Failover.

3.6.7 API

The following APIs can be used by an application (a memory-based datastore) to write into

the stable storage system.

• connect(replicationFactor); finds a storage unit with with the specified replication factor

and randomly connect to one of the server within that storage.

• addEntry(entry); submit an entry to the storage unit and blocks when the number of

non-acknowledged entries reaches the specified windowsize.

• setWindowSize(windowSize); sets the number of entries that can be sent without receiving

acknowledgments. Setting the window size to one is equivalent to having a synchronous

communication (default windowsize is one).

• close(); Close the connection to the service

3.7 Implementation

We have implemented a prototype of the storage system in Java and the source code is

publicly available through github repository (34). For network communication, we use Netty

(35), an asynchronous event-driven network application framework, and for efficient serializa-

tion/deserialization, we use Protocol Buffers (36). We have implemented a cluster of stable

storage unit (SSUs) and SSU client with failover mechanism. The parameters such as replica-

Page 51: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

3.7. IMPLEMENTATION 35

tion factors, buffer size and client window size are configurable. We have not implemented read

operations and recovery, and they are part of the future work.

Page 52: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

36 CHAPTER 3. DESIGN AND ARCHITECTURE

Page 53: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

4Experimental Evaluation

This section explains a set of experimental evaluations. All the experiments are conducted

on a set of identical machines with the following specifications:

• Two processors, each processor is an Intel(R) Xeon(R), 2.50GHz, L5420, 4 cores

• SATA drive of 1 TB with spin speed of 7200 RPM

• 1 Gb/s network interface

• DRAM of 16 GB

Each experiment has a specific configuration that is explained in the respective section. We

begin with an experiment that enables us to estimate the expected network latency of chain

replication and atomic broadcast. Then, we evaluate our stable storage in terms of throughput

and latency. We show the system behavior under a sustained entry rate and finally we compare

the performance results of our storage with hard disk as the most common means of storage.

4.1 Network Latency

The goal of the experiment is to determine the lowest possible latency that can be achieved

using chain replication and broadcast. We only accomodate the network delay in our compuation

as processing time at each node depends on the implementaion. In this experiment, we measure

the average transfer time for different entry sizes between two machines located within the same

rack and different racks in a datacenter. We conducted this experiment before the design stage

to help us with our design decisions.

To estimate the network latency of each replication technique, we first measure the transfer

time from one machine to another. In table 4.1 first column contains two numbers (first/second),

the first number indicates the elapsed time starting at the point that application starts writing

Page 54: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

38 CHAPTER 4. EXPERIMENTAL EVALUATION

Size Latency within one rack (microsecond) Latency in different racks (microsecond)

256B 107/62 240/130

512B 138/93 285/175

1KB 178/133 357/247

Table 4.1: RPC latency for different packet sizes within a datacenter

to the socket and ends when it receives an acknowledgement from the receiver, receiver sends the

acknowledgement after reading and discarding the entry. The second number simply excludes

the time for acknowledgement from the receiver to the sender. To compute the second number we

need to know the time required for transmission of an acknowledgement message. We estimate

this time by measuring the round-trip time of an ICMP packet by sending a ping message from

one server to the other server and divide it by two. Transmission time of an ICMP packet of 50

bytes from one server to another within a same rack is 45 microsecond and for different racks is

110 microsecond.

Using the measured latencies in 4.1 and estimated time for an acknowledgement, we now

analytically compute the expected network latency for chain replication and broadcast. Our

computation assumes replication factor of three and packet size of 256 bytes (typical write

query size in social network platforms is limited to few hundreds of bytes; for example, twitter

limits the tweets’ size to 140 characters (33) ). For chain replication, we assume the tail is in a

different rack and for broadcast one of the slaves reside in a different rack from other servers. In

a chain of three servers, an entry is sent from the client to the head server (62µs) consequently

head server reads the entry and sends it to the middle server (62µs). Similarly the middle server

reads the entry and sends it to the tail (130µs). Finally the client receives an acknowledgement

from the tail (110µs). In total, chain replication with replication factor of 3 (perfomed in 2

racks) and entry size of 256B yields a total latency of 364 microsecond (this time excludes the

computation time at each node).

In atomic broadcast an entry should be transferred to the master (62µs), and then the master

broadcasts the entry to the slaves (130µs). Once the master receives the acknowledgement from

all the slaves (110µs), it can acknowledge the message to the client (45µs). This yields a total

latency of 347 microsecond.

The network latency of chain replication and atomic broadcast is nearly the same for replica-

tion factor of three. However, the bandwidth utilization of the master node in atomic broadcast

Page 55: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

4.2. STABLE STORAGE PERFORMANCE 39

is 50% more since the master receives an entry and sends it to two slaves. For higher replication

factors, the master node can easily become a bottleneck as it consumes more bandwidth and

CPU power to transmit the packets compared to chain replication. For chain replication ev-

ery node would consume similar amount of resources and increasing the replication factor only

increases the latency and does not impact the throughput.

4.2 Stable Storage Performance

As explained in Chapter 3, our stable storage consists of several stable storage units. Fol-

lowing experiments evaluate the performance of a single stable storage unit. In the following

experiments, we set the TCP-NODELAY to true, to ensure TCP does not batch small packets

and they are sent immediately. This enables us to measure the actual performance of the sys-

tem for small packages. In the experiments, large chunks of entries are periodically written to

disk, thereby, the maximum disk bandwidth is utilized. This prevents the disk from becoming

a bottleneck in the following experiments(disk bandwidth is higher than the available network

bandwidth used by our servers).

To measure the performance, we use a single client and increase the load on the storage

unit by increasing the client window size, ranging from window size of one (synchronous) up

to one hundred (outstanding entries). The results are average over three runs of 200,000 write

operations. All the following experiments are performed within the same rack and there is no

cross-rack traffic. Placement of servers in different racks adds a constant delay to the latency.

4.2.1 Impact of Log Entry Size

The objective of this experiment is to measure the performance of our stable storage unit

and the performance that a single client can expect from the stable storage.

In order to measure the lowest latency a client can expect from the system, we run an

experiment using a single client with window size of one and replication factor of three. As can

be seen in Table 4.2, client can write entries ranging from 200 bytes to 4KB with latency of

less than one millisecond. Figure 4.1 shows the performance of storage unit with different log

entry sizes while window size increases. For entry size of 200 bytes throughut can increases up

Page 56: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

40 CHAPTER 4. EXPERIMENTAL EVALUATION

to 34600 entries/sec and for 4KB entries throughput increases up to 7200 entries/sec. Moreover,

for entry size of 200 bytes depending on the throughput, latencies vary from 0.45 ms to 2.5 ms

and for entry size of 4 KB vary from 1 ms to 3 ms .

The results indicate that for smaller size of log entries (few bytes to hundreds of bytes)

performance does not vary significantly but larger packet sizes notably degrade the throughput

as well as the latency. For packet size of 200 bytes and its highest throughput, 34600 entries/sec,

CPU utilization of each server increases nearly up to 300 percent (three threads running on each

server) while network bandwidth utilization of each server stays below 100 Mb/s. This shows

that our system is CPU bound for small packet sizes due to high overhead of packet processing.

Another contributor to the high CPU usage is the way the buffer has been implemented. The

current implementation of the buffer uses hashing to locate the entries and it does not pre-

allocate memory. A more sophisticated implementation of buffer will certainly increase the

throughput.

For larger packet sizes (4KB and 16KB), the CPU utilization comes close to saturation

point, although it does not become fully saturated. The reason is that our system is network

bound for such large entries. For the highest throughput of 4 KB entries, 7900 entries/sec, the

network bandwidth usage of each server (except tail) exceeds 500 Mb/s. This is the highest

network bandwidth each one of our servers can utilize using a single thread writing to a single

channel. In the current implementation, implementation, servers are connected using a single

socket connection. A single connection is not able to utilize the full bandwidth (37). To fully

uilize the available bandwidth we need to increase the number of concurrent connections between

the servers. However, this causes the servers in a chain to receive and persist the messages in

different orders and might complicate the read operation. Another observation from Figure 4.1

is that for large entry sizes, throughput in terms of the number of entries is lower, although

network bandwidth is utilized more efficiently. To benefit from this property of the system,

client can batch the small entries before sending to storage unit. This also lower the overhead

of TCP/IP packet header (at the cost of higher latency).

4.2.2 Impact of Replication Factor

This experiment is to determine the behavior of the system with different replication factors

and additional latencies caused by an extra replica. Figure 4.2 shows the impact of differing

Page 57: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

4.2. STABLE STORAGE PERFORMANCE 41

Entry size (bytes) Throughput (entries/sec) Latency (ms)

5 2440 0.39

200 2200 0.45

1024 1600 0.62

4096 971.52 0.99

Table 4.2: Latency and throughput for a single client synchronously writing to stable storageunit.

Figure 4.1: Thoughput vs. Latency graph for our stable storage unit for different entry sizeswith replication factor of three.

replication factor on throughput and latency. As can be seen from the figure, increasing the

replication factor from two to three does not impact the throughput but only the latency. For

entry size of 200 bytes and throughput ranging from 5000 to 34000 latency difference lays between

100µs and 350µs. This behavior is expected since in chain replication each server handles the

same load regardless of chain’s length; thus, the number of servers in a chain do not impact the

throughput (if none becomes a bottleneck). Adding an extra server to a chain only increases the

latency of a request by sub-millisecond since the distance of the tail from the client increases.

Larger number of replicas provide a higher reliability and availability as well as a higher read

throughput. These benefits can be provided with additioinal latency of sub-millisecond and

without degrading the write throughput.

Page 58: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

42 CHAPTER 4. EXPERIMENTAL EVALUATION

Figure 4.2: Throughput vs. Latency for stable storage unit with replication factor of two andthree for log entry size of 200 bytes.

4.2.3 Impact of Persistence on Disk

This experiments is to show how the system benefits from asynchronous logging to improve

the performance. Figure 4.3 shows how our asynchronous approach eliminates the impact of

low performance of disk on the write performance of stable storage units. The red line shows

the storage unit performance when it only performs in-memory chain replication. Meaning that

entries are not persisted on disk and every server removes the log entry from its memory once it

is sent to next hop. The blue line shows the storage unit performance when it removes the entries

after persisting to disk. As can be seen, storage unit performance is identical in both cases. This

also indicates the correct implementation of asynchronous logging, as disabling persistence on

disk do not impact latency. Nevertheless, the high performance lasts up to a point that none of

the server’s buffer in the chain is filled. Next experiment investigates our system behavior when

servers’ buffer is full.

4.3 Load Test

Asynchronous logging of entries provide significant performance gain. However, under a

sustained load with an average throughput higher than hard disk write throughput, the buffer

Page 59: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

4.3. LOAD TEST 43

Figure 4.3: Throughput vs. Latency of a stable storage unit for log entries of 200 bytes, whenpersistence to local disk is enabled and disabled.

eventually fills up and the performance drops to the performance of the slowest disk in the chain.

In this experiment, we explore the behavior of our system in such a situation.

To conduct the experiment, replication factor of the storage unit is set to three, each server

has a buffer size of 100,000 entries and writes are forced to disk upon every five writes. As a

side note, in the current implementation of the buffer capacity is defined in terms of the number

of entries and this number is configurable. This is fine for the purpose of prototyping, however,

in a more sophisticated system this limit should be imposed on the number of bytes.

Figure 4.4 and 4.5 illustrate how the latency and throughput of the system change when

the buffer fills up under sustained load. The horizental axis indicates the number of entries

sent to the storage unit, starting from a point in time before the buffer is saturated. This can

be thought as the passage of time. As can be seen in Figure 4.4, when the buffer is filled the

throughput drops dramatically. Throughput for entry sizes of 200 bytes, 1 KB and 4 KB drops

from 11100, 8770 and 5540 entries/sec to 2741, 2328 and 1479 entries/sec respectively. The

latter throughput numbers (after the buffer is saturated) nearly equal to the throughput of hard

disk for entry sizes equal to total size of the batched entries (each batch consists of five entries

in this experiment). Figure 4.5 illustrates a similar effect on latency. For entry sizes of 200

bytes, 1 KB and 4 KB latency increases from 0.75, 1.17 and 3.01 ms to 3.01, 4.02 and 15.10 ms

once the buffer is filled. The reason for this significant raise of latency after the saturation of

buffer is that an entry can be written to buffer only when another entry (head of the queue in

Page 60: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

44 CHAPTER 4. EXPERIMENTAL EVALUATION

Persistence Queue) is written to disk and removed from the buffer. In short, write latency of

the stable storage unit tightens to latency of the hard disk once the buffer is filled.

Figure 4.4: Throughput of stable storage unit under sustained load.

Figure 4.5: Latency of stable storage unit under sustained load.

4.4 Durability and Performance Comparison

We have described several exsiting systems and approaches towards of durability in section

2.2. One common approach is to persist log entries in local hard disk (as Redis does). To

ensure every entry is written to disk we need to force upon every write. However, this does not

guarantee the durability of data on commodity hardware for all the failure cases. Hard disks

have a cache layer that buffers the data and writes it to the disk every now and then. Data

Page 61: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

4.4. DURABILITY AND PERFORMANCE COMPARISON 45

written to disk’s buffer survives the process crash; however this data will be lost in the case of

power outage or hardware failure. To guarantee the persistence of data in all the cases, disk

cache must be disabled but this is not acceptable as the disk performance drops significantly.

Using redundent array of inexpensive disks (RAID) write/read throughput can be improved (38)

and also disk failures can be tolerated (e.g. level 1 RAID), but this still does not address the

power outage problem.

Our stable storage persists entries to the disk and in order to address the process crash and

power failure, it replicates the entries on several servers to guarantee the durability of every

entry.

To establish a baseline, we have conducted an experiment to measure the performance of

a hard disk in terms of latency of append-only operations. Using RAID, throughput can be

multiplied but the latency does not improve. Figure 4.6 provides the comparison between the

latency of our stable storage unit with the hard disk. As can be seen, for entry size of 200 bytes

the storage unit is able to outperform the disk by factor of 4 even when its cache is enabled and

by factor of 2 for entry size of 1 KB and 4 KB. Disabling the disk cache increases the disk latency

from 2 ms to almost 50 ms. Having the disk cache disabled, it performs dozens of times slower

than the stable storage unit. Moreover, our stable storage also provides a higher reliability and

availability by replicating data on several remote servers. This can also be used to increase

the read throughput through aggregating the throughput of servers hosting the replicas; hence,

accecelerating the recovery process.

Page 62: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

46 CHAPTER 4. EXPERIMENTAL EVALUATION

Figure 4.6: Performance comparison of stable storage unit and hard disk.

Page 63: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

5Conclusions5.1 Conclusions

Memory-based databases outperform disk-based databases, but their superior performance

is hindered by the non-volatility of DRAM and consequently data loss at time of failure. Existing

techniques to provide durability of data such as checkpointing and write-ahead logging to hard

disk either do not guarantee the persistence of the entire data or result in significant performance

degradation. Moreover, full state in-memory replication of databases becomes costly for large

deployments. We have proposed an approach to provide durabiliy to memory-based key-value

stores by creating a high performance stable storage system. The system consists of a set of

stable storage units capable of storing log entries for every write operation to memory database

with low response time. Each stable storage unit ensures the persistence of each log entry by

replicating the entry in the memory of several servers prior to acknowledgment of the operation.

Entries are stored in memory of each server and asynchronously written to disk in append-

only fashion. We apply chain replication by forming a pipeline of servers, achieving high write

throughput. Our stable storage unit implements the fail-stop model and in case of failure, it’s

clients simply switch to another replica without significant disruption of operation.

The evaluation results demonstrate that our stable storage system can provide durable

writes with low latency, while providing high data availability. For log entry sizes of 200 bytes

to 4 KB, and replication factor of three, we achieve write latencies of less than one millisecond

and a maximum throughput of 7900 entries/sec for 4 KB entries and 34600 entries/sec for

200 bytes entries. An additional replica can be added to a stable storage unit with negligible

increased latency and no impact on the system throughput. The evaluation results also show

that, in terms of latency, our approach outperforms the convential write-ahead logging to disk.

We believe our solution enables memory-based databases to achieve durability while preserving

their high performance.

Page 64: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

48 CHAPTER 5. CONCLUSIONS

5.2 Future Work

The current implementation supports the write operation to stable storage. Design and

implementation of the read operation for recovery purpose is part of the future work. There

are several possible extensions to the current system. Rack-awareness can be incorporated

into server clustering decisions (to create stable storage units) to enhance the reliability and

availability of the storage as well as providing more efficient bandwidth utilization. The current

load balancing decisions are made before creation of storage units and by distribution of clients

amongst the stable storage units. An extension would be to enable the system to adapt to

changes of servers load when the servers are sharing resources. Another extension is to provide

a self-healing mechanism for our stable storage unit such that a failed server within a storage unit

can be replaced to resume the service. Lastly, we are exploring the potential memory-databases

that can benefit from our system to gain durability and performance.

Page 65: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

References

[1] J. Ousterhout, P. Agrawal, D. Erickson, C. Kozyrakis, J. Leverich, D. Mazieres, S. Mi-

tra, A. Narayanan, G. Parulkar, M. Rosenblum, S. M. Rumble, E. Stratmann, and

R. Stutsman, “The case for ramclouds: scalable high-performance storage entirely in

dram,” SIGOPS Oper. Syst. Rev., vol. 43, pp. 92–105, Jan. 2010.

[2] “Apache BookKeeper presentation - apache software foundation.”

https://cwiki.apache.org/confluence/display/bookkeeper/BookKeeper+presentations,

June 2012.

[3] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoop distributed file system,”

in Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on,

p. 1a10, 2010.

[4] D. Ongaro, S. Rumble, R. Stutsman, J. Ousterhout, and M. Rosenblum, “Fast crash recov-

ery in ramcloud,” in Proceedings of the Twenty-Third ACM Symposium on Operating

Systems Principles, pp. 29–41, ACM, 2011.

[5] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra,

A. Fikes, and R. E. Gruber, “Bigtable: A distributed storage system for structured

data,” ACM Trans. Comput. Syst., vol. 26, pp. 4:1–4:26, June 2008.

[6] “Redis: Remote dictionary server.” http://redis.io, June 2012.

[7] “Couchbase.” http://www.couchbase.com/, June 2012.

[8] “memcached - a distributed memory object caching system.” http://memcached.org/, June

2012.

[9] J. Sobel, “Scaling out.” http://www.facebook.com/note.php?note id=23844338919, June

2012.

[10] H. Lu, Y. Y. Ng, and Z. Tian, “T-tree or b-tree: Main memory database index structure

revisited,” in Database Conference, 2000. ADC 2000. Proceedings. 11th Australasian,

p. 6573, 2000.

49

Page 66: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

50 CHAPTER 5. CONCLUSIONS

[11] H. Garcia-Molina and K. Salem, “Main memory database systems: An overview,” Knowl-

edge and Data Engineering, IEEE Transactions on, vol. 4, no. 6, p. 509a516, 1992.

[12] A. S. Tanenbaum and M. Van Steen, Distributed systems: principles and paradigms. 2002.

second ed.

[13] J. Duell, “The design and implementation of berkeley lab’s linux checkpoint/restart,” 2005.

[14] L. Alvisi and K. Marzullo, “Message logging: Pessimistic, optimistic, causal, and optimal,”

Software Engineering, IEEE Transactions on, vol. 24, no. 2, p. 149a159, 1998.

[15] X. Defago, A. Schiper, and P. Urban, “Total order broadcast and multicast algorithms:

Taxonomy and survey,” ACM Computing Surveys (CSUR), vol. 36, no. 4, p. 372421,

2004.

[16] R. van Renesse and F. B. Schneider, “Chain replication for supporting high throughput and

availability,” in Proceedings of the 6th conference on Symposium on Opearting Systems

Design & Implementation - Volume 6, OSDI’04, (Berkeley, CA, USA), pp. 7–7, USENIX

Association, 2004.

[17] “DRAM and memory system trends.” www.research.ibm.com/ismm04/slides/woo.pdf,

June 2012.

[18] J. Dean, “Large-scale distributed systems at google: Current systems and future directions,”

2009.

[19] A. Jacobs, “The pathologies of big data,” Communications of the ACM, vol. 52, no. 8,

pp. 36–44, 2009.

[20] D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, and A. Rowstron, “Everest: Scaling

down peak loads through I/O off-loading,” in Proceedings of the 8th USENIX conference

on Operating systems design and implementation, p. 1528, 2008.

[21] “Introduction to NetApp infinite volume.” http://www.netapp.com/templates/mediaView?m=tr-

4037.pdf&cc=us&wid=159193675&mid=78187885, June 2012.

[22] “HP smart array controller technology.” http://h20000.www2.hp.com/bc/docs/support/SupportManual/c00687518/c00687518.pdf.

[23] “Documentation of redis.” http://redis.io/documentation, June 2012.

[24] “Apache BookKeeper.” http://zookeeper.apache.org/bookkeeper/, June 2012.

[25] “Replication - MongoDB.” http://www.mongodb.org/display/DOCS/Replication, June

2012.

Page 67: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

5.2. FUTURE WORK 51

[26] “Redis persistence a redis.” http://redis.io/topics/persistence, June 2012.

[27] “HBase - Hdfs Sync Support - hadoop wiki.” http://wiki.apache.org/hadoop/Hbase/HdfsSyncSupport,

June 2012.

[28] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed, “Zookeeper: wait-free coordination

for internet-scale systems,” in Proceedings of the 2010 USENIX conference on USENIX

annual technical conference, USENIXATC’10, (Berkeley, CA, USA), pp. 11–11, USENIX

Association, 2010.

[29] B. Reed and F. P. Junqueira, “A simple totally ordered broadcast protocol,” in Proceedings

of the 2nd Workshop on Large-Scale Distributed Systems and Middleware, LADIS ’08,

(New York, NY, USA), pp. 2:1–2:6, ACM, 2008.

[30] P. Saab, “Scaling memcache at facebook.” http://www.facebook.com/note.php?note id=39391378919,

Dec. 2008.

[31] M. Kwiatkowski, “Memcache at facebook.” qcontokyo.com/, 2010.

[32] S. Ding, K. Lai, and D. Wang, “A study on the characteristics of the data traffic of online

social networks,” in Communications (ICC), 2011 IEEE International Conference on,

p. 15, 2011.

[33] J. K. Todd Fast, “Twitter data - a simple, open proposal for embedding data in twitter

messages - home.” http://twitterdata.org/, June 2009.

[34] K. Rezahanjani, “Source code - high performance stable storage.”

https://github.com/Kiarashrezahanjani/in-memory-chain-replication/tree/complete,

June 2012.

[35] “Netty - the java NIO client server socket framework - JBoss community.”

http://www.jboss.org/netty, June 2012.

[36] “Protocol buffers - google’s data interchange format.” http://code.google.com/p/protobuf/,

June 2012.

[37] “Performance comparison between NIO frameworks.” http://gleamynode.net/articles/2232/,

Oct. 2008.

[38] D. Patterson, P. Chen, G. Gibson, and R. Katz, “Introduction to redundant arrays of inex-

pensive disks (raid),” in COMPCON Spring’89. Thirty-Fourth IEEE Computer Society

Page 68: Durability for Memory-Based Key-Value Storespeople.ac.upc.edu/leandro/emdc/kiarash_rezahanjani_thesis_EMDC.pdfWe propose an approach to provide durability to memory databases, with

52 CHAPTER 5. CONCLUSIONS

International Conference: Intellectual Leverage, Digest of Papers., pp. 112–117, IEEE,

1989.