49
NoSQL Database for Software Project Data Anna Bj¨ orklund January 18, 2011 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Ola ˚ Agren Examiner: F redrik Georgsson Ume ˚ a University Department of Computing Science SE-901 87 UME ˚ A SWEDEN

NoSQL Database for Software

  • Upload
    ali-m

  • View
    223

  • Download
    0

Embed Size (px)

Citation preview

Page 1: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 1/49

NoSQL Database for Software

Project Data

Anna Bjorklund

January 18, 2011

Master’s Thesis in Computing Science, 30 creditsSupervisor at CS-UmU: Ola Agren

Examiner: Fredrik Georgsson

Umea University

Department of Computing Science

SE-901 87 UMEA

SWEDEN

Page 2: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 2/49

Page 3: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 3/49

Abstract

The field of databases have exploded in the last couple of years. New architectures try tomeet the need to store more and more data and new kinds of data. The old relational modelis no longer the only way and the NoSQL movement is not a trend but a new way of making

the database fit the data, not the other way around.This master thesis report aims to find an efficient and well designed solution for storing

and retrieving huge amounts of software project data at Tieto. It starts by looking atdifferent architectures and trying three to see if any of them can solve the problem. Thethree databases selected are the relational database PostgreSQL, the graph database Neo4jand the key value store Berkeley DB. These are all implemented as a Web service and timeis measured to find out which, if any, can handle the data at Tieto. In the end it is clearthat the best database for Tieto is Berkeley DB. Even if Neo4j is almost as fast, it is stillnew and not as mature as Berkeley DB.

Page 4: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 4/49

ii

Page 5: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 5/49

Contents

1 Introduction 1

1.1 Paper outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Modern Databases 3

2.1 A brief history . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 The CAP Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.1 ACID v. BASE - two different ways of achieving partitioning . . . . . 4

2.3 Storing data today . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3.1 Column store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3.2 Key value store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3.3 Document store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.4 Graph database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 The problem at Tieto 9

3.1 The data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 The Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.3 The databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4 The solutions 13

4.1 PostgreSQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.1.1 The questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.1.2 Strengths and weaknesses . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.2 Neo4j . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.2.1 The questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.2.2 Strengths and weaknesses . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.3 Berkely DB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.3.1 The questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.3.2 Strengths and weaknesses . . . . . . . . . . . . . . . . . . . . . . . . . 19

5 Results 21

iii

Page 6: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 6/49

iv CONTENTS

6 Conclusions 29

6.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

7 Acknowledgements 31

References 33

A Data from test runs 35

A.1 Server times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

A.2 Client times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Page 7: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 7/49

List of Figures

3.1 An overview of the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 An example of how a property is shared between different nodes . . . . . . . 10

3.3 An example of a consistent tree . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.4 An example of an inconsistent tree . . . . . . . . . . . . . . . . . . . . . . . . 12

4.1 An example of how a shared property gets duplicated in Berkeley DB . . . . 18

5.1 Total time for the client. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.2 Client times for 10% of the data without PostgreSQL . . . . . . . . . . . . . 23

5.3 Total time for the server. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.4 Client times for question 1-6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.5 Total time for the client for question 7 at different amounts of data . . . . . . 26

5.6 Client times for question 8-13 . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

v

Page 8: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 8/49

vi LIST OF FIGURES

Page 9: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 9/49

List of Tables

2.1 An example of data organized in a table . . . . . . . . . . . . . . . . . . . . . 6

4.1 The table retured by PostgreSQL for the inconsistent example . . . . . . . . 14

A.1 The server times for 0.1% of the data. Time in milliseconds. . . . . . . . . . . 35

A.2 The server times for 1% of the data. Time in milliseconds. . . . . . . . . . . . 36

A.3 The server times for 10% of the data. Time in milliseconds. . . . . . . . . . . 36

A.4 The server times for 100% of the data. Time in milliseconds. . . . . . . . . . 37

A.5 The server times for the hash table. Time in milliseconds. . . . . . . . . . . . 37

A.6 Client times for all test runs at 0.1%. Time in seconds. . . . . . . . . . . . . . 38

A.7 Client times for all test runs at 1%. Time in seconds. . . . . . . . . . . . . . . 38

A.8 Client times for all test runs at 10%. Time in seconds. . . . . . . . . . . . . . 39

A.9 Client times for all test runs at 100%. Time in seconds. . . . . . . . . . . . . 39

vii

Page 10: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 10/49

viii LIST OF TABLES

Page 11: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 11/49

Chapter 1

Introduction

Today there exist many different types of databases, not only the traditional relational SQLdatabase but several other architectures designed to handle different types of data. Sincethe 70s and into the new millennium the relational model was the dominant and almost all

databases followed the same basic architecture. At the beginning of the new millenniumdevelopers started to realize that their data did not fit the relational model and some of them started to develop other architectures for storing data in databases. When choosinga database today the problem is much more complex then deciding on a vendor for therelational database, the main problem is deciding which architecture of data storing is bestsuited for the data. When that decision is made it is time to choose a vendor that meetsthe companies requirements regarding price, reliability and so forth. This paper will lookat three different database solutions for software pro ject data at Tieto Umea and comparethem. First a theoretical approach is made and then all three are implemented and aretested to see which is fastest in a test with the real data.

1.1 Paper outlineChapter 2 begins with a brief history and then takes a deeper look at the different solutions

for data storing that exist today.

Chapter 3 takes a deeper look at the problem at Tieto, the data they have and how thisdata fits different architectures.

Chapter 4 describes the three different solutions implemented and where the strengthsand weaknesses lie in each solution from a theorecial point of view.

Chapter 5 presents the result of the implementation with extra attention to performanceand the specific requirements from Tieto.

Chapter 6 addresses what is left to do and how Tieto can move forward with this.

1

Page 12: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 12/49

2 Chapter 1. Introduction

Page 13: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 13/49

Chapter 2

Modern Databases

2.1 A brief history

In the 70s databases was a growing field and there were some debate on how to organize thedata. IBM developed System R and it was the first system to implement a Standard QueryLanguage (SQL). System R is the foundation for many of todays popular DBMSs (DatabaseManagement System)[9]. The hardware design in the 70s and 80s were much different fromtoday. Today processors are thousands of times faster, memory is thousands of times largerand the main bottleneck is the bandwidth between disk and main memory. The main marketfor RDBMS (Relational DBMS) in those days was business data processing, today there area lot of different markets with completely different requirements. Yet another difference isthe user interface, in the beginning there was a text terminal and today there is a graphicalinterface. Despite the changes in requirements and hardware the relational model was the

dominate one until the beginning of the new millennium. At that time developers started tothink outside the box and realized that they had data that did not fit the relational model.Several started to develop different ways to organize their data depending on their specificneeds. There were some products but most of them were only available within the companyand for a specific solution.

The phrase NoSQL was first used in 1998 as a name for a lightweight relational databasethat did not expose a SQL interface. In early 2009 it was reused by the organizers of an eventto discuss open source distributed databases and was a reference to the naming conventionof traditional relational databases, such as MySQL and PostgreSQL. Today the expressionis often thought of as Not only SQL and is the movement of other database solutions than arelational database. The idea is not that relational databases are bad and wrong, just thatin some cases the relational model just isn’t enough. If the If the relational model fits the

data then it is a good idea to use it. But if the data does not  fit the relational model it isworthwhile to look at another types of database. The two main disadvantages that RDBMShave are that they do not scale easily (in the next Section it will be obvious why) and theyoften fail at capturing the relation between the data. Only a few years ago these problemswas not such a big problem but the amount of data that is in store today is infinitely muchmore than only ten years ago. The continuing trends of cloud computing and growth of social networks will only fuel the need for large data stores even more.

3

Page 14: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 14/49

4 Chapter 2. Modern Databases

2.2 The CAP Theorem

To be able to discuss the different database solutions that exist today it is important to havean understanding of the CAP theorem [3]. The CAP theorem states that it is impossiblefor a Web service to guarantee all three of the following properties:

Consistency – all clients have the same view of the same data at all times

Availability – all clients can always read and write

Partition-tolerance – the system works well despite physical network partitions

All three are desirable for all Web services but at PODC 2000 Brewer [3] made theconjecture that it is impossible to have all three, a Web service can at most choose twoof the three. In 2002 Gilbert and Lynch proved that Brewer was right for asynchronousnetwork, which is a network where the nodes does not share a clock but have to rely on themessages that are sent between them. Since this is the case for most web services it has amajor impact on the decision to choose the right model for storing data. The CAP theoremstates that any database solution can only fulfill two of the criteria and that it is up to thearchitecture to choose which two.

Most relational databases can promise consistency and availability and this is goodfor smaller system. If this is the main goal, the data fits the relational model and thereare no requirements on uptime then the relational model is a good choice. If there arerequirements on uptime or the data is massive there might be necessary to partition thedata between several nodes and make a compromise on one of the other. One node can neverguarantee a given uptime and for some companies this is so important that they can toleratea database that is inconsistent at times to guarantee availability. One important note is thatinconsistency is not always inconsistency, it only means that the database cannot guaranteethat every node have the exact same picture of the data at all times. They do guaranteethat all nodes will have the same picture at some time, but not all the time. This is referredto as eventual consistency and as the term implies, the database will be consistent at some

time but not all the time.

2.2.1 ACID v. BASE - two different ways of achieving partitioning

If a database needs to be physically partitioned then the CAP-theorem states that it needsto choose to give up either A (availability) or C (consistency). ACID (Atomicity ConsistentIsolation Durability) and BASE (Basically, Available, Soft state, Eventually consistent) aretwo different ways of doing this. ACID and BASE are not databases but more of organisationschemas that can give guidelines how a database can operate to be as good as it can be forthe third criteria.

In 1981, Jim Gray [4] proposed how a partitioned database could guarantee consistencyby making sure updates were done in transactions that followed some given guidelines.

Today transactions are something natural and most databases support it, some even demandit for some type of update. But in 1981 when Jim Gray reinvented transactions it wassomething new and it is on that foundation most systems are built today. The propertiesof a transaction are

Consistency a transaction only commits if it preserves the consistency of the database

Atomicity a transaction either commits or not, it acts as an atomic operation

Page 15: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 15/49

2.3. Storing data today 5

Durability once a transaction is committed, it cannot roll back

Isolation no other transaction can see the events in a non-committed transaction

In the original paper there were no I, it was added later in 1983 by Andreas Reuter and

Theo Haerder [5] to form the acronym ACID (Atomicity Consistent Isolation Durability).In 1998 Jim Gray was awarded the Turing Award1 for seminal contributions to databaseand transaction processing research and technical leadership in system implementation [14].Grays rules guarantee that a database stays consistent even then partitioned, but the pricefor this must be availability according to the CAP theorem. Another problem with the rulesis performance; the cost to keep the data consistent is not nonexistent. As a solution to theseproblems Dan Prittchett, Ebay [7] suggested that trading some consistency for availabilitycan lead to dramatic improvements in scalability. His solution has the acronym BASE(Basically, Available, Soft state, Eventually consistent) and uses functional partitioningas a method of partitioning the data. Functional partitioning is dependent on the actualdata stored and for some systems this technique will not work well. It allows some datato be inconsistent between different partitions at some period in time and uses persistentmessaging to make the data consistent at a later point. The main point here is to allowsome of the data to be inconsistent at some times but not all the time, hence the eventuallyconsistent part of the acronym.

The notion of having inconsistent data, even if it only is for a moment, is very scary tosome computer scientist. The important thing is to choose which data to allow inconsistencyon and partition the system according to this. This is something we all come in contact withat some point in our lives, for example when we pay with our credit card it takes a day or twobefore it can be seen on the bank statement. Another example of this is Amazon and theirsolution Dynamo [2]. They risk losing millions in revenue if customers cannot access theirweb store at all times because they have customers from all around the world. When it isnight at one part of the world another part has daytime and millions of potential customerschoose between them and another online book store. If Amazon tolerated downtime onparts of the store at any given time the word would spread and they risk losing reputation

and customers. Because of this they tolerate that different nodes have different views of some the data at short periods of time.

It is worth mentioning that not all non-relational databases operate in the same spaceof the CAP-theorem and there is no clear way of saying that a specific type of NoSQLdatabase is in any specific area. Today there are several solutions that operate in differentareas of the CAP theorem and the same database can exist in different areas depending onconfigurations.

2.3 Storing data today

Today there exist some famous non-relational database systems; Googles Big Table, Ama-

zons Dynamo and Cassandra (used by Facebook and Twitter) name a few. There arealso several open source solutions with varying quality. There are no reasons to choose anRDBMS and try to fit the data into it, instead there is money and time to be saved bychoosing carefully and finding the model which fits the data the best. The term NoSQLdoes not denote a specific type of database but can be divided into several different types of 

1The Turing Award is recognized as the ”highest distinction in Computer science” and ”Nobel Prize of computing”.

Page 16: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 16/49

6 Chapter 2. Modern Databases

non-relational databases that all have different characteristics and are suitable for differenttypes of data and in different situations.

2.3.1 Column store

Most data today is organized in tables, this is a model that suits lots of data and is easy tounderstand for most humans.

Name Birth date Address Zipcode City

John Svensson 1976-02-10 Nygatan 12 123 94 OrebroMalin Olsson 1986-09-23 Storgatan 54 345 19 GoteborgOve Nykvist 1967-05-02 Hammargrand 2 735 12 Sundsvall

Table 2.1: An example of data organized in a table

The data cannot be stored in two dimensions on disc since disc is sequentially accessed.The traditional way a RBDMS organizes the data is in records and these are continuouslyplaced in storage. This row-oriented architecture gives fast writes and is called write opti-

mized.

John Svensson, 1976-02-10, Nygatan 12, 123 94, Orebro

Malin Olsson, 1986-09-23, Storgatan 54, 345 19, Goteborg

Ove Nykvist, 1967-05-02, Hammargrand 2, 735 12, Sundsvall

This is optimized for systems that does lots of writes but does not work well with systemsthat handle few writes with lots of data in each write and lots of querying in between thewrites. In that case a read-optimized system is better suited and a way to achieve this is acolumn-oriented organization [10]. In a column store the data is stored in columns instead,making it faster to read a particular column to memory and making calculations on allvalues in a column.

John Svensson, Malin Olsson, Ove Nykvist1976-02-10, 1986-09-23, 1967-05-02

Nygatan 12, Storgatan 54, Hammargrand 2

123 94, 345 19, 735 12

Orebro, Goteborg , Sundsvall

Then the columns of data are stored together and when querying it is not necessary toread unimportant columns to memory, thus making it faster for some types of operations.One disadvantage of this type of storage is that it makes joins very time consuming andsome column stores does not support join operations on the data. Cassandra is one of themost famous of the wide column stores and is used by both Facebook and Twitter; thoughTwitter use a slightly different configuration called Twissandra. Another famous one isGoogles Bigtable.

2.3.2 Key value store

A key value store stores anything as a key/value pair. The key is used to access the storedvalue and the stored value can be anything. This may seem very simple and it is, but onlyon the surface; the database engine handling the persistent data is often very advanced.The main advantage with this type of storage is that it is schema less, in theory there

Page 17: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 17/49

2.3. Storing data today 7

are no constraints on the key or the value. In practice the key is usually some primitiveof the programming language; a string, an integer or something like that. The value canbe anything but is usually an object of the implementing language or a string. The mainadvantage is the speed and ease that data can be stored in a persistent way. It is easyto support physical partitioning and most support the eventually consistent idea behind

BASE. The main disadvantage is that inhereted relations between data are lost and sinceanything can be stored it is up to the client to interpret the data returned by the store.

The most famous of the key value stores is Amazons Dynamo with was already discussedearlier. An open source alternative to Dynamo is Voldemort [13] Another key value storeis Berkeley DB which is one of the databases that is used in this paper. More on that willfollow later.

2.3.3 Document store

A document store is a special kind of key value store; it does not store the document (thevalue) as a mass of data, and it uses information in the document to index the document.Because of this there are demands on the data in the document; it has to be structured insome way. This is usually accomplice by XML, JSON or something else that the databasecan understand. This allows queries on the data and not just the keys as is the case for akey value store. This also allows for a much more flexible solution than an RDBMS sincethe database has no schema. There are no problems adding attributes to records after theyhave been inserted into the database, even if the attribute is something not even conceivedat design time. This makes the document store very flexible, something that is hard toachieve with a traditional RDBMS.

Some famous document stores are Raven [8], MongoDB [1] and CouchDB [12].

2.3.4 Graph database

In a graph database the data is stored as nodes and vertices between the nodes. The nodeshave attributes or properties and the vertices have types or names. The data is extracted

by traversing the nodes and vertices in different ways. Some vendors include some way of indexing the nodes for easy access. The main advantage with this type of storing is thepossibility to traverse the nodes with known mathematical graph traversing algorithms. Aswith all these different ways of storing data it will only work if the data fits the model.There will be problems if the data is tabular in nature with little or no relationship betweennodes. This model will then work poorly and it would have been better to use another typeof database instead.

The notion of storing data in something else besides an RDBMS is nothing new, therehave been several projects for as long as there have been computers. In the last yearsthere have been an exponential growth of data and the need to use something else than anRDBMS has also grown exponentially.

Page 18: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 18/49

8 Chapter 2. Modern Databases

Page 19: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 19/49

Chapter 3

The problem at Tieto

The database today has two major problems; it is designed to handle any type of data andscalability was not an issue for the designers. This became a big problem when the amountof data became much larger than anticipated at design time. One of the questions Tieto

wanted to have answered is if a more specific design of an RDBMS can help with scalability.At the same time they want to know if a different kind of database can do the job better.The choice is described later, first a look at the data at Tieto.

3.1 The data

The content and the nature of the content of the data is a company secret. Therefore thisthesis will only give a general schematic picture of the data and not use the correct namesor labels.

The easiest way of describing the data is by using a graph. There are four different levelsof nodes; A, B, C, and D. There are no vertices between nodes of the same level and only

vertices to a node of the adjacent level. The information of the relationships lies entirelyon the upper nodes. The C nodes have information about which D nodes it relates to, theB nodes have information on which C nodes they relate to and so forth. A C node hasno information about which B nodes it connect to. This is the nature of the data; in theimplementations there exist a relationship both ways. Because of this it is a requirementamong all solutions that nodes are entered in the right order. If a B node is entered intothe database without all C nodes it relates to already being there, the B node cannot bestored.

Figure 3.1 is a picture to help understand the organization of the data and also give anidea of how many nodes there are in each level. In total there are 4 million nodes, 40 millionrelationships between nodes and 100 million values.

This is a simplification of the real data, in the real data there are some vertices from A toC, the nodes also have some metadata attached to them and each node has a predecessor.

These properties are removed from the data for this thesis to make the implementationa little bit easier. Since the main goal of this thesis is to implement and test differentdatabases and see if they can handle the amount of data these simplifications should notmake the result differ too much from reality. The picture is an overview of how the data isorganized; the B and C nodes are very similar and have properties that they share, makingthe picture a little more complicated. Figure 3.2 is a small section of the big graph and anillustration of how one of the properties of the data in the nodes makes them connected.

9

Page 20: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 20/49

10 Chapter 3. The problem at Tieto

Figure 3.1: An overview of the data

A node has on average one or zero connections to properties but the number vary a lotand others have up to ten. The nature of the graph; the number of vertices between thedifferent levels, the number of nodes and the number of properties shared between the nodesdiffer a lot depending on where in graph the calculations are made. This makes it harderto implement an optimal solution. An algorithm that works nice in one part of the graph

may be a catastrophe at another part. These implementations try to be as good as it getsfor the majority of the graph but not optimal for any one part of it.

Figure 3.2: An example of how a property is shared between different nodes

Page 21: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 21/49

3.2. The Questions 11

3.2 The Questions

The questions asked to the database can be divided into two different types. The first typeis a simple get a set of nodes with a given property. The properties are one or two and allnodes returned must have the given property. In PostgreSQL a typical question is SELECT

* FROM table WHERE table.a = X AND table.b = Y. This type is called the simplequestions without connections and there are nine in this solution. The other type is a set of more complicated questions where the connections between the nodes are explored. Thesequestions are

– Return a sub tree of a given A

– Return all B that connect to a given C

– Check the difference between two A, return all nodes unique to A1, unique to A2 andall nodes A1 and A2 have in common in three different lists

– Check if a tree under A is consistent and if it is inconsistent return how it is inconsistent

A sub tree under an A is inconsistent if two C have the same value in a specific attribute

but are not the same node. For all cases of inconsistency the database returns the value of the attribute and all B-C pairs where the C-node has the value of the attribute. If the subtree is consistent the question only return true. To illustrate this see Figure 3.3 and 3.4. InFigure 3.3 the sub tree of A is consistent but in 3.4 there are two C-nodes that have thesame value of a=4 but they are not the same node. In this case the question will returna=4 and a list of BC-pair; B2-C2, B3-C4 and B4-C4.

Figure 3.3: An example of a consistent tree

Note that the attribute in question is a very specific attribute and that it is only in asub tree under a given A that this is interesting. In two different sub trees there are some Cnodes with the same value that are not the same C node but this is permitted. It is only inthe sub tree of one A that all C nodes with the same value in the attribute must be the samenode. How and why inconsistencies occur is to closely connect to the nature of the data tobe revealed here. They do occur at some points and the solution at Tieto today cannot tellin which nodes the problem is, it only gives true or false if the sub tree is consistent or not.

Page 22: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 22/49

12 Chapter 3. The problem at Tieto

Figure 3.4: An example of an inconsistent tree

3.3 The databases

One of the databases chosen was PostgreSQL, a free Open Source RDBMS that is known tohave good performance. One major advantage was also that Tieto already had this installedfor other testing purpose. Because of the nature of the data one of the databases is a graphoriented database and the choice was Neo4j, a Swedish open source graph database. Onemajor advantage was also that it is written in Java and well documented. The last databasewas a choice of either a document store or a key value store. This was because they arevery different in their architecture from the other two and it is interesting to see if theyare as good as they should be on the simple requests and how bad they are on the morecomplicated questions. Several were considered but the choice fell on Berkeley DB, a key

value store that had recently been rewritten for Java. It previously only existed in C witha library that could run the C-version in Java but with the new version the cost for interlanguage translation was avoided. Berkeley DB is also well documented and has severalexamples.

Page 23: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 23/49

Chapter 4

The solutions

The solution is implemented in Java 1.6 as a Web service. This was a requirement from Tietoas this would make it easier to integrate this into their existing systems. The three differentdata stores implements the same interface and are therefore interchangeable without making

any changes to the Web service methods. Since this is intended as an internal system the datasource is trusted and there are no protections against malicious data. The only protectionthat exists is against faulty data. This may seem a bit strange and if any of these databasesare integrated into Tietos existing systems they need some work in this area. The mainreason for this limitation is time. The time for this thesis is limited and it was decidedthat it was better get as much functionality as possible instead of spending time on errorhandling for something that may be discarded.

All three solutions operate in the same space of the CAP theorem; none of them arephysically partitioned. Both Neo4j and Berkeley can be partitioned if the need should arise.

4.1 PostgreSQL

The PostgreSQL solution is implemented in version 8 and uses java.sql.* library to commu-nicate with the database. The main table structure is straight forward; one table for each of the levels of nodes with a serial id field and three tables containing the relationship betweenthe different layers of nodes. The property described in Section 3.1 is also in its own tablewith a join table that tells which property belongs to which node. This is not the structureof Tietos current solution; this is a new design that tries to be as good as it can be for thisversion of the data.

One of the main problems with PostgreSQL can be found in this design; there areapproximately 18 million rows in the table joining B and C nodes. All questions regardingthe relationship between the B and C nodes will be costly any way it is done. When askingfor a specific B it is necessary to query over this table because one important part of a Bis with C it connects to. The information about these connections must be in the database

and the other solutions for achieving this would have other problems. One way would beto let the table for B nodes contain this information but the number of connections differswidely between different B nodes. This makes it hard to have any other solution than theone chosen. Pruning the data and discard some of the nodes is not an option, all data is stillrelevant in one way or another. Another possibility would be if the graph could be dividedinto several smaller sub graphs. Then the table could be split into several smaller tables.This is not possible for this data, not even when looking at only the B and C levels so this

13

Page 24: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 24/49

14 Chapter 4. The solutions

a B.id C.id3 C1 B13 C1 B24 C2 B24 C4 B3

4 C4 B52 C3 B32 C3 B4

Table 4.1: The table retured by PostgreSQL for the inconsistent example

cannot be done without having more than one copy of some of the nodes.The logic of this solution strives to make as few queries to the database as possible

without having to loop the resulting rows more than once. The overhead of querying thedatabase is something to consider and it is worth a little more logic in the program to nothave to query the database more than once for each question. This is the only solution thatrequires anything else than a java library since the PostgreSQL server runs independently

from the java program. In the test runs the PostgreSQL server was run on the same computerand thus eliminating the time it would take the data to be transferred in a network.

4.1.1 The questions

In this Section the more interesting solutions will be described in some detail. The first ninequestions are simple select with some joins for retrieving the data. They are not particularlyinteresting or special, only simple selects. The queries are written to allow the query plannerin PostgreSQL as much freedom as possible since it probably is better at the planning thenthe author of this paper.

Question 10 and 11 does nothing special; they only return the sub tree or the list of Bnodes and does nothing unexpected. Question 12 gets the unique labelling string for thetwo different sub trees for the different A nodes. It then uses javas set operators with a

hash set on the two sets of strings to get the three different subsets. Given the results of the test runs this is probably not the optimal way of doing this even thought java is good athashing strings. If there is any more work done on this solution this questions is definitelyworth looking at and implementing a better solution.

The last query determined if a sub tree of A was consistent or not. For examples of this see Figure 3.3 and 3.4. This was really hard to implement and the final solution isone that uses PostgreSQL for the most part and some java for the final logic. The query toPostgreSQL returns table 4.1 for the previous inconsistent example. This is sorted primarilyon the value of a and secondarily on C. The java program then has the following algorithmto remove the rows that does not contain the inconsistent a-value. The return structure isnot in row format but contains the same information, organized in a slightly different way.Note that before the first row is calculated the variables keeping track of the previous row

are set to empty strings and therefore are a value but matches noting from the database.

1 Fetch the values for the new row

2 is this a-value the same as the previous row

2.1 is this C the same as the previous row

2.1.1 keep C and D in a structure

Page 25: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 25/49

4.2. Neo4j 15

2.2 else (the C is different meaning an inconsistent tree)

2.2.1 set this a-value as inconsistent

2.2.2 keep C and B nodes in a structure

3 else (this is not the same a-value as the previous row)

3.1 is the previous a inconsistent

3.1.1 set the return structure as inconsistent

3.1.2 save all data in the return structure

3.2 else (the previous a is consistent)

3.2.1 discard all saved data

4 set this row as the previous row

4.1.2 Strengths and weaknesses

The implementation of PostgreSQL was the first one, since this was the database mostfamiliar from previous experience. In the end this database proved to be the hardest andmost time consuming to implement. The amount of code needed to handle calls to thedatabase, exceptions and similar things is massive. Almost all exception handling is simplyprinting an error message on stderr and moving on or returning false since there is nouse spending time implementing fancy error handling for something that may be discardedshortly. All changes to the database are made in transaction and time and effort was spent onmaking sure the data in the database did not get corrupted. One problem with PostgreSQLand other SQL databases is that the programmer needs to be good at SQL to be able tohandle writing the queries, setting up tables and similar things. There is a big hurdle to getover in order to do things nicely and efficiently.

4.2 Neo4jNeo4j is a graph database and as such it uses nodes and vertices to store data. A nodecan have several properties and several vertices or relationships with other nodes. Theserelationships must be of a specific type. A relationship can also have several properties justlike the nodes. A property is a key that is a string and a value. The value must be one of Javas primitive types, a string or an array of primitives or strings. The data is retrievedfrom the database by traversing the graph.

Since the data is in a graph structure there was no need to think of any other structurefor storing the data. All attributes in a node are stored as properties and the relationshipsare set as the nodes get entered into the database. Relationships are set both ways sobetween two nodes there are two relationships with different types, one going up and onegoing down. This helps with the traversing of the tree, making sure that only nodes in the

right direction gets traversed.To index the nodes this solution uses the LucenIndexService that is closely integrated

in the database but not a part of it. There is no indexing in the graph engine but thissemi built in index service uses a Lucene as backend and is as close as it gets to being anintegrated index. This is not intended as a key-value store and therefore indexing is not apriority, the main way of finding the right nodes should be by traversing the graph withdifferent algorithms. Because the indexing is separate from the database it is possible to

Page 26: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 26/49

16 Chapter 4. The solutions

remove a node from the database and still have it indexed. The index service may stillreturn the node but it will not have the correct properties set and when sked for the valueof the propery an exception will be thrown. The first time this happened it was a somewhathard error to find. The database said that the property was not set but the problem wasthat the node was still indexed by the property and removed from the database. It is vital

that all indices are removed when a node is removed from the database. The only timenodes are removed from the database in this solution is then all data is removed for testingpurposes.

This solution uses version 1.1 of Neo4j. The new version 1.2 came out in December 2010.The implementation was finished in November of 2010 and has not been checked against thenew version of the database. One of the main differences is how the indexing is handled. If there is any future work done on this solution one of the first steps should be integratingthe new way of indexing. More information on this can be found on Neo4js web page [11].

4.2.1 The questions

The first nine questions are handled by the index service. All nodes that are put in the

database get indexed on the different properties that are needed for this. Then the correctnode has been found the information is moved from the node to the Java object that getsreturned. A new object is created for every node that is returned, it is not the same objectthat got put in the database but it has the same information.

The more complex questions use the graph structure of Neo4j to return the correct nodes.Question 10 gets the correct A from the index service and then simply traverses the sub treeand returns the nodes. Neo4j makes this traversing very easy, it is possible to ask for everynode that is at a maximum depth from this node and that can be accessed by the correctrelationship type. The same is true for question 11, the correct C node is found by askingthe index service for the node and then asking the database for all its neighbours with thecorrect relationship.

Question 12 uses Javas hash set in the same way as PostgreSQL to calculate the differentsets from given sub trees. As previously mentioned this is probably not the best way and adifferent algorithm should be considered if any more work is done on this database.

The consistency check really uses the graph properties of this database. It begins bymaking a depth first search with a maximum depth of two. For all C nodes it saved thea-value and the C node in a hash table. The a-value is set as key and the C node is put ina list. When all nodes are traversed the program begins to go through the hash table andsearching for a-values that have more than one value in the list. If such a list is found, thesub tree is inconsistent. To get the right B nodes all B nodes are examined and the oneswith correct A are stored in the return structure. For the most cases a sub tree will beconsistent and this should be faster than BerkeleyDB for those cases since it really uses thenature of the graph and only when it is needed.

4.2.2 Strengths and weaknessesIt was fairly easy to start implementing Neo4j. There are some examples on the web siteand a really good API for all of the classes. The only major problem was described above,if an index was not removed it showed in a really strange behaviour and the root causewas hard to find. A completely missing functionality is the ability to truncate the entiredatabase. Since this was a test there was a need to store data, perform the tests and thenremove all data to make the database clean for the next test. This is very easy in both

Page 27: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 27/49

4.3. Berkely DB 17

PostgreSQL and BerkeleyDB but no easy solution was found for Neo4j. Neo4j is still youngand the changing of indexing from version 1.1 to 1.2 shows that big changes are still beingmade. Both PostgreSQL and BerkeleyDB are older and more mature products.

Neo4j requires really fast discs or huge amounts of memory to work well. The Linuxmachine that was used in the developing of this thesis had some real problems with speed.

The test runs was done on solid state drive and Neo4j needs better hardware for the programthan the other two databases.

Because the data is organized as a graph there are several graph algorithms that canbe used to solve various problems. A graph is easy to understand and most data with lotsof relationships are described as a graph. There exist programs that allows for a graphicalpresentation of the data in the database but none were tested by the author of this thesis.

4.3 Berkely DB

BerkeleyDB is a key-value store that is originally written in C++ but now has a completelyrewritten version in Java. BerkeleyDB is owned by Oracle since 2006. Berkeley stores any java-class that is set to be persistent. It uses annotation to set the class as persistent or

as an entity class and the members of the class that are primary- and secondary key. Thesecondary keys have of four different ways of relating to other instances of the same class.An example will clarify this.

@Entity

class ExampleClass {

@PrimaryKey

long id;

@SecondaryKey(relate=ONE_TO_ONE)

Int ssn;

@SecondaryKey(relate=MANY_TO_ONE)String Name;

@SecondaryKey(relate=ONE_TO_MANY)

String[] email;

@SecondaryKey(relate=MANY_TO_MANY)

String[] family;

}

ONE TO ONE says that the value is unique for every instance in the database. Aprimary key is of this type but it is unusual for secondary keys. MANY TO ONE meansthat this instance only has one but share that with several other instances of this class. If 

an instance may have many but no other instance may have any of the same the relationtype is ONE TO MANY and if an instance can have many that it shares with other therelation type is MANY TO MANY. For more information on implementation details, seethe API [6].

Since BerkeleyDB require all objects to be stored to be set as persistent the informationhave to be moved from the original object coming in to a Berkeley object that looks the sameexcept for the Berkeley specific annotations. When returning an object from the database

Page 28: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 28/49

18 Chapter 4. The solutions

the reverse is done to make sure the correct type of object gets returned. When doing thissome information about relationships between the nodes are lost. In Figure 3.2 there isa shared property between the nodes. In Berkeley each node will get their own copy of this property and when returned the nodes will have different but identical copies of thisproperty, see Figure 4.1.

Figure 4.1: An example of how a shared property gets duplicated in Berkeley DB

4.3.1 The questions

As with the others the first nine questions were easy then the indexing part was understood.If a Berkeley DB object could be returned the search methods would consist of only one lineof code. Returning a sub tree needs the database to first get the right A node then loopingall B nodes, getting the right node from database and finding all C nodes it connects to.It then needs to do the same for all C nodes to get all D nodes. This means that severalnodes may get visited more than once and that is not optimal. If a get from the databaseis costly and there are lots of C nodes that belong to different B nodes in the same subtree this approach will be expensive. But if there are only a few of these nodes it will costmore to keep track of which have already been explored than to let them get explored oncemore. Therefore this solution uses the naive approach and lets the same C node get exploredseveral times.

Getting down the tree is fairly easy with Berkeley DB as with both the other databases.In the other two there is also information on how to get up the tree but this informationdoes not exist in Berkeley DB. Information on with B nodes a C node belongs to is onlystored in the B nodes and therefore it must be search in the B nodes. The search is not hard,the id of the C nodes is set to be a secondary index and the search is simple but possibly

time consuming. The difference between two sub trees is handled in much the same way aswith the other databases. The difference here is in the speed of the initial search since it isdone twice.

The inconsistency check was hard to do in Berkley DB. In Neo4j there was a possibilityto store only the C nodes and the a-value and then get the information if the sub tree isinconsistent. In Berkeley DB this would be much harder since there is no link from a C nodeto its B nodes and on to the A node to make sure only the correct B nodes gets returned.

Page 29: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 29/49

4.3. Berkely DB 19

The supervisor at Tieto, Anders Martinsson, made a solution for his Hashtable and thatsolution seemed to be a good approach: it stores all the information needed as it makes adepth first search of the sub tree. In Neo4j only the a-value and its corresponding C node issaved but there it is possible to retrieve the information about the B nodes without havingto search the whole tree again. In Berkeley DB this is not possible so all information needed

to be saved as the initial search proceeded. The hashtable has the structureHashtable<String, Hashtable<String, List<String>>> to keep track of all nodes vis-ited so far. The outer hashtable have the a-value as key and a hashtable as the value. Theinner hashtable have the C node for key and then a list of B nodes for value. Then the initialsearch is done it is simply a matter of looking at the length of each of the inner hashtablesto see if any of them are of greater length than one. If so, then that a-value is in more thanone C node and the tree is inconsistent.

4.3.2 Strengths and weaknesses

This was a little bit harder than Neo4j to get started on the implementation but after theinitial hurdle was cleared there were few problems with getting the code to work. Then firsttested on the developing machine the first reaction was that it was really fast. Those testswere only to test the functionality but even then it was a clear difference in the speed of thetest program.

The author of this paper has tried to find some graphical program to view the data inthe database and handle it manually. No such program has been found but it would begood if it existed.

Page 30: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 30/49

20 Chapter 4. The solutions

Page 31: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 31/49

Chapter 5

Results

The test program was developed by Anders Martinsson; my supervisor at Tieto. All testruns were done by him on his computer,

The test runs are made in four different percentages of the data; 0.1%, 1%, 10% and100%. The test was run 2048 times for each question except for Neo4j and PortgreSQL at100% because they took so long it was not possible to let them run that many tests. Neo4jhad 512 tests and PostgreSQL had 256 tests per question. Time was measured at bothclient and server; the client measured the wall clock time for all test runs in millisecondsand the server each questions time in nanoseconds. In total there were almost 500 000 timesto analyse at the end of all test runs. The server times will not be presented in total here;only the mean, median and a trimmed mean will be presented. The trimmed mean has 5%cut of at each end to get rid of any extreme values. All times are in tables in Appendix A.Note that the listings of client times for 100% have the numbers for Neo4j and PostgreSQLare multiplied with 4 resp. 8 to give an accurate picture for comparison.

This is maybe the most important result of them all. The fact that the total time forrunning the entire test set was so high that it could not be completed for both Neo4j and

PostgreSQL is very telling of which database is faster for the large amount of data.The test program was developed at the same time as the databases and to test it Anders

designed a Hash table to handle the data. This is not a persistent database but for com-parison reasons test runs were done with this as well for the three smaller data sets.

First a look at the client times for the different questions at the different percentagelevels. Note that the scale in the x-axis is logarithmic and not linear.

It can be seen in Figure 5.1(a) and 5.1(b) that PostgreSQL has a high overhead costsince it needs to call an external database. For that amount of data there can be no otherexplanation as to why the cost is so much higher. The impact of this overhead shoulddecrease as the amount of data increases. Neo4j struggles with some of the questions whenit comes to 100% of the data but Berkeley DB is still quite fast. Neither Neo4j nor BerkeleyDB can keep all the information in memory and need to read from disc. This seems to affect

Neo4j more than Berkeley DB, even though they both keep their data on the SSD.If the client times are divided by the number of test runs Berkeley DB completes query

12 in less than 5 seconds, Neo4j needs almost 25 second but PostgreSQL needs as much as115 seconds or almost 2 minutes. Even if this is a question asked once a day 2 minutes isa very long time to wait for an answer. One other interesting thing that is obvious fromthese graphs is that the more complex questions that is the hardest seems to be question12, the difference between two sub-trees and question 10, returning a sub tree of A. When

21

Page 32: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 32/49

22 Chapter 5. Results

0 2 4 6 8 10 12 140

100

200

300

400

500

600

700

800

900

Question number

   T   i  m  e   i  n  s  e  c  o  n   d  s

 

PostgreSQL

Neo4J

BerkeleyDB

Hash table

(a) 0.1 % of the data.

0 2 4 6 8 10 12 140

500

1000

1500

2000

2500

3000

Question number

   T   i  m  e   i  n  s  e  c  o  n   d  s

 

PostgreSQL

Neo4J

BerkeleyDB

Hash table

(b) 1 % of the data.

0 2 4 6 8 10 12 140

1000

2000

3000

4000

5000

6000

7000

8000

Question number

   T   i  m  e   i  n  s  e  c  o  n   d  s

 

PostgreSQL

Neo4J

BerkeleyDB

Hash table

(c) 10 % of the data.

0 2 4 6 8 10 12 140

0.5

1

1.5

2

2.5x 10

5

Question number

   T   i  m  e   i  n  s  e  c  o  n   d  s

 

PostgreSQL

Neo4J

BerkeleyDB

(d) 100 % of the data.

Figure 5.1: Total time for the client.

comparing the times, for 10% of the data the time is almost double which means that thetime spend calculating the sets is almost nothing compared to the time of getting the subtrees. For 100% the time is still only slightly more than double the time for getting one subtree.

One interesting fact is that Berkeley DB and Neo4j almost can keep up with the Hashtable. One of the main reasons for this is probably the speed of the solid state drive. Witha slower hard drive this would probably not be possible, especially for Neo4j. In Figure 5.2the client times for Neo4j, Berkeley DB and the Hash table are plotted for 10% of the data.For question 9 it appears that the hash table is the slowest, but only just.

On the server side the times plotted are the trimmed means because they are generallysomewhere in between the mean and median. The graphs are almost the same; PostgreSQLis the slowest for almost every question.

One thing that becomes apparent from these Figures is that some of the simple questionsare not so simple after all. A look at question 7 in Figure 5.5 and the results are veryinteresting. The maximum seems to be when only 1% of the data is tested, except for Neo4jwhich has a top at 100% when the rest of them actually go down in time. There are someexplanations for this but the most likely is that the data scales badly for this example. If the original data contains 20 unique value of the attribute in question, it drops to 0.2 for1% and is rounded to 1. That means that a much larger portion of the data is returned

Page 33: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 33/49

23

0 2 4 6 8 10 12 140

100

200

300

400

500

600

700

800

Question number

   T   i  m  e   i  n  s  e  c  o  n   d  s

 

Neo4J

BerkeleyDB

Hash table

Figure 5.2: Client times for 10% of the data without PostgreSQL

than for 10%. A closer look at the exact behaviour of the data and the databases for thisquestion would be interresting but it probably does not influence the final result.

Page 34: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 34/49

24 Chapter 5. Results

0 2 4 6 8 10 12 140

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Question number

   T   i  m  e   i  n  s  e  c  o  n   d  s

 

PostgreSQLNeo4JBerkeleyDBHash table

(a) 0.1 % of the data.

0 2 4 6 8 10 12 140

0.2

0.4

0.6

0.8

1

1.2

1.4

Question number

   T   i  m  e   i  n  s  e  c  o  n   d  s

 

PostgreSQLNeo4JBerkeleyDBHash table

(b) 1 % of the data.

0 2 4 6 8 10 12 140

0.5

1

1.5

2

2.5

3

3.5

4

Question number

   T   i  m

  e   i  n  s  e  c  o  n   d  s

 

PostgreSQLNeo4JBerkeleyDBHash table

(c) 10 % of the data.

0 2 4 6 8 10 12 140

20

40

60

80

100

120

Question number

   T   i  m

  e   i  n  s  e  c  o  n   d  s

 

PostgreSQLNeo4JBerkeleyDB

(d) 100 % of the data.

Figure 5.3: Total time for the server.

Page 35: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 35/49

25

0.1% 1% 10% 100%0

200

400

600

800

1000

1200

Amount of data

   T   i  m  e   i  n  s  e  c  o  n   d  s

 

PostgreSQL

Neo4J

BerkeleyDBHash table

(a) Question 1

0.1% 1% 10% 100%0

200

400

600

800

1000

1200

Amount of data

   T   i  m  e   i  n  s  e  c  o  n   d  s

 

PostgreSQL

Neo4J

BerkeleyDBHash table

(b) Question 2

0.1% 1% 10% 100%0

20

40

60

80

100

120

140

160

180

200

Amount of data

   T   i  m  e   i  n  s  e  c  o  n   d  s

 

PostgreSQL

Neo4J

BerkeleyDB

Hash table

(c) Question 3

0.1% 1% 10% 100%0

200

400

600

800

1000

1200

Amount of data

   T   i  m  e   i  n  s  e  c  o  n   d  s

 

PostgreSQL

Neo4J

BerkeleyDB

Hash table

(d) Question 4

0.1% 1% 10% 100%0

200

400

600

800

1000

1200

1400

1600

1800

Amount of data

   T   i  m  e   i  n  s  e  c  o  n   d  s

 

PostgreSQL

Neo4J

BerkeleyDB

Hash table

(e) Question 5

0.1% 1% 10% 100%0

1000

2000

3000

4000

5000

6000

7000

8000

9000

Amount of data

   T   i  m  e   i  n  s  e  c  o  n   d  s

 

PostgreSQL

Neo4J

BerkeleyDB

Hash table

(f) Question 6

Figure 5.4: Client times for question 1-6

Page 36: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 36/49

26 Chapter 5. Results

0.1% 1% 10% 100%0

500

1000

1500

2000

2500

3000

Amount of data

   T   i  m  e   i  n  s  e  c  o  n   d  s

 

PostgreSQL

Neo4J

BerkeleyDB

Hash table

Figure 5.5: Total time for the client for question 7 at different amounts of data

Page 37: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 37/49

27

0.1% 1% 10% 100%0

500

1000

1500

Amount of data

   T   i  m  e   i  n  s  e  c  o  n   d  s

 

PostgreSQL

Neo4J

BerkeleyDBHash table

(a) Question 8

0.1% 1% 10% 100%0

1000

2000

3000

4000

5000

6000

7000

8000

9000

Amount of data

   T   i  m  e   i  n  s  e  c  o  n   d  s

 

PostgreSQL

Neo4J

BerkeleyDBHash table

(b) Question 9

0.1% 1% 10% 100%0

2

4

6

8

10

12

x 104

Amount of data

   T   i  m  e   i  n  s  e  c  o  n   d  s

 

PostgreSQL

Neo4J

BerkeleyDB

Hash table

(c) Question 10

0.1% 1% 10% 100%0

10

20

30

40

50

60

Amount of data

   T   i  m  e   i  n  s  e  c  o  n   d  s

 

PostgreSQL

Neo4J

BerkeleyDB

Hash table

(d) Question 11

0.1% 1% 10% 100%0

0.5

1

1.5

2

2.5x 10

5

Amount of data

   T   i  m  e   i  n  s  e  c  o  n   d  s

 

PostgreSQL

Neo4J

BerkeleyDB

Hash table

(e) Question 12

0.1% 1% 10% 100%0

500

1000

1500

2000

2500

3000

3500

Amount of data

   T   i  m  e   i  n  s  e  c  o  n   d  s

 

PostgreSQL

Neo4J

BerkeleyDB

Hash table

(f) Question 13

Figure 5.6: Client times for question 8-13

Page 38: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 38/49

28 Chapter 5. Results

Page 39: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 39/49

Chapter 6

Conclusions

In this paper a question was asked; is there any database that can handle the amount of data that Tieto have and how good can it get? To answer this question it was necessary totake a look at what architectures that exist today and find some different databases that

could do the job. Based on the nature of the data it became clear that one should be a graphdatabase and Neo4j was chosen because it considered to be one of the best on the market.The relational database was based on the fact that is already existed in the company andthat it is one of the fastest relational databases available. A third and completely differentdatabase was needed and Berkeley DB is a key value store that had all the qualities.

All three were implemented as a Web service and their performance was measured. Theresults were pretty clear; the database that is the best at handling the data is Berkeley DB,even for the questions that are closely connected to the graphical aspect of the data. Thiswas a surprising result especially that it was faster even for the more graphical questions.Even though it data fits the graphical model the best Neo4j just was not fast enough to beable to use its advantage.

There is also the problem with the maturity of the product. The first version of Neo4j

was released in February of 2010, version 1.1 half a year later and in December of 2010 yetanother version. Neo4j needs time to mature and become a more stable product before suitscompanies such as Tieto.

6.1 Future work

The results are promising and there is definitely worth a continued development of theBerkeley DB part of the solution. Even though there exist a solution today it is not optimaland at the rate the data is growing Tieto may find themselves in trouble a lot faster thanthey anticipate. The real data have some properties that are excluded from this first testto make the task a little easier. A good first step would be to identify these and startimplementing them as well to see if the results still hold.

29

Page 40: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 40/49

30 Chapter 6. Conclusions

Page 41: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 41/49

Chapter 7

Acknowledgements

I would like to start by thanking my supervisor at Tieto, Anders Martinsson for all hissupport and help. Without him this master thesis would not have existed since the wholething was his idea. I also thank everyone at Tietos office in Umea for making my workday

a pleasant time.I thank internal supervisor at the department of Computing Science at Umea Universitet,

Ola Agren.Last but not least I thank my husband for everything he has done to support me throw-

out the entire master thesis project.

31

Page 42: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 42/49

32 Chapter 7. Acknowledgements

Page 43: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 43/49

References

[1] 10Gen. Mongodb. http://www.mongodb.org/, August 18 2010.

[2] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati,Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, andWerner Vogels. Dynamo: Amazon’s highly available key-value store. In Proc. SOSP ,41:205–220, 2007.

[3] Seth Gilbert and Nancy Lynch. Brewer’s Conjecture and the Feasibility of Consistent,Available, Partition-Tolerant Web Services. ACM SIGACT News, 33, 2002.

[4] Jim Gray. The Transaction Concept: Virtues and Limitations. Proceedings of Seventh 

International Conference on Very Large Databases, 1981.

[5] Theo Haerder and Andreas Reuter. Principles of transaction-oriented database recov-ery. Computing Surveys, 15(4), 1983.

[6] Oracle. http://www.systomath.com/doc/BerkeleyDb-4.7/html/java/, December 62010.

[7] Dan Pritchett. BASE: An Acid Alternative. Queue, 6(3), 2008.

[8] Hibernating Rhinos. Raven DB. http://ravendb.net/, August 18 2010.[9] Michael Stonebraker, Samuel Madden, Daniel J. Abadi, Stavros Harizopoulos, Nabil

Hachem, and Pat Helland. The End of an Architectural Era (It’s Time for a CompleteRewrite). VLDB ’07 , 2007.

[10] Mike Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack,Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O’Neil, PatO’Neil, Alex Rasin, Nga Tran, and Stan Zdonik. C-Store: A Column-oriented DBMS.VLDB , pages 553–564, 2005.

[11] Neo Technology. Neo4J, the graph database. http://neo4j.org/, December 6 2010.

[12] The Apache Software Foundation. The CouchDB Project.

http://couchdb.apache.org/, August 18 2010.

[13] Project Voldemort. Project Voldemort, A distributed database.http://project-voldemort.com/, August 18 2010.

[14] Wikipedia. Turing award. http://en.wikipedia.org/wiki/Turing Award, October 52010.

33

Page 44: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 44/49

34 REFERENCES

Page 45: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 45/49

Appendix A

Data from test runs

Here is the raw data from the test runs. The server times are rounded to 4 significant digits.The values for the Hash table did not fit in the table with the others and therefore it is inits own table.

A.1 Server times

PostgreSQL Neo4j Berkeley DBmean trim median mean trim median mean trim median

Q1 13.41 13.14 12.9 0.1889 0.1657 0.1647 0.1363 0.114 0.1078Q2 12.38 12.37 12.36 0.09073 0.09009 0.09068 0.05806 0.05725 0.05615Q3 0.7108 0.7051 0.7035 0.0648 0.06052 0.06033 0.04722 0.04666 0.04629Q4 82.43 81.49 68.1 3.679 3.648 2.709 3.36 3.295 2.403Q5 53.41 53.37 53.36 8.909 8.704 8.457 9.297 9.134 8.978Q6 298.4 297.7 297.4 18.74 18.59 18.6 16.6 16.29 15.98

Q7 326.9 326.2 326 27.47 27.22 27.29 25.53 25.16 24.6Q8 27.97 27.89 28.13 3.619 3.575 3.563 0.421 0.4028 0.4147Q9 32.83 32.74 32.53 0.8853 0.8656 0.8562 0.813 0.7952 0.7852

Q10 29.17 29.1 28.86 0.3765 0.3518 0.346 0.3681 0.3437 0.3003Q11 5.288 5.285 5.408 0.06893 0.06775 0.0664 0.2815 0.2452 0.1964Q12 59.67 59.52 59.18 0.5257 0.4977 0.5086 0.5271 0.4973 0.5486Q13 1.722 1.715 1.713 0.09877 0.09506 0.09334 0.3445 0.3281 0.2868

Table A.1: The server times for 0.1% of the data. Time in milliseconds.

35

Page 46: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 46/49

36 Chapter A. Data from test runs

PostgreSQL Neo4j Berkeley DB

mean trim median mean trim median mean trimmean medianQ1 18.8 18.58 18.29 0.1265 0.1246 0.1218 0.06133 0.06115 0.05767Q2 17.91 17.91 17.9 0.08418 0.08166 0.08158 0.03206 0.03178 0.03149Q3 0.7133 0.7109 0.7107 0.06575 0.06573 0.0664 0.0218 0.02161 0.02163Q4 107.1 105.6 105.1 3.795 3.734 3.668 3.381 3.267 3.075Q5 161.3 161.9 163.1 14.96 14.99 15.09 17.13 17.09 17.01Q6 631.4 631.4 639.3 31.5 31.47 31.89 25.94 25.8 25.94Q7 1044 1061 1092 85.15 86.93 105.3 105.9 106.3 120.9Q8 26.76 26.91 27.09 29.22 30.14 33.4 0.4238 0.4219 0.4153Q9 65.11 65.04 64.87 1.024 1.018 1.016 0.8452 0.8292 0.8298

Q10 160.2 159.7 159.9 4.833 4.811 4.807 3.21 3.119 3.099Q11 5.266 5.239 5.225 0.1018 0.09598 0.08651 0.9761 0.8173 0.6008Q12 319.7 319.2 318.2 9.418 9.499 9.53 6.97 6.929 6.206

Q13 3.434 3.429 3.429 0.3105 0.3028 0.3035 8.692 8.54 8.136

Table A.2: The server times for 1% of the data. Time in milliseconds.

PostgreSQL Neo4j Berkeley DB

mean trim median mean trim median mean trim medianQ1 61.75 61.62 61.54 0.5108 0.4027 0.388 0.1131 0.1026 0.1032Q2 61.37 61.37 61.33 0.1364 0.1335 0.1332 0.05555 0.05541 0.0554Q3 1.072 1.066 1.064 0.1775 0.1761 0.1757 0.08425 0.0829 0.08234Q4 200.9 199.4 201.9 5.118 5.015 5.004 3.745 3.69 3.676Q5 209.4 209.3 209.2 21.59 21.27 21.23 21.62 21.48 21.37Q6 1097 1096 1098 46.82 46.57 46.37 37.03 36.88 36.44Q7 567.3 566.7 567.2 39.17 39.23 39.04 179.3 179.8 187.4Q8 54.51 49.95 18.76 32.76 32.59 32.5 0.5008 0.4963 0.4841Q9 382.7 381.8 375.2 1.347 1.338 1.33 1.007 0.9983 0.9945

Q10 1808 1807 1806 69.08 60.8 60.32 37.71 37.24 37.21Q11 6.072 6.037 6.335 0.187 0.1742 0.1559 1.481 1.245 0.9066Q12 3641 3639 3637 120 119.9 120.3 74.42 74.81 74.77Q13 27.5 27.49 27.48 3.368 3.238 3.237 114.1 113.9 113.8

Table A.3: The server times for 10% of the data. Time in milliseconds.

Page 47: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 47/49

A.1. Server times 37

PostgreSQL Neo4j Berkeley DB

mean trim median mean trim median mean trim medianQ1 501.6 492.7 492.6 55.14 17.95 14.56 1.332 1.147 0.896Q2 495.7 495.5 495.1 18.87 9.12 8.794 0.6003 0.5044 0.4637Q3 19.53 19.38 17.89 90.93 63.22 46.37 0.5779 0.5593 0.5589Q4 563.7 562.8 558.6 394.9 353.3 163.5 21.14 21.09 22.88Q5 761.8 747.9 697.6 338.4 291 249.7 157.2 153 22.5Q6 2153 2144 2135 4075 4096 4092 84.07 77.21 36.11Q7 463.2 462.7 462.4 584.4 524.9 490.5 378 262.8 193Q8 57.72 25.97 22.05 693.7 599.5 357.8 3.302 3.301 3.269Q9 4059 4058 4058 93.71 55.95 32.2 5.644 5.669 5.648

Q10 52980 52890 52660 8455 8473 8559 504.5 442.7 410.7Q11 21.65 18.87 14.65 16 12.37 8.832 24.99 17.14 11.91Q12 112400 111900 112200 22320 22540 22750 740.5 740.8 731.8

Q13 374.3 373.9 371.8 1540 1505 1489 1136 1134 1133

Table A.4: The server times for 100% of the data. Time in milliseconds.

0.1% 1% 10%

mean trim median mean trim median mean trim medianQ1 0.001531 0.001521 0.001518 0.00117 0.001157 0.001138 0.00144 0.00142 0.001518Q2 0.001579 0.001573 0.001518 0.001185 0.001189 0.001138 0.001467 0.001458 0.001518Q3 0.001787 0.001779 0.001897 0.001569 0.001563 0.001518 0.001744 0.001719 0.001897Q4 0.008888 0.008829 0.008348 0.008475 0.008419 0.007589 0.009677 0.009643 0.009106Q5 0.02186 0.02147 0.02125 0.03216 0.03121 0.03111 0.0408 0.04063 0.0406Q6 0.01621 0.01608 0.01594 0.02497 0.02483 0.02466 0.03602 0.03593 0.03605Q7 0.1925 0.1872 0.1867 1.114 1.134 1.293 1.242 1.239 1.233Q8 0.0371 0.03555 0.03529 0.2556 0.261 0.2831 0.62 0.6182 0.6575Q9 0.05229 0.05061 0.05122 0.6377 0.6278 0.631 11.55 11.56 11.97

Q10 0.008697 0.008618 0.009106 0.05151 0.05172 0.05084 0.5432 0.5411 0.5418Q11 0.004694 0.004403 0.004174 0.006041 0.005718 0.005312 0.007546 0.006902 0.006071Q12 0.0187 0.01792 0.01404 0.1213 0.1217 0.1226 1.366 1.365 1.366Q13 0.01769 0.0172 0.0167 0.08085 0.07295 0.06412 0.8914 0.8829 0.8852

Table A.5: The server times for the hash table. Time in milliseconds.

Page 48: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 48/49

38 Chapter A. Data from test runs

A.2 Client times

Here are the times recordeed by the server in seconds. This is the total time for 2048 testruns, Neo4j and PostgreSQLs times are multiplied with appropriate scalar to give a timefor comparison.

PostgreSQL Neo4j Berkeley DB Hash tableQ1 29.75 2.221 2.217 1.86Q2 26.72 1.383 1.455 1.239Q3 2.54 1.147 1.142 0.952Q4 198.7 37.52 38.9 30.42Q5 175.4 83.29 97.62 74.35Q6 75.15 181.7 182.7 139.2Q7 876.9 262.6 285.8 215.1Q8 61.33 10.94 4.53 3.561Q9 73.91 7.958 8.266 6.402

Q10 63.16 3.745 4.004 3.065Q11 11.84 0.889 1.341 0.674

Q12 127.1 5.233 5.645 4.502Q13 4.52 1.195 1.83 1.006

Table A.6: Client times for all test runs at 0.1%. Time in seconds.

PostgreSQL Neo4j Berkeley DB Hash tableQ1 39.84 1.296 1.357 1.007Q2 37.74 0.983 0.962 0.776Q3 2.192 0.872 0.837 0.687Q4 242.8 31.02 39.59 23.9Q5 457 152.6 182.6 133.8Q6 1491 263.7 268.6 197.7Q7 2754 791.2 905.3 641.7Q8 57.92 63.22 3.875 3.431Q9 139.8 8.397 9.377 7.722

Q10 350.7 32.57 31.66 23.22Q11 11.58 0.956 3.038 0.743Q12 698.5 63.31 67 45.28Q13 7.891 1.485 19.59 0.984

Table A.7: Client times for all test runs at 1%. Time in seconds.

Page 49: NoSQL Database for Software

7/31/2019 NoSQL Database for Software

http://slidepdf.com/reader/full/nosql-database-for-software 49/49

A.2. Client times 39

PostgreSQL Neo4j Berkeley DB Hash table

Q1 127.9 2.628 1.457 1.193Q2 126.9 1.38 0.998 0.865Q3 3.478 2.006 1.533 1.273Q4 436.8 41.96 53.12 26.51Q5 574.5 193 204.2 154.5Q6 2494 360.1 334.2 254.2Q7 1270 193.1 481.6 114.4Q8 113 68.62 2.041 2.262Q9 791 9.891 9.14 31.07

Q10 3940 400.3 328.9 246.1Q11 13.36 1.405 3.874 0.794Q12 7932 762.5 661.7 496.3

Q13 58.98 9.899 236.7 4.437

Table A.8: Client times for all test runs at 10%. Time in seconds.

PostgreSQL Neo4j Berkeley DB

Q1 1032 116.9 4.127Q2 1018 41.56 2.143Q3 48.68 198.1 7.885Q4 1181 838.1 69.37Q5 1710 844.6 494.7Q6 4659 8604 428.8Q7 961.6 1212 78.14Q8 120.6 1423 7.619Q9 8322 201.1 18.67

Q10 111000 19970 3758Q11 47.34 35.38 52.44Q12 235200 50860 9654

Q13 789.7 3186 2349

Table A.9: Client times for all test runs at 100%. Time in seconds.