Seminar Nosql

CS 708 SeminarNoSQL

Prepared by,Fayaz Yusuf Khan,

Reg.No: CSU071/16

Guided by,Nimi Prakash P.System Analyst

Computer Science & Engineering Department

September 29, 2010

NoSQL DEFINITION

Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontal scalable. The originalintention has been modern web-scale databases. The movement began early2009 and is growing rapidly. Often more characteristics apply as: schema-free,easy replication support, simple API, eventually consistent /BASE (notACID), a huge data amount, and more. So the misleading term ”NoSQL”(the community now translates it mostly with “not only SQL”) should be seenas an alias to something like the definition above. [1]

Contents

1 Introduction 51.1 Why relational databases are not enough . . . . . . . . . . . . . 61.2 What NoSQL has to offer . . . . . . . . . . . . . . . . . . . . . 71.3 ACIDs & BASEs . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3.1 ACID Properties of Relational Databases . . . . . . . . . 71.3.2 CAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.3.3 BASE . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 NoSQL Features 112.1 No entity joins . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2 Eventual Consistency . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.1 Historical Perspective . . . . . . . . . . . . . . . . . . . 132.2.2 Consistency — Client and Server . . . . . . . . . . . . . 15

2.3 Cloud Based Memory Architecture . . . . . . . . . . . . . . . . 192.3.1 Memory Based Architectures . . . . . . . . . . . . . . . 19

3 Different NoSQL Database Choices 213.1 Document Databases & BigTable . . . . . . . . . . . . . . . . . 213.2 Graph Databases . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.4 Distributed Key-Value Stores . . . . . . . . . . . . . . . . . . . 22

4 NoSQL: Merits & Demerits 254.1 Merits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.1.1 Semi-Structured Data . . . . . . . . . . . . . . . . . . . 254.1.2 Alternative Model Paradigms . . . . . . . . . . . . . . . 264.1.3 Multi-Valued Properties . . . . . . . . . . . . . . . . . . 274.1.4 Generalized Analytics . . . . . . . . . . . . . . . . . . . 294.1.5 Version History . . . . . . . . . . . . . . . . . . . . . . . 294.1.6 Predictable Scalability . . . . . . . . . . . . . . . . . . . 304.1.7 Schema Evolution . . . . . . . . . . . . . . . . . . . . . 32

3

4.1.8 More Natural Fit with Code . . . . . . . . . . . . . . . . 334.2 Demerits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2.1 Limitations on Analytics . . . . . . . . . . . . . . . . . . 34

5 Conclusion 375.1 Data inconsistency . . . . . . . . . . . . . . . . . . . . . . . . . 375.2 Making a Decision . . . . . . . . . . . . . . . . . . . . . . . . . 38

A Different popular NoSQL databases 39A.1 The Shortlist . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39A.2 Cloud-Service Contenders . . . . . . . . . . . . . . . . . . . . . 39

A.2.1 Amazon: SimpleDB . . . . . . . . . . . . . . . . . . . . 39A.2.2 Google AppEngine Data Store . . . . . . . . . . . . . . . 41A.2.3 Microsoft: SQL Data Services . . . . . . . . . . . . . . . 41

A.3 Non-Cloud Service Contenders . . . . . . . . . . . . . . . . . . . 42A.3.1 Tokyo Cabinet . . . . . . . . . . . . . . . . . . . . . . . 42A.3.2 CouchDB . . . . . . . . . . . . . . . . . . . . . . . . . . 45A.3.3 Project Voldemort . . . . . . . . . . . . . . . . . . . . . 45A.3.4 Mongo . . . . . . . . . . . . . . . . . . . . . . . . . . . 45A.3.5 Drizzle . . . . . . . . . . . . . . . . . . . . . . . . . . . 46A.3.6 Cassandra . . . . . . . . . . . . . . . . . . . . . . . . . 47A.3.7 BigTable . . . . . . . . . . . . . . . . . . . . . . . . . . 47A.3.8 Dynamo . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4

Chapter 1

Introduction

The history of the relational database has been one of continual adversity: ini-tially, many claimed that mathematical set-based models could never be the basisfor efficient database implementations; later, aspiring object oriented databasesclaimed they would remove the “middle man” of relational databases from theOO design and persistence process. In all of these cases, through a combinationof sound concepts, elegant implementation, and general applicability, relationaldatabases have become and remained the lingua franca of data storage and ma-nipulation.

Most recently, a new contender has arisen to challenge the supremacy of re-lational databases. Referred to generally as “non-relational databases” (amongother names), this class of storage engine seeks to break down the rigidity ofthe relational model, in exchange for leaner models that can perform and scaleat higher levels, using various models (including key/value pairs, sharded arrays,and document-oriented approaches) which can be created and read efficiently asthe basic unit of data storage. Primarily, these new technologies have arisen insituations where traditional relational database systems would be extremely chal-lenging to scale to the degree needed for global systems (for example, at com-panies such as Google, Yahoo, Amazon, LinkedIn, etc., which regularly collect,store and analyze massive data sets with extremely high transactional throughputand low latency). As of this writing, there exist dozens of variants of this newmodel, each with different capabilities and trade-offs, but all with the generalproperty that traditional relational design — as practiced on relational databasemanagement systems like Oracle, Sybase, etc. — is neither possible nor desired.[11]

5

1.1 Why relational databases are not enough

Even though RDBMS have provided database users with the best mix of sim-plicity, robustness, flexibility, performance, scalability, and compatibility, theirperformance in each of these areas is not necessarily better than that of an al-ternate solution pursuing one of these benefits in isolation. This has not beenmuch of a problem so far because the universal dominance of RDBMS has out-weighed the need to push any of these boundaries. Nonetheless, if you really hada need that couldn’t be answered by a generic relational database, alternativeshave always been around to fill those niches.

Today, we are in a slightly different situation. For an increasing numberof applications, one of these benefits is becoming more and more critical; andwhile still considered a niche, it is rapidly becoming mainstream, so much sothat for an increasing number of database users this requirement is beginningto eclipse others in importance. That benefit is scalability. As more and moreapplications are launched in environments that have massive workloads, such asweb services, their scalability requirements can, first of all, change very quicklyand, secondly, grow very large. The first scenario can be difficult to manage ifyou have a relational database sitting on a single in-house server. For example,if your load triples overnight, how quickly can you upgrade your hardware? Thesecond scenario can be too difficult to manage with a relational database ingeneral.

Relational databases scale well, but usually only when that scaling happens ona single server node. When the capacity of that single node is reached, you needto scale out and distribute that load across multiple server nodes. This is whenthe complexity of relational databases starts to rub against their potential toscale. Try scaling to hundreds or thousands of nodes, rather than a few, and thecomplexities become overwhelming, and the characteristics that make RDBMSso appealing drastically reduce their viability as platforms for large distributedsystems.

For cloud services to be viable, vendors have had to address this limitation,because a cloud platform without a scalable data store is not much of a platformat all. So, to provide customers with a scalable place to store application data,vendors had only one real option. They had to implement a new type of databasesystem that focuses on scalability, at the expense of the other benefits that comewith relational databases.

These efforts, combined with those of existing niche vendors, have led to therise of a new breed of database management system. [2]

6

1.2 What NoSQL has to offer

This new kind of database management system is commonly called a key/valuestore. In fact, no official name yet exists, so you may see it referred to asdocument-oriented, Internet-facing, attribute-oriented, distributed database (al-though this can be relational also), sharded sorted arrays, distributed hash table,and key/value database. While each of these names point to specific traits of thisnew approach, they are all variations on one theme, which we’ll call key/valuedatabases.

Whatever you call it, this “new” type of database has been around for along time and has been used for specialized applications for which the genericrelational database was ill-suited. But without the scale that web and cloudapplications have brought, it would have remained a mostly unused subset. Now,the challenge is to recognize whether it or a relational database would be bettersuited to a particular application.

Relational databases and key/value databases are fundamentally different anddesigned to meet different needs. A side-by-side comparison only takes you sofar in understanding these differences, but to begin, let’s lay one down: [2]

1.3 ACIDs & BASEs

1.3.1 ACID Properties of Relational Databases

• The claim to fame for relational databases is they make the ACID promise:

– Atomicity — a transaction is all or nothing.

– Consistency — only valid data is written to the database.

– Isolation — pretend all transactions are happening serially and thedata is correct.

– Durability — what you write is what you get.

• The problem with ACID is that it gives too much; it trips up when tryingto scale a system across multiple nodes.

• Down time is unacceptable. So the system needs to be reliable. Reliabilityrequires multiple nodes to handle machine failures.

• To make a scalable systems that can handle lots and lots of reads andwrites requires many more nodes.

7

Database Definition

Relational Database Key/Value Database

Database contains tables, tablescontain columns and rows, androws are made up of column val-ues. Rows within a table all havethe same schema.

Domains can initially be thoughtof like a table, but unlike a tableyou don’t define any schema fora domain. A domain is basicallya bucket that you put items into.Items within a single domain canhave differing schema.

The data model is well defined inadvance. A schema is stronglytyped and it has constraints andrelationships that enforce data in-tegrity.

Items are identified by keys, and agiven item can have a dynamic setof attributes attached to it.

The data model is based on a “nat-ural” representation of the data itcontains, not on an application’sfunctionality.

In some implementations, at-tributes are all of a string type. Inother implementations, attributeshave simple types that reflect codetypes, such as ints, string arrays,and lists.

The data model is normalized toremove data duplication. Nor-malization establishes table rela-tionships. Relationships associatedata between tables.

No relationships are explicitly de-fined between domains or within agiven domain.

Table 1.1: Fundamental differences between relational databases and key/value stores

• Once we try to scale ACID across many machines we hit problems withnetwork failures and delays. The algorithms don’t work in a distributedenvironment at any acceptable speed.

1.3.2 CAP

• If we can’t have all of the ACID guarantees it turns out we can have twoof the following three characteristics:

– Consistency — your data is correct all the time. What you write iswhat you read.

– Availability — you can read and write and write your data all thetime.

8

– Partition Tolerance — if one or more nodes fails the system still worksand becomes consistent when the system comes on-line.

1.3.3 BASE

• The types of large systems based on CAP aren’t ACID they are BASE:

– Basically Available — system seems to work all the time.

– Soft State — it doesn’t have to be consistent all the time.

– Eventually Consistent — becomes consistent at some later time.

• Everyone who builds big applications builds them on CAP and BASE:Google, Yahoo, Facebook, Amazon, eBay, etc. [7]

9

10

Chapter 2

NoSQL Features

2.1 No entity joins

Key/value databases are item-oriented, meaning all relevant data relating to anitem are stored within that item. A domain (which you can think of as a table)can contain vastly different items. For example, a domain may contain customeritems and order items. This means that data are commonly duplicated betweenitems in a domain. This is accepted practice because disk space is relativelycheap. But this model allows a single item to contain all relevant data, whichimproves scalability by eliminating the need to join data from multiple tables.With a relational database, such data needs to be joined to be able to regrouprelevant attributes.

But while the need for relationships is greatly reduced with key/value databases,certain ones are inevitable. These relationships usually exist among core enti-ties. For example, an ordering system would have items that contain data about

11

customers, products, and orders. Whether these reside on the same domain orseparate domains is irrelevant; but when a customer places an order, you wouldlikely not want to store both the customer and product’s attributes in the sameorder item.

Instead, orders would need to contain relevant keys that point to the cus-tomer and product. While this is perfectly doable in a key/value database, theserelationships are not defined in the data model itself, and so the database man-agement system cannot enforce the integrity of the relationships. This meansyou can delete customers and the products they have ordered. The responsibilityof ensuring data integrity falls entirely to the application. [2]

Data Access


Data is created, updated, deleted,and retrieved using SQL.

Data is created, updated, deleted,and retrieved using API methodcalls.

SQL queries can access data froma single table or multiple tablesthrough table joins.

Some implementations provide ba-sic SQL-like syntax for defining fil-ter criteria.

SQL queries include functions foraggregation and complex filtering.

Basic filter predicates (such as =,6=, <,>,≤, and ≥) can often onlybe applied.

Usually contain means of embed-ding logic close to data storage,such as triggers, stored proce-dures, and functions.

All applications and data integritylogic is contained in the applica-tion code.

Table 2.1: Data access for relational databases and key/value stores

Application Interface


Tend to have their own specificAPI, or make use of a generic APIsuch as OLE-DB or ODBC.

Tend to provide SOAP and/orREST APIs over which data accesscalls can be made.

Data is stored in a format thatrepresents its natural structure, somust be mapped between applica-tion code structure and relationalstructure.

Data can be more effectivelystored in application code that iscompatible with its structure, re-quiring only relational “plumbing”code for the object.

Table 2.2: Data access for relational databases and key/value stores

12

2.2 Eventual Consistency

Eventually Consistent - Building reliable distributed systems at a worldwide scaledemands trade-offs between consistency and availability.

At the foundation of Amazon’s cloud computing are infrastructure servicessuch as Amazon’s S3 (Simple Storage Service), SimpleDB, and EC2 (ElasticCompute Cloud) that provide the resources for constructing Internet-scale com-puting platforms and a great variety of applications. The requirements placed onthese infrastructure services are very strict; they need to score high marks in theareas of security, scalability, availability, performance, and cost effectiveness, andthey need to meet these requirements while serving millions of customers aroundthe globe, continuously.

Under the covers these services are massive distributed systems that operateon a worldwide scale. This scale creates additional challenges, because when asystem processes trillions and trillions of requests, events that normally have a lowprobability of occurrence are now guaranteed to happen and need to be accountedfor up front in the design and architecture of the system. Given the worldwidescope of these systems, we use replication techniques ubiquitously to guaranteeconsistent performance and high availability. Although replication brings us closerto our goals, it cannot achieve them in a perfectly transparent manner; under anumber of conditions the customers of these services will be confronted with theconsequences of using replication techniques inside the services.

One of the ways in which this manifests itself is in the type of data consistencythat is provided, particularly when the underlying distributed system provides aneventual consistency model for data replication. When designing these large-scalesystems, we ought to use a set of guiding principles and abstractions related tolarge-scale data replication and focus on the trade-offs between high availabilityand data consistency. [4]

2.2.1 Historical Perspective

In an ideal world there would be only one consistency model: when an update ismade all observers would see that update. The first time this surfaced as difficultto achieve was in the database systems of the late ’70s. The best “period piece”on this topic is “Notes on Distributed Databases” by Bruce Lindsay et al. It laysout the fundamental principles for database replication and discusses a numberof techniques that deal with achieving consistency. Many of these techniquestry to achieve distribution transparency — that is, to the user of the systemit appears as if there is only one system instead of a number of collaboratingsystems. Many systems during this time took the approach that it was better tofail the complete system than to break this transparency.

13

In the mid-’90s, with the rise of larger Internet systems, these practices wererevisited. At that time people began to consider the idea that availability wasperhaps the most important property of these systems, but they were strugglingwith what it should be traded off against. Eric Brewer, systems professor atthe University of California, Berkeley, and at that time head of Inktomi, broughtthe different trade-offs together in a keynote address to the PODC (Principlesof Distributed Computing) conference in 2000. He presented the CAP theorem,which states that of three properties of shared-data systems — data consistency,system availability, and tolerance to network partition — only two can be achievedat any given time. A more formal confirmation can be found in a 2002 paper bySeth Gilbert and Nancy Lynch.

A system that is not tolerant to network partitions can achieve data consis-tency and availability, and often does so by using transaction protocols. To makethis work, client and storage systems must be part of the same environment;they fail as a whole under certain scenarios, and as such, clients cannot observepartitions. An important observation is that in larger distributed-scale systems,network partitions are a given; therefore, consistency and availability cannot beachieved at the same time. This means that there are two choices on what todrop: relaxing consistency will allow the system to remain highly available underthe partitionable conditions, whereas making consistency a priority means thatunder certain conditions the system will not be available.

Both options require the client developer to be aware of what the system isoffering. If the system emphasizes consistency, the developer has to deal withthe fact that the system may not be available to take, for example, a write. Ifthis write fails because of system unavailability, then the developer will have todeal with what to do with the data to be written. If the system emphasizesavailability, it may always accept the write, but under certain conditions a readwill not reflect the result of a recently completed write. The developer then hasto decide whether the client requires access to the absolute latest update all thetime. There is a range of applications that can handle slightly stale data, andthey are served well under this model.

In principle the consistency property of transaction systems as defined in theACID properties (atomicity, consistency, isolation, durability) is a different kind ofconsistency guarantee. In ACID, consistency relates to the guarantee that whena transaction is finished the database is in a consistent state; for example, whentransferring money from one account to another the total amount held in bothaccounts should not change. In ACID-based systems, this kind of consistencyis often the responsibility of the developer writing the transaction but can beassisted by the database managing integrity constraints. [4]

14

2.2.2 Consistency — Client and Server

There are two ways of looking at consistency. One is from the developer/clientpoint of view: how they observe data updates. The second way is from the serverside: how updates flow through the system and what guarantees systems cangive with respect to updates. [4]

Client-side Consistency

The client side has these components:

A storage system. For the moment we’ll treat it as a black box, but oneshould assume that under the covers it is something of large scale and highlydistributed, and that it is built to guarantee durability and availability.

Process A. This is a process that writes to and reads from the storage system.

Processes B and C. These two processes are independent of process A andwrite to and read from the storage system. It is irrelevant whether these arereally processes or threads within the same process; what is important isthat they are independent and need to communicate to share information.

Client-side consistency has to do with how and when observers (in this casethe processes A, B, or C) see updates made to a data object in the storagesystems. In the following examples illustrating the different types of consistency,process A has made an update to a data object:

Strong consistency. After the update completes, any subsequent access (byA, B, or C) will return the updated value.

Weak consistency . The system does not guarantee that subsequent accesseswill return the updated value. A number of conditions need to be metbefore the value will be returned. The period between the update andthe moment when it is guaranteed that any observer will always see theupdated value is dubbed the inconsistency window.

Eventual consistency. This is a specific form of weak consistency; the stor-age system guarantees that if no new updates are made to the object,eventually all accesses will return the last updated value. If no failuresoccur, the maximum size of the inconsistency window can be determinedbased on factors such as communication delays, the load on the system, andthe number of replicas involved in the replication scheme. The most pop-ular system that implements eventual consistency is DNS (Domain Name

15

System). Updates to a name are distributed according to a configured pat-tern and in combination with time-controlled caches; eventually, all clientswill see the update.

The eventual consistency model has a number of variations that are importantto consider:

Causal consistency. If process A has communicated to process B that it hasupdated a data item, a subsequent access by process B will return theupdated value, and a write is guaranteed to supersede the earlier write.Access by process C that has no causal relationship to process A is subjectto the normal eventual consistency rules.

Read-your-writes consistency. This is an important model where processA, after it has updated a data item, always accesses the updated value andwill never see an older value. This is a special case of the causal consistencymodel.

Session consistency. This is a practical version of the previous model, wherea process accesses the storage system in the context of a session. As longas the session exists, the system guarantees read-your-writes consistency.If the session terminates because of a certain failure scenario, a new sessionneeds to be created and the guarantees do not overlap the sessions.

Monotonic read consistency. If a process has seen a particular value for theobject, any subsequent accesses will never return any previous values.

Monotonic write consistency. In this case the system guarantees to serial-ize the writes by the same process. Systems that do not guarantee thislevel of consistency are notoriously hard to program.

A number of these properties can be combined. For example, one can getmonotonic reads combined with session-level consistency. From a practical pointof view these two properties (monotonic reads and read-your-writes) are mostdesirable in an eventual consistency system, but not always required. These twoproperties make it simpler for developers to build applications, while allowing thestorage system to relax consistency and provide high availability.

As you can see from these variations, quite a few different scenarios arepossible. It depends on the particular applications whether or not one can dealwith the consequences.

Eventual consistency is not some esoteric property of extreme distributedsystems. Many modern RDBMSs (relational database management systems)that provide primary-backup reliability implement their replication techniques in

16

both synchronous and asynchronous modes. In synchronous mode the replicaupdate is part of the transaction. In asynchronous mode the updates arriveat the backup in a delayed manner, often through log shipping. In the lattermode if the primary fails before the logs are shipped, reading from the promotedbackup will produce old, inconsistent values. Also to support better scalableread performance, RDBMSs have started to provide the ability to read from thebackup, which is a classical case of providing eventual consistency guarantees inwhich the inconsistency windows depend on the periodicity of the log shipping.[4]

Server-side Consistency

On the server side we need to take a deeper look at how updates flow throughthe system to understand what drives the different modes that the developer whouses the system can experience. Let’s establish a few definitions before gettingstarted:

N = the number of nodes that store replicas of the dataW = the number of replicas that need to acknowledge the receipt of the

update before the update completesR = the number of replicas that are contacted when a data object is accessed

through a read operationIf W + R > N , then the write set and the read set always overlap and one

can guarantee strong consistency. In the primary-backup RDBMS scenario, whichimplements synchronous replication, N = 2, W = 2, and R = 1. No matterfrom which replica the client reads, it will always get a consistent answer. Inasynchronous replication with reading from the backup enabled, N = 2, W = 1,and R = 1. In this case R + W = N , and consistency cannot be guaranteed.

The problems with these configurations, which are basic quorum protocols,is that when the system cannot write to W nodes because of failures, the writeoperation has to fail, marking the unavailability of the system. With N = 3 andW = 3 and only two nodes available, the system will have to fail the write.

In distributed-storage systems that need to provide high performance andhigh availability, the number of replicas is in general higher than two. Systemsthat focus solely on fault tolerance often use N = 3 (with W = 2 and R = 2configurations). Systems that need to serve very high read loads often replicatetheir data beyond what is required for fault tolerance; N can be tens or evenhundreds of nodes, with R configured to 1 such that a single read will returna result. Systems that are concerned with consistency are set to W = N forupdates, which may decrease the probability of the write succeeding. A commonconfiguration for these systems that are concerned about fault tolerance but notconsistency is to run with W = 1 to get minimal durability of the update and

17

then rely on a lazy (epidemic) technique to update the other replicas.How to configure N , W , and R depends on what the common case is and

which performance path needs to be optimized. In R = 1 and N = W weoptimize for the read case, and in W = 1 and R = N we optimize for a very fastwrite. Of course in the latter case, durability is not guaranteed in the presenceof failures, and if W < (N + 1)/2, there is the possibility of conflicting writeswhen the write sets do not overlap.

Weak/eventual consistency arises when W + R ≤ N , meaning that there isa possibility that the read and write set will not overlap. If this is a deliberateconfiguration and not based on a failure case, then it hardly makes sense to setR to anything but 1. This happens in two very common cases: the first is themassive replication for read scaling mentioned earlier; the second is where dataaccess is more complicated. In a simple key-value model it is easy to compareversions to determine the latest value written to the system, but in systems thatreturn sets of objects it is more difficult to determine what the correct latestset should be. In most of these systems where the write set is smaller than thereplica set, a mechanism is in place that applies the updates in a lazy mannerto the remaining nodes in the replica’s set. The period until all replicas havebeen updated is the inconsistency window discussed before. If W +R ≤ N , thenthe system is vulnerable to reading from nodes that have not yet received theupdates.

Whether or not read-your-writes, session, and monotonic consistency can beachieved depends in general on the “stickiness” of clients to the server thatexecutes the distributed protocol for them. If this is the same server every time,then it is relatively easy to guarantee read-your-writes and monotonic reads. Thismakes it slightly harder to manage load balancing and fault tolerance, but it is asimple solution. Using sessions, which are sticky, makes this explicit and providesan exposure level that clients can reason about.

Sometimes the client implements read-your-writes and monotonic reads. Byadding versions on writes, the client discards reads of values with versions thatprecede the last-seen version.

Partitions happen when some nodes in the system cannot reach other nodes,but both sets are reachable by groups of clients. If you use a classical majorityquorum approach, then the partition that has W nodes of the replica set cancontinue to take updates while the other partition becomes unavailable. Thesame is true for the read set. Given that these two sets overlap, by definition theminority set becomes unavailable. Partitions don’t happen frequently, but theydo occur between data centers, as well as inside data centers.

In some applications the unavailability of any of the partitions is unacceptable,and it is important that the clients that can reach that partition make progress.In that case both sides assign a new set of storage nodes to receive the data,

18

and a merge operation is executed when the partition heals. [4]

2.3 Cloud Based Memory Architecture

2.3.1 Memory Based Architectures

Google query results are now served in under an astonishingly fast 200ms, downfrom 1000ms in the olden days. The vast majority of this great performanceimprovement is due to holding indexes completely in memory. Thousands ofmachines process each query in order to make search results appear nearly in-stantaneously.

This text was adapted from notes on Google Fellow Jeff Dean keynote speechat WSDM 2009.

What makes Memory Based Architectures different from traditional architec-tures is that memory is the system of record. Typically disk based databaseshave been the system of record. All the data is stored on the disk. Disk beingslow we’ve ended up wrapping disks in complicated caching and distributed filesystems to make them perform.

Even though, memory is used as all over the place as cache, it is simply as-sumed that cache can be invalidated at any time. In Memory Based Architecturesmemory is where the “official” data values are stored.

Caching also serves a different purpose. The purpose behind cache basedarchitectures is to minimize the data bottleneck due to disk. Memory based ar-chitectures can address the entire end-to-end application stack. Data in memorycan be of higher reliability and availability than traditional architectures.

Memory Based Architectures initially developed out of the need in someapplications spaces for very low latencies. The dramatic drop of RAM pricesalong with the ability of servers to handle larger and larger amounts of RAMhas caused memory architectures to verge of going mainstream. For example,someone recently calculated that 1TB of RAM across 40 servers at 24GB perserver would cost an additional $40,000. Which is really quite affordable giventhe cost of the servers. Projecting out, 1U and 2U rack-mounted servers willsoon support a terabyte or more of memory. [6]

RAM: High Bandwidth and Low Latency

Compared to disk RAM is a high bandwidth and low latency storage medium.The bandwidth of RAM is typically 5GB/s. The bandwidth of disk is about100MB/s. RAM bandwidth is many hundreds of times faster. Modern harddrives have latencies under 13 milliseconds. When many applications are queued

19

for disk reads latencies can easily be in the many second range. Memory latencyis in the 5 nanosecond range. Memory latency is 2,000 times faster. [6]

RAM is the New Disk

The superiority of RAM is at the heart of the RAM is the New Disk paradigm.As an architecture it combines the holy quadrinity of computing:

• Performance is better because data is accessed from memory instead ofthrough a database to a disk.

• Scalability is linear because as more servers are added data is transpar-ently load balanced across the servers so there is an automated in-memorysharding.

• Availability is higher because multiple copies of data are kept in memoryand the entire system reroutes on failure.

• Application development is faster because theres only one layer of softwareto deal with, the cache, and its API is simple. All the complexity is hiddenfrom the programmer which means all a developer has to do is get and putdata.

Access disk on the critical path of any transaction limits both throughput andlatency. Committing a transaction over the network in-memory is faster thanwriting through to disk. Reading data from memory is also faster than readingdata from disk. So the idea is to skip disk, except perhaps as an asynchronouswrite-behind option, archival storage, and for large files. [6]

20

Chapter 3

Different NoSQL DatabaseChoices

3.1 Document Databases & BigTable

The BigTable paper describes how Google developed their own massively scalabledatabase for internal use, as basis for several of their services. The data modelis quite different from relational databases: columns dont need to be predefined,and rows can be added with any set of columns. Empty columns are not storedat all.

BigTable inspired many developers to write their own implementations of thisdata model; amongst the most popular are HBase, Hypertable and Cassandra.The lack of a predefined schema can make these databases attractive in appli-cations where the attributes of objects are not known in advance, or changefrequently.

Document databases have a related data model (although the way they han-dle concurrency and distributed servers can be quite different): a BigTable rowwith its arbitrary number of columns/attributes corresponds to a document ina document database, which is typically a tree of objects containing attributevalues and lists, often with a mapping to JSON or XML. Open source docu-ment databases include Project Voldemort, CouchDB, MongoDB, ThruDB andJackrabbit.

How is this different from just dumping JSON strings into MySQL? Documentdatabases can actually work with the structure of the documents, for exampleextracting, indexing, aggregating and filtering based on attribute values withinthe documents. Alternatively you could of course build the attribute indexingyourself, but I wouldnt recommend that unless it makes working with your legacycode easier.

21

The big limitation of BigTables and document databases is that most imple-mentations cannot perform joins or transactions spanning several rows or doc-uments. This restriction is deliberate, because it allows the database to doautomatic partitioning, which can be important for scaling — see the section3.4 on distributed key-value stores below. If the structure of our data is lots ofindependent documents, this is not a problem — but if the data fits nicely into arelational model and we need joins, we shouldn’t try to force it into a documentmodel. [9]

3.2 Graph Databases

Graph databases live at the opposite end of the spectrum. While documentdatabases are good for storing data which is structured in the form of lots ofindependent documents, graph databases focus on the relationships betweenitems — a better fit for highly interconnected data models.

Standard SQL cannot query transitive relationships, i.e variable-length chainsof joins which continue until some condition is reached. Graph databases, on theother hand, are optimised precisely for this kind of data.

There is less choice in graph databases than there is in document databases:Neo4j, AllegroGraph and Sesame (which typically uses MySQL or PostgreSQLas storage back-end) are ones to look at. FreeBase and DirectedEdge havedeveloped graph databases for their internal use.

Graph databases are often associated with the semantic web and RDF data-stores, which is one of the applications they are used for. [9]

3.3 MapReduce

Popularised by another Google paper, MapReduce is a way of writing batchprocessing jobs without having to worry about infrastructure. Different databaseslend themselves more or less well to MapReduce.

Hadoop is the big one amongst the open MapReduce implementations, andSkynet and Disco are also worth looking at. CouchDB also includes some MapRe-duce ideas on a smaller scale. [9]

3.4 Distributed Key-Value Stores

A key-value store is a very simple concept, much like a hash table: you canretrieve an item based on its key, you can insert a key/value pair, and youcan delete a key/value pair. The value can just be an opaque list of bytes, or

22

might be a structured document (most of the document databases and BigTableimplementations above can also be considered to be key-value stores).

Document databases, graph databases and MapReduce introduce new datamodels and new ways of thinking which can be useful even in a small-scaleapplications. Distributed key-value stores, on the other hand, are really justabout scalability. They can scale to truly vast amounts of data — much morethan a single server could hold.

Distributed databases can transparently partition and replicate the data acrossmany machines in a cluster. We dont need to figure out a sharding scheme todecide on which server we can find a particular piece of data; the database canlocate it for us. If one server dies, no problem — others can immediately takeover. If we need more resources, just add servers to the cluster, and the databasewill automatically give them a share of the load and the data.

When choosing a key-value store we need to decide whether it should be opti-mised for low latency (for lightning-fast data access during the request-responsecycle) or for high throughput (which is needed for batch processing jobs).

Other than the BigTables and document databases above, Scalaris, Dynomiteand Ringo provide certain data consistency guarantees while taking care of par-titioning and distributing the dataset. MemcacheDB and Tokyo Cabinet (withTokyo Tyrant for network service and LightCloud to make it distributed) focuson latency.

The caveat about limited transactions and joins applies even more stronglyfor distributed databases. Different implementations take different approaches,but in general, if we need to read several items, manipulate them in some way andthen write them back, there is no guarantee that we will end up in a consistentstate immediately (although many implementations try to become eventuallyconsistent by resolving write conflicts or using distributed transaction protocols).[9]

23

24

Chapter 4

NoSQL: Merits & Demerits

4.1 Merits

There is a long list of potential advantages to using non-relational databases.Of course, not all non-relational databases are the same; but the following listcovers areas common to many of them. [11]

4.1.1 Semi-Structured Data

Here, a structure where each entity can have any number of properties definedat run-time. This approach is clearly helpful in domains where the problem isitself amenable to expansion or change over time. We can begin simply, andalter the details of our problem as we go with minimal administrative burden.This approach has much in common with the imputed typing systems of scriptinglanguages like Python, which, while often less efficient than strongly typed lan-guages like C and Java, usually more than make up for this deficiency by givingprogrammers improved usability; they can get started quickly and add structureand overhead only as needed.

But there is another, more important aspect to this tendency towards storingnon-structured, or semi-structured, data: the idea that our understanding ofa problem, and its data, might legitimately emerge over time, and be entirelydata-driven after the fact. As one observer put it:

RDBMSs are designed to model very highly and statically structured datawhich has been modeled with mathematical precision — data and designs thatdo not meet these criteria, such as data designed for direct human consumption,lose the advantages of the relational model, and result in poorer maintainabilitythan with less stringent models.

This kind of emergent behavior is atypical when dealing with the program-

25

ming problems of the past 40 years, such as accounting systems, desktop wordprocessing software, etc. However, many of today’s interesting problems involveunpredictable behavior and inputs from extremely large populations; considerweb search, social network graphs, large scale purchasing habits, etc. In these“messy” arenas, the impulse to exactly model and define all the possible struc-tures in the data in advance is exactly the wrong approach. Relational datadesign tends to turn programmers into “structure first” proponents, but in manycases, the rest of the world (including the users we are writing programs for) arethinking “data first”. [11]

4.1.2 Alternative Model Paradigms

Modelling data in terms of relations, tuples and attributes —or equivalently, ta-bles, rows and columns — is but one conceptual approach. There are entirelydifferent ways of considering, planning, and designing a data model. These in-clude hierarchical trees, arbitrary graphs, structured objects, cube or star schemaanalytical approaches, tuple spaces, and even undifferentiated (emergent) stor-age. By moving into the realm of semi-structured non-relational data, we gain thepossibility of accessing our data along these lines instead of simply in relationaldatabase terms.

For example, graph-oriented databases, such as Neo4j. This paradigm at-tempts to map persistent storage capabilities directly onto the graph model ofcomputation: sets of nodes connected by sets of edges. The database enginethen innately provides many algorithmic services that one would expect on graphrepresentations: establishing spanning trees, finding shortest path, depth andbreadth-first search, etc.

Object databases are another paradigm that have, at various times, appearedpoised to challenge the supremacy of the relational database. An example of acurrent contender in this space is Persevere (http://www.persvr.org/), whichis an object store for JSON (JavaScript Object Notation) data. Advantagesgained in this space include a consistent execution model between the storageengine and the client platform (JavaScript, in this case), and the ability to nativelystore objects without any translation layer.

Here again, the general principle is that by moving away from the strictlymodeled structure of SQL, we untie the hands of developers to model data interms they may be more familiar with, or that may be more conducive to solvingthe problem at hand. This is very attractive to many developers. [11]

26

http://www.persvr.org/

4.1.3 Multi-Valued Properties

Even with the bounds of the more traditional relational approach, there are waysin which the semi-structured approach of non-relational databases can give us ahelping hand in conceptual data design. One of these is by way of multi-valueproperties — that is, attributes that can simultaneously take on more than onevalue.

A credo of relational database design is that for any given tuple in a relation,there is only one value for any given attribute; storing multiple values in the sameattribute for the same tuple is considered very bad practice, and is not supportedby standard SQL. Generally, cases where one might be tempted to store multiplevalues in the same attribute indicate that the design needs further normalization.

As an example, consider a User relation, with an attribute email. Sincepeople typically have more than one email address, a simple (but wrong, at leastfor relational database design) decision might be to store the email addresses asa comma-delimited list within the “emails” attribute.

The problems with this are myriad - for example, simple membership testslike

SELECT ∗ FROM User WHERE e m a i l s = ’ homer@simpson . com ’ ;

will fail if there are more than one email address in the list, because that is nolonger the value of the attribute; a more general test using wildcards such as

SELECT ∗ FROM User WHERE e m a i l s LIKE ’%homer@simpson . com%’ ;

will succeed, but raises serious performance issues in that it defeats the use ofindexes and causes the database engine to do (at best) linear-time text patternsearches against every value in the table. Worse, it may actually impact correct-ness if entries in the list can be proper substrings of each other (as in the list“car, cart, art”).

The proper way to design for this situation, in a relational model, is to nor-malize the email addresses into their own table, with a foreign key relationshipto the user table.

This is a design strategy that can is frequently applied to many situations instandard relational database design, even recursively: if you sense a one-to-manyrelationship in an attribute, break it out into two relations with a foreign key.

The trouble with this pattern, however, is that it still does not elegantlyserve all the possible use cases of such data, especially in situations with a lowcardinality; either it is overkill, or it is a clumsy way to store data. In the aboveexample, there are a very small set of use cases that we might typically do withemail addresses, including:

• Return the user, along with their one “primary” email address, for normaloperations involving sending an email to the user.

27

• Return the user with a list of all their email addresses, for showing on a“profile” screen, for example.

• Find which user (if any) has a given email address.

The first situation requires an additional attribute along the lines of is primaryon the email table, not to mention logic to ensure that only one email tuple peruser is marked as primary (which cannot be done natively in a relational database,because a UNIQUE constraint on the user id and the is primary field wouldonly allow one primary and one non-primary email address per user id ). Alter-nately, a primary email field can be kept on the User table, acting as a cache ofwhich email address is the primary one; this too requires coordination by code toensure that this field actually exists in the User Email table, etc.

To use standard SQL to return a single tuple containing the user and allof their email addresses, comma delimited like our original (“wrong”) designconcept, is actually quite difficult under this two-table structure. Standard SQLhas no way of rendering this output, which is surprising considering how commonit is. The only mechanisms would be constructing intermediate temporary tablesof the information, looping through records of the join relation and outputtingone tuple per user id with the concatenation of email addresses as an attribute.

Under key/value stores, we have a different paradigm entirely for this problem,and one which much more closely matches the real-world uses of such data. Wecan simply model the email attribute as a substructure: a list of emails withinthe attribute.

For example, Google App Engine has a “List” type that can store exactly thistype of information as an attribute:

c l a s s User ( db . Model ) :name = db . S t r i n g P r o p e r t y ( )e m a i l s = db . S t r i n g L i s t P r o p e r t y ( )

The query system then has the ability to not only return the contained listsas structured data, but also to do membership queries, such as:

r e s u l t s = db . GqlQuery (”SELECT ∗ FROM User WHERE e m a i l = ’ homer@simpson . com ’ ”

)

Since order is preserved, the semantics of “primary” versus “additional” canbe encoded into the order of items, so no additional attribute is needed forthis purpose; we can always get the primary email by saying something like“ results . emails [0] ”.

In effect, we have expressed our actual data requirements in a much moresuccinct and powerful way using this notation, without any noticeable loss inprecision, abstraction, or expressive power. [11]

28

4.1.4 Generalized Analytics

If the nature of the analysis falls outside of SQL’s standard set of operations,it can be extremely difficult to produce results with the operational silo of SQLqueries. Worse, this has a pernicious effect on the mindset of data developers,sometimes called “SQL Myopia”: if you can’t do it in SQL, you cant do it.1

This is unfortunate, because there are many interesting and useful modes ofinteracting with data sets that are outside of this paradigm — consider matrixtransformations, data mining, clustering, Bayesian filtering, probability analysis,etc.

Additionally, besides simply lacking Turing-completeness,2 SQL has a long listof faults that non-SQL developers regularly present. These include a verbose,non-customizable syntax; inability to reduce nested constructions to recursivecalls, or generally work with graphs, trees, or nested structures; inconsistency inspecific implementation between vendors, despite standardization; and so forth.It is no wonder that the moniker for the current non-relational database move-ment is converging on the tag “NOSQL”: it is a limited, inelegant language.[11]

4.1.5 Version History

Part of the design of many (but not all) non-relational databases is the explicitinclusion of version history in the storage unit of data. For example, when youstore the value 123 in an attribute, and later change it to the value 234, yourdata store actually now contains both values, each with a timestamp or vectorclock version stamp. This approach has many benefits from an efficiency pointof view: primary interaction with the database disks is always in write-forwardmode, and multi-version concurrency control can be easily modeled with thisstructure.

From a modeling point of view, however, there are other distinct advantagesto this format. One of them is the ability to intentionally keep, and interact with,older versions of data in a structured way. An example of this, which almostcertainly uses the versioned characteristics of Google’s Bigtable infrastructure, isGoogle Docs: any document can be instantly viewed in, or reverted to, its stateat any point in its history — a granular, infinite “undo”.

Implementing this kind of revision ability in typical relational database ap-

1Note that this is not a fault of the relational model itself — only of SQL, which isultimately just one possible declarative grammar for interacting with relational structures.

2For the record, this lack of Turing-completeness is by design, so that all queries wouldbe able to run in bounded time; never mind that every major commercial vendor hasextended SQL with operations that do make it Turing complete, albeit still awkward.

29

plications is prohibitive both from a programming complexity standpoint (thisability must be consciously designed in to each entity that might need it) as wellas from a performance standpoint.3

We have two main options when keeping a history of information for a table.On the one hand, we can keep a full additional copy of every row whenever itchanges. This can be done in place, by adding an additional component to theprimary key which is a timestamp or version number.

This is problematic in that all application code that interacts with this entityneeds to know about the versioning scheme; it also complicates the indexing ofthe entities, because relational database storage with a composite primary keyincluding a date is significantly less optimized than for a single integer key.

Alternately, the entire-row history method can be done in a secondary tablewhich only keeps historical records, much like a log.

This is less obtrusive on the application (which need not even be aware ofits existence, especially if it is produce via a database level procedure or trigger),and has the benefit that it can be populated asynchronously.

However, both of these cases require O(sn) storage, where s is the row sizeand n is the number of updates. For large row sizes, this approach can beprohibitive.

The other mechanism for doing this is to keep what amounts to an Entity/Attribute/Value table for the historical changes: a table where only the changedvalue is kept. This is easier to do in situations where the table design itself isalready in the EAV paradigm, but can still be done dynamically (if not efficiently)by using the string name of the updated attribute.

For sparsely updated tables, this approach does save space over the entire-rowversions, but it suffers from the drawback that any use of this data via interactiveSQL queries is nearly impossible.

Overall, the non-relational database stores that support column-based versionhistory have a huge advantage in any situations where the application might needthis level of historical data snapshots. [11]

4.1.6 Predictable Scalability

These databases are simple and thus scale much better than today’s relationaldatabases. If you are putting together a system in-house and intend to throwdozens or hundreds of servers behind your data store to cope with what youexpect will be a massive demand in scale, then consider a key/value store.

3Consider how many traditional relational database implemented products you knowof that offer any kind of Undo functionality.

30

This definitively impacts the modelling concepts supported by the systems,because it elevates scalability concerns to a first class modeling directive — partof the logical and conceptual modeling process itself. Rather than designing anelegant relational model and only later considering how it might reasonably be“sharded” or replicated in such a way as to provide high availability in variousfailure scenarios (typically accompanied by great cost, in commercial relationaldatabase products), instead the bedrock of the logical design asks: how can weconceive of this data in such a way that it is scalable by its definition?

As an example, consider the mechanism for establishing the locality of trans-actions in Bigtable and its ilk (including the Google App Engine data store).Obviously, when involving multiple entities in a transaction on a distributed datastore, it is desirable to restrict the number of nodes who actually must par-ticipate in the transaction. (While protocols do of course exist for distributedtransactions, the performance of these protocols suffer immensely as the size ofmachine cluster increases, because the risk of a node failure, and thus a time-out on the distributed transaction, increases.) It is therefore most beneficial tocouple related entities tightly, and unrelated entities loosely, so that the mostcommon entities to participate in a transaction would be those that are alreadytightly coupled. In a relational database, you might use foreign key relationshipsto indicate related entities, but the relationship carries no additional informationthat might indicate “these two things are likely to participate in transactionstogether”.

By contract, in Bigtable, this is enabled by allowing entities to indicate an“ancestor” relation chain, of any depth. That is, entity A can declare entity B its“parent”, and henceforth, the data store organizes the physical representation ofthese entities on one (or a small number of) physical machines, so that they caneasily participate in shared transactions. This is a natural design inclination, butone that is not easily expressed in the world of relational databases (you couldcertainly provide self-relationships on entities but since SQL does not readilyexpress recursive relationships, that is only beneficial in cases where the self-relationship is a key part of the data design itself, with business import.)

Many commercial relational database vendors make the claim that their solu-tions are highly scalable. This is true, but there are two caveats. First, of course,is cost: sharded, replicated instances of Oracle or DB2 are not a cheap com-modity, and the cost scales with the load. Second, however, and less obvious, isthe predictability factor. This is highly touted by systems such as Project Volde-mort, which point out that with a simple data model, as in many non-relationaldatabases, not only can you scale more easily, but you can scale more predictably:the requirements to support additional operations, in terms of CPU and memoryis known fairly exactly, so load planning can be an exact science. Compare thiswith SQL/relational database scaling, which is highly unpredictable due to the

31

complex nature of the RDBMS engine.

There are, naturally, other criteria that are involved in the quest for per-formance and scalability, including topics like low level data storage (b-tree-likestorage formats, disk access patterns, solid state storage, etc.); issues with theraw networking of systems and their communications overhead; data reliability,both considered for single-node and multi-node systems, etc.

Because key/value databases easily and dynamically scale, they are also thedatabase of choice for vendors who provide a multi-user, web services platformdata store. The database provides a relatively cheap data store platform withmassive potential to scale. Users typically only pay for what they use, but theirusage can increase as their needs increase. Meanwhile, the vendor can scale theplatform dynamically based on the total user load, with little limitation on theentire platform’s size. [2, 11]

4.1.7 Schema Evolution

In addition to the static existence of a database schema, it is also importantto consider what happens over time as an application’s needs or requirementschange. Non-relational databases have a distinct advantage in this realm, be-cause they offer more options for how the version update should proceed.

To be sure, relational databases have mechanisms for handling ongoing up-dates to data schema; indeed, one of the strengths of the relational model is thatthe schema is data: databases keep system tables which define schema meta-data, which are handled by the exact same database primitives as user-spacetables. This generality has advantages in terms of manageability, but it alsoprovides a clean abstraction that vendors can use to provide valuable schemaupdate facilities. Indeed, commercial RDMBS products have applied a greatdeal of engineering resources to the problem, and have developed sophisticatedmechanisms that allow production databases to ALTER their schema withoutdowntime in most scenarios.4 However, there are two issues with the relationaldatabase approach to this.

First, relational database schemas exist in only one state at any given time.This means that if the specific form of an attribute changes, it must changeimmediately for all records, even in cases where the new form of the attributewould rightfully require processing that the database cannot do (for example,application-specific business logic). It also implies that if there is a high-volumeupdate, such as one that might need to write many gigabytes of changed data

4Non-commercial databases such as MySQL also have mechanisms such as this, ingeneral their methods are much less sophisticated, often requiring downtime to do evensimple operations such as rebuild indices, etc.

32

back to disk, the RDBMS is obligated to do this operation atomically and inreal-time (because DDL updates are transactional); regardless of how efficientlyimplemented it is, this type of operation cannot be made seamless in a highlytransactional production environment.

Second, the release of relational database schema changes typically requiresprecise coordination with application-layer code; the code version must exactlymatch the data version. In any highly available application, there is a high likeli-hood that this implies downtime,5 or at least advanced operational coordinationthat takes a great deal of precision and energy.

Non-relational databases, by comparison, can use a very different approachfor schema versioning. Because the schema (in many cases) is not enforced atthe data engine level, it is up to the application to enforce (and migrate) theschema. Therefore, a schema change can be gradually introduced by code thatunderstands how to interact with both the N − 1 version and the N version,and leaves each entity updated as it is touched. “Gardener” processes can thenperiodically sweep through the data store, updating nodes as a lower-priorityprocess.

Naturally, this approach produces more complex code in the short term, es-pecially if the schema of the data is relied upon by analytical (map/reduce) jobs.But in many cases, the knowledge that no downtime will be required duringa schema evolution is worth the additional complexity. In fact, this approachmight be seen to encourage a more agile development methodology, becauseeach change to the internal schema of the application’s data is bundled with theupdate to the codebase, and can be collectively versioned and managed accord-ingly.

4.1.8 More Natural Fit with Code

Relational data models and Application Code Object Models are typically builtdifferently, which leads to incompatibilities. Developers overcome these incom-patibilities with code that maps relational models to their object models, a pro-cess commonly referred to as object-to-relational mapping.This process, whichessentially amounts to “plumbing” code and has no clear and immediate value,can take up a significant chunk of the time and effort that goes into develop-ing the application. On the other hand, many key/value databases retain data

5The exception to this is that, thanks to the relational models implicit lack of attributeorder, there are situations in which new attributes can be added and it is guaranteed thatno application code would even know of the existence of the new attributes, let alone beaffected by them. This is a case where the relational model has the upper hand; however,because it is not a comprehensive solution for every situation, the end result is that, forsafety, most relational database schema updates are treated as downtime events.

33

in a structure that maps more directly to object classes used in the underlyingapplication code, which can significantly reduce development time. [2]

4.2 Demerits

The inherent constraints of a relational database ensure that data at the lowestlevel have integrity. Data that violate integrity constraints cannot physically beentered into the database. These constraints don’t exist in a key/value database,so the responsibility for ensuring data integrity falls entirely to the application.But application code often carries bugs. Bugs in a properly designed relationaldatabase usually don’t lead to data integrity issues; bugs in a key/value database,however, quite easily lead to data integrity issues.

One of the other key benefits of a relational database is that it forces you togo through a data modeling process. If done well, this modeling process createin the database a logical structure that reflects the data it is to contain, ratherthan reflecting the structure of the application. Data, then, become somewhatapplication-independent, which means other applications can use the same dataset and application logic can be changed without disrupting the underlying datamodel. To facilitate this process with a key/value database, try replacing therelational data modeling exercise with a class modeling exercise, which createsgeneric classes based on the natural structure of the data.

And don’t forget about compatibility. Unlike relational databases, cloud-oriented databases have little in the way of shared standards. While they allshare similar concepts, they each have their own API, specific query interfaces,and peculiarities. So, you will need to really trust your vendor, because you won’tsimply be able to switch down the line if you’re not happy with the service. Andbecause almost all current key/value databases are still in beta, that trust is farriskier than with old-school relational databases. [2]

4.2.1 Limitations on Analytics

In the cloud, key/value databases are usually multi-tenanted, which means that alot of users and applications will use the same system. To prevent any one processfrom overloading the shared environment, most cloud data stores strictly limit thetotal impact that any single query can cause. For example, with SimpleDB, youcan’t run a query that takes longer than 5 seconds. With Google’s AppEngineDatastore, you can’t retrieve more than 1000 items for any given query.

These limitations aren’t a problem for your bread-and-butter application logic(adding, updating, deleting, and retrieving small numbers of items). But whathappens when your application becomes successful? You have attracted many

34

users and gained lots of data, and now you want to create new value for yourusers or perhaps use the data to generate new revenue. You may find yourselfseverely limited in running even straightforward analysis-style queries. Things liketracking usage patterns and providing recommendations based on user historiesmay be difficult at best, and impossible at worst, with this type of databaseplatform.

In this case, you will likely have to implement a separate analytical database,populated from your key/value database, on which such analytics can be exe-cuted. Think in advance of where and how you would be able to do that? Wouldyou host it in the cloud or invest in on-site infrastructure? Would latency be-tween you and the cloud-service provider pose a problem? Does your currentcloud-based key/value database support this? If you have 100 million items inyour key/value database, but can only pull out 1000 items at a time, how longwould queries take?

Ultimately, while scale is a consideration, don’t put it ahead of your ability toturn data into an asset of its own. All the scaling in the world is useless if yourusers have moved on to your competitor because it has cooler, more personalizedfeatures.[2]

35

36

Chapter 5

Conclusion

5.1 Data inconsistency

Data inconsistency in large-scale reliable distributed systems has to be toleratedfor two reasons: improving read and write performance under highly concurrentconditions; and handling partition cases where a majority model would renderpart of the system unavailable even though the nodes are up and running.

Whether or not inconsistencies are acceptable depends on the client applica-tion. In all cases the developer needs to be aware that consistency guarantees areprovided by the storage systems and need to be taken into account when develop-ing applications. There are a number of practical improvements to the eventualconsistency model, such as session-level consistency and monotonic reads, whichprovide better tools for the developer. Many times the application is capable ofhandling the eventual consistency guarantees of the storage system without anyproblem. A specific popular case is a Web site in which we can have the notionof user-perceived consistency. In this scenario the inconsistency window needs tobe smaller than the time expected for the customer to return for the next pageload. This allows for updates to propagate through the system before the nextread is expected. [4]

BigTable inspired many developers to write their own implementations of thisdata model; amongst the most popular are HBase, Hypertable and Cassandra.The lack of a predefined schema can make these databases attractive in appli-cations where the attributes of objects are not known in advance, or changefrequently.

Document databases have a related data model (although the way they han-dle concurrency and distributed servers can be quite different): a BigTable rowwith its arbitrary number of columns/attributes corresponds to a document ina document database, which is typically a tree of objects containing attribute

37

values and lists, often with a mapping to JSON or XML. Open source docu-ment databases include Project Voldemort, CouchDB, MongoDB, ThruDB andJackrabbit.

How is this different from just dumping JSON strings into MySQL? Documentdatabases can actually work with the structure of the documents, for exampleextracting, indexing, aggregating and filtering based on attribute values withinthe documents. Alternatively you could of course build the attribute indexingyourself, but I wouldnt recommend that unless it makes working with your legacycode easier.

The big limitation of BigTables and document databases is that most imple-mentations cannot perform joins or transactions spanning several rows or doc-uments. This restriction is deliberate, because it allows the database to doautomatic partitioning, which can be important for scaling see the section ondistributed key-value stores below. If the structure of your data is lots of in-dependent documents, this is not a problem but if your data fits nicely into arelational model and you need joins, please dont try to force it into a documentmodel.ions require. One of the tools the system

5.2 Making a Decision

Ultimately, there are four reasons why you would choose a non-relational key/value database platform for your application:

1. Your data is heavily document-oriented, making it a more natural fit withthe key/value data model than the relational data model.

2. Your development environment is heavily object-oriented, and a key/valuedatabase could minimize the need for “plumbing” code.

3. The data store is cheap and integrates easily with your vendor’s web servicesplatform.

4. Your foremost concern is on-demand, high-end scalability — that is, large-scale, distributed scalability, the kind that can’t be achieved simply byscaling up.

But in making your decision, remember the database’s limitations and therisks you face by branching off the relational path.

For all other requirements, you are probably best off with the good oldRDBMS. So, is the relational database doomed? Clearly not. Well, not yetat least. [2]

38

Appendix A

Different popular NoSQLdatabases

A.1 The Shortlist

Table A.1 provides a list of projects that could potentially replace a group ofrelational database shards. Some of these are much more than key-value stores,and arent suitable for low-latency data serving, but are interesting none-the-less.[3]

A.2 Cloud-Service Contenders

A number of web service vendors now offer multi-tenanted key/value databaseson a pay-as-you-go basis. Most of them meet the criteria discussed to this point,but each has unique features and varies from the general standards described thusfar. Let’s take a look now at particular databases, namely SimpleDB, GoogleAppEngine Datastore, and SQL Data Services. [2]

A.2.1 Amazon: SimpleDB

SimpleDB is an attribute-oriented key/value database available on the Ama-zon Web Services platform. SimpleDB is still in public beta; in the meantime,users can sign up online for a “free” version – free, that is, until you exceed yourusage limits.

39

Name Language Fault-tolerance

Persistence ClientProtocol

Datamodel

Docs Community

ProjectVoldemort

Java partitioned,replicated,read-repair

Pluggable:BerkleyDB,MySQL

Java API Structured/blob /text

A LinkedIn,no

Ringo Erlang partitioned,repli-cated, im-mutable

Customon-disk(appendonly log)

HTTP blob B Nokia, no

Scalaris Erlang partitioned,replicated,paxos

In-memoryonly

Erlang,Java,HTTP

blob B OnScale,no

Kai Erlang partitioned,repli-cated?

On-diskDets file

Memcached blob C no

Dynomite Erlang partitioned,replicated

Pluggable:couch,dets

CustomASCII,Thrift

blob D+ Powerset,no

MemcacheDBC replication BerkleyDB Memcached blob B someThruDB C++ replication Pluggable:

BerkleyDB,Custom,MySQL,S3

Thrift Documentoriented

C+ Third rail,unsure

CouchDB Erlang replication,partition-ing?

Customon-disk

HTTP,JSON

Documentoriented(JSON)

A Apache,yes

Cassandra Java replication,partition-ing

Customon-disk

Thrift BigTablemeetsDynamo

A Facebook,no

HBase Java replication,partition-ing

Customon-disk

CustomAPI,Thrift,Rest

BigTable F Apache,yes

Hypertable C++ replication,partition-ing

Customon-disk

Thrift,other

BigTable A Zvents,Baidu, yes

Table A.1: Some NoSQL initiatives

SimpleDB has several limitations. First, a query can only execute for a max-imum of 5 seconds. Secondly, there are no data types apart from strings. Every-thing is stored, retrieved, and compared as a string, so date comparisons won’twork unless you convert all dates to ISO8601 format. Thirdly, the maximum sizeof any string is limited to 1024 bytes, which limits how much text (i.e. productdescriptions, etc.) you can store in a single attribute. But because the schema isdynamic and flexible, you can get around the limit by adding “ProductDescrip-tion1”, “ProductDescription2”, etc. The catch is that an item is limited to 256attributes. While SimpleDB is in beta, domains can’t be larger than 10GB, andentire databases cannot exceed 1TB.

One key feature of SimpleDB is that it uses an eventual consistency model.Thisconsistency model is good for concurrency, but means that after you have changedan attribute for an item, those changes may not be reflected in read operations

40

that immediately follow. While the chances of this actually happening are low,you should account for such situations. For example, you don’t want to sell thelast concert ticket in your event booking system to five people because your datawasn’t consistent at the time of sale. [2]

A.2.2 Google AppEngine Data Store

Google’s AppEngine Datastore is built on BigTable, Google’s internal storagesystem for handling structured data. In and of itself, the AppEngine Datastore isnot a direct access mechanism to BigTable, but can be thought of as a simplifiedinterface on top of BigTable.

The AppEngine Datastore supports much richer data types within items thanSimpleDB, including list types, which contain collections within a single item.

You will almost certainly use this data store if you plan on building applica-tions within the Google AppEngine. However, unlike with SimpleDB, you cannotcurrently interface with the AppEngine Datastore (or with BigTable) using anapplication outside of Google’s web service platform. [2]

A.2.3 Microsoft: SQL Data Services

SQL Data Services is part of the Microsoft Azure Web Services platform. TheSDS service is also in beta and so is free but has limits on the size of databases.SQL Data Services is actually an application itself that sits on top of many SQLservers, which make up the underlying data storage for the SDS platform. Whilethe underlying data stores may be relational, you don’t have access to these;SDS is a key/value store, like the other platforms discussed thus far.

Microsoft seems to be alone among these three vendors in acknowledging thatwhile key/value stores are great for scalability, they come at the great expenseof data management, when compared to RDBMS. Microsoft’s approach seemsto be to strip to the bare bones to get the scaling and distribution mechanismsright, and then over time build up, adding features that help bridge the gapbetween the key/value store and relational database platform. [2]

41

A.3 Non-Cloud Service Contenders

Outside the cloud, a number of key/value database software products exist thatcan be installed in-house. Almost all of these products are still young, either inalpha or beta, but most are also open source; having access to the code, you canperhaps be more aware of potential issues and limitations than you would withclose-source vendors. [2]

A.3.1 Tokyo Cabinet

Developed and sponsored by Mixi Inc., it is an incredibly fast, and feature richdatabase library. [5]

Tokyo Cabinet Highlights

Speed and efficiency are two consistent themes for Tokyo Cabinet. Benchmarksshow that it only takes 0.7 seconds to store 1 million records in the regular hashtable and 1.6 seconds for the B-Tree engine. To achieve this, the overhead perrecord is kept at as low as possible, ranging between 5 and 20 bytes: 5 bytesfor B-Tree, 16 – 20 bytes for the Hash-table engine. And if small overheadis not enough, Tokyo Cabinet also has native support for Lempel-Ziv or BWTcompression algorithms, which can reduce your database to 25% of it’s size(typical text compression rate). Also, it is thread safe (uses pthreads) and offersrow-level locking. [5]

Features

• Similar use cases as for BerkelyDB.

• Disk persistence. Can store data larger than RAM.

• Performs well.

• Actively developed. Lots of developers adding new features (but not bugfixes).

• Similar replication strategy to MySQL. Not useful for scalability as it limitsthe write throughput to one node.

• Optional compressed pages so has some compression advantages. [7]

42

Hash and B-Tree Database Engines

Hash database engine is a direct competitor to BerkeleyDB, and other key-valuestores: one key, one value, no duplicates, and very fast.

Functionally, the B-Tree database engine is equivalent to the Hash database.However, because of its underlying structure, the keys can be ordered via a user-specified function, which in turn allows us to do prefix and range matching on akey, as well as, traverse the entries in order. Let’s look at some examples:

requ i re ” rubygems ”requ i re ” t o k y o c a b i n e t ”

inc lude TokyoCabinet

bdb = BDB : : new # B−Tree d a t a b a s e ; k e y s may have m u l t i p l e v a l u e sbdb . open ( ” c a s k e t . bdb” , BDB : : OWRITER | BDB : : OCREAT)

# s t o r e r e c o r d s i n t h e database , a l l o w i n g d u p l i c a t e sbdb . putdup ( ” key1 ” , ” v a l u e 1 ” )bdb . putdup ( ” key1 ” , ” v a l u e 2 ” )bdb . put ( ” key2 ” , ” v a l u e 3 ” )bdb . put ( ” key3 ” , ” v a l u e 4 ” )

# r e t r i e v e a l l v a l u e sp bdb . g e t l i s t ( ” key1 ” )# => [ ” v a l u e 1 ” , ” v a l u e 2 ” ]

# r a n g e query , f i n d a l l match ing k e y sp bdb . r a n g e ( ” key1 ” , true , ” key3 ” , true )# => [ ” key1 ” , ” key2 ” , ” key3 ” ]

43

Fixed-length and Table Database Engines

Next, we have the ‘fixed length’ engine, which is best understood as a simplearray. There is absolutely no hashing and access is done via natural number keys,which also means no key overhead. This method is extremely fast.

Saving best for last, we have the Table engine, which mimics a relationaldatabase, except that it requires no predefined schema (in this, it is a closecousin to CouchDB, which allows arbitrary properties on any object). Eachrecord still has a primary key, but we are allowed to declare arbitrary indexes onour columns, and even perform queries on them:

requ i re ” rubygems ”requ i re ” r u f u s / tokyo / c a b i n e t / t a b l e ”

t = Rufus : : Tokyo : : Table . new ( ’ t a b l e . tdb ’ , : c r e a t e , : w r i t e )

# p o p u l a t e t a b l e w i t h a r b i t r a r y data ( no schema ! )t [ ’ pk0 ’ ] = { ’ name ’ => ’ a l f r e d ’ , ’ age ’ => ’ 22 ’ ,

’ s e x ’ => ’ male ’ }t [ ’ pk1 ’ ] = { ’ name ’ => ’ bob ’ , ’ age ’ => ’ 18 ’ }t [ ’ pk2 ’ ] = { ’ name ’ => ’ c h a r l y ’ , ’ age ’ => ’ 45 ’ ,

’ n ickname ’ => ’ c h a r l i e ’ }t [ ’ pk3 ’ ] = { ’ name ’ => ’ doug ’ , ’ age ’ => ’ 77 ’ }t [ ’ pk4 ’ ] = { ’ name ’ => ’ ephrem ’ , ’ age ’ => ’ 32 ’ }

# q u e r y t a b l e f o r age >= 32p t . q u e r y { | q |

q . a d d c o n d i t i o n ’ age ’ , : numge , ’ 32 ’q . o r d e r b y ’ age ’

}

# => [ {”name”=>”ephrem ” , : pk=>”pk4 ” , ” age ”=>”32”} ,# {”name”=>” c h a r l y ” , : pk=>”pk2 ” , ” nickname”=>” c h a r l i e ” ,# ” age ”=>”45”} ,# {”name”=>”doug ” , : pk=>”pk3 ” , ” age”=>”77”} ]

t . c l o s e

44

A.3.2 CouchDB

CouchDB is a free, open-source, distributed, fault-tolerant and schema-freedocument-oriented database accessible via a RESTful HTTP/JSON API. Derivedfrom the key/value store, it uses JSON to define an item’s schema. Data is storedin ‘documents’, which are essentially key/value maps themselves. CouchDB ismeant to bridge the gap between document-oriented and relational databases byallowing “views” to be dynamically created using JavaScript. These views mapthe document data onto a table-like structure that can be indexed and queried.It can also do full text indexing of the documents.

At the moment, CouchDB isn’t really a distributed database. It has replica-tion functions that allow data to be synchronized across servers, but this isn’t thekind of distribution needed to build highly scalable environments. The CouchDBcommunity, though, is no doubt working on this. [3, 2]

A.3.3 Project Voldemort

Project Voldemort is a distributed key/value database that is intended to scalehorizontally across a large numbers of servers. It spawned from work done atLinkedIn and is reportedly used there for a few systems that have very highscalability requirements. Project Voldemort also uses an eventual consistencymodel, based on Amazon’s.

Project-Voldemort handles replication and partitioning of data, and appearsto be well written and designed using Java. [3, 2, 10]

A.3.4 Mongo

Mongo is the database system being developed at 10gen by Geir Magnussonand Dwight Merriman. Like CouchDB, Mongo is a document-oriented JSONdatabase, except that it is designed to be a true object database, rather than apure key/value store. Originally, 10gen focused on putting together a completeweb services stack; more recently, though, it has refocused mainly on the Mongodatabase. [2]

45

Features

• Written in C++.

• Significantly faster the CouchDB.

• JSON and BSON (binary JSON-ish) formats.

• Asynchronous replication with auto-sharding.

• Supports indexes. Querying a property is quick because an index is auto-matically kept on updates. Trades off some write speed for more consistentread spead.

• Documents can be nested unlike CouchDB which requires applications keeprelationships. Advantage is that the whole object doesn’t have to be writ-ten and read because the system knows about the relationship. Exampleis a blog post and comments. In CouchDB the post and comments arestored together and walk through all the comments when creating a vieweven though you are only interested in the blog post. Better write andquery performance.

• More advanced queries than CouchDB. [7]

A.3.5 Drizzle

Drizzle can be thought of as a counter-approach to the problems that key/value stores are meant to solve. Drizzle began life as a spin-off of the MySQL(6.0) relational database. Over the last few months, its developers have removeda host of non-core features (including views, triggers, prepared statements, storedprocedures, query cache, ACL, and a number of data types), with the aim ofcreating a leaner, simpler, faster database system. Drizzle can still store relationaldata; as Brian Aker of MySQL/Sun puts it, “There is no reason to throw outthe baby with the bath water.” The aim is to build a semi-relational databaseplatform tailored to web- and cloud-based apps running on systems with 16 coresor more. [2]

46

A.3.6 Cassandra

The source code for Cassandra was released recently by Facebook. They useit for inbox search. It’s a BigTable-esque, but uses a DHT so doesn’t need acentral server.

Originally developed by Facebook, it was developed by some of the key engi-neers behind Amazon’s famous Dynamo database.

Cassandra can be thought of as a huge 4-or-5-level associative array, whereeach dimension of the array gets a free index based on the keys in that level.The real power comes from that optional 5th level in the associative array, whichcan turn a simple key-value architecture into an architecture where you can nowdeal with sorted lists, based on an index of your own specification. That 5th levelis called a SuperColumn, and it’s one of the reasons that Cassandra stands outfrom the crowd.

Cassandra has no single points of failure, and can scale from one machineto several thousands of machines clustered in different data centers. It has nocentral master, so any data can be written to any of the nodes in the cluster,and can be read likewise from any other node in the cluster.

It provides knobs that can be tweaked to slide the scale between consistencyand availability, depending on a particular application and problem domain. Andit provides a high availability guarantee, that if one node goes down, anothernode will step in to replace it smoothly. [3, 8]

Pros:

• Open source.

• Incremental scalable — as data grows one can add more nodes to storagemesh.

• Minimal administration — because it’s incremental we don’t have to do alot of up front planning for migration. [7]

Cons:

• Not polished yet. It was built for inbox searching so may not be work wellfor other use cases.

• No compression yet. [7]

A.3.7 BigTable

• Google BigTable — manages data across many nodes.

47

• Paxos (Chubby) — distributed transaction algorithm that manages locksacross systems.

• BigTable Characteristics:

– Stores data in tablets using GFS, a distributed file system.

– Compression — great gains in throughput, can store more, reducesIO bottleneck because you have to store less so you have to talk tothe disks less so performance improves.

– Single master — one node knows everything about all the other node(backed up and cached).

– Hybrid between row and column database:

∗ Row database — store objects together.

∗ Column database — store attributes of objects together. Makessequential retrieval very fast, allows very efficient compression,reduces disks seeks and random IO.

– Versioning

– Bloom filters — allows data to be distributed across a bunch of nodes.It’s a calculation on data that probabilistically maps the data to thenodes it can be found on.

– Eventually consistent — append only system using a row time stamp.When a client queries they get several versions and the client is incharge of picking the most recent.

• Pros:

– Compression is available.

– Clients are simple.

– Integrates with map-reduce.

• Cons:

– Proprietary to Google — Unavailable for our own use. [7]

A.3.8 Dynamo

• Amazon’s Dynamo — A giant distributed hash table.

• Uses consistent hashing to distribute data to one or more nodes for redun-dancy and performance.

48

– Consistent hashing — a ring of nodes and hash function picks whichnode(s) to store data.

– Consitency between nodes is based on vector clocks and read repair.

– Vector clocks — time stamp on every row for every node that haswritten to it.

– Read repair — When a client does a read and the nodes disagree onthe data it’s up to the client to select the correct data and tell thenodes the new correct state.

• Pros:

– No Master — eliminates single point of failure.

– Highly Available for Write — This is the partition failure aspect ofCAP. You can write to many nodes at once so depending on the num-ber of replicas (which is configurable) maintained you should alwaysbe able to write somewhere. So users will never see a write failure.

– Relatively simple which is why we see so many clones.

• Cons:

– Proprietary.

– Clients have to be smart to handle read-repair, rebalancing a cluster,hashing, etc. Client proxies can handle these responsibilities but thatadds another hop.

– No compression which doesn’t reduce IO.

– Not suitable for column-like workloads, it’s just a key-value store,so it’s not optimized for analytics. Aggregate queries, for example,aren’t in it’s wheel house. [7]

49

50

List of Tables

1.1 Fundamental differences between relational databases and key/value stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1 Data access for relational databases and key/value stores . . . . 122.2 Data access for relational databases and key/value stores . . . . 12

A.1 Some NoSQL initiatives . . . . . . . . . . . . . . . . . . . . . . 40

51

52

Bibliography

[1] NoSQL Databaseshttp://nosql-database.org/

[2] “Is the Relational Database Doomed?”, Tony Bainhttp://www.readwriteweb.com/enterprise/2009/02/

is-the-relational-database-doomed.php

[3] “Anti-RDBMS: A list of distributed key-value stores”, Richard Joneshttp://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/

[4] “Eventually Consistent”, Werner Vogels (Amazon)http://www.allthingsdistributed.com/2008/12/eventually_

consistent.html

[5] “Tokyo Cabinet: Beyond Key-Values Store”http://www.igvita.com/2009/02/13/tokyo-cabinet-beyond-key-value-store/

[6] “Are cloud based memory architectures the next big thing?”, High ScalabilityBloghttp://highscalability.com/blog/2009/3/16/

are-cloud-based-memory-architectures-the-next-big-thing.

html

[7] “Drop ACID and think about Data”, High Scalability Bloghttp://highscalability.com/drop-acid-and-think-about-data

[8] “Thoughts on NOSQL”, Eric Florenzanohttp://www.eflorenzano.com/blog/post/my-thoughts-nosql/

[9] “Should you go beyond relational databases?”, Martin Kleppmannhttp://thinkvitamin.com/dev/should-you-go-beyond-relational-databases/

[10] “Notes from the NoSQL Meetup”, Toby Negrinhttp://developer.yahoo.net/blog/archives/2009/06/nosql_

meetup.html

53

http://nosql-database.org/

http://www.readwriteweb.com/enterprise/2009/02/is-the-relational-database-doomed.php

http://www.readwriteweb.com/enterprise/2009/02/is-the-relational-database-doomed.php

http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/

http://www.allthingsdistributed.com/2008/12/eventually_consistent.html

http://www.allthingsdistributed.com/2008/12/eventually_consistent.html

http://www.igvita.com/2009/02/13/tokyo-cabinet-beyond-key-value-store/

http://highscalability.com/blog/2009/3/16/are-cloud-based-memory-architectures-the-next-big-thing.html



http://highscalability.com/drop-acid-and-think-about-data

http://www.eflorenzano.com/blog/post/my-thoughts-nosql/

http://thinkvitamin.com/dev/should-you-go-beyond-relational-databases/

http://developer.yahoo.net/blog/archives/2009/06/nosql_meetup.html

http://developer.yahoo.net/blog/archives/2009/06/nosql_meetup.html

[11] “The mixed blessing of Non-Relational Databases”, Ian Thomas Varleyhttp://ianvarley.com/UT/MR/Varley_MastersReport_Full_

2009-08-07.pdf

54

http://ianvarley.com/UT/MR/Varley_MastersReport_Full_2009-08-07.pdf

http://ianvarley.com/UT/MR/Varley_MastersReport_Full_2009-08-07.pdf