28
© 2011 Microsoft Corporation. All rights reserved. NoSQL and the Windows Azure platform Investigation of an Unlikely Combination Author Andrew J. Brust, Blue Badge Insights, Inc. Published April 25, 2011 Applies to Windows Azure, SQL Azure and NoSQL Abstract An introduction to NoSQL database technology, and its major subcategories, for those new to the subject; an examination of NoSQL technologies available in the cloud using Windows Azure and SQL Azure; and a critical discussion of the NoSQL and relational database approaches, including the suitability of each to line-of-business application development. Disclaimer The research and opinions contained herein are the author’s own, and represent his perspective on the topics discussed. While the author received support and assistance from Microsoft in the creation of this paper, its thesis and conclusions do not constitute Microsoft’s official position, implied or explicit, on NoSQL technology.

Livre blanc Windows Azure No SQL

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Livre blanc Windows Azure No SQL

© 2011 Microsoft Corporation. All rights reserved.

NoSQL and the Windows Azure platform Investigation of an Unlikely Combination

Author

Andrew J. Brust, Blue Badge Insights, Inc.

Published

April 25, 2011

Applies to

Windows Azure, SQL Azure and NoSQL

Abstract

An introduction to NoSQL database technology, and its major subcategories, for those new to the subject;

an examination of NoSQL technologies available in the cloud using Windows Azure and SQL Azure; and a

critical discussion of the NoSQL and relational database approaches, including the suitability of each to

line-of-business application development.

Disclaimer

The research and opinions contained herein are the author’s own, and represent his perspective on the

topics discussed. While the author received support and assistance from Microsoft in the creation of this

paper, its thesis and conclusions do not constitute Microsoft’s official position, implied or explicit, on

NoSQL technology.

Page 2: Livre blanc Windows Azure No SQL

2

Contents Introduction ............................................................................................................................................................................................ 4

What is NoSQL? .................................................................................................................................................................................... 5

Key-Value Stores ............................................................................................................................................................................. 6

Document Stores ............................................................................................................................................................................. 7

Wide Column Stores ...................................................................................................................................................................... 8

Graph Databases .......................................................................................................................................................................... 10

From Relational to Relationships ...................................................................................................................................... 10

Graphs and ORM ..................................................................................................................................................................... 10

NoSQL Database Common Traits ............................................................................................................................................... 11

Shared Legacy: MapReduce, Hadoop, BigTable and HBase ....................................................................................... 11

NoSQL Database Consistency ................................................................................................................................................. 13

Logical Models, Physical Models, and the Ubiquity of Key-Value Pairs ................................................................. 13

NoSQL Indexing ............................................................................................................................................................................ 14

NoSQL options on the Windows Azure Platform ................................................................................................................. 14

Azure Table Storage .................................................................................................................................................................... 15

SQL Azure XML Columns .......................................................................................................................................................... 15

SQL Azure Federation ................................................................................................................................................................. 16

OData ................................................................................................................................................................................................ 17

What the Support Means ..................................................................................................................................................... 17

Running NoSQL Database Products using Azure Worker Roles, VM Roles and Azure Drive ........................ 18

On-Premise Technologies ......................................................................................................................................................... 18

SQL Server 2008/2008R2 “Beyond Relational” Features.......................................................................................... 19

SQL Server Parallel Data Warehouse Edition ............................................................................................................... 19

Microsoft Research Dryad .................................................................................................................................................... 20

NoSQL Upsides, Downsides .......................................................................................................................................................... 21

Upsides ............................................................................................................................................................................................. 22

Lightweight, low-friction ...................................................................................................................................................... 22

Minimalist tool requirements ............................................................................................................................................. 22

Sharding & Replication ......................................................................................................................................................... 22

Web Developer-Friendliness ............................................................................................................................................... 22

Page 3: Livre blanc Windows Azure No SQL

3

Cross-Platform, Cross-Device Operation ....................................................................................................................... 23

Downsides ....................................................................................................................................................................................... 23

Optimizations Have a Price ................................................................................................................................................. 23

Requirement to Query using a Procedural Language .............................................................................................. 24

Necessity to Scale Manually................................................................................................................................................ 24

Primitive Tooling ...................................................................................................................................................................... 25

Lack of ACID Transactional Capabilities in Some Products .................................................................................... 25

Conclusion: Relational’s Continued Indispensability in Line-of-Business ................................................................... 26

Page 4: Livre blanc Windows Azure No SQL

4

Introduction Just at the time when the database market seemed to many to be almost completely mature, a group of

non-relational data stores, collectively categorized as “NoSQL” databases, have attracted significant

attention. These databases are often employed in public, massively scaled Web site scenarios, where

traditional database features matter less, and fast fetching of relatively simple data sets matters most.

Many of these databases employ parallelized query mechanisms, horizontal partitioning and allow storage

of heterogeneous, loosely-schematized data records.

With so much developer mindshare being focused on the Web these days, and with the constant thirst for

performance amongst technologists, especially for large Web applications, it’s no wonder that NoSQL

databases are seen favorably and used by an enthusiastic population of developers. As Cloud computing

grows, and given the proclivity of developers to conflate Web computing and scale with Cloud computing

and elasticity, interest in NoSQL databases amongst cloud developers is equally unsurprising. Together,

these streams of interest and visibility are significant; understandably, then, even users of traditional,

relational databases are exploring the question of whether NoSQL technology is something they should

use, too.

There’s no free lunch though. Although NoSQL databases do facilitate the performance and availability

that public Web properties sometimes require, the cost can be great. Things that users of a Relational

Database Management System (RDBMS) would take for granted, including some or all of: transactional,

atomic writes; indexing of non-key columns; query optimizers; and declarative, set-oriented query, are

sacrificed in the NoSQL world. In certain scenarios, that sacrifice is justified and acceptable. But in many

others, including line-of-business applications, that sacrifice is much less reasonable.

As with anything in the software world, when technologies enter the realm of phenomena, the prudent

thing to do is deconstruct and demystify them, understand and enumerate their various capabilities, then

judge if those capabilities merit the enthusiasm and justify a disruption. Specifically, in the realm of cloud

computing with the Microsoft stack, i.e. Windows Azure and SQL Azure, important questions arise with

respect to NoSQL, and need to be answered.

What exactly is NoSQL, and what characterizes its various subcategories? Are individual facets of NoSQL

database architectures available to Azure developers? Are they sufficient or will only a full-blown NoSQL

technology fulfill most requirements? Where in the Azure stack do these NoSQL technologies sit? For the

types of applications that .NET and SQL Server practitioners build, is NoSQL better than relational? Is it

even as good? These questions must be explored and answered before the larger question of NoSQL’s

(or relational’s) overall efficacy can be judged.

In this paper, we will define NoSQL, explore some of its history, review the various types of NoSQL

databases, and understand their respective features. We will determine the commonalities between the

various NoSQL subcategories and try to determine what basket of features seem to attract developers the

most. We’ll examine the scenarios where use of NoSQL makes the most sense. We’ll distill the

enumeration of NoSQL features down to the overall tradeoffs between NoSQL and relational databases.

Page 5: Livre blanc Windows Azure No SQL

5

We will also review the various components of the Azure stack that offer NoSQL technology, or

capabilities that are comparable to those found in NoSQL databases. We will look at Windows Azure

Storage, new and imminent features in SQL Azure, and even ways to deploy non-Microsoft, NoSQL

databases to the Azure cloud, to make them usable from .NET code that is also deployed there. By the

end of this paper, readers should have a good understanding of what NoSQL is all about and whether

individual NoSQL features, full-fledged NoSQL databases or continued use of relational technology will

work best for them.

Let’s now define NoSQL, by examining the general use cases that it serves. We’ll also discuss the

subcategories of NoSQL and take a more detailed look at each of them.

What is NoSQL? There are scenarios in the software development world where data management is required, but what

many of us might think of as a full-fledged database is not. Think of that application you wrote once that

had a small amount of data to store, and did it using flat files, so you could avoid creating a database.

Maybe you needed to store a few bits of information about the current user; maybe you needed to store

application settings, or application state information, like window size and position; or perhaps you

needed to store and retrieve actual content – be it raw text, images, or media – and the file system

seemed to make more sense than a relational database as the repository.

Now imagine an application like that one you wrote, but which ran on the Web and needed to serve a

vast array of users distributed across the globe, many of them concurrently. You would find that your

database needs, while still technically modest in terms of query complexity, would almost certainly

outstrip what you could do comfortably using the file system. You’d need a server, or even a globally

distributed cluster of servers. The server or cluster would need to be highly scalable to meet the demands

of a popular Web-based application, and very fast at performing these relatively simple discrete store and

fetch operations. You would need a database, but probably not the relational one you’re used to.

The grouping of database engines collectively referred to as “NoSQL” is optimized for these workloads.

Most of them sport distributed architectures as a core feature. Many of them are Apache or independent

open source projects.

NoSQL databases are good at what they do, primarily by dispensing with many of the tenets of relational

database management. Many NoSQL databases trade off “ACID” (atomicity, consistency, isolation and

durability) guarantees in favor of providing for very-high performance in the broad scale/simple store and

retrieve scenario. And as we mentioned already, NoSQL databases, to varying degrees, even allow for the

schema of data to differ from record to record. The “CAP” theorem says that databases may only excel at

two of the following three attributes: consistency, availability and partition tolerance. Relational databases

favor the first and last of those three properties; NoSQL databases favor the last two. In other words,

NoSQL intentionally de-emphasizes the rules and functionality of consistency that many database

administrators and developers think of as the very prerequisites of database management.

Page 6: Livre blanc Windows Azure No SQL

6

In his paper Amazon's Dynamo1 (Dynamo is the online retailer’s foundational NoSQL database), Werner

Vogels, Amazon.com’s Chief Technology Officer, describes why such an approach is appropriate: “Most of

these services only store and retrieve data by primary key and do not require the complex querying and

management functionality offered by an RDBMS.” In other words, various systems on the Web, many of

which are consumer-facing, don’t have sophisticated database needs, but they nonetheless have a huge

burden. They must carry out their simple needs very, very quickly.

NoSQL databases handle these workloads well, but they make serious concessions, to otherwise

mainstream database needs, in order to do it. That is well-justified, but not always well-understood; in

fact there exist NoSQL practitioners who advocate the usage of NoSQL as a general database technology

applicable to the mainstream of application database needs. Such advocacy has caused some relational

database customers to have concerns that they should perhaps switch to NoSQL databases even for line-

of-business (LOB) applications.

Customers have these concerns despite the fact that most LOB apps require transactional guarantees, and

are well-served by normalized design and formal schema. This can be a controversial state of affairs and

we hope to sort out that controversy. For now though, let’s just say that NoSQL databases work well in

certain scenarios, and that sketching out what those scenarios are, and what they are not, is an important

goal of this paper.

To help enumerate those scenarios, it’s best that we discuss four subcategories that NoSQL databases

tend to break down into. Enumerations of such subcategories tend to vary, but they usually include Key-

Value Stores, Document Stores, Wide Column Stores and Graph Databases. Each NoSQL subcategory

serves certain scenarios best. To understand core NoSQL scenarios as best as we can, let’s explore the

various NoSQL subcategories and the specific types of applications and workloads they support most

ably.

Key-Value Stores

The Key-Value Store subcategory (summarized

graphically in Figure 1) is perhaps the mother of all

NoSQL database types. Most NoSQL databases

feature key-value mechanisms, even if only behind

the scenes. NoSQL databases that belong to the

explicit Key-Value Store category use their namesake

construct as the basic unit of storage. A key-value

pair might consist of a key like “Phone Number” that

is associated with a value like “(212) 555-1212.” Key-

Value Stores contain records whose entire content is

made up of such pairs; the structure of one record

can differ from the others in the same collection.

1 http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

Figure 1: Key-Value Stores often use the nomenclature of

tables and rows, but the latter simply contain collections

of key-value pairs, which vary from row to row.

Page 7: Livre blanc Windows Azure No SQL

7

If you do much programming, you’ll recognize this construct right away. That’s because collections,

dictionaries and associative arrays in the programming world work on the same principle. Data caches

work on the key-value principle as well. In fact, one prominent Key-Value Store, MemcacheDB, is API-

compatible with the Memcached open source cache.

The parallels between Key-Value Stores on the one hand, and collections, dictionaries, associative arrays

and caches on the other, is more than academic; it’s significant. It shows that NoSQL databases work well

in circumstances where data retrieval needs to be cache-like in speed and where the data which must be

stored and retrieved consists of small, simple collections of attributes and values.

Applications where Key-Value Stores would work well include anything where lists, like product

categories, individual product attributes, shopping cart contents and top n best-selling products, or

individual values like color schemes, a landing page URI, or a default account number, must be

maintained. Values can consist of long text content, not just numeric and short string data. As such,

content like comments, reviews, status messages or even private emails can be stored in a Key-Value

Store. Most of this data is non-hierarchical, so the lack of relational logic or join constructs is acceptable.

Some of this key-value-appropriate data (though probably not the long text content) is akin to lookup

data, or configuration and preference data, in smaller applications. For a desktop app, we could imagine

this data might be stored in a configuration file or a small, offline database. We could also imagine that

much of it might do well to be loaded in memory upon application startup. For a consumer-facing Web

app, the data is similarly straightforward, but the storage technology itself must be more capable. The

data must live in a repository that is distributed, fault tolerant, fast and highly available.

Beyond MemcacheDB and Dynamo lie other Key-Value Stores. Project Voldemort is an open source Key-

Value store that originated at LinkedIn; and Dynomite, Kai and Riak are open source derivatives of

Dynamo (which is not open source, nor publicly available, even though its architecture has been disclosed

through published papers).

Before we go on to describe other NoSQL database types, we must reiterate that almost all of them,

whether physically or conceptually, build upon Key-Value Store principles. Therefore you should expect

their applications to be more specialized than, but not wholly distinct from, those of Key-Value Stores

themselves.

Document Stores

Document Stores are NoSQL databases which treat what might be otherwise called “records” or “rows” as

“documents.” As with Key-Value Stores, each record can have a structure widely differentiated from the

others. Each document consists of a set of keys and values, which can be compared to a relational table’s

field names and values. The Document Store data structure is summarized in Figure 2.

Two leading Document Stores, CouchDB and MongoDB, each use JavaScript data types for the values

stored in their documents. Because of this, their documents can be thought of as JavaScript objects and

can, in fact, be written and read in JSON (JavaScript Object notation) format. That doesn’t mean

Document Stores equate to Object Databases, but it does mean that Document Stores have an affinity

Page 8: Livre blanc Windows Azure No SQL

8

with JavaScript programming and programmers. In fact, the native stored procedure/scripting language

for both CouchDB and MongoDB is JavaScript itself.

Documents can also contain attachments, making document stores useful for content management. The

fact that certain Document Stores feature versioning of their documents (i.e. old versions are retained and

all versions are numbered) makes this all the more so.

CouchDB and MongoDB have been used for an array of public-facing Web application types including

blog engines, event logs, appointment calendars, media stores, chat applications, cloud bookmark storage

and even Twitter clients.

An important facet of Document Stores is that

the documents themselves can be addressed

by unique URLs. And given the HTTP and URL

orientation, document databases are

automatically REST-friendly, as their APIs bear

out. In the case of CouchDB, the HTTP

orientation is developed to the point where

the database can function as its own Web

application server.

Here’s how: so-called Show Functions in

CouchDB – JavaScript functions that render

HTML with the return statement – can be

stored in special documents called design

documents, and each function within is

accessible via URL. This means that entire

Web applications can be implemented in a

document database. Users visit a URL, code

runs on the server and content is returned via the HTTP response stream, just as it would be with classic

ASP, node.js, ASP.NET Web Pages or PHP.

This HTTP and application orientation distinguishes Documents Stores from Key-Value Stores, the latter of

which are more general purpose in their implementation and application. That said, there are some

NoSQL taxonomies which do not recognize the Document Store category and instead label its members

as Key-Value Stores.

As you will see, the remaining two NoSQL subcategories utilize key-value technology as well.

Wide Column Stores

Wide Column Stores, also known as Column Family Stores, manage key-value pairs, but they organize

their storage in a semi-schematized and hierarchical pattern. Perhaps fittingly then, some of their

nomenclature correlates with that of RDBMS technology. For example, the keys in a Wide Column Store

are referred to as columns, and are stored in structures that are sometimes referred to as tables. In

Figure 2: Document Stores contain JSON objects, referred to as

documents, each of which has a schema-free of set properties

and values. Values may contain attachments, point to other

documents, or directly contain them.

Page 9: Livre blanc Windows Azure No SQL

9

between the table and column level lie various intermediate structures that vary by product. For example,

Apache Cassandra (originated by Facebook) features Super Columns. Hypertable and Apache HBase

feature Column Families, and Google’s BigTable features Tablets. The hierarchical structure and some of

the varying nomenclature of Wide Column Stores is summarized in Figure 3.

Although the schema within the intermediate structures can vary from row to row, tables and the

intermediate structures themselves must be declared. Therefore, Wide Column Stores, while they tolerate

schema variation at the “leaf” column level, are not completely schema-free. One could reasonably argue,

in fact, that schema changes at the non-leaf level in Wide Column Stores are more disruptive than

changes to table schemas in relational databases.

Wide Column Stores work well for a subset of

requirements that Key-Value Stores accommodate

and many adopters of this category of NoSQL

database cite the performance factors, over the

structural ones, as reasons they chose it. But,

clearly, Wide Column Stores are best for semi-

structured data, rather than data whose structure is

completely variable from row to row.

As an example, in a product catalog, we may have

a collection of items, each of which has a size and

a rating associated with it, and we may want to

store these items together in a table. But certain

items’ sizes may be represented by height, width

and depth, others by radius, and still others by

weight. The rating may be a star rating on a 1-5

scale (e.g. for a book), or collection of sub-ratings

on various attributes (e.g. freshness, flavor, color,

moistness). Accommodating a grouping of entities

with high-level characteristics in common, but with

differing context-specific attributes, is one area

where Wide Column Stores do well.

In the relational world, traditionally, such context-specific attributes would each need to be stored in

separate tables, with a foreign key in the main table to link them2. Joins and application-level merging of

the datasets might be necessary. But Wide Column Stores allow such differently nuanced data to

comingle in the same tables and query result sets.

2 Recent versions of major RDBMS products offer new features to accommodate this requirement without

resorting to separate attribute tables. Such features in SQL Server and SQL Azure will be discussed later in

this paper.

Figure 3: Wide Column Stores contain tables

(indicated above as “T”); Cassandra calls them “super-

column families” (shown as “SCF”). These contain a key

and columns (“C”) which consist of name/value pairs.

Columns are subdivided into column families (“CF”),

which are known as “super columns” (“SC”) in

Cassandra. Columns are schema-free, but higher-level

objects must be declared.

Page 10: Livre blanc Windows Azure No SQL

10

Graph Databases

Graph databases recognize entities in a business or other domain, and explicitly track the relationships

between them. In the graph database world, these entities are called nodes and the relationships between

them are called edges; all of these terms come from mathematical graph theory as does this NoSQL

database subcategory’s name. An example of a graph database assertion (the fundamental atomic unit of

data expression) might be:

Chris city Auckland

Where Chris and Auckland are nodes and city is an edge.

From Relational to Relationships

As we try to orient ourselves to graph

databases from a relational frame of

reference, we could think of an edge in a

graph database (a predicate) as a join, and

the subject and the object of that predicate

(the Chris node and the Auckland node,

respectfully, in the above case) as rows in a

table. Attributes of a node that have scalar

values (for example the attribute Age

might have a value of 45) can also be

represented using edges and nodes, or as

properties and values, depending on the

specific graph database in use. In the

former case, an edge might be thought of as

a column, in a broad sense, rather than as a

join. A collection of assertions are kept

together in a graph. The structure of Graph

Databases is illustrated in Figure 4.

New edges can be added (or old ones removed) at any time, allowing one-to-many and many-to-many

relationships to be expressed easily and avoiding anything like an intermediate relationship table that you

might use in a relational database to accommodate many-to-many joins.

Social graphs fit into the graph database rubric nicely (as does the name). Constructs like friends,

followers, degrees of separation, lists, endorsements, status messages and responses to them are very

naturally accommodated in graph databases. Semantic Web data also maps quite nicely on to the graph

database structure.

Graphs and ORM

As we consider the concepts of properties, values and relationships, it starts to become clear that graph

database theory has some alignment with object-relational modeling and ORM programming. This then

Figure 4: Graph databases, like those in other NoSQL

subcategories, may be key-value based, but they excel at tracking

relationships (edges) between entities (nodes), in addition to the

entities, keys and values, themselves. Sometimes even the key-

value pairs are represented as edges and nodes.

Page 11: Livre blanc Windows Azure No SQL

11

begs the question of whether object databases belong in the NoSQL camp or even of whether they are in

fact synonymous with graph databases. There really are no rules or strict definitions to provide

authoritative answers to these questions, but there are differences in intent between graph and object

databases. Object databases typically are schema based (even if the schema describes a class rather than

a table) and are focused on entities and their properties. Graph databases are designed to accommodate

slowly- or even rapidly-changing schemas and focus on relationships between entities more than the

entities themselves.

Popular graph databases include AllegroGraph, Neo4j and Twitter’s FlockDB.

NoSQL Database Common Traits Having now covered the four main NoSQL subcategories, and what distinguishes them, let’s take a look at

the qualities which each category’s products have in common. We’ll first look at a pair of technologies

from Google (and their Apache project counterparts) whose design principles pervade all NoSQL

subcategories. We’ll continue with a general look at the data consistency models employed in NoSQL

databases and the split between NoSQL’s physical and logical implementations. We’ll finish with a look at

NoSQL indexing and we’ll then be able to move to the next section and review the various features and

products within Windows Azure and SQL Azure that provide NoSQL functionality.

Shared Legacy: MapReduce, Hadoop, BigTable and HBase

It’s a good idea for us to take a look at two technologies which underlie, or have provided inspiration for,

many of the individual products in each NoSQL subcategory. Specifically, Google’s MapReduce and

BigTable and their open source counterparts, Apache Hadoop and Apache HBase. Google MapReduce

and the open source Hadoop project provide generalized parallel job processing engines; Google

BigTable and the open source HBase are Wide Column Stores whose tables can serve as sources and

destinations for the MapReduce and Hadoop jobs, respectively.

Why are the job processing engines necessary? Because the less structured, less formal approaches

employed by NoSQL databases make querying them less straightforward than in the relational world, and

MapReduce/Hadoop help mitigate the burden.

Think about it: although explicit joins are not necessary in the NoSQL world, the permissive environment

and resulting inconsistency across records/entities/documents makes for quite a bit more hunting and

gathering in order to satisfy a query. This is especially true for distributed NoSQL databases which store

their data across various servers, typically using a partitioning pattern called sharding (more on that later).

The lack of query optimizers, and corresponding query efficiencies, in NoSQL databases cries out for some

help.

NoSQL databases often require queries to be broken up and executed across multiple repositories on

different servers. At some point, the resulting segmented result sets need to be collected and unified. An

Page 12: Livre blanc Windows Azure No SQL

12

approach called map-reduce acknowledges and addresses this conundrum. Specifically, the process of

distributing the query across multiple agents is the Map step, and the process of coalescing the results

into a single result set is the Reduce step.

Map-reduce is a general algorithm, and is prevalent in functional programming languages – including F#

– which support the notion of map and reduce functions. MapReduce (without the hyphen) is the

patented software framework from Google that the company applies in the realm of managing large

datasets over clusters or other distributed topologies. Hadoop is the top-level Apache project which

implements map-reduce as a generalized highly parallel, divide-and-conquer batch job task manager.

Google MapReduce/ BigTable and Apache Hadoop /HBase have their fingerprints all over most NoSQL

databases. For example, Apache CouchDB, one of the document store databases already discussed, is,

according to its Web site on apache.org, “queried and indexed in a MapReduce fashion.” Some would

argue that CouchDB’s map and reduce steps differ conceptually from those in MapReduce itself.

Nonetheless, the overarching map-reduce approach is the inspiration for the design of many NoSQL

products.

As effective as these mechanisms can be, they also introduce extra work for the database developer.

That’s because instead of providing a declarative language over distributed storage that could then be

implemented using map-reduce functionality under the covers, the architecture’s designers focused

primarily on the raw processing approach and never added a language abstraction. In the world of line-

of-business applications, the declarative power of SQL provides productivity that most organizations

count on. Map-reduce based systems, by and large, cannot provide that productivity.

A summary of the various NoSQL database subcategories, and the suitability of each to different scenarios

and requirements, including map-reduce, is presented in table form in Figure 5.

Figure 5: This chart shows the applicability of different NoSQL database types to different needs

or scenarios. Notice that wide column stores are more special-purposed than are the other

NoSQL subcategories, which are applicable in a variety of scenarios.

Page 13: Livre blanc Windows Azure No SQL

13

NoSQL Database Consistency

Many NoSQL databases use an “eventual consistency” model for database updates and schema changes.

This means that changes made at one replica will be transmitted asynchronously to the others. Domain

Name Servers on the Internet refresh themselves on this model, and that is exactly why DNS propagation

delay can allow some Internet users to navigate successfully to a new or updated domain name, while for

other users the name may not resolve correctly. Eventually, all users’ DNS servers are updated and the

anomaly disappears.

The sacrifice of propagation delay is acceptable when the alternative (a coordinated atomic update across

all DNS servers globally) is considered. The eventual consistency model allows updates to occur and DNS

server availability to be maintained, all for the price of a temporary, tolerable, well-understood anomaly in

the data.

Likewise, in the NoSQL context, eventual consistency makes possible discrepancies in data state between

replicas, and thus between users and locations, for a temporary period. As with DNS servers, such

concessions to consistency are made in the name of high availability and will eventually resolve.

Not all NoSQL databases use eventual consistency. Some are fully transactional. Others use an optimistic

concurrency model. Some databases, like Apache Cassandra and Apache HBase, not only replicate over

time, but commit their initial writes to disk over a certain latency period as well. In other words, these

databases perform buffered writes by writing to memory initially (and to a log), rather than tables on disk.

This is done in order to batch up the writes, rather than have them execute one at-a-time, since batching

reduces the aggregate i/o time required. It is completely different from the update behavior of an RDBMS.

The liberal consistency regimes of many NoSQL databases are quite appropriate, in certain scenarios. It’s

important to remember that the transactional model is still the correct one in many others, including most

line-of-business applications. The supremacy of one model in certain circumstances does not render

established models obsolete in a variety, or even a majority, of others.

Consistency is not the only sacrifice made in the name of performance and high availability. For some

NoSQL databases, declarative query power is sacrificed as well. For example, “views” in CouchDB, rather

than being stored queries, are actually JavaScript programs that return data. They are somewhat akin to

stored procedures in the relational world, but even that analogy falters, as CouchDB views must iterate

through data imperatively rather than use the set-oriented constructs found in SQL.

The result is that individual query patterns must be optimized through code that anticipates them, rather

than through optimizing logic that encounters them. As with the consistency sacrifice, in some situations,

this may be perfectly acceptable. As we have discussed, many public Web applications perform a variety

of very simple queries and a small number of complex ones, all of which can be explicitly coded. But,

again, that’s not usually the case with LOB apps.

Logical Models, Physical Models, and the Ubiquity of Key-Value Pairs

The subcategory distinctions we’ve covered here are not only soft, but are logical model distinctions that

may or may not translate to the underlying physical models. For example, Cassandra, a Wide Column

Page 14: Livre blanc Windows Azure No SQL

14

Store, essentially imposes a logical “super column” hierarchy over key-value pairs. Key-Value Stores

underlie most other subcategories, either in terms of technique (such as how CouchDB’s documents are

actually key-value structures, in an overt fashion) or in implementation (such as how edges and nodes in a

graph database can be stored as key-value pairs as well, but behind the scenes).

Document Stores, Wide Column Stores, and Graph Databases are in some senses akin to domain specific

languages (DSL) in the programming world. While most NoSQL databases utilize key-value constructs,

distributed architectures and sharding, and allow for schema-free databases, the various NoSQL

subcategories provide different data interfaces, each of which works best in a subset of scenarios.

NoSQL Indexing

Despite the DSL analogy above, the common key-value substrate of most NoSQL databases does not

render the subcategory a mere trivial abstraction. The quite wide spectrum of indexing features in the

various NoSQL databases makes this clear. Some NoSQL databases index on little else than the keys used

for rows/entities/documents and/or partitions. Others go a bit beyond this. For example, CouchDB

indexes documents only on their IDs and sequence (version) numbers, but it also creates indexes on

views. The AllegroGraph Graph Database, meanwhile, indexes everything (id, subject, predicate, object

and graph), automatically.

Some Key-Value and Wide Column Stores support so-called “secondary” indexes – a generic term for an

index built on the value of a property/column that is not the key. But secondary indexes are relatively

new features in some databases and still a bit immature. For example, Cassandra added secondary

indexes in version 0.7, which was just released on January 9, 2011. These secondary indexes are

essentially hash indexes only; support for bitmapped indexes, with which range criteria could be satisfied,

is in the works for a future release.

In the absence of secondary index support, some developers implement them on their own. The common

approach is to create a second table containing the values of the “indexed” column and their

corresponding row keys from the main table. This requirement is somewhat emblematic of NoSQL

databases in general: developers may need to implement on their own what could long be taken for

granted in an RDBMS. Again, in some situations, the tradeoff is deemed reasonable given the

performance and availability requirements, but the price should not be understated.

NoSQL options on the Windows Azure Platform As we discussed in the paper’s introduction, a proper evaluation of NoSQL involves deconstructing and

deciding which features or characteristics are compelling. Next, you need to decide if those same features

or characteristics are available from technologies you already use. With that in mind, what follows is an

overview of certain Windows Azure and SQL Azure technologies (plus a few Microsoft on-premise

Page 15: Livre blanc Windows Azure No SQL

15

products and features) and which aspect of NoSQL technology each one implements. As you will see,

elements of NoSQL computing can pop up in some unexpected places.

Azure Table Storage

Azure Storage is probably the most compelling place to start on our tour of NoSQL in Azure. That’s

because Azure Table Storage is in fact a NoSQL database. Of the various categories of NoSQL database

discussed in the last section, Azure Table Storage fits most snugly with Key-Value Stores. Azure Storage

key-value pairs are called Properties; they belong to Entities which, in turn, are organized into so-called

Tables. Azure Table Storage features optimistic concurrency and, as with other NoSQL databases, is

schema-free, so the properties of each entity in a table may differ.

Azure Table Storage does not support secondary indexes, and it’s not intended for use as a mainstream

database, especially since SQL Azure is available to handle relational database duties. But Azure Table

Storage is inexpensive (15c/GB/month and $0.01/10,000 transactions), easily programmed (via a .NET

client library, a LINQ client and a RESTful API), and scales over multiple servers, as needed, automatically.

Since Azure Table Storage is a bona fide NoSQL database, we could stop there. But it’s important to

realize that other Azure technologies allow for the implementation of NoSQL approaches. These options

are less about full-on NoSQL and more about cherry picking various NoSQL features when that is all that

is actually desired. Let’s continue by looking at those options.

SQL Azure XML Columns

We’ll declare here and now: using XML columns in SQL Azure data storage constitutes NoSQL database

storage. There are a number of reasons why this is the case. First, consider that an XML payload bears

much resemblance to a Document Store NoSQL database. Not only are XML documents just that (i.e.

documents) but they store a collection of elements and values, with those XML elements equivalent to

key-value pairs in Document Stores3.

The schema of an XML document can be changed at will (provided there’s no XSD schema in place – and

the Schema Collections feature of SQL Server that supports XSD is not even implemented in SQL Azure at

this time) and a collection of XML documents may or may not follow a given schema consistently. Again,

each of these qualities is common to SQL Azure XML columns and Document Stores.

If that weren’t enough to convince you, then consider that the developer version of Azure Storage (i.e. the

emulator that runs on the local PC to use during development) is actually implemented using XML

columns in SQL Server Express Edition. That means all Azure developers have a full XML-data-as-NoSQL

proof-of-concept running on their development PCs.

This is more than coincidence; it’s about motivation: XML columns were added to SQL Server (and other

major relational database products) to accommodate databases with dynamic schema needs for certain

3 This analogy works best if we think of XML documents as a non-hierarchical storage mechanism. If we

think of them as hierarchical (i.e. through the use of XML attributes or child elements) then an analogy

with Wide Column Stores becomes more appropriate.

Page 16: Livre blanc Windows Azure No SQL

16

tables. Prior to XML in the database, the only way to accommodate changing schemas was to build out

“vertical” tables, whose column values were stored as rows in attribute value tables (as key-value pairs, in

fact).

So if we consider one of the major value propositions of NoSQL, namely flexibility around changing

schemas, we see that very scenario is the inspiration for the XML column feature in SQL Server (and now

in SQL Azure). Using XML for NoSQL computing needs is not a kluge, but rather a sensible alignment of

interests.

It is important to note, however, that unlike on-premise editions of SQL Server, SQL Azure does not

support indexes on XML columns. As long as your tables contain a scalar primary key column, then you’ll

have the option of a key-based index, though you will lack the equivalent of a secondary index.

SQL Azure Federation

NoSQL focuses quite heavily on the notion of horizontal scaling and “sharding.” Sharding (i.e. horizontal

partitioning) of databases accommodates the vast demand that many public Web products may

experience. Using map-reduce-style technology is a common NoSQL product solution for managing the

shards.

SQL Azure Federation, announced at the 2010 Professional Developer Conference (PDC), is a forthcoming

feature of SQL Azure which will allow individual SQL Azure databases to function as individual “shards” in

a larger virtual database. This feature provides a supportable approach to dealing with SQL Azure’s

current 50GB size limit on individual databases and enhances query performance while at the same time

retaining the RDBMS features that most LOB developers need.

SQL Azure Federation “Members” are the counterparts to NoSQL Shards. Shards are “federated” (hence

the name of the feature) and this is achieved through the creation of a so-called Federation Key. The key

is present in any table that will be distributed and each shard is defined in such a way that it is responsible

for storing rows whose federation keys are in a specific range of values4. If the distribution of values

changes over time, individual shards which become too large can be split into multiple ones. A significant

advantage of this splitting feature is that it takes place online, under load, without affecting database

availability or consistency. Once again, Azure lets us cherry-pick a NoSQL feature, without forcing us to

forfeit RDBMS underpinnings

This first version of SQL Azure Federation will not have support for so-called fan-out queries. So it will not

have a map-reduce-style facility for taking a query that spans multiple members, splitting it automatically

into separate queries and merging the results of each into a single result set. But SQL Azure Federation

will have mapping functions, whereby a needed shard can be located by a specific Federation Key value

and need not be addressed by its physical database name. This makes programming the query

4 In this way, a Federation Key is similar to an Azure Table Storage Partition Key

Page 17: Livre blanc Windows Azure No SQL

17

distribution simpler and it also provides the foundation for a full map-reduce-style fan out query

capability that could appear in a future release.5

OData

OData is Microsoft’s generalized XML data serialization format, based on the ATOM feed standard, and

RESTful API used to query, create and update data in the repositories it wraps. OData debuted as the

transmission format and API for data exposed by what is now called WCF Data Services (originally known

as project “Astoria,” then as ADO.NET Data Services). Typically, Astoria services act as RESTful wrappers

around Entity Framework data models. But with the generalization of the data format and REST

implementation, OData is now used by Microsoft and others to expose a variety of data sources. On-

premise Microsoft products and technologies that support OData interfaces include SQL Server Reporting

Services in SQL Server 2008 R2, SharePoint 2010 lists and Dynamics CRM 2011.

In the Azure world, both Azure Table Storage and SQL Azure support OData interfaces to their respective

tables. Azure Storage does so natively, while SQL Azure exposes its OData interface via a pre-release tool

(SQL Azure OData Service) at time of this writing available from SQL Azure Labs. By logging into the tool

and enabling OData access with a single checkbox (either for anonymous access or access by specific

named users), the OData interface is made available immediately; there is no coding required to enable it.

What’s more, SQL Azure provides this RESTful interface while maintaining its conventional Tabular Data

Stream (TDS) interface. As such, SQL Azure provides developer simplicity while retaining its native

interface, and the performance necessary for heavy LOB workloads.

Windows Azure Marketplace DataMarket leverages OData as its native format for publishing the free and

subscription-based data feeds that comprise the service. This makes the OData format itself especially

valuable, and arguably more so than more generic XML data serialization formats, as it is at once an API

tool and a channel to commercial or public distribution of data.

What the Support Means

In practical terms, this broad support for OData on Azure means that most of its data-focused services can

be programmed via REST from most any development platform. The commands use intuitive URL

patterns and open HTTP verb conventions to provide a full data platform for key-value structured storage

(Azure Table Storage), relational data (SQL Azure) and de-normalized, processed data (DataMarket).

OData can return results not only in ATOM/XML format, but in JSON format too. This makes it conform

extremely well to various numerous NoSQL database APIs.

Many NoSQL databases tout their support for REST, and the corresponding ease of use and low barrier to

entry this provides. Arguably many NoSQL proponents are drawn to these platforms because of their

simple RESTful interfaces. Given that Azure provides this same ease of use throughout the platform, we

can see once again that Azure addresses specific needs catered to by NoSQL platforms. In fact, Azure

provides for this need, and then goes beyond it: given Microsoft’s PowerPivot self-service BI tool, and its

5 Even in advance of such support, considering that map-reduce jobs must themselves be explicitly coded

or scripted in many NoSQL databases, the notion of writing an Azure Federation fan-out query through

code seems a reasonable task by comparison

Page 18: Livre blanc Windows Azure No SQL

18

ability to consume and analyze OData-formatted feeds using Azure’s RESTful services, Azure provides self-

service BI to customers and not just APIs to developers. This presents a very clear business case that

various NoSQL databases may be hard-pressed to counter.

Running NoSQL Database Products using Azure Worker Roles, VM

Roles and Azure Drive

If the desire or specific need is present to run a particular NoSQL database product, Worker and Virtual

Machine Roles make it possible to accommodate this setup on Azure, provided the NoSQL product has a

Windows Server-compatible version (and most do). The VM role allows customers to build their own

machine image, upload it as a virtual hard drive (VHD) file to their Azure accounts, and then spin up

instances of that image. Any properly licensed software can be installed in that machine image, including

various free NoSQL products. Likewise, a Worker Role can accommodate such customization, but any

products added to the baseline image must be xcopy-deployable or silently installed during the Worker

Role's startup task or its code's RoleEntryPoint.OnStart method.

There is one complication though: since Worker and even VM role instances may be recycled at any point,

local hard drive storage within the instance may at any time revert back to its baseline image state. So

unless the data in the instance is static and can itself be included with a VM Role image or placed on a

Worker Role image in a scripted manner at startup, data storage becomes an issue.

Luckily, the Windows Azure Drive offering provides a solution. Azure Drive allows a separate VHD file,

hosted in Azure Blob Storage, to be mounted as a mapped drive, within the Worker/VM Role instance,

through a simple .NET API. This means that a Worker/VM Role instance could have a NoSQL database

product installed on it, configured to read and write data to a mapped drive, and as long as the drive were

mounted before the NoSQL product initialized, all would be well. Scaling this to multiple Role instances

gets tricky, since a given VHD can be used as a read/write volume by only one instance at a time, but

there are ways to do it.

Is this solution optimal? Probably not. But it is workable and still runs within the context of the Azure

managed platform from which you can avail yourself of the elasticity and other traits and features of the

Azure fabric’s management. For Microsoft customers who already have a substantial investment in SQL

Server and/or .NET, this no mere trivial benefit. And readers who find compelling the argument that

NoSQL features and benefits can be had from existing Azure data products like Azure Storage, SQL Azure

and their OData interfaces, will likely find the need to run dedicated NoSQL products an edge case. With

that in mind, the Azure Worker Role/VM Role/Azure Drive option appears quite feasible.

On-Premise Technologies

Before we move on, three non-cloud technologies from Microsoft bear special mention, as they provide

their own implementations of the non-tabular data, fan-out query and map-reduce job execution

technology discussed in this paper.

Page 19: Livre blanc Windows Azure No SQL

19

SQL Server 2008/2008R2 “Beyond Relational” Features

With the release of SQL Server 2008, a number of features were added to the product under the moniker

“beyond relational.” There is an array of features in this category. The two features most often identified

there are the so-called spatial features that allow for efficient storage and processing of geo-spatial

information, such as latitude/longitude coordinates, polygons, points and lines. But “Beyond Relational”

goes beyond geospatial, and includes a set of features that one could classify as NoSQL-like in nature.

For example, the Sparse Columns feature effectively allows for loosely-schematized tables. Although all

possible columns do in fact need to be declared as part of a table’s definition, the values for columns

declared as sparse can be null, without introducing any storage overhead on a per-row basis (there is

some overhead at the table-level, however). So while the full schema of sparse columns is stored, the

physical content of each row in the table may differ, and drastically so, if necessary. Special filtered

indexes and filtered statistics can be used to maintain good performance in tables that use sparse

columns. Filestream columns allow Binary Large Object (BLOB) data to be stored in the server’s file

system rather than in the database per se. Hierarchies and the HierarchyID column type allow for the

representation of hierarchical data and provide explicit support for referencing and testing data in terms

of ancestors and descendants.

The XML data type is a beyond-relational feature as well and, as we have discussed, it is supported by SQL

Azure; spatial features and the HierarchyID column type are supported by SQL Azure as well. However,

Sparse Columns and Filestream features are not supported by SQL Azure at present. My take on this is

that the symmetry between SQL Server and SQL Azure will continue to increase and, as such, the

remaining Beyond Relational features will eventually be available in the cloud. When that happens,

developers who are attracted to specific facets of NoSQL databases will find SQL Azure even more

accommodating of their needs.

SQL Server Parallel Data Warehouse Edition

SQL Server Parallel Data Warehouse Edition (SQL PDW), which was borne of the acquisition of DATAllegro

by Microsoft in 2008, is Microsoft’s maiden offering in the Massive Parallel Processing (MPP) database

space. The product allows horizontal scaling of SQL Server by providing an interface over a number of

instances of the product, each of which participates in a striped distribution of large data warehouse

databases. To the database client, the entire array of SQL Server instances appears as a unified whole, and

the queries sent to that single entity are appropriately split and dispatched by PDW to the appropriate

individual agents, with each constituent query being executed in parallel (hence the term MPP).

MPP shares qualities with both the sharding and map-reduce approaches to database management.

PDW provides more value than a raw MPP or map-reduce software implementation though. It is sold as

an appliance such that compute, network and storage hardware are purchased together with the software,

as an appliance. PDW provides more evidence that if you seek specific capabilities of NoSQL, you may

find that the relational products you use today, or products from the same family, deliver those

capabilities to you, without the disruption that would come from migration to a new database platform.

Page 20: Livre blanc Windows Azure No SQL

20

Microsoft Research Dryad

Dryad is a Microsoft Research (MSR) project that implements a map-reduce style execution engine. Dryad

jobs consist of series of programs that are connected by channels. The programs represent vertices, and

the channels represent edges. Together, these vertices and edges form a graph, and any such graph6, as

long as it is acyclical, can be executed by Dryad.

Like MapReduce or Hadoop, Dryad is an execution engine that manages jobs, processes input files and

produces output files. Dryad manages the execution of a graph’s vertices/programs across various nodes

in a compute cluster. Nodes may be physical machines, or cores within a machine. MSR explains that

Dryad subsumes map-reduce and also provides such infrastructural services as fault tolerance, re-

execution, scheduling, and accounting.

Dryad is not a database, but it can coordinate the operations of multiple database servers. In fact,

Microsoft AdCenter uses Dryad to run multiple instances of SQL Server Integration Services (and SQL

Server RDBMS instances) for log processing.

Dryad is now available as a technology preview within the Windows HPC Server 2008 R2 high-

performance computing line. Furthermore, according to Microsoft Research, Dryad eventually will be

integrated with Microsoft SQL Server and Windows Azure. Dryad implements an execution model with

great affinity to the map-reduce approach so closely associated with NoSQL databases. It is therefore

crucial to the discussion of NoSQL computing in the Microsoft technology universe.

An enumeration of all the cloud and on-premise products and technologies discussed in this section is

presented in Figure 6.

6 Do not confuse Dryad’s graphs with those of Graph Databases. Though the vocabulary is quite similar,

the contexts are rather different.

Page 21: Livre blanc Windows Azure No SQL

21

Figure 6: These lists summarize the cloud and on-premise technologies from Microsoft which deliver genuine NoSQL

technology (e.g. Azure Table Storage) and/or features that NoSQL databases offer and which resonate with NoSQL

developers (like OData’s HTTP/REST APIs). We also enumerate the option of running open source NoSQL database

products in Azure compute instances, using Worker and VM Roles.

NoSQL Upsides, Downsides We’ve already alluded to many of the relative pros and cons of dedicated NoSQL products and various

Azure technologies which, at the very least, nip away at the NoSQL feature list and deliver certain of their

advantages on an a la carte basis. Allusions are one thing, but it’s probably best that we work to

enumerate NoSQL’s upsides and downsides in a formal manner. By doing so, readers will be able to

evaluate their NoSQL needs in a no-nonsense fashion and then determine, given the Azure platform

capabilities, whether those needs necessitate use of dedicated NoSQL products.

Page 22: Livre blanc Windows Azure No SQL

22

Upsides

Lightweight, low-friction

Probably the most touted attribute of NoSQL database systems is their ease of provisioning, deployment

and integration into application code. Download, install, run a browser-based UI, create a new database,

and away you go. Since the products are open source, the licensing worries are reduced. Since there are

no schemas to declare with many NoSQL products, the database is ready as soon as you create it. And

since many NoSQL APIs are HTTP- and REST-based, and, for a number of NoSQL databases, a multitude

of client libraries for various programming environments are available, you can start coding quickly too.

Minimalist tool requirements

A number of NoSQL databases have browser-based UIs. After the product is installed, simply point your

browser at the server’s host name (or localhost, if you’re browsing on the server), a specific port and a

given virtual directory, and you may get a fully-functional UI in the browser for managing your databases,

and querying them too.

Sharding & Replication

Most NoSQL databases support the notion of sharding, which we have already discussed in the section on

SQL Azure Federation, above. Unlike SQL Azure though, the sharding facilities in most NoSQL databases

do support fan-out queries transparently. It seems reasonable that fan-out query capabilities will come to

SQL Azure in the future, but they’re not there now.

Many NoSQL databases also have simple replication facilities built in. In the relational world, replication

can be useful in branch office scenarios, but for the Web-centric focus of most NoSQL databases, it is

likely that geographic content distribution is more important. In other words, NoSQL database instances

can be created in various geographic regions, and then be configured for continuous replication such that

users can work against a database to which minimal network hops are required, with replication assuring

that each regional server gets data changes from the others.

Replication is also a disaster recovery tool, as the failure of a single replica can be addressed by the

swapping in of another. This is very important in both sharded and single-server implementations: in the

latter, the unitary server becomes a single point of failure; in the former, every single shard becomes a

point of failure as well. For this reason, sharding and replication are often used together.

Web Developer-Friendliness

Many Document Store databases use JavaScript Object Notation (JSON) as the internal storage format

and JavaScript as an internal scripting language. Therefore, writing an AJAX application against a

database in one of these products becomes much easier, as the objects in the application’s JavaScript

code can be directly written to, or read from, the database. This makes client-side (browser script-based)

data access code quite feasible and simple.

Page 23: Livre blanc Windows Azure No SQL

23

Add to this the REST APIs used by most Document Store products, and the jQuery REST libraries available

to Web developers, and it becomes clear that the suitability of NoSQL products to JavaScript/jQuery-

based applications is high, with a reasonably low learning curve for many Web developers.

For certain NoSQL products, especially Document Stores, it seems almost a core design principal that the

databases function as an extension of JavaScript’s implementation of object orientation. While it would

probably be a stretch to call these NoSQL products object databases7, that is a useful way to consider the

intent with which they are built, with respect to JavaScript developers and their code.

Cross-Platform, Cross-Device Operation

Most NoSQL database products run on multiple OSes and thus on multiple devices. Specifically, most of

them run on Windows clients and servers, as well as on Linux. Running on Linux allows certain of these

products to run on Apple Mac OS, iOS and the Android operating system on phones and tablets8.

For cloud computing though, the cloud servers are the host, and the only device compatibility that

becomes important is on the client side. And given the number of OData interfaces supported by Azure,

client compatibility with Microsoft’s cloud platform is quite high indeed.

Downsides

Having enumerated several facets of NoSQL databases that work out elegantly and advantageously, it’s

important to point out some of the NoSQL product’s liabilities as well, especially with regard to

productivity and suitability to line-of-business application development.

Optimizations Have a Price

Usually in computing, an advantageous optimization for certain activities and patterns leads to less

functionality or flexibility in others. And with certain NoSQL databases, that is definitely the case.

Consider CouchDB, and its ability to read and write data very quickly, which in turn helps it facilitate the

Web scale capability which draws so many of its users to it.

On the write side, CouchDB can process things so quickly because the operation of writing to disk is in

fact deferred. Writes are buffered, which makes for better responsiveness, but leads to inconsistency in

the physical database in the short term and risk of data loss in the event of a crash or other outage before

the cache is committed to disk9.

On the read side, CouchDB cannot be queried in an ad hoc fashion at all. Instead, the database designer

must author a “view” containing JavaScript code that traverses CouchDB databases and returns a specific

result set. This requirement, of course, makes CouchDB less than suitable for ad hoc query activities, or

even for applications where the standard querying needs are in flux. The good news is that for

applications where the querying needs are well-known and limited, CouchDB can work well, and the

7 Recall that we had already drawn parallel between Graph Databases and Object Databases. Here we do

so for Document Stores. As before, the distinctions between NoSQL categories are not cut and dry. 8 At time of writing, CouchDB for Android is available as a developer alpha release.

9 The lost data is recoverable from database log files. But the restore operation can prove inefficient.

Page 24: Livre blanc Windows Azure No SQL

24

overhead of a query optimizer need not impose itself. But for applications where requirements may shift

over time, capabilities are much more limited than with relational databases. This has some irony to it,

given the importance of schema flexibility (and thus accommodation of changing requirements) in NoSQL

databases overall.

Requirement to Query using a Procedural Language

A corollary to the above point on development of static views for querying is the procedural method by

which the code itself must traverse the database in order to produce its results. Instead of using the set-

based paradigm in SQL, NoSQL databases often must be traversed on a row-by-row (or document-by-

document, or entity-by-entity) basis. Each row/document/entity must be evaluated individually, and

declarative SQL operations like joins, which filter data more implicitly, are not available. What this does is

force a client-like data access model to be employed at the server which could, in turn, impair scalability

more than facilitate it.10

Of course, that statement really comingles two separate senses of the word “scalability.” For many Web

applications, scalability involves the elimination of latency in rather simple operations, such as pulling up

an individual note, writing out a status message, bringing up account settings for a specific customer, and

so forth. Another kind of scaling involves things such as efficient keyword searches over a gigantic bodies

of data, limiting the value of specific fields to a certain range or aggregating numeric field values over a

large subset of data; this sense of scale is very important as well and procedural traversals do not often

enhance it.

So perhaps it is unfair to say that, generally, procedural, row-wise data evaluation impairs scalability, since

notions of scale differ between classes of applications. But this assertion must hold true in the converse

as well, making it inaccurate to say, in a sweeping fashion, that NoSQL databases are more “scalable” or

“Web scale” than relational databases. The reality is that different applications have different needs,

different burdens and different points of stress (or failure). Scalability really is measured by the degree to

which these needs are met, burdens lifted and stresses reduced as the volume of data and/or user activity

grows linearly, and exponentially.

The best database for the job is just that: the best database for the job at hand. For some applications,

relational databases are not the optimal vehicle for storage and retrieval. For many others, NoSQL

databases would be quite inappropriate. So the most important question in evaluating options amongst

NoSQL databases, as well as evaluating the option of using them at all, hinges on the type of application

being written, the type of queries that must be expected and handled with relative ease, and the regularity

vs. variability of the data’s structure. That a certain type of database appears clumsy in certain situations

does not by itself render that type of database inappropriate if that situation is merely an edge case.

Necessity to Scale Manually

For various Web applications that are public facing, and whose data may be document-, user- or

message-oriented, NoSQL databases can work quite well. Their ability to stripe, replicate, cluster and

10

SQL Server and SQL Azure provide this same data access option through cursors, but SQL Server

developers use cursors very sparingly to avoid the downside.

Page 25: Livre blanc Windows Azure No SQL

25

provide geographically distributed points of presence may form the perfect approach for the problem

space of these applications. The ad hoc, semi-federated nature of NoSQL clusters and replicas makes for

low-friction provisioning and helps assure that growth spurts in services usage and membership are non-

disruptive.

That said, there is still work involved, both in terms of resource monitoring and provisioning, that must be

done in order to meet these very demands. Meanwhile, a Platform as a Service cloud like Windows Azure,

with a data platform like SQL Azure to match, facilitates a more automated approach to both the

monitoring and provisioning which must be performed to make certain a site or application grows non-

disruptively. New Windows Azure Web and Worker roles can be spun up through clicks in the Azure

portal’s management interface, and they can be deactivated just as easily.

As a result, elasticity is achieved more laboriously with hosted NoSQL database applications. Replicas for

SQL Azure databases are created implicitly and the “cutover” from one replica to another is implicit as

well. The ramifications of this for NoSQL include extra effort and greater opportunity for error, which may

have a very real and measurable economic impact in labor costs and/or opportunity costs, as well as

greater risk exposure, to the companies building sites or providing services that use NoSQL databases11

.

Primitive Tooling

NoSQL databases are, in many cases, easier to get up and running than are relational databases. There’s

less up-front formality involved in terms of planning and design and, as a result, there’s a shorter distance

between concept and implementation. That’s exactly the kind of agility that growing companies and their

sites may need. There’s also far less complexity in tooling around these databases…simple, self-

explanatory browser-based management interfaces, straightforward REST programming interfaces and

conceptually simple key-value paradigms abound.

But tooling has its value, and that value tends to increase over time, when the imperative of raw

implementation has passed and need for smooth maintenance and troubleshooting becomes more

pronounced (and economically impactful). The design, diagnostic and operational monitoring capabilities

of SQL Server’s tools are significant, and have evolved over the roughly 20-year existence of the product.

These tools, including SQL Server Management Studio and its execution plan window, aid greatly in

preventing problems, and in solving them quickly when they do arise. NoSQL databases’ more minimalist

tooling approach leads to more manual and time-consuming management and troubleshooting than is

the case with SQL Azure (which is compatible with SQL Server’s tools), and may also make the process

more error prone. The cost impact of this can be significant.

Lack of ACID Transactional Capabilities in Some Products

Many NoSQL databases do not provide ACID guarantees nor support for large-scoped transactions. As

discussed previously, some products provide “eventual consistency” while others treat each database

operation as its own isolated transaction. This may be appropriate if the application need only provide

that level of reliability. For example, if social media status messages occasionally fail to post, users may

11

Some Web enterprises have large, dedicated technology staffs in place, who can handle this burden

well. But many corporate business units, and even IT departments, are not in that position

Page 26: Livre blanc Windows Azure No SQL

26

find it perfectly acceptable to discover the failure (by noticing the message never appears in a feed or

stream) and re-post the message. Furthermore, the occurrence of transactions that span more than a

single database operation may not be significant in certain apps. Note taking-applications must update

notes one at a time; blog posting is a simple operation; social networks may need to register a new

follower for a given user, and that’s a discrete operation. Unlike a financial system which may need to

execute a debit and credit as an atomic operation, many Web applications interact with data in a more

granular, minimalist way.

But for most corporate business applications, ACID guarantees are imperative. Debits and credits must

execute in an all-or-nothing fashion; ecommerce orders cannot be lost as customers will not be content to

recreate them from scratch. So, once again, the context of an application/service/site in large part

determines what defines standards of reliability and what determines whether certain advanced features

of a database are overkill or absolute necessities.

Conclusion: Relational’s Continued Indispensability in

Line-of-Business In this paper, we’ve investigated NoSQL’s general tenets. We have discussed each of its four major

subcategories: Key-Value Stores, Document Stores, Wide Column Stores and Graph Databases. We’ve

also reviewed the distributed nature of NoSQL databases, including the partitioning and replication

schemes many of them use. We have looked at NoSQL’s concurrency models, its programming models

and have explored the concepts around loosely schematized data. We reviewed MapReduce and

BigTable, and saw that they established a legacy that has influenced most, if not all, NoSQL products.

We also looked at Microsoft’s Azure cloud stack, including Windows Azure Table Storage, which is itself a

bona fide NoSQL database; various facets of SQL Azure; and support for OData in both Windows Azure

and SQL Azure. In doing so, we have seen how the Azure platform supports a full-on NoSQL approach as

well as the ability to implement various NoSQL features on an “a la carte” basis. Furthermore, we looked

at how Windows Azure Worker Roles and VM Roles support the installation and use of non-Microsoft

NoSQL databases, when and if nothing else will do. We digressed, slightly, to review the NoSQL qualities

of SQL Server’s “Beyond Relational” features and SQL Server Parallel Data Warehouse Edition; we briefly

discussed Dryad, Microsoft Research’s project providing map-reduce capabilities, and more.

We saw how NoSQL databases are suitable for data management that is light-duty but large-scale, and

how they work well for content management requirements of many stripes. We also saw, again and

again, that relational databases are best for line-of-business applications. The database consistency,

query optimization and set-based declarative query capability that relational databases have provided for

decades is still required by most LOB applications; this has not changed.

In business, data in a specific domain tends to be very regular and consistent in structure. For example,

most equities trades have the same fields, as do the counterparties involved in the trades. Most sales

invoices and line items in those invoices have consistent structure as well. When such regularity exists –

which is in fact quite often –relational databases work perfectly. Granted, they may need to be

Page 27: Livre blanc Windows Azure No SQL

27

appropriately scaled and tuned, but the overarching point is that the relational scheme is best in these

scenarios.

To understand the line-of-business versus structured data distinction, it may be helpful to consider a

hypothetical large, online bookseller. This reseller likely keeps its catalog data in a NoSQL database. It

may do likewise with its Web content, reviews and perhaps even its reading lists. But in all likelihood, its

customer billing system, its inventory and supply chain systems, its publisher online inquiry systems and

its shipping application all use relational databases. We don’t know this for a fact about any one

bookseller, but the assumptions are nonetheless based on good rules of thumb for when and where each

type of database is best utilized.

The regular, consistent data scenario is the most common one in most corporate settings. Granted, for

any number of outward-, consumer-facing Web applications, which are essentially content-and

relationship-driven, NoSQL structured stores have a welcoming home.

So you must ask yourself: do I have irregularly schematized data, such that I need to use a NoSQL,

structured storage approach to storing and retrieving it? Try not to be led to a conclusion by fear (or

even guilt) over the issue of inflexibility. Just because schema-less databases let you store irregular data

doesn’t mean you’ll need that, and just because relational databases require you to go through steps that

can be disruptive in order to modify a table’s schema, doesn’t mean you’re somehow foolhardy for going

that route.

Consider a household analogy: if, as you build a house, you run wiring in conduit, external to your walls,

and surface-mount your fixtures, you’ll always be able to upgrade your wiring, or repair a wiring segment

gone bad. But if you know that the electrical, and maybe cable TV and computer network wiring to be

installed will suit your purposes for the long term, then it makes perfect sense to run your wiring in-wall.

You can always open the walls again if need be, and if you’re reasonably certain that you won’t need to,

then running the wiring internally is the right decision. It will look better to most people, make it easier to

push furniture against the wall and will, arguably, be somewhat safer. In general, your home will have a

more finished look to it. If one day your needs change and you need to open the walls again, that will not

necessarily mean you made a bad decision.

People should not let a relatively insignificant chance of disruption thwart them from enjoying the

advantages of something that is otherwise advantageous. By the same token, customers should not let

the notion that their database schema may someday change force them into a decision of going with a

non-relational, loosely-schematized database.

As we have said, some applications by their nature manage data that is variable in structure, and NoSQL

databases may work very well for those applications. But if your app uses highly structured data – and

most line-of-business apps do – then why forego the compatibility, data consistency, query optimization,

maturity, broad support and professional talent pool that a major relational database offers? You should

give that up only if the benefits of doing so outweigh the costs, and each such benefit should be

evaluated on a sober survey of likelihoods and risks.

Page 28: Livre blanc Windows Azure No SQL

28

But what if the “wires” in your “house” are changing a lot? What if you’ve got an app that manages a lot

of data that is ever-changing in structure and much of it functions as content on your Web site? Do you

need Cassandra or MongoDB or Neo4j on a hosted Linux server? Probably not. Azure tools like Azure

Table Storage, SQL Azure XML columns and OData may be viable options for your structured storage or

key-value retrieval needs. And if not, then running xcopy-deployable or silently-installable NoSQL

databases in Azure Worker Roles and Azure Drive, or running full blown NoSQL installations using Azure

VM Roles ,may well work for you.

Hopefully this paper has made the choices more clear and your evaluation a more straightforward and

less “loaded” prospect. The Azure cloud provides for a spectrum of choice, rather than a single,

compulsory methodology. This provides flexibility and protection in a cost-effective, elastic computing

environment. And that’s really what “Web scale” should be all about.