CS 440 Database Management Systems Lecture 14: NoSQL & NewSQL, Cont’d. Some slides due to Magda Balazinska 1

1

CS 440 Database Management Systems

Lecture 14: NoSQL & NewSQL, Cont’d.

Some slides due to Magda Balazinska

2

Scaling by partitioning & replication

• Partition the data across machines• Replicate the partitions– Good: • spread read queries across replica

– Bad: • should keep the replica consistent after write queries

– Ugly: • difficult to scale transactions

– two phase commit is expensive

• difficult to scale complex operations

3

NoSQL: Not Only SQL/ Not relational

• Goals– highly scalable data management system– flexible data model: various records from different schema

• They are willing to give up– Complex queries

• e.g. no join

– ACID guarantees• weaker versions, e.g. eventual consistency

– Multi-object transactions

• Not all NoSQL systems give up all these properties

4

NoSQL key features

• Scale horizontally simple operations – key lookups, reads and writes of one record or a

small number of records, simple selections • Replicate/distribute data over many servers • Simple call level interface (contrast w/ SQL) • Weaker concurrency model than ACID • Efficient use of distributed indexes and RAM • Flexible schema

5

Different types of NoSQL

Taxonomy based on the data models:• Key-value stores– e.g., Dynamo, Project Voldemort, Memcached

• Document stores– e.g., SimpleDB, CouchDB, MongoDB

• Extensible Record stores– e.g., Bigtable, HBase, Cassandra

• NewSQL: new type of RDBMSs– e.g., Megastore, VoltDB,

6

Key-Value stores features • Data model: (key, value) pairs– values are binary objects– no further schema

• Operations– insert, delete, and lookup operations on keys – no operation across multiple data items

• Consistency– replication with eventual consistency• e.g., vector clocks in Dynamo

– goal to NEVER reject any writes (bad for business!) – multiple versions with conflict resolution during reads

7

Key-Value stores features

• Use replication to provide fault-tolerance• Quorum replication in Dynamo – Each update creates a new version of an object – Vector clocks track causality between versions– Parameters: • N = number of copies (replicas) of each object • R = minimum number of nodes that must participate in

a successful read • W = minimum number of nodes that must participate in

a successful write • Quorum: R+W > N

8

Key-Value stores internals

• Only primary index: lookup by key– No secondary indexes!

• Data remains in main memory• Most systems also offer a persistence option• Some offer ACID transactions others do not– Multiversion concurrency control or locking

Multiversion Concurrency Control

• Idea: Let writers make a “new” copy while readers use an appropriate “old” copy:

OO’

O’’

MAINSEGMENT(Currentversions ofDB objects)

VERSIONPOOL(Older versions thatmay be useful for some active readers.)

Readers are always allowed to proceed.– But may be blocked until writer commits.

Multiversion CC (Contd.)

• Each version of an object has its writer’s TS as its WTS, and the TS of the Xact that most recently read this version as its RTS.

• Versions are chained backward; we can discard versions that are “too old to be of interest”.

• Each Xact is classified as Reader or Writer.– Writer may write some object; Reader never will.– Xact declares whether it is a Reader when it begins.

Reader Xact

• For each object to be read:– Finds newest version with WTS < TS(T). (Starts

with current version in the main segment and chains backward through earlier versions.)

• Assuming that some version of every object exists from the beginning of time, Reader Xacts are never restarted.– However, might block until writer of the

appropriate version commits.

T

old newWTS timeline

Writer Xact

• To read an object, follows reader protocol.• To write an object:– Finds newest version V s.t. WTS < TS(T). – If RTS(V) < TS(T), T makes a copy CV of V,

with a pointer to V, with WTS(CV) = TS(T), RTS(CV) = TS(T). (Write is buffered until T commits; other Xacts can see TS values but can’t read version CV.)

– Else, reject write.

T

old newWTS

CV

V

RTS(V)

13

Check out DynamoDB!

http://aws.amazon.com/dynamodb/

14


Taxonomy based on the data models:• Key-value stores– e.g., Dynamo, project voldemort, Memcached


• Extensible Record stores– e.g., BigTable, HBase, Cassandra

• NewSQL: new type of RDBMSs

15

Document stores

• A "document” is a pointer-less object– e.g., JSON– nested or not– schema-less

• They may have secondary indexes. • Scalability– Replication (e.g. SimpleDB, CounchDB – means

entire db is replicated)– Sharding (MongoDB)– Both

16

Amazon SimpleDB (1/3)

• Partitioning– Data partitioned into domains: queries run within a domain– Domains seem to be unit of replication. Limit 10GB– Can use domains to manually create parallelism

• Data Model/ Schema– No fixed schema– Objects are defined with attribute-value pairs

17


• Indexing – Automatically indexes all attributes

• Support for writing – PUT and DELETE items in a domain

• Support for querying – GET by key– Selection + sort:

SELECT output_list FROM domain_name [where expression] [sort_instructions] [limit limit]

– A simple form of aggregation: count– Query is limited to 5s and 1MB output (but can continue)

18


• Availability and consistency – Data is stored redundantly across multiple servers– Takes time for the update to propagate to all locations• Eventually consistent, but an immediate read might

not show the change– Choose between consistent or eventually consistent

read

19




• Extensible record stores– e.g., BigTable, HBase, Cassandra


20

Extensible record stores

• Data model is rows and columns• Typical Access: Row ID, Column ID, Timestamp • Scalability by splitting rows and columns over nodes– Rows: sharding on primary key– Columns: "column groups" = indication for which columns

to be stored together (e.g. customer name/address group, financial info group, login info group)

21

Google Bigtable

• Distributed storage system• Designed to store structured data • Scale to thousands of servers • Store up to several hundred terabytes (maybe even

petabytes) • Perform backend bulk processing • Perform real-time data serving• To scale, Bigtable has a limited set of features

22

Bigtable data model

• Sparse, multidimensional sorted map

(row:string, column:string, time:int64)string

Columns are grouped in to families

23

Bigtable key features• Read/writes of data under single row key is atomic– Only single-row transactions!

• Data is stored in lexicographical order – Improves data access locality– Horizontally partitioned into tablets– Tablets are unit of distribution and load balancing

• Column families are unit of access control• Data is versioned (old versions garbage collected) – Ex: most recent three crawls of each page, with times

24

Bigtable API

• Data definition– Creating/deleting tables or column families – Changing access control rights

• Data manipulation–Writing or deleting values– Looking up values from individual rows– Iterating over subset of data in the table

• Can select on rows, columns, and timestamps

25

HBase

• Open source implementation of BigTablehttp://hbase.apache.org/

http://hbase.apache.org/

http://hbase.apache.org/

26




• Extensible record stores– e.g., BigTable, HBase, Cassandra


27

Scalable RDBMS: NewSQL

• Means RDBS that are offering sharding • Key difference: – NoSQL make it difficult or impossible to perform large

scope operations and transactions (to ensure performance), while scalable RDBMS do not preclude these operations, but users pay a price only when they need them.

• Megastore, VoltDB, MySQL Cluster, Clusterix, ScaleDB

28

Megastore

• Implemented over Bigtable, used within Google • Megastore is a layer on top of Bigtable– Transactions that span nodes– A database schema defined in a SQL-like language – Hierarchical paths that allow some limited joins

• Megastore is made available through the Google App Engine Datastore

29

VoltDB

• Main-memory RDBMS: no disk IO no buffer mngmt!• Sharded across a shared-nothing cluster – One transaction = one stored procedure – So both the data and processing are partitioned

• Transaction processing – SQL execution single-threaded for each shard – Avoids all locking and latching overheads

• Synchronous multi-master replication for HA

30

Application 1

• Web application that needs to display lots of customer information; the users data is rarely updated, and when it is, you know when it changes because updates go through the same interface.

31

Application 2

• Department of Motor Vehicle: lookup objects by multiple fields (driver's name, license number, birth date, etc); "eventual consistency" is ok, since updates are usually performed at a single location.

32

Application 3

• eBay-style application. Cluster customers by country; separate the rarely changed "core” customer information (address, email) from frequently-updated info (current bids).

33

Application 4

• Everything else (e.g. a serious DMV application)

34

Criticism (from Stonebraker, CACM2011)

• No ACID Equals No Interest in enterprises– Screwing up mission-critical data is no-no-no

• Low-level Query Language is Death – Before SQL

• NoSQL means NoStandards – One (typical) large enterprise has 10,000 databases. These

need accepted standards

Documents

CS 440 Database Management Systems Lecture 14: NoSQL & NewSQL, Cont’d. Some slides due to Magda Balazinska 1