Introduction to NoSQL Databases - it.uu.se · Introduction to NoSQL Databases Tore Risch Information Technology. Uppsala University. 2015-12-06. DataBase Management Systems, DBMS

Introduction toNoSQL Databases

Tore RischInformation Technology

Uppsala University2015-12-06

DataBase Management Systems, DBMS

2

e.g. Oracle, MySQL, SQL Server

Query Processor

Data Manager

Meta –data

DBMS

Dynamic SQL Queries

StoredData

Data managers, key/value stores

3

Data manager

Java/C++/JavaScript programs

StoredData

e.g. BerkeleyDB, Riak, Cassandra, DynamoDB

Tore RischUppsala University, Sweden

Evolution of DBMS technology

1960 1970 1980 1990 2000

Files RDB Object Stores(key/value)

ORDBIMSCODASYL

DatabasesWeb sources

Federated

0011001..

Streaming data

DSMS

UDBL

HighlyDistributed databases

2010

BIG table

Cloud databases

F1



1960 1970 1980 1990 2000


ORDBIMSCODASYL


Federated

0011001..

Streaming data

DSMS

Distributed databases

UDBL


2010

MapReduce

BIG table

HIVE

Cloud databases

Google F1

F1



1960 1970 1980 1990 2000


ORDBIMSCODASYL


Federated

0011001..

Streaming data

DSMS

Distributed databases

UDBL


2010

MapReduce

BIG table

NoSQL SQL NoSQL SQL+ SQL+NewQL SQL-

NoSQL

HIVE

SQL-SQL

Cloud databases

Google F1

SQL+

F1


UDBLKinds ofDBMS support

Query(SQL)

No Query(NoSQL)

Simple Data Complex data

File systems(scalable)Storage managers

RelationalDBMSs

Object Stores

Object-RelationalDBMSs


UDBLClassification of Modern Database applications

Complex QueriesSQL, New QL

No QueriesNoSQL


Simple QueriesSQL-

Business operationsBusiness analyticsPersonal db

Text, data logsSimple computationsWeb documents

CAD systemComplex computationsWeb surfing

Multimedia searchCustom DatatypesData streamsWeb analytics

Web shopE-business

Web search


UDBLKind of Database support

Complex QueriesSQL, New QL

No QueriesNoSQL


Simple QueriesSQL-

File systemsContent Mgmt. Syst.

Object StoresGoogle BigTableYahoo HadoopSpark

Google App Engine (GQL)

Amazon SimpleDBMicrosoft Azure

MongoDBCouchbase

RelationalDBMS

Object-Relational DBMSData Stream Mgmt. Syst.Google F2

What is a NoSQL Database?• A key/value store

Basic index manager, no complete query language– E.g. Google BigTable, Amazon Dynamo

• A web document databaseFor web documents, not for small business transactions– E.g. MongoDB, CouchBase

• A DBMS with a limited query languageProvides for high volume small business transactions– Sometimes called cloud databases– E.g. Google App Engine, Microsoft Azure, Amazon SimpleDB;

MongoDB

What is a NoSQL Database?• A DBMS where mapreduce is used instead of queries

Manual programs to iterate over entire data sets– E.g. Hadoop, Spark, CouchDB, Dynamo

• A mapreduce engine with a query language on top:HIVE on top of Hadoop provides HiveQL– Provides non-procedural data analythics (select from groupby)

without detailed programming– Executed in batch as parallel Hadoop jobs

• A DBMS with a new query language for new applications– Streambase, Virtuoso, Neo4J, Amos, Google F1

• Other non-relational databases– Including Object Stores

NoSQL Characteristics• Highly distributed and parallel DBMS architecture

– Typically runs on data centers– This is similar to parallel databases!

• Highly scalable DBMS by compromised consistency– No 2-phase commit as in distributed databases– Eventual consistency

• Or perhaps never consistency• Similar isolation levels available in modern DBMSs too

– Puts burden on programmer to handle consistency! • Race conditions• Recovery⇒Customized implementation of transactionsWARNING: Consistency best handled in kernel!

– Mainly suitable for applications not needing consistency

NoSQL Characteristics• New query languages for new applications

– SQL- for cloud databases• Most simple web applications do not need full SQL• Simple SQL permits high scalability• Familiar model• Full SQL too complex for new systems

– Graph query languages• SPARQL for RDF

– RDF data model– For searching linked data (http://linkeddata.org/)

– Stream query languages• CQL

– Variant of SQL for streams • AmosQL (UU)

– Functional parallel data stream query language

MapReduce• Parallel batch processing using mapreduce, Hadoop, Spark

– Highly scalable implementation of• parallell batch processing • of many instances of same (e.g. Java) program • over large amounts of data stored in different files• Based on a scalable file system (e.g. HDFS)

• The mapreduce function:– Applies a (costly) user function mapper producing key/value

pairs in parallel on many nodes accessing files in a cluster– Applies a user aggregate function (reducer) on the key/value

pairs produced by the mapper– Very similar to GROUP BY in SQL– Read reference article on MapReduce

Mapreduce

Data file

Data file

Data file

Data file

Data file

Data file

Map

Map

Map

Map

Map

Map

Reduce

Reduce

Partition

Partition

Partition

Partition

Partition

Partition

OutputWriter

ResultfileReduce

File I/O

Mapreduce codefunction map(String name, String document): // name: document name, i.e. HDFS file contents// document: document contents, parsed HDFS file tokens// Can make own parser as preprocessor

for each word w in document: emit (w, 1)

function reduce(String word, Iterator partialCounts): // word: a word// partialCounts pc: a list of aggregated partial counts (word, cnt)

sum = 0;for each pc in partialCounts:

sum += ParseInt(pc);emit (word, sum)

Mapreduce stages• Input reader

– System component that reads files from scalable file system (e.g. HDFS) and sends to map functions applied in parallell

• Map function– Applied in parallel on many different files– Parses input file data from HDFS– Does some (expensive) computation– Emits key value pairs as result– Result stored by MapReduce system as file

• Partition function (optional)– Partitions output key/value pairs from map function into groups of

key/value pairs to be reduced in parallel– Usually hash partitioning

• Reduce function– Iterates over set of key/value pairs to produced a reduced set of key

value pairs stored in the file system– C.f. aggregate functions

Wordcount in HiveQL

FROM (MAP docs.doctext USING 'python

wc_mapper.py' AS (word, cnt)FROM docsCLUSTER BY word) REDUCE word, cnt USING

'pythonwc_reduce.py';

Wordcount in SQL• Alt 1, assume words on documents stored in table: CREATE TABLE docs(id INTEGER, word VARCHAR(20),

PRIMARY KEY(id))

– The query becomes:SELECT d.word, COUNT(d.word)

FROM docs d

GROUP BY d.word

– Problem: DOCS is table, not stored in file => Needs to be loaded• Alt 2, use user defined table function to access documents in file:SELECT d.word, COUNT(d.word)

FROM mydocuments(“C:/mydocuments”) AS d

GROUP BY d.word

HIVEQL vs raw mapreduce• Raw mapreduce:

– Java (Python, C++, etc.) program does map and reduce

• Very common use of mapreduce:– Statistics collection over files (count, sum, stdev, etc)

• HIVEQQL handles most applications using basic statistics– >80% of applications

• When advanced statistics not supported in HIVEQL (or SQL):• Alt 1: User defined aggregate functions in HIVE (UDAF)

– Can be generally used in other queries too

• Alt 2: Raw mapreduce– Code may be complicated– Code cannot be used in queries

Reading raw data files to RDBMS

• Major point with mapreduce:– Save time to load database and build index

• Modern DBMS have bulk load facilities– Never use INSERT to bulk load!– At least do not do commit after each insert!

• Notice that MySQL does commit after each SQL statement by default

– Bulk load speed is comparable to file copy time• Orders of magnitude faster than naïve inserts

– DBMS does automatic parallelization of the bulk loading– Binary and formatted bulk loading supported– Indexes are voluntary and can be built afterwards

• Speeds up MySQL bulk loading considerably

Relational Databases vs mapreduce

• Modern RDBMSs have user defined table functions (not MySQL) and user defined aggregate functions– Can read data from files using UDFs– User defined aggregate functions are incremental and can also

be parallelized• RDBMSs have indexing and high parallelism to provide scalability

– Notice that it takes time to build index during database loading=> May slow down database loading considerable

• When is HIVE with mapreduce better than RDBs?– One shot queries when no indexing is needed– Massively parallel very expensive brute force computations

• embarrassingly parallel computations

Performance of aggregate• Schema: measures(sensor, bt, et, mv), index on mv• SQL: SELECT COUNT(*)FROM measures WHERE mv>?• A single value is returned! Highly selective query.

Sharded MongoDB (Mongo-AS) performs best, since each shard makes parallel scan and sends a single result object to the coordinator.

“embarrassingly parallel”

DB-C is fastest when very few records are aggregated over.

Mongo-AS is slightly slower due to overhead in coordination among shards.

0

2

4

6

8

10

12

14

16

0 1 2 3 4 5

Exec

utio

n Ti

me

(s)

Selectivity (%)

Mongo-AS

Mongo

DB-C

DB-O

0

0.2

0.4

0.6

0.8

1

0.00 0.05 0.10 0.15 0.20 0.25

Exec

utio

n Ti

me

(s)

Selectivity (%)

Mongo-AS

Mongo

DB-C

DB-O

Source: http://www.it.uu.se/research/group/udbl/publ/IDEAS2015.pdf

http://www.it.uu.se/research/group/udbl/publ/IDEAS2015.pdf

Performance of range search• SQL: SELECT * FROM measures WHERE mv > ?• Many rows returned (non-selective query)

0

1 000

2 000

3 000

4 000

Exec

utio

n Ti

me

(s)

millions

Mongo

Mongo-AS

DB-O

DB-C

0

50

100

150

200

250

Exec

utio

n Ti

me

(s)

millions

Mongo-AS

Mongo

DB-C

DB-O does not scale. DB-C switches from scanning the secondary index to full table scan at 0.25% selectivity. Query optimizer of DB-C makes it the system with the most stable performance

Mongo-AS is slowest for non-selective queries

0 10 20 30 40 50 0 2 4 6 8 10

Performance of Bulk Loading

• All systems offer scalable loading performance except DB-O and sharded MongoDB. MongoDB with weak consistency is fastest.

• Relaxing transactional overhead did not provide significant performance improvement for any system ( around 25%).

0

5

10

15

20

25

30

35

40

0 167 333 500 667 833 1 000

Exec

utio

n Ti

me

(hou

r)

Number of measurements (million)

Mongo-AS-S Mongo-ASDB-O-S DB-OMongo-S MongoDB-C-S DB-C

Documents

Introduction to NoSQL Databases - it.uu.se · Introduction to NoSQL Databases Tore Risch Information Technology. Uppsala University. 2015-12-06. DataBase Management Systems, DBMS