Upload
anish-ar
View
252
Download
0
Embed Size (px)
Citation preview
7/31/2019 seminar topic nosql
1/73
July 11th, 2010
7/31/2019 seminar topic nosql
2/73
7/31/2019 seminar topic nosql
3/73
3
What is NoSQL?
Whats wrong with RDBMS?
Why now?
Introduction
Agenda
7/31/2019 seminar topic nosql
4/73
4
Scaling
CAP Theorem
ACID vs. BASE
RDBMS vs. NoSQL
Agenda
7/31/2019 seminar topic nosql
5/73
5
Key / Value
Column
Document
Graph
NoSQL Taxonomy
Agenda
7/31/2019 seminar topic nosql
6/73
6
Comparing Apples to Oranges
Polyglot Persistence
How to choose?
Agenda
7/31/2019 seminar topic nosql
7/73
Introduction
7/31/2019 seminar topic nosql
8/738
Introduction
Question: What do they all have in common?
7/31/2019 seminar topic nosql
9/739
Before we answer some facts:
Introduction
7/31/2019 seminar topic nosql
10/7310
Before we answer some facts:
Introduction
Daily Page Views
Daily Visitors
Data size
7.8x109
620x106
Petabytes
7.1x109
500x106
Petabytes
550x106
56x106
Petabytes
350x106
37x106
Terabytes
82x106
12x106
Terabytes
July, 2010: http://www.alexa.com
7/31/2019 seminar topic nosql
11/7311
Introduction
Answer: They use NoSQL data stores
7/31/2019 seminar topic nosql
12/7312
Why!?
Introduction
7/31/2019 seminar topic nosql
13/7313
ACID doesnt scale well horizontally
Sharding breaks relations
Joins are inefficient
Transactions overhead
Schema is not flexible
Predfined
Hard to evolve
Relational DBs Have Scaling Limitations
Introduction
7/31/2019 seminar topic nosql
14/73
7/31/2019 seminar topic nosql
15/7315
Introduction
7/31/2019 seminar topic nosql
16/7316
NoSQL data stores predate RDBMS (1970)
But remained a niche
RDBMS most popular and generic option Web 2.0 introduced new requirements:
Exponential increase in data
Information connectivity
Semi-structured data
NoSQL data stores had answers
When time was right
When RDBMSs didnt
Why now?
Introduction
7/31/2019 seminar topic nosql
17/7317
Its theory time:
Introduction
7/31/2019 seminar topic nosql
18/7318
Scaling
7/31/2019 seminar topic nosql
19/7319
Adding resources to a single node in a system
Add more CPUs or memory
Move system to a larger machine Pros:
Quick and Simple
Cons:
Outgrowing the capacity of largest
system available (Mores law)
Expensive
Creates vendor lock-in
Scaling Up
Scaling
7/31/2019 seminar topic nosql
20/7320
Add more nodes to a system
Functional Scaling (vertical)
Grouping data by function and spreadingfunctional groups across databases
Sharding (horizontal)
Splitting same functional data across
multiple databases Pros: More flexible
Cons: More complex
Scaling Out
Scaling
7/31/2019 seminar topic nosql
21/73
Distributed
Databases
7/31/2019 seminar topic nosql
22/7322
Distributed Databases
Many nodes
Same databaseNode 1 Node 2
Node 3
7/31/2019 seminar topic nosql
23/7323
Consistency
All clients can see the same data
Availability All clients can always access data
Partition tolerance
The ability to continue working when the network topology is
broken The ability to recover once the network is healed
What are the requirements from distributed databases?
Distributed Databases
7/31/2019 seminar topic nosql
24/7324
You can fully satisfy at most 2 out of 3
Compromise on 3rd
Not all or nothing Choose various levels of consistency, availability or partition
tolerance
Recognize which of the CAP rules your business needs for the
task
CAP Theorem (E. Brewer, N. Lynch)
Distributed Databases
7/31/2019 seminar topic nosql
25/7325
Partition Tolerance is compromised
Single site clusters (easier to ensure all nodes are always in
contact) When a network partition occurs, the system blocks
e.g. Two Phase Commit (2PC)
CA: Consistency & Availability
Distributed Databases
PartitionTolerance
7/31/2019 seminar topic nosql
26/7326
Availability is compromised
Access to some data may be temporarily limited
The rest is still consistent/accurate
e.g. Sharded database
TBD sample
CP: Consistency & Partitioning
Distributed Databases
PartitionTolerance
7/31/2019 seminar topic nosql
27/73
7/31/2019 seminar topic nosql
28/73
ACID vs. BASE
7/31/2019 seminar topic nosql
29/73
29
Atomicity
When a part of the transaction fails -> the entire transaction fails;
Database state is left unchanged Consistency
A transaction takes database from one consistent state to another
Isolation
A transaction can't see dirty state from other transactions Durability
Commit means commit.
ACID a quick recap
ACID vs. BASE
7/31/2019 seminar topic nosql
30/73
30
The CAP compliment of ACID
Just had to be called BASE
Backronym:
Basically Available
Soft State
Eventually Consistent
BASE
ACID vs. BASE
7/31/2019 seminar topic nosql
31/73
31
RDBMSs strive to provide ACID guarantees
ACID forces consistency
NoSQL solutions often scale through BASE
BASE accepts that conflicts will happen
RDBMS & ACID / NoSQL & BASE
ACID vs. BASE
7/31/2019 seminar topic nosql
32/73
Taxonomy
7/31/2019 seminar topic nosql
33/73
33
Taxonomy
Key / Value Column
Graph
Document
7/31/2019 seminar topic nosql
34/73
34
Taxonomy
Key / Value Databases
7/31/2019 seminar topic nosql
35/73
7/31/2019 seminar topic nosql
36/73
36
Key/Value e.g.: Riak
Taxonomy
No single point offailure
No machines are special or central
MapReduce queries (Erlang / Javascript) HTTP/JSON API
Ring cluster with automatic replication
Elastic / partition rebalancing
Written in: Erlang, C, Javascript
Developed by: Basho Technologies
Java client: (jonjlee / riak-java-client)
7/31/2019 seminar topic nosql
37/73
37
Data Model
Key/Value e.g.: Riak
Key / Value pairs are stored in a Bucket
A Bucket ~ a namespace
Each update is tracked by a Vector Clock
An algorithm for determining ordering and detecting conflicts
When in conflict
Last wins / manual resolution
Versioning
7/31/2019 seminar topic nosql
38/73
38
Read an object
Store a new object
Store an object with existing key (update)
Key/Value e.g.: Riak
GET /riak/bucket/key
POST /riak/bucket
PUT /riak/bucket/key
Example: REST API
7/31/2019 seminar topic nosql
39/73
39
A framework supporting distributed computing on large data
sets on clusters of machines
Leverage parallel processing power Introduced by Google
Inspired by map / reduce functions in functional programming
Map step
Reduce step
Key/Value e.g.: Riak
MapReduce
7/31/2019 seminar topic nosql
40/73
40
Map
Parse each document
Emit a sequence of pairs
Key/Value e.g.: Riak
MapReduce example: Inverted Index
,
,
Node1
Node2
Node3
1
2
3
,
,
,
7/31/2019 seminar topic nosql
41/73
41
Reduce
Accept all pairs for a given word
Sort the corresponding document IDs Emit a pair
Key/Value e.g.: Riak
MapReduce example: Inverted Index
,
,
7/31/2019 seminar topic nosql
42/73
42
Taxonomy
BigTable andColumn Oriented Databases
7/31/2019 seminar topic nosql
43/73
43
Conceptually a single, infinitely large table
Each rows can have different number ofcolumns
Table is sparse: |rows|*|columns| > |values | Based on Googles BigTable paper
E.g.
Cassandra
Hbase
Hypertable
Column Stores BigTable derivatives
Taxonomy
7/31/2019 seminar topic nosql
44/73
44
RDBMS:
Create a central table with common attributes
Create a table per product with unique attributes Use a join query
Alternatively create a table that holds meta data on products
NoSQL:
Column oriented database
Use arbitrarily columns
Use Case: Manage products with diverse attributes
Taxonomy
7/31/2019 seminar topic nosql
45/73
45
Data model: Googles BigTable
Infrastructure: Amazon Dynamo
Incremental scalability Flexible schema
No single point of failure (Distributed P2P)
Optimistic replication (Gossip protocol)
Written in: Java
Developed by: Facebook
Java client: e.g. Hector / Thrift
Column Store e.g.: Cassandra
Taxonomy
7/31/2019 seminar topic nosql
46/73
46
Column
Smallest increment of data: tuple ofname, value, timestamp
Data Model
Column e.g.: Cassandra
{
name: "emailAddress",
value: [email protected]",
timestamp: 123456789
}
7/31/2019 seminar topic nosql
47/73
47
SuperColumn
A sorted, associative, unbounded
array of columns
Column e.g.: Cassandra
{ // this is a SuperColumn
name: "homeAddress",
// with an unbounded array of Columns
value: {
// the keys is the name of the Columnstreet: {name: "street", value: "s", timestamp:...},
city: {name: "city", value: "c", timestamp:...},
zip: {name: "zip", value: "z", timestamp:...}
}
}
7/31/2019 seminar topic nosql
48/73
48
ColumnFamily
A container (~Table) for columns sorted by their names
Column Families are referenced and sorted by row keys
Column e.g.: Cassandra
Users = { // ColumnFamily
john: { // key to row in CF"role" : "admin",
"status" : "offline",
"nick" : "dude1934"
}, // end row
fred: { // another row
"nick" : freddy","email" :"[email protected]",
"age" : "25",
"gender" : "male",
}, // more rows
} Column Family
7/31/2019 seminar topic nosql
49/73
49
Keyspace The outer most grouping of data (~DB Schema)
Contains ColumnFamilys
There is no imposed relationship between ColumsFamilys
Column e.g.: Cassandra
7/31/2019 seminar topic nosql
50/73
50
Example
Column e.g.: Cassandra
Tweets CF
Timeline CFKeyspace
7/31/2019 seminar topic nosql
51/73
51
Taxonomy
Document Oriented Databases
7/31/2019 seminar topic nosql
52/73
52
Store semi-structured documents (think JSON)
Document versioning
Map/Reduce based queries, sorting, aggregation, etc. DB is aware ofinternal structure
E.g.
MongoDB
CouchDB JackRabbit (JCR JSR 170)
Document Store
Taxonomy
7/31/2019 seminar topic nosql
53/73
53
RDBMS:
Table for each: posts, comments, tags
Foreign relations NoSQL:
Document storage
Store post + tags + comments as a document
Use Case: Blog with tagged posts and comments
Taxonomy
7/31/2019 seminar topic nosql
54/73
54
MongoDB (from "humongous")
Manages collections ofJSON-like documents (BSON)
Queries can return specific fields of documents Supports secondary indexes
Atomic operations on single documents
Developed by: 10gen
Written in: C++
Clients: Java, Scala and more
Document Store e.g: MongoDB
Taxonomy
7/31/2019 seminar topic nosql
55/73
55
Suppose you host a blog, where each post is tagged:
Notice how posts have an array of tags
Example: Blog posts
Docment e.g.: MongoDB
db.posts.save({_id : 3,
author:"john",
title : Apples, Oranges and NOSQL",
text : This article will",
tags : [database"
,nosql" ]});
7/31/2019 seminar topic nosql
56/73
56
MongoDB supports secondary indexes and a query optimizer Compound indexes are also supported
Docment e.g.: MongoDB
db.posts.ensureIndex({ tags: 1 });
db.posts.ensureIndex({ author: 1});
db.posts.find({ author: "john", tags: "nosql" });
// Result:
{
"_id" : 3,
"author" : "john","title" : "Apples, Oranges and NOSQL",
"text" : "This article will",
"tags" : ["database", "nosql", "mongodb" ]
}
7/31/2019 seminar topic nosql
57/73
57
Let's update our posts to include some comments:
Docment e.g.: MongoDB
db.posts.update({ _id: 3 }, {
$inc: { comments_count: 4},
$pushAll : {
comments: [
{ text: Comment 1" },
{ text: Comment 2", author: "Mr. T" },
{ text: Comment 3" },
{ text: Comment 4" }
]
}
});
T
7/31/2019 seminar topic nosql
58/73
58
Taxonomy
Graph Databases
T
7/31/2019 seminar topic nosql
59/73
59
Inspired by mathematical graph theory G=(E,V)
Models the structure of data
Navigational data model Scalability / data complexity
Data model: Key-Value pairs on Edges / Nodes
Relationships: Edges between Nodes
E.g.
Neo4j
Pregel (Googles PageRank)
AllegroGraph
Graph databases
Taxonomy
T
7/31/2019 seminar topic nosql
60/73
60
RDBMS
Complex recursive algorithm Multiple Self joins
Round trips to DB / bulk read and resolve in RAM
NoSQL:
Graph Storage
Network traversal
Use Case: Connected data - deep relationship linksbetween users in a social network
Taxonomy
7/31/2019 seminar topic nosql
61/73
G h N 4j
7/31/2019 seminar topic nosql
62/73
62
Graph e.g.: Neo4j
http://neo4j.org/
7/31/2019 seminar topic nosql
63/73
Comparing Apples to Oranges
C i A l t O
7/31/2019 seminar topic nosql
64/73
64
RDBMS
Databases contains tables, columns and rows
All rows the same structure
Inherent ORM mismatch
NoSQL
Choose your data structure
Data is stored in natural structure (e.g. Documents, Graphs,
Objects)
Comparing Data Structures
Comparing Apples to Oranges
Comparin Apples to Oran es
7/31/2019 seminar topic nosql
65/73
65
RDBMS
Strict schema, difficult to evolve
Maintains relations and forces data integrity
NoSQL
Structure of data can be changed dynamically
e.g. Column stores Cassandra
Data can sometimes be completely opaque
e.g Key/Value Project Voldemort
Comparing Schema Flexibility
Comparing Apples to Oranges
Comparing Apples to Oranges
7/31/2019 seminar topic nosql
66/73
66
RDBMS
The data model is normalized to remove data duplication
Normalization establishes table relations
NoSQL
Denormalization is not a dirty word
Relations are not explicitly defined
Related data is usually grouped and stored as one unit
E.g. document, column
Comparing Normalization & Relations
Comparing Apples to Oranges
Comparing Apples to Oranges
7/31/2019 seminar topic nosql
67/73
67
RDBMS
CRUD operations using SQL
Access data from multiple tables using SQLjoins
Generic API such as JDBC
NoSQL
Proprietary API and DSLs (e.g. Pig / Hive / Gremlin)
MapReduce, graph traversals
REST APIs, portable serialization formats
BSON, JSON, Apache Thrift, Memcached
Comparing Data Acces
Comparing Apples to Oranges
Comparing Apples to Oranges
7/31/2019 seminar topic nosql
68/73
68
RDBMS
Slice and Dice data, then reassemble any way you like
NoSQL Hard to repurpose data for ad-hoc usage
Plan ahead
Think in advance
How and what you store
Data access patterns
Comparing Reporting Capabilities
Comparing Apples to Oranges
7/31/2019 seminar topic nosql
69/73
Summary
Summary
7/31/2019 seminar topic nosql
70/73
70
ACID ruled exclusively in the last 40 years
doesnt compromise on consistency
Database industry neglected distributed DBs w/ availability Vacuum was filled with NoSQL BASE architectures
Strict A and P, minimize C compromise
Relational databases are now trying to catch up
Why NOSQL / BASE
Summary
Summary
7/31/2019 seminar topic nosql
71/73
71
Missing some query capabilities
joins / composite transaction
Eventual consistency -- not for every problem Not a drop in replacement for RDBMS on ACID
No standardization -> product lock-in
Relatively immature (support, bugs, community)
NoSQL Limitations
Summary
Summary
7/31/2019 seminar topic nosql
72/73
72
Relational databases and NoSQL databases are designed to
meet different needs
RDBMS-only should not be a default NOSQL databases outperform RDBMSs
in their particular niche
No one size fits all / Silver bullet
...but you dont have to choose one
Choose the right tool for the job
Summary
Summary
7/31/2019 seminar topic nosql
73/73
Poly: many Glot: language
Meshing up persistence mechanisms to best meet
requirements Good integration stories:
E.g. Neo4j + JDBC using JTA
Polyglot Persistence
Summary