seminar topic nosql

Embed Size (px)

Citation preview

  • 7/31/2019 seminar topic nosql

    1/73

    July 11th, 2010

  • 7/31/2019 seminar topic nosql

    2/73

  • 7/31/2019 seminar topic nosql

    3/73

    3

    What is NoSQL?

    Whats wrong with RDBMS?

    Why now?

    Introduction

    Agenda

  • 7/31/2019 seminar topic nosql

    4/73

    4

    Scaling

    CAP Theorem

    ACID vs. BASE

    RDBMS vs. NoSQL

    Agenda

  • 7/31/2019 seminar topic nosql

    5/73

    5

    Key / Value

    Column

    Document

    Graph

    NoSQL Taxonomy

    Agenda

  • 7/31/2019 seminar topic nosql

    6/73

    6

    Comparing Apples to Oranges

    Polyglot Persistence

    How to choose?

    Agenda

  • 7/31/2019 seminar topic nosql

    7/73

    Introduction

  • 7/31/2019 seminar topic nosql

    8/738

    Introduction

    Question: What do they all have in common?

  • 7/31/2019 seminar topic nosql

    9/739

    Before we answer some facts:

    Introduction

  • 7/31/2019 seminar topic nosql

    10/7310

    Before we answer some facts:

    Introduction

    Daily Page Views

    Daily Visitors

    Data size

    7.8x109

    620x106

    Petabytes

    7.1x109

    500x106

    Petabytes

    550x106

    56x106

    Petabytes

    350x106

    37x106

    Terabytes

    82x106

    12x106

    Terabytes

    July, 2010: http://www.alexa.com

  • 7/31/2019 seminar topic nosql

    11/7311

    Introduction

    Answer: They use NoSQL data stores

  • 7/31/2019 seminar topic nosql

    12/7312

    Why!?

    Introduction

  • 7/31/2019 seminar topic nosql

    13/7313

    ACID doesnt scale well horizontally

    Sharding breaks relations

    Joins are inefficient

    Transactions overhead

    Schema is not flexible

    Predfined

    Hard to evolve

    Relational DBs Have Scaling Limitations

    Introduction

  • 7/31/2019 seminar topic nosql

    14/73

  • 7/31/2019 seminar topic nosql

    15/7315

    Introduction

  • 7/31/2019 seminar topic nosql

    16/7316

    NoSQL data stores predate RDBMS (1970)

    But remained a niche

    RDBMS most popular and generic option Web 2.0 introduced new requirements:

    Exponential increase in data

    Information connectivity

    Semi-structured data

    NoSQL data stores had answers

    When time was right

    When RDBMSs didnt

    Why now?

    Introduction

  • 7/31/2019 seminar topic nosql

    17/7317

    Its theory time:

    Introduction

  • 7/31/2019 seminar topic nosql

    18/7318

    Scaling

  • 7/31/2019 seminar topic nosql

    19/7319

    Adding resources to a single node in a system

    Add more CPUs or memory

    Move system to a larger machine Pros:

    Quick and Simple

    Cons:

    Outgrowing the capacity of largest

    system available (Mores law)

    Expensive

    Creates vendor lock-in

    Scaling Up

    Scaling

  • 7/31/2019 seminar topic nosql

    20/7320

    Add more nodes to a system

    Functional Scaling (vertical)

    Grouping data by function and spreadingfunctional groups across databases

    Sharding (horizontal)

    Splitting same functional data across

    multiple databases Pros: More flexible

    Cons: More complex

    Scaling Out

    Scaling

  • 7/31/2019 seminar topic nosql

    21/73

    Distributed

    Databases

  • 7/31/2019 seminar topic nosql

    22/7322

    Distributed Databases

    Many nodes

    Same databaseNode 1 Node 2

    Node 3

  • 7/31/2019 seminar topic nosql

    23/7323

    Consistency

    All clients can see the same data

    Availability All clients can always access data

    Partition tolerance

    The ability to continue working when the network topology is

    broken The ability to recover once the network is healed

    What are the requirements from distributed databases?

    Distributed Databases

  • 7/31/2019 seminar topic nosql

    24/7324

    You can fully satisfy at most 2 out of 3

    Compromise on 3rd

    Not all or nothing Choose various levels of consistency, availability or partition

    tolerance

    Recognize which of the CAP rules your business needs for the

    task

    CAP Theorem (E. Brewer, N. Lynch)

    Distributed Databases

  • 7/31/2019 seminar topic nosql

    25/7325

    Partition Tolerance is compromised

    Single site clusters (easier to ensure all nodes are always in

    contact) When a network partition occurs, the system blocks

    e.g. Two Phase Commit (2PC)

    CA: Consistency & Availability

    Distributed Databases

    PartitionTolerance

  • 7/31/2019 seminar topic nosql

    26/7326

    Availability is compromised

    Access to some data may be temporarily limited

    The rest is still consistent/accurate

    e.g. Sharded database

    TBD sample

    CP: Consistency & Partitioning

    Distributed Databases

    PartitionTolerance

  • 7/31/2019 seminar topic nosql

    27/73

  • 7/31/2019 seminar topic nosql

    28/73

    ACID vs. BASE

  • 7/31/2019 seminar topic nosql

    29/73

    29

    Atomicity

    When a part of the transaction fails -> the entire transaction fails;

    Database state is left unchanged Consistency

    A transaction takes database from one consistent state to another

    Isolation

    A transaction can't see dirty state from other transactions Durability

    Commit means commit.

    ACID a quick recap

    ACID vs. BASE

  • 7/31/2019 seminar topic nosql

    30/73

    30

    The CAP compliment of ACID

    Just had to be called BASE

    Backronym:

    Basically Available

    Soft State

    Eventually Consistent

    BASE

    ACID vs. BASE

  • 7/31/2019 seminar topic nosql

    31/73

    31

    RDBMSs strive to provide ACID guarantees

    ACID forces consistency

    NoSQL solutions often scale through BASE

    BASE accepts that conflicts will happen

    RDBMS & ACID / NoSQL & BASE

    ACID vs. BASE

  • 7/31/2019 seminar topic nosql

    32/73

    Taxonomy

  • 7/31/2019 seminar topic nosql

    33/73

    33

    Taxonomy

    Key / Value Column

    Graph

    Document

  • 7/31/2019 seminar topic nosql

    34/73

    34

    Taxonomy

    Key / Value Databases

  • 7/31/2019 seminar topic nosql

    35/73

  • 7/31/2019 seminar topic nosql

    36/73

    36

    Key/Value e.g.: Riak

    Taxonomy

    No single point offailure

    No machines are special or central

    MapReduce queries (Erlang / Javascript) HTTP/JSON API

    Ring cluster with automatic replication

    Elastic / partition rebalancing

    Written in: Erlang, C, Javascript

    Developed by: Basho Technologies

    Java client: (jonjlee / riak-java-client)

  • 7/31/2019 seminar topic nosql

    37/73

    37

    Data Model

    Key/Value e.g.: Riak

    Key / Value pairs are stored in a Bucket

    A Bucket ~ a namespace

    Each update is tracked by a Vector Clock

    An algorithm for determining ordering and detecting conflicts

    When in conflict

    Last wins / manual resolution

    Versioning

  • 7/31/2019 seminar topic nosql

    38/73

    38

    Read an object

    Store a new object

    Store an object with existing key (update)

    Key/Value e.g.: Riak

    GET /riak/bucket/key

    POST /riak/bucket

    PUT /riak/bucket/key

    Example: REST API

  • 7/31/2019 seminar topic nosql

    39/73

    39

    A framework supporting distributed computing on large data

    sets on clusters of machines

    Leverage parallel processing power Introduced by Google

    Inspired by map / reduce functions in functional programming

    Map step

    Reduce step

    Key/Value e.g.: Riak

    MapReduce

  • 7/31/2019 seminar topic nosql

    40/73

    40

    Map

    Parse each document

    Emit a sequence of pairs

    Key/Value e.g.: Riak

    MapReduce example: Inverted Index

    ,

    ,

    Node1

    Node2

    Node3

    1

    2

    3

    ,

    ,

    ,

  • 7/31/2019 seminar topic nosql

    41/73

    41

    Reduce

    Accept all pairs for a given word

    Sort the corresponding document IDs Emit a pair

    Key/Value e.g.: Riak

    MapReduce example: Inverted Index

    ,

    ,

  • 7/31/2019 seminar topic nosql

    42/73

    42

    Taxonomy

    BigTable andColumn Oriented Databases

  • 7/31/2019 seminar topic nosql

    43/73

    43

    Conceptually a single, infinitely large table

    Each rows can have different number ofcolumns

    Table is sparse: |rows|*|columns| > |values | Based on Googles BigTable paper

    E.g.

    Cassandra

    Hbase

    Hypertable

    Column Stores BigTable derivatives

    Taxonomy

  • 7/31/2019 seminar topic nosql

    44/73

    44

    RDBMS:

    Create a central table with common attributes

    Create a table per product with unique attributes Use a join query

    Alternatively create a table that holds meta data on products

    NoSQL:

    Column oriented database

    Use arbitrarily columns

    Use Case: Manage products with diverse attributes

    Taxonomy

  • 7/31/2019 seminar topic nosql

    45/73

    45

    Data model: Googles BigTable

    Infrastructure: Amazon Dynamo

    Incremental scalability Flexible schema

    No single point of failure (Distributed P2P)

    Optimistic replication (Gossip protocol)

    Written in: Java

    Developed by: Facebook

    Java client: e.g. Hector / Thrift

    Column Store e.g.: Cassandra

    Taxonomy

  • 7/31/2019 seminar topic nosql

    46/73

    46

    Column

    Smallest increment of data: tuple ofname, value, timestamp

    Data Model

    Column e.g.: Cassandra

    {

    name: "emailAddress",

    value: [email protected]",

    timestamp: 123456789

    }

  • 7/31/2019 seminar topic nosql

    47/73

    47

    SuperColumn

    A sorted, associative, unbounded

    array of columns

    Column e.g.: Cassandra

    { // this is a SuperColumn

    name: "homeAddress",

    // with an unbounded array of Columns

    value: {

    // the keys is the name of the Columnstreet: {name: "street", value: "s", timestamp:...},

    city: {name: "city", value: "c", timestamp:...},

    zip: {name: "zip", value: "z", timestamp:...}

    }

    }

  • 7/31/2019 seminar topic nosql

    48/73

    48

    ColumnFamily

    A container (~Table) for columns sorted by their names

    Column Families are referenced and sorted by row keys

    Column e.g.: Cassandra

    Users = { // ColumnFamily

    john: { // key to row in CF"role" : "admin",

    "status" : "offline",

    "nick" : "dude1934"

    }, // end row

    fred: { // another row

    "nick" : freddy","email" :"[email protected]",

    "age" : "25",

    "gender" : "male",

    }, // more rows

    } Column Family

  • 7/31/2019 seminar topic nosql

    49/73

    49

    Keyspace The outer most grouping of data (~DB Schema)

    Contains ColumnFamilys

    There is no imposed relationship between ColumsFamilys

    Column e.g.: Cassandra

  • 7/31/2019 seminar topic nosql

    50/73

    50

    Example

    Column e.g.: Cassandra

    Tweets CF

    Timeline CFKeyspace

  • 7/31/2019 seminar topic nosql

    51/73

    51

    Taxonomy

    Document Oriented Databases

  • 7/31/2019 seminar topic nosql

    52/73

    52

    Store semi-structured documents (think JSON)

    Document versioning

    Map/Reduce based queries, sorting, aggregation, etc. DB is aware ofinternal structure

    E.g.

    MongoDB

    CouchDB JackRabbit (JCR JSR 170)

    Document Store

    Taxonomy

  • 7/31/2019 seminar topic nosql

    53/73

    53

    RDBMS:

    Table for each: posts, comments, tags

    Foreign relations NoSQL:

    Document storage

    Store post + tags + comments as a document

    Use Case: Blog with tagged posts and comments

    Taxonomy

  • 7/31/2019 seminar topic nosql

    54/73

    54

    MongoDB (from "humongous")

    Manages collections ofJSON-like documents (BSON)

    Queries can return specific fields of documents Supports secondary indexes

    Atomic operations on single documents

    Developed by: 10gen

    Written in: C++

    Clients: Java, Scala and more

    Document Store e.g: MongoDB

    Taxonomy

  • 7/31/2019 seminar topic nosql

    55/73

    55

    Suppose you host a blog, where each post is tagged:

    Notice how posts have an array of tags

    Example: Blog posts

    Docment e.g.: MongoDB

    db.posts.save({_id : 3,

    author:"john",

    title : Apples, Oranges and NOSQL",

    text : This article will",

    tags : [database"

    ,nosql" ]});

  • 7/31/2019 seminar topic nosql

    56/73

    56

    MongoDB supports secondary indexes and a query optimizer Compound indexes are also supported

    Docment e.g.: MongoDB

    db.posts.ensureIndex({ tags: 1 });

    db.posts.ensureIndex({ author: 1});

    db.posts.find({ author: "john", tags: "nosql" });

    // Result:

    {

    "_id" : 3,

    "author" : "john","title" : "Apples, Oranges and NOSQL",

    "text" : "This article will",

    "tags" : ["database", "nosql", "mongodb" ]

    }

  • 7/31/2019 seminar topic nosql

    57/73

    57

    Let's update our posts to include some comments:

    Docment e.g.: MongoDB

    db.posts.update({ _id: 3 }, {

    $inc: { comments_count: 4},

    $pushAll : {

    comments: [

    { text: Comment 1" },

    { text: Comment 2", author: "Mr. T" },

    { text: Comment 3" },

    { text: Comment 4" }

    ]

    }

    });

    T

  • 7/31/2019 seminar topic nosql

    58/73

    58

    Taxonomy

    Graph Databases

    T

  • 7/31/2019 seminar topic nosql

    59/73

    59

    Inspired by mathematical graph theory G=(E,V)

    Models the structure of data

    Navigational data model Scalability / data complexity

    Data model: Key-Value pairs on Edges / Nodes

    Relationships: Edges between Nodes

    E.g.

    Neo4j

    Pregel (Googles PageRank)

    AllegroGraph

    Graph databases

    Taxonomy

    T

  • 7/31/2019 seminar topic nosql

    60/73

    60

    RDBMS

    Complex recursive algorithm Multiple Self joins

    Round trips to DB / bulk read and resolve in RAM

    NoSQL:

    Graph Storage

    Network traversal

    Use Case: Connected data - deep relationship linksbetween users in a social network

    Taxonomy

  • 7/31/2019 seminar topic nosql

    61/73

    G h N 4j

  • 7/31/2019 seminar topic nosql

    62/73

    62

    Graph e.g.: Neo4j

    http://neo4j.org/

  • 7/31/2019 seminar topic nosql

    63/73

    Comparing Apples to Oranges

    C i A l t O

  • 7/31/2019 seminar topic nosql

    64/73

    64

    RDBMS

    Databases contains tables, columns and rows

    All rows the same structure

    Inherent ORM mismatch

    NoSQL

    Choose your data structure

    Data is stored in natural structure (e.g. Documents, Graphs,

    Objects)

    Comparing Data Structures

    Comparing Apples to Oranges

    Comparin Apples to Oran es

  • 7/31/2019 seminar topic nosql

    65/73

    65

    RDBMS

    Strict schema, difficult to evolve

    Maintains relations and forces data integrity

    NoSQL

    Structure of data can be changed dynamically

    e.g. Column stores Cassandra

    Data can sometimes be completely opaque

    e.g Key/Value Project Voldemort

    Comparing Schema Flexibility

    Comparing Apples to Oranges

    Comparing Apples to Oranges

  • 7/31/2019 seminar topic nosql

    66/73

    66

    RDBMS

    The data model is normalized to remove data duplication

    Normalization establishes table relations

    NoSQL

    Denormalization is not a dirty word

    Relations are not explicitly defined

    Related data is usually grouped and stored as one unit

    E.g. document, column

    Comparing Normalization & Relations

    Comparing Apples to Oranges

    Comparing Apples to Oranges

  • 7/31/2019 seminar topic nosql

    67/73

    67

    RDBMS

    CRUD operations using SQL

    Access data from multiple tables using SQLjoins

    Generic API such as JDBC

    NoSQL

    Proprietary API and DSLs (e.g. Pig / Hive / Gremlin)

    MapReduce, graph traversals

    REST APIs, portable serialization formats

    BSON, JSON, Apache Thrift, Memcached

    Comparing Data Acces

    Comparing Apples to Oranges

    Comparing Apples to Oranges

  • 7/31/2019 seminar topic nosql

    68/73

    68

    RDBMS

    Slice and Dice data, then reassemble any way you like

    NoSQL Hard to repurpose data for ad-hoc usage

    Plan ahead

    Think in advance

    How and what you store

    Data access patterns

    Comparing Reporting Capabilities

    Comparing Apples to Oranges

  • 7/31/2019 seminar topic nosql

    69/73

    Summary

    Summary

  • 7/31/2019 seminar topic nosql

    70/73

    70

    ACID ruled exclusively in the last 40 years

    doesnt compromise on consistency

    Database industry neglected distributed DBs w/ availability Vacuum was filled with NoSQL BASE architectures

    Strict A and P, minimize C compromise

    Relational databases are now trying to catch up

    Why NOSQL / BASE

    Summary

    Summary

  • 7/31/2019 seminar topic nosql

    71/73

    71

    Missing some query capabilities

    joins / composite transaction

    Eventual consistency -- not for every problem Not a drop in replacement for RDBMS on ACID

    No standardization -> product lock-in

    Relatively immature (support, bugs, community)

    NoSQL Limitations

    Summary

    Summary

  • 7/31/2019 seminar topic nosql

    72/73

    72

    Relational databases and NoSQL databases are designed to

    meet different needs

    RDBMS-only should not be a default NOSQL databases outperform RDBMSs

    in their particular niche

    No one size fits all / Silver bullet

    ...but you dont have to choose one

    Choose the right tool for the job

    Summary

    Summary

  • 7/31/2019 seminar topic nosql

    73/73

    Poly: many Glot: language

    Meshing up persistence mechanisms to best meet

    requirements Good integration stories:

    E.g. Neo4j + JDBC using JTA

    Polyglot Persistence

    Summary