55
Big Data Gianluca Quercini, St´ ephane Vialle Laboratoire de Recherche en Informatique CentraleSup´ elec 2020 – 2021

Big Data · More on SQL Vs NoSQL Integrity and Consistency in SQL Integrity De nition (Integrity) Integrity is the property of a database where data satisfy all integrity constraints

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

  • Big Data

    Gianluca Quercini, Stéphane Vialle

    Laboratoire de Recherche en InformatiqueCentraleSupélec

    2020 – 2021

  • More on SQL Vs NoSQL

    More on SQL Vs NoSQL

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 1 / 54

  • More on SQL Vs NoSQL Introduction

    Limitations of the Relational Model

    Relational DBMSs have two major limitations:

    Data distribution.Consistency at any cost.Sharding. Joins across different machines.

    Impedance mismatch.The DBMS always models data as tables.An application models data in different ways, depending on theirnature.

    Example

    The DBMS models a graph as a collection of tables.

    An application models a graph as an adjacency list.

    NoSQL databases conceived to address these concerns.

    Data distribution. Amazon Dynamo (2007), Google BigTable (2008).Impedance mismatch. Neo4j (2007).

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 2 / 54

  • More on SQL Vs NoSQL Integrity and Consistency in SQL

    Integrity

    Definition (Integrity)

    Integrity is the property of a database where data satisfy all integrityconstraints.

    Integrity constraints are defined by the database creator.The salary of an employee is not negative.If two employees have the same position, they have the same salary.No two employees have the same numeric identifier (primary keyconstraint).

    Integrity constraints are checked at each update.

    If an update violates any integrity constraint, the update is rejected.

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 3 / 54

  • More on SQL Vs NoSQL Integrity and Consistency in SQL

    Consistency

    Definition (Consistency)

    A database is consistent if we cannot infer any contradiction from thedata.

    Example: an employee works in a non-existing department.

    Integrity implies consistency.

    But consistency does not imply integrity.

    Inconsistency might arise from:1 Violation of foreign key constraints.2 Execution of a sequence of updates outside of transactions.3 Insufficient data normalization.4 Data replication.

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 4 / 54

  • More on SQL Vs NoSQL Integrity and Consistency in SQL

    Consistency and Foreign Keys

    Suppose we have two tables:

    Employee(codeE, first, last, position, salary, codeD)

    Department(codeD, nameD, budget)

    Suppose that Employee contains the following tuple:

    23451, ’John’, ’White’, ’Secretary’, 50000, 45

    If the table Department does not contain a department with codeD= 45, the database is in a inconsistent state.

    This inconsistency can be avoided by imposing a foreign keyconstraint.

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 5 / 54

  • More on SQL Vs NoSQL Integrity and Consistency in SQL

    Consistency and Foreign Keys

    CREATE TABLE Employee

    (

    codeE INTEGER PRIMARY KEY,

    first VARCHAR(50),

    last VARCHAR(50),

    position VARCHAR(50),

    salary INTEGER DEFAULT 30000,

    codeD INTEGER,

    FOREIGN KEY(codeD) REFERENCES Department (codeD)

    )

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 6 / 54

  • More on SQL Vs NoSQL Integrity and Consistency in SQL

    Normalization and Transactions

    Consider the following table:

    Example

    BankAccount (acct_number, acct_type, balance, customer_id)

    Suppose that the customer 1234 wants to transfer 200 euros from heraccount 4246 to her account 2334.

    Here are the needed SQL queries:

    UPDATE BankAccount SET balance=balance-200 WHERE account_nbr=4346;

    UPDATE BankAccount SET balance=balance+200 WHERE account_nbr=2334;

    Where’s the problem?

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 7 / 54

  • More on SQL Vs NoSQL Integrity and Consistency in SQL

    Transactions

    Definition (Transaction)

    A transaction is a sequence of read and/or write operations on adatabase that are executed as a single atomic operation. Either all areexecuted or none. Importantly, the values are stored only if the transactionis successful.

    In our previous example:

    Example

    START TRANSACTION;

    UPDATE BankAccount SET balance=balance-200 WHERE account_nbr=4346;

    UPDATE BankAccount SET balance=balance+200 WHERE account_nbr=2334;

    COMMIT;

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 8 / 54

  • More on SQL Vs NoSQL Integrity and Consistency in SQL

    Transactions

    Transactions have the following properties (ACID):

    Atomicity (A). “All or nothing”.

    Consistency (C). From a consistent state to a consistent state.some operations within the transaction may lead to inconsistencies.

    Isolation (I). Serializability of transactions.

    Durability (D). Upon commit, all the updates are permanent.

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 9 / 54

  • More on SQL Vs NoSQL Integrity and Consistency in SQL

    Consistency and Normalization

    Suppose you have the following table:

    Example

    Book (book_id, isbn, title, type, year, publisher_name,

    publisher_country)

    Suppose that the table contains the following tuples:

    Example

    (23412, ’978-2253006312’, ’De la terre à la lune’, ’Paperback’, 2001,

    Librairie Générale Française, France)

    (23413, ’978-2253006329’, ’Vingt mille lieues sous les mers’, ’Paperback’, 2001,

    Librairie Générale Française, Italie)

    Is the database consistent?

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 10 / 54

  • More on SQL Vs NoSQL Integrity and Consistency in SQL

    Consistency and Normalization

    Example

    Book (book_id, isbn, title, type, year, publisher_name,

    publisher_country)

    The table Book mixes information about two different entities, booksand editors.

    The database is not properly normalized.

    Normalizing the database reduces the risk of inconsistencies (seebelow).

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 11 / 54

  • More on SQL Vs NoSQL Integrity and Consistency in SQL

    Consistency and Replication

    In a distributed database data is often replicated.

    Example: the table Employee is replicated on two different machines.

    This means that we have two identical copies of the same table.

    Any change on one copy must be propagated to the other, otherwisethe database would be in a inconsistent state (see below).

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 12 / 54

  • Normalization

    Normalization

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 13 / 54

  • Normalization Introduction

    Normalization

    Normalization: process by which redundancy is eliminated orreduced to a minimum.

    Redundancy wastes disk space.Huge concern in the 70s (cost of 1 GB ˜$300k).Less of a concern today (cost of 1 GB ˜$0.019).

    Redundancy might cause data inconsistencies.Number of telephone of an employee in two records.When we update a record, we need to update the second.

    Normalization theory: based on the definition of six normal forms.Each normal form adds some constraints to the previous.

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 14 / 54

  • Normalization First Normal Form

    First Normal Form

    Definition (First Normal Form (1NF))

    A table is in first normal form if it meets the following conditions:

    1 Each tuple is unique (identified by a primary key).

    2 There are no duplicate columns.

    3 Each cell contains a single value (lists are not allowed).

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 15 / 54

  • Normalization First Normal Form

    First Normal Form

    The table Book is not in 1NF. Why?

    Example

    Book (book_id, isbn, title, type, year, authors,

    publisher_name, publisher_country)

    ("book-001", 978-0321197849, "An introduction to database systems",

    "Hardcover", 1999, "Christopher J. Date, UK", "Addison-Wesley", "US" )

    ("book-001", 978-8178082318, "An introduction to database systems",

    "Paperback", 1999, "Christopher J. Date, UK", "Addison-Wesley", "US" )

    ("book-002", 978-3446215337, "Data Mining",

    "Paperback", 2001, "Ian H. Witten, New Zealand"; "Frank Eibe, Germany",

    "Hanser Fachbuch", "Germany" )

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 16 / 54

  • Normalization First Normal Form

    First Normal Form

    We create a new table BookAuthor that contains the authors of thebooks.

    The two tables Book and BookAuthor are in 1NF.

    But there is still a lot of redundancy.

    Example

    Book (book_id, isbn, title, type, year, publisher_name,

    publisher_country)

    BookAuthor (isbn, year, aut_id, aut_name, aut_country)

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 17 / 54

  • Normalization Second Normal Form

    Functional Dependencies

    Definition (Functional Dependency)

    Given a relational table, a set of attributes Y = {Y1, . . . ,Yn} isfunctionally dependent on a set of attributes X = {X1, . . . ,Xm}, whichis denoted as X → Y , if any value (x1, . . . , xm) of X always implies value(y1, . . . , yn) of Y .

    Meaning of a functional dependency X → Y :if any two tuples have the same values of Xthen they have the same values of Y .

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 18 / 54

  • Normalization Second Normal Form

    Functional Dependencies

    Example

    Book (book_id, isbn, title, type, year, publisher_name,

    publisher_country)

    BookAuthor (isbn, year, aut_id, aut_name, aut_country)

    Functional dependencies in table Book:

    book id → titleisbn → book id, type, publisher namepublisher name → publisher country

    Functional dependencies in table BookAuthor:

    aut id → aut name, aut country

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 19 / 54

  • Normalization Second Normal Form

    Second Normal Form

    Definition (Prime attribute)

    A prime column is one that belongs to at least one candidate key.Conversely, a non-prime column is one that does not belong to anycandidate key.

    Definition (Second Normal Form (2NF))

    A table is in second normal form if it meets the following conditions:

    The table is in 1NF.

    All non-prime columns depend on all the columns of each candidatekey.

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 20 / 54

  • Normalization Second Normal Form

    Second Normal Form

    Table Book is not in 2NF. Why?

    Table BookAuthor is not in 2NF. Why?

    Example

    Book (book_id, isbn, title, type, year, publisher_name,

    publisher_country)

    BookAuthor (isbn, year, aut_id, aut_name, aut_country)

    How do we turn this database into 2NF?

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 21 / 54

  • Normalization Second Normal Form

    Second Normal Form

    The following tables are in 2NF.

    Still redundant data (publisher country).

    Example

    Book (book_id, isbn, title, type, publisher_name,

    publisher_country)

    PublicationYear (isbn, year)

    BookAuthor (isbn, aut_id)

    Author (aut_id, aut_name, aut_country)

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 22 / 54

  • Normalization Third Normal Form

    Third Normal Form

    Definition (Third Normal Form (3NF))

    A table is in third normal form if it meets the following conditions:

    The table is in 2NF.

    No non-prime column depends on non-prime columns.

    The table Book is not in 3NF. Why?

    Example

    Book (book_id, isbn, title, type, publisher_name,

    publisher_country)

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 23 / 54

  • Normalization Third Normal Form

    Third Normal Form

    Example

    Book (book_id, title)

    f.d.: book_id --> title

    BookEdition (isbn, book_id, type, publisher_name)

    f.d.: isbn --> book_id, type, publisher_name

    Publisher (publisher_name, publisher_country)

    f.d.: publisher_name --> publisher_country

    PublicationYear (isbn, year)

    f.d.: none

    BookAuthor (isbn, aut_id)

    f.d.: none

    Author (aut_id, aut_name, aut_country)

    f.d.: aut_id --> aut_name, aut_contry

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 24 / 54

  • Data Distribution

    Data Distribution

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 25 / 54

  • Data Distribution Introduction

    Distributed Database

    Single-server database:Database on only one machine.All data under the control of one DBMS. ,Performances of DBMS decrease as data volume increases. /Best solution if the scale of data allows it.

    Distributed database.Data reside on multiple machines (a.k.a., nodes).Each machine is independent of the others (shared-nothingarchitecture).Allows storage and management of large volumes of data. ,Far more complex than a single-server database. /Scale out only if a single-server database is not viable.

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 26 / 54

  • Data Distribution Characteristics of Distributed Databases

    Characteristics of Distributed Databases

    Data distribution options: replication and sharding.

    Location transparency: the user queries the data without evenknowing that they are distributed.

    Replication transparency: consistency of the data that arereplicated.

    Data management functions: security and concurrent accesscontrol.

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 27 / 54

  • Data Distribution Data distribution

    Replication

    A

    codeD nameD budget

    14 Administration 300,000

    25 Education 150,000

    62 Finance 600,000

    45 Human Resources 150,000

    Department

    B C

    codeD nameD budget

    14 Administration 300,000

    25 Education 150,000

    62 Finance 600,000

    45 Human Resources 150,000

    Department

    codeD nameD budget

    14 Administration 300,000

    25 Education 150,000

    62 Finance 600,000

    45 Human Resources 150,000

    Department

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 28 / 54

  • Data Distribution Data distribution

    Replication

    A

    codeD nameD budget

    14 Administration 300,000

    25 Education 150,000

    62 Finance 600,000

    45 Human Resources 150,000

    Department

    B C

    codeD nameD budget

    14 Administration 300,000

    25 Education 150,000

    62 Finance 600,000

    45 Human Resources 150,000

    Department

    codeD nameD budget

    14 Administration 300,000

    25 Education 150,000

    62 Finance 600,000

    45 Human Resources 150,000

    Department

    + Scalability. Multiple nodes receive queries on the same tuple.

    + Latency. Worldwide database, replica close to the user.

    + Fault tolerance. If a node fails, the others can still answer queries.

    – Consistency. Keep all replicas up-to-date.G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 29 / 54

  • Data Distribution Data distribution

    Replication and Consistency

    A

    codeD nameD budget

    14 Administration 500,000

    25 Education 150,000

    62 Finance 600,000

    45 Human Resources 150,000

    Department

    B C

    codeD nameD budget

    14 Administration 300,000

    25 Education 150,000

    62 Finance 600,000

    45 Human Resources 150,000

    Department

    codeD nameD budget

    14 Administration 300,000

    25 Education 150,000

    62 Finance 600,000

    45 Human Resources 150,000

    Department

    Synchronous update. The update is propagated immediately.

    + Short inconsistency window.– Not viable if updates are frequent.

    Asynchronous update. The update is propagated at regular intervals.

    + Best option when updates are frequent.– Large inconsistency window.

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 30 / 54

  • Data Distribution Data distribution

    Master-Slave Replication

    masterslave slave

    write readread read

    write write

    Write operations: on the master. Writes are propagated to the slaves.

    Read operations: from the master and from the slaves.

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 31 / 54

  • Data Distribution Data distribution

    Master-Slave Replication

    masterslave slave

    write readread read

    write write

    Write operations: on the master. Writes are propagated to the slaves.

    Read operations: from the master and from the slaves.

    + No write conflicts.

    – Single point of failure.

    If the master is down, write operations are not allowed.Algorithms exist to elect a new master.

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 32 / 54

  • Data Distribution Data distribution

    Master-Slave Replication – Read conflicts

    masterslave slave

    write readread read

    write write

    codeD nameD budget

    14 Administration 300,000

    25 Education 150,000

    62 Finance 600,000

    45 Human Resources 150,000

    Department

    codeD nameD budget

    14 Administration 500,000

    25 Education 150,000

    62 Finance 600,000

    45 Human Resources 150,000

    Department

    codeD nameD budget

    14 Administration 500,000

    25 Education 150,000

    62 Finance 600,000

    45 Human Resources 150,000

    Department

    Write: update (Department, budget=500,000) Read: select (Department, budget)

    1

    300,000

    22 4

    3 3

    500,000 500,000

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 33 / 54

  • Data Distribution Data distribution

    Peer-to-peer replication

    A

    C B

    write read

    write

    write read write read

    write

    writeWrite and read operations on any node.

    + No single point of failure (very high availability).

    – Write and read conflicts (low consistency).

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 34 / 54

  • Data Distribution Data distribution

    Sharding

    AcodeD nameD budget

    14 Administration 300,000

    25 Education 150,000

    Department

    B

    codeD nameD budget

    62 Finance 600,000

    45 Human Resources 150,000

    Department

    codeE last_name codeD

    5 Russel 25

    6 Smith 62

    Employee

    7 Watson 14

    8 Young 45

    codeE last_name codeD

    1 Bennet 14

    2 Doe 62

    Employee

    3 Fisher 25

    4 Green 62

    Tuples partitioned into balanced shards. Shards distributed across the nodes.

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 35 / 54

  • Data Distribution Data distribution

    Sharding

    AcodeD nameD budget

    14 Administration 300,000

    25 Education 150,000

    Department

    B

    codeD nameD budget

    62 Finance 600,000

    45 Human Resources 150,000

    Department

    codeE last_name codeD

    5 Russel 25

    6 Smith 62

    Employee

    7 Watson 14

    8 Young 45

    codeE last_name codeD

    1 Bennet 14

    2 Doe 62

    Employee

    3 Fisher 25

    4 Green 62

    Tuples partitioned into balanced shards. Shards distributed across the nodes.

    + Load balance.

    + No consistency problems.

    – Loss of data when node fail.

    – Join across different nodes.

    – Updates entail changes in shards.

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 36 / 54

  • Data Distribution Data distribution

    Combining Replication and Sharding

    P1 P2 P3Data

    A1

    A3A2

    P1

    P1 P1

    B1

    B3B2

    P2

    P2 P2

    C1

    C3C2

    P3

    P3 P3

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 37 / 54

  • Data Distribution Consistency in distributed databases

    Distributed Transactions

    Objective: guarantee ACID properties in distributed databases.

    Solution: distributed transactions.sequence of read/write operations that span multiple nodes.

    Two-phase commit protocolPrepare phase. The coordinator asks all the nodes to prepare toeither commit or rollback.Commit phase. The coordinator asks the other nodes to commit theiroperations.

    If only one node fails, the whole transaction is rolled back.

    Distributed transaction management involves a lot of coordination.

    network traffic

    Ensuring strong consistency in distributed databases is not always agood idea.

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 38 / 54

  • Data Distribution Consistency in distributed databases

    Replication and Consistency

    Replication makes consistency problems even worse.

    An update on a replica takes time to propagate to the other replicas.

    Update consistency (write conflict, write-write conflict).Pessimistic approach: use write locks.Optimistic approach: use version control-like solutions.

    Read consistency. Two applications might read different valueswhen they read different replicas.

    Only in the inconsistency window.

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 39 / 54

  • Data Distribution Consistency in distributed databases

    Quorums (1/2)

    A transactional approach to replication consistency is inefficient.Why?

    Write quorum. Number of replicas to lock in a n-node cluster.

    QW >n

    2

    Propagate the update to the remaining n − QW replicasStrong write consistency.

    Weak read consistency.

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 40 / 54

  • Data Distribution Consistency in distributed databases

    Quorums (2/2)

    Read quorum. Number of replicas QR to read to get a consistentread.

    QR + QW > n

    QUORUM ASSEMBLY for replicas

    Assume n copies. Define a read quorum QR and a write quorum QWwhich must be locked for reading (QR) and writing (QW).

    QW > n/2

    QR + QW > n

    e.g. QW = n, QR = 1 is lock all copies for writing, read from any

    These ensure: only one write quorum at a time can be assembled (deadlock? - see D10) every QW and QR contain at least one up-to-date replica.After assembling a (write) quorum, make all replicas consistent then do operation.

    e.g. n=7, QW=5 QR=3

    optimisation: after making a write quorum consistent, and performing the update, background-propagate to other replicas not in the quorum

    D-5

    time

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 41 / 54

  • Data Distribution The CAP Theorem

    The CAP Theorem

    Consistency. “Equivalent to having a single up-to-date copy of thedata” (Brewer).

    Availability. A database can perform read/write operations evenwhen some nodes fail.

    Partition tolerance. The database can operate when a networkpartition occurs as if the partition did not happen.

    Theorem (CAP, Brewer 1999)

    Given the three properties of consistency, availability and partitiontolerance, a networked shared-data system can have at most two of theseproperties.

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 42 / 54

  • Data Distribution The CAP Theorem

    The CAP Theorem

    Sketch of the proof (Gilbert & Lynch, 2002).

    Suppose that a database is partition tolerant. Two cases when anetwork partition occurs.

    Write operations are allowed.

    Consistency cannot be guaranteed.AP database.

    Write operations are not allowed.

    Database unavailable (write operations forbidden on reachable nodes).CP database.

    Consistency and availability only when no network partition occurs.

    Assuming absence of network partitions means that a database is notpartition tolerant.

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 43 / 54

  • Data Distribution The CAP Theorem

    The CAP Theorem

    The theorem has been largely misunderstood for years.

    Common interpretation:

    Network partitions occur.Therefore, the CAP theorem reduces to the choice of consistency overavailability.

    However, network partitions are not that frequent.

    Makes no sense to give up either consistency or availability.

    A distributed database should detect network partitions and operatein a partition mode with:

    Reduced availability (some operations forbidden), orReduced consistency.

    Partition recovery to resolve the inconsistencies.

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 44 / 54

  • Data Distribution The CAP Theorem

    BASE Consistency Model

    ACID transactions are a pessimistic consistency model.

    An optimistic model to consistency, used in distributed databases, isBASE.

    Basic Availability (BA). The database appears to work most of thetime.

    Soft state (S). Write and read inconsistencies can occur.

    Eventually consistent (E). The database is consistent at some time.Update propagation.

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 45 / 54

  • NoSQL Databases

    NoSQL Databases

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 46 / 54

  • NoSQL Databases

    Characteristics of NoSQL databases

    NoSQL means Not Only SQL (term coined in 2011).

    NoSQL databases are not based on the relational model.

    NoSQL databases are generally open-source.

    NoSQL databases are cluster-oriented.

    NoSQL databases tend to privilege availability over consistency.

    NoSQL databases are schemaless.

    NoSQL databases are classified into four families:

    Key-value databases.Document databases.Column-family databases.Graph databases.

    The first three databases are based on the notion of aggregate.

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 47 / 54

  • NoSQL Databases Aggregates

    Aggregate

    Aggregate: data structure containing the description of an entity.

    All data in the same aggregate must stay in the same shard.

    Operations within the same aggregate are atomic.

    Schema is flexible and non-normalized.

    Aggregate

    { {

    "codeE": 1, "codeE": 2,

    "first": "Joseph", "first": "John",

    "last": "Bennett", "last": "Doe",

    "position": "Office assistant",

    "salary": 55,000, "salary": 45,000

    "department": [ "department": [

    { {

    "codeD": 14, "codeD": 14,

    "nameD": "Administration", "nameD": "Administration",

    "budget": 30000 "budget": 30000

    } }

    ]

    }

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 48 / 54

  • NoSQL Databases Aggregates

    Aggregate

    Different ways to model the data.

    Aggregate

    {

    "codeD": 14,

    "name": "Administration",

    "budget": 30000,

    "employees": [

    {

    "codeE": 1,

    "first": "Joseph",

    ....

    },

    {

    "codeE": 7,

    "first": "Michael",

    ....

    }

    ]

    }

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 49 / 54

  • NoSQL Databases Key-value Data Model

    Key-value Data Model

    Data are modeled as key-value pairs.Key: alphanumeric string auto-generated by the database.Value: an aggregate.Query: get a value given its key.

    Fast read/write operations.

    Little to no checks on integrity constraints.

    Example: shopping cart.

    Key-value databases: Amazon Dynamo, Voldermort, Riak,Memcached DB.

    key:01029120334product:3345product:334561product:234561

    key:01029145522product:221334product:4533319product:6734862

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 50 / 54

  • NoSQL Databases Document Data Model

    Document Data Model

    Data are modeled as key-value pairs.Key: alphanumeric string auto-generated by the database.Value: an aggregate (called document).Query: get documents by key and by the values of their properties.

    Example in detail: MongoDB.

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 51 / 54

  • NoSQL Databases Column-family Data Model

    Column-family Data Model

    Data are modeled as key-value pairs.Key: row identifier.Value: an aggregate, composed of one or more column families.Query: get a row given its key and the values of its columns.

    Sharding unit: a row.

    Storage unit: a column family.

    Column-family databases: BigTable, HBase, Cassandra.

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 52 / 54

  • NoSQL Databases Column-family Data Model

    Column-family Data Model

    Definition of a small number of column families.

    As many columns as we need.

    The value of a column can be an aggregate.

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 53 / 54

  • NoSQL Databases Graph Databases

    Graph Databases

    DBMS specifically thought to manage and process graphs.

    Two components:

    Storage engine: dictates how the graph is stored.Processing engine: dictates how the graph is processed.

    Native storage engine. Storage is tailored to graphs.

    Native processing engine. Operations optimized for graphs.

    G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 54 / 54

    More on SQL Vs NoSQLIntroductionIntegrity and Consistency in SQL

    NormalizationIntroductionFirst Normal FormSecond Normal FormThird Normal Form

    Data DistributionIntroductionCharacteristics of Distributed DatabasesData distributionConsistency in distributed databasesThe CAP Theorem

    NoSQL DatabasesAggregatesKey-value Data ModelDocument Data ModelColumn-family Data ModelGraph Databases