Big Data · More on SQL Vs NoSQL Integrity and Consistency in SQL Integrity De nition (Integrity) Integrity is the property of a database where data satisfy all integrity constraints

Big Data

Gianluca Quercini, Stéphane Vialle

Laboratoire de Recherche en InformatiqueCentraleSupélec

2020 – 2021

More on SQL Vs NoSQL

More on SQL Vs NoSQL

G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 1 / 54

More on SQL Vs NoSQL Introduction

Limitations of the Relational Model

Relational DBMSs have two major limitations:

Data distribution.Consistency at any cost.Sharding. Joins across different machines.

Impedance mismatch.The DBMS always models data as tables.An application models data in different ways, depending on theirnature.

Example

The DBMS models a graph as a collection of tables.

An application models a graph as an adjacency list.

NoSQL databases conceived to address these concerns.

Data distribution. Amazon Dynamo (2007), Google BigTable (2008).Impedance mismatch. Neo4j (2007).


More on SQL Vs NoSQL Integrity and Consistency in SQL

Integrity

Definition (Integrity)

Integrity is the property of a database where data satisfy all integrityconstraints.

Integrity constraints are defined by the database creator.The salary of an employee is not negative.If two employees have the same position, they have the same salary.No two employees have the same numeric identifier (primary keyconstraint).

Integrity constraints are checked at each update.

If an update violates any integrity constraint, the update is rejected.



Consistency

Definition (Consistency)

A database is consistent if we cannot infer any contradiction from thedata.

Example: an employee works in a non-existing department.

Integrity implies consistency.

But consistency does not imply integrity.

Inconsistency might arise from:1 Violation of foreign key constraints.2 Execution of a sequence of updates outside of transactions.3 Insufficient data normalization.4 Data replication.



Consistency and Foreign Keys

Suppose we have two tables:

Employee(codeE, first, last, position, salary, codeD)

Department(codeD, nameD, budget)

Suppose that Employee contains the following tuple:

23451, ’John’, ’White’, ’Secretary’, 50000, 45

If the table Department does not contain a department with codeD= 45, the database is in a inconsistent state.

This inconsistency can be avoided by imposing a foreign keyconstraint.



Consistency and Foreign Keys

CREATE TABLE Employee

(

codeE INTEGER PRIMARY KEY,

first VARCHAR(50),

last VARCHAR(50),

position VARCHAR(50),

salary INTEGER DEFAULT 30000,

codeD INTEGER,

FOREIGN KEY(codeD) REFERENCES Department (codeD)

)



Normalization and Transactions

Consider the following table:

Example

BankAccount (acct_number, acct_type, balance, customer_id)

Suppose that the customer 1234 wants to transfer 200 euros from heraccount 4246 to her account 2334.

Here are the needed SQL queries:

UPDATE BankAccount SET balance=balance-200 WHERE account_nbr=4346;

UPDATE BankAccount SET balance=balance+200 WHERE account_nbr=2334;

Where’s the problem?



Transactions

Definition (Transaction)

A transaction is a sequence of read and/or write operations on adatabase that are executed as a single atomic operation. Either all areexecuted or none. Importantly, the values are stored only if the transactionis successful.

In our previous example:

Example

START TRANSACTION;

UPDATE BankAccount SET balance=balance-200 WHERE account_nbr=4346;

UPDATE BankAccount SET balance=balance+200 WHERE account_nbr=2334;

COMMIT;



Transactions

Transactions have the following properties (ACID):

Atomicity (A). “All or nothing”.

Consistency (C). From a consistent state to a consistent state.some operations within the transaction may lead to inconsistencies.

Isolation (I). Serializability of transactions.

Durability (D). Upon commit, all the updates are permanent.



Consistency and Normalization

Suppose you have the following table:

Example

Book (book_id, isbn, title, type, year, publisher_name,

publisher_country)

Suppose that the table contains the following tuples:

Example

(23412, ’978-2253006312’, ’De la terre à la lune’, ’Paperback’, 2001,

Librairie Générale Française, France)

(23413, ’978-2253006329’, ’Vingt mille lieues sous les mers’, ’Paperback’, 2001,

Librairie Générale Française, Italie)

Is the database consistent?



Consistency and Normalization

Example


publisher_country)

The table Book mixes information about two different entities, booksand editors.

The database is not properly normalized.

Normalizing the database reduces the risk of inconsistencies (seebelow).



Consistency and Replication

In a distributed database data is often replicated.

Example: the table Employee is replicated on two different machines.

This means that we have two identical copies of the same table.

Any change on one copy must be propagated to the other, otherwisethe database would be in a inconsistent state (see below).


Normalization

Normalization


Normalization Introduction

Normalization

Normalization: process by which redundancy is eliminated orreduced to a minimum.

Redundancy wastes disk space.Huge concern in the 70s (cost of 1 GB ˜$300k).Less of a concern today (cost of 1 GB ˜$0.019).

Redundancy might cause data inconsistencies.Number of telephone of an employee in two records.When we update a record, we need to update the second.

Normalization theory: based on the definition of six normal forms.Each normal form adds some constraints to the previous.


Normalization First Normal Form

First Normal Form

Definition (First Normal Form (1NF))

A table is in first normal form if it meets the following conditions:

1 Each tuple is unique (identified by a primary key).

2 There are no duplicate columns.

3 Each cell contains a single value (lists are not allowed).



First Normal Form

The table Book is not in 1NF. Why?

Example

Book (book_id, isbn, title, type, year, authors,

publisher_name, publisher_country)

("book-001", 978-0321197849, "An introduction to database systems",

"Hardcover", 1999, "Christopher J. Date, UK", "Addison-Wesley", "US" )

("book-001", 978-8178082318, "An introduction to database systems",

"Paperback", 1999, "Christopher J. Date, UK", "Addison-Wesley", "US" )

("book-002", 978-3446215337, "Data Mining",

"Paperback", 2001, "Ian H. Witten, New Zealand"; "Frank Eibe, Germany",

"Hanser Fachbuch", "Germany" )



First Normal Form

We create a new table BookAuthor that contains the authors of thebooks.

The two tables Book and BookAuthor are in 1NF.

But there is still a lot of redundancy.

Example


publisher_country)

BookAuthor (isbn, year, aut_id, aut_name, aut_country)


Normalization Second Normal Form

Functional Dependencies

Definition (Functional Dependency)

Given a relational table, a set of attributes Y = {Y1, . . . ,Yn} isfunctionally dependent on a set of attributes X = {X1, . . . ,Xm}, whichis denoted as X → Y , if any value (x1, . . . , xm) of X always implies value(y1, . . . , yn) of Y .

Meaning of a functional dependency X → Y :if any two tuples have the same values of Xthen they have the same values of Y .



Functional Dependencies

Example


publisher_country)


Functional dependencies in table Book:

book id → titleisbn → book id, type, publisher namepublisher name → publisher country

Functional dependencies in table BookAuthor:

aut id → aut name, aut country



Second Normal Form

Definition (Prime attribute)

A prime column is one that belongs to at least one candidate key.Conversely, a non-prime column is one that does not belong to anycandidate key.

Definition (Second Normal Form (2NF))

A table is in second normal form if it meets the following conditions:

The table is in 1NF.

All non-prime columns depend on all the columns of each candidatekey.



Second Normal Form

Table Book is not in 2NF. Why?

Table BookAuthor is not in 2NF. Why?

Example


publisher_country)


How do we turn this database into 2NF?



Second Normal Form

The following tables are in 2NF.

Still redundant data (publisher country).

Example

Book (book_id, isbn, title, type, publisher_name,

publisher_country)

PublicationYear (isbn, year)

BookAuthor (isbn, aut_id)

Author (aut_id, aut_name, aut_country)


Normalization Third Normal Form

Third Normal Form

Definition (Third Normal Form (3NF))

A table is in third normal form if it meets the following conditions:

The table is in 2NF.

No non-prime column depends on non-prime columns.

The table Book is not in 3NF. Why?

Example

Book (book_id, isbn, title, type, publisher_name,

publisher_country)


Normalization Third Normal Form

Third Normal Form

Example

Book (book_id, title)

f.d.: book_id --> title

BookEdition (isbn, book_id, type, publisher_name)

f.d.: isbn --> book_id, type, publisher_name

Publisher (publisher_name, publisher_country)

f.d.: publisher_name --> publisher_country

PublicationYear (isbn, year)

f.d.: none

BookAuthor (isbn, aut_id)

f.d.: none

Author (aut_id, aut_name, aut_country)

f.d.: aut_id --> aut_name, aut_contry


Data Distribution

Data Distribution


Data Distribution Introduction

Distributed Database

Single-server database:Database on only one machine.All data under the control of one DBMS. ,Performances of DBMS decrease as data volume increases. /Best solution if the scale of data allows it.

Distributed database.Data reside on multiple machines (a.k.a., nodes).Each machine is independent of the others (shared-nothingarchitecture).Allows storage and management of large volumes of data. ,Far more complex than a single-server database. /Scale out only if a single-server database is not viable.


Data Distribution Characteristics of Distributed Databases

Characteristics of Distributed Databases

Data distribution options: replication and sharding.

Location transparency: the user queries the data without evenknowing that they are distributed.

Replication transparency: consistency of the data that arereplicated.

Data management functions: security and concurrent accesscontrol.


Data Distribution Data distribution

Replication

A

codeD nameD budget

14 Administration 300,000

25 Education 150,000

62 Finance 600,000

45 Human Resources 150,000

Department

B C

codeD nameD budget



62 Finance 600,000


Department

codeD nameD budget



62 Finance 600,000


Department



Replication

A

codeD nameD budget



62 Finance 600,000


Department

B C

codeD nameD budget



62 Finance 600,000


Department

codeD nameD budget



62 Finance 600,000


Department

+ Scalability. Multiple nodes receive queries on the same tuple.

+ Latency. Worldwide database, replica close to the user.

+ Fault tolerance. If a node fails, the others can still answer queries.

– Consistency. Keep all replicas up-to-date.G. Quercini, S. Vialle (CentraleSupélec) Big Data 2020 – 2021 29 / 54


Replication and Consistency

A

codeD nameD budget



62 Finance 600,000


Department

B C

codeD nameD budget



62 Finance 600,000


Department

codeD nameD budget



62 Finance 600,000


Department

Synchronous update. The update is propagated immediately.

+ Short inconsistency window.– Not viable if updates are frequent.

Asynchronous update. The update is propagated at regular intervals.

+ Best option when updates are frequent.– Large inconsistency window.



Master-Slave Replication

masterslave slave

write readread read

write write

Write operations: on the master. Writes are propagated to the slaves.

Read operations: from the master and from the slaves.



Master-Slave Replication

masterslave slave

write readread read

write write

Write operations: on the master. Writes are propagated to the slaves.

Read operations: from the master and from the slaves.

+ No write conflicts.

– Single point of failure.

If the master is down, write operations are not allowed.Algorithms exist to elect a new master.



Master-Slave Replication – Read conflicts

masterslave slave

write readread read

write write

codeD nameD budget



62 Finance 600,000


Department

codeD nameD budget



62 Finance 600,000


Department

codeD nameD budget



62 Finance 600,000


Department

Write: update (Department, budget=500,000) Read: select (Department, budget)

1

300,000

22 4

3 3

500,000 500,000



Peer-to-peer replication

A

C B

write read

write

write read write read

write

writeWrite and read operations on any node.

+ No single point of failure (very high availability).

– Write and read conflicts (low consistency).



Sharding

AcodeD nameD budget



Department

B

codeD nameD budget

62 Finance 600,000


Department

codeE last_name codeD

5 Russel 25

6 Smith 62

Employee

7 Watson 14

8 Young 45


1 Bennet 14

2 Doe 62

Employee

3 Fisher 25

4 Green 62

Tuples partitioned into balanced shards. Shards distributed across the nodes.



Sharding

AcodeD nameD budget



Department

B

codeD nameD budget

62 Finance 600,000


Department


5 Russel 25

6 Smith 62

Employee

7 Watson 14

8 Young 45


1 Bennet 14

2 Doe 62

Employee

3 Fisher 25

4 Green 62

Tuples partitioned into balanced shards. Shards distributed across the nodes.

+ Load balance.

+ No consistency problems.

– Loss of data when node fail.

– Join across different nodes.

– Updates entail changes in shards.



Combining Replication and Sharding

P1 P2 P3Data

A1

A3A2

P1

P1 P1

B1

B3B2

P2

P2 P2

C1

C3C2

P3

P3 P3


Data Distribution Consistency in distributed databases

Distributed Transactions

Objective: guarantee ACID properties in distributed databases.

Solution: distributed transactions.sequence of read/write operations that span multiple nodes.

Two-phase commit protocolPrepare phase. The coordinator asks all the nodes to prepare toeither commit or rollback.Commit phase. The coordinator asks the other nodes to commit theiroperations.

If only one node fails, the whole transaction is rolled back.

Distributed transaction management involves a lot of coordination.

network traffic

Ensuring strong consistency in distributed databases is not always agood idea.



Replication and Consistency

Replication makes consistency problems even worse.

An update on a replica takes time to propagate to the other replicas.

Update consistency (write conflict, write-write conflict).Pessimistic approach: use write locks.Optimistic approach: use version control-like solutions.

Read consistency. Two applications might read different valueswhen they read different replicas.

Only in the inconsistency window.



Quorums (1/2)

A transactional approach to replication consistency is inefficient.Why?

Write quorum. Number of replicas to lock in a n-node cluster.

QW >n

2

Propagate the update to the remaining n − QW replicasStrong write consistency.

Weak read consistency.



Quorums (2/2)

Read quorum. Number of replicas QR to read to get a consistentread.

QR + QW > n

QUORUM ASSEMBLY for replicas

Assume n copies. Define a read quorum QR and a write quorum QWwhich must be locked for reading (QR) and writing (QW).

QW > n/2

QR + QW > n

e.g. QW = n, QR = 1 is lock all copies for writing, read from any

These ensure: only one write quorum at a time can be assembled (deadlock? - see D10) every QW and QR contain at least one up-to-date replica.After assembling a (write) quorum, make all replicas consistent then do operation.

e.g. n=7, QW=5 QR=3

optimisation: after making a write quorum consistent, and performing the update, background-propagate to other replicas not in the quorum

D-5

time


Data Distribution The CAP Theorem

The CAP Theorem

Consistency. “Equivalent to having a single up-to-date copy of thedata” (Brewer).

Availability. A database can perform read/write operations evenwhen some nodes fail.

Partition tolerance. The database can operate when a networkpartition occurs as if the partition did not happen.

Theorem (CAP, Brewer 1999)

Given the three properties of consistency, availability and partitiontolerance, a networked shared-data system can have at most two of theseproperties.



The CAP Theorem

Sketch of the proof (Gilbert & Lynch, 2002).

Suppose that a database is partition tolerant. Two cases when anetwork partition occurs.

Write operations are allowed.

Consistency cannot be guaranteed.AP database.

Write operations are not allowed.

Database unavailable (write operations forbidden on reachable nodes).CP database.

Consistency and availability only when no network partition occurs.

Assuming absence of network partitions means that a database is notpartition tolerant.



The CAP Theorem

The theorem has been largely misunderstood for years.

Common interpretation:

Network partitions occur.Therefore, the CAP theorem reduces to the choice of consistency overavailability.

However, network partitions are not that frequent.

Makes no sense to give up either consistency or availability.

A distributed database should detect network partitions and operatein a partition mode with:

Reduced availability (some operations forbidden), orReduced consistency.

Partition recovery to resolve the inconsistencies.



BASE Consistency Model

ACID transactions are a pessimistic consistency model.

An optimistic model to consistency, used in distributed databases, isBASE.

Basic Availability (BA). The database appears to work most of thetime.

Soft state (S). Write and read inconsistencies can occur.

Eventually consistent (E). The database is consistent at some time.Update propagation.


NoSQL Databases

NoSQL Databases


NoSQL Databases

Characteristics of NoSQL databases

NoSQL means Not Only SQL (term coined in 2011).

NoSQL databases are not based on the relational model.

NoSQL databases are generally open-source.

NoSQL databases are cluster-oriented.

NoSQL databases tend to privilege availability over consistency.

NoSQL databases are schemaless.

NoSQL databases are classified into four families:

Key-value databases.Document databases.Column-family databases.Graph databases.

The first three databases are based on the notion of aggregate.


NoSQL Databases Aggregates

Aggregate

Aggregate: data structure containing the description of an entity.

All data in the same aggregate must stay in the same shard.

Operations within the same aggregate are atomic.

Schema is flexible and non-normalized.

Aggregate

{ {

"codeE": 1, "codeE": 2,

"first": "Joseph", "first": "John",

"last": "Bennett", "last": "Doe",

"position": "Office assistant",

"salary": 55,000, "salary": 45,000

"department": [ "department": [

{ {

"codeD": 14, "codeD": 14,

"nameD": "Administration", "nameD": "Administration",

"budget": 30000 "budget": 30000

} }

]

}


NoSQL Databases Aggregates

Aggregate

Different ways to model the data.

Aggregate

{

"codeD": 14,

"name": "Administration",

"budget": 30000,

"employees": [

{

"codeE": 1,

"first": "Joseph",

....

},

{

"codeE": 7,

"first": "Michael",

....

}

]

}


NoSQL Databases Key-value Data Model

Key-value Data Model

Data are modeled as key-value pairs.Key: alphanumeric string auto-generated by the database.Value: an aggregate.Query: get a value given its key.

Fast read/write operations.

Little to no checks on integrity constraints.

Example: shopping cart.

Key-value databases: Amazon Dynamo, Voldermort, Riak,Memcached DB.

key:01029120334product:3345product:334561product:234561

key:01029145522product:221334product:4533319product:6734862


NoSQL Databases Document Data Model

Document Data Model

Data are modeled as key-value pairs.Key: alphanumeric string auto-generated by the database.Value: an aggregate (called document).Query: get documents by key and by the values of their properties.

Example in detail: MongoDB.


NoSQL Databases Column-family Data Model

Column-family Data Model

Data are modeled as key-value pairs.Key: row identifier.Value: an aggregate, composed of one or more column families.Query: get a row given its key and the values of its columns.

Sharding unit: a row.

Storage unit: a column family.

Column-family databases: BigTable, HBase, Cassandra.


NoSQL Databases Column-family Data Model

Column-family Data Model

Definition of a small number of column families.

As many columns as we need.

The value of a column can be an aggregate.


NoSQL Databases Graph Databases

Graph Databases

DBMS specifically thought to manage and process graphs.

Two components:

Storage engine: dictates how the graph is stored.Processing engine: dictates how the graph is processed.

Native storage engine. Storage is tailored to graphs.

Native processing engine. Operations optimized for graphs.


More on SQL Vs NoSQLIntroductionIntegrity and Consistency in SQL

NormalizationIntroductionFirst Normal FormSecond Normal FormThird Normal Form

Data DistributionIntroductionCharacteristics of Distributed DatabasesData distributionConsistency in distributed databasesThe CAP Theorem

NoSQL DatabasesAggregatesKey-value Data ModelDocument Data ModelColumn-family Data ModelGraph Databases

Documents

Big Data · More on SQL Vs NoSQL Integrity and Consistency in SQL Integrity De nition (Integrity) Integrity is the property of a database where data satisfy all integrity constraints