101
Dynamo: Amazon’s Highly Available Key-value Store Amir H. Payberah [email protected] Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 1 / 64

Dynamo and NoSQL Databases

Embed Size (px)

Citation preview

Dynamo: Amazon’s Highly Available Key-value Store

Amir H. [email protected]

Amirkabir University of Technology(Tehran Polytechnic)

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 1 / 64

Database and Database Management System

I Database: an organized collection of data.

I Database Management System (DBMS): a software that interactswith users, other applications, and the database itself to captureand analyze data.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 2 / 64

Database and Database Management System

I Database: an organized collection of data.

I Database Management System (DBMS): a software that interactswith users, other applications, and the database itself to captureand analyze data.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 2 / 64

History of Databases

I 1960s• Navigational data model: hierarchical model (IMS) and network

model (CODASYL).• Disk-aware

I 1970s• Relational data model: Edgar F. Codd paper.• Logical data is disconnected from physical information storage.

I 1980s• Object databases: information is represented in the form of objects.

I 2000s• NoSQL databases: BASE instead of ACID.• NewSQL databases: scalable performance of NoSQL + ACID.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 3 / 64

History of Databases

I 1960s• Navigational data model: hierarchical model (IMS) and network

model (CODASYL).• Disk-aware

I 1970s• Relational data model: Edgar F. Codd paper.• Logical data is disconnected from physical information storage.

I 1980s• Object databases: information is represented in the form of objects.

I 2000s• NoSQL databases: BASE instead of ACID.• NewSQL databases: scalable performance of NoSQL + ACID.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 3 / 64

History of Databases

I 1960s• Navigational data model: hierarchical model (IMS) and network

model (CODASYL).• Disk-aware

I 1970s• Relational data model: Edgar F. Codd paper.• Logical data is disconnected from physical information storage.

I 1980s• Object databases: information is represented in the form of objects.

I 2000s• NoSQL databases: BASE instead of ACID.• NewSQL databases: scalable performance of NoSQL + ACID.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 3 / 64

History of Databases

I 1960s• Navigational data model: hierarchical model (IMS) and network

model (CODASYL).• Disk-aware

I 1970s• Relational data model: Edgar F. Codd paper.• Logical data is disconnected from physical information storage.

I 1980s• Object databases: information is represented in the form of objects.

I 2000s• NoSQL databases: BASE instead of ACID.• NewSQL databases: scalable performance of NoSQL + ACID.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 3 / 64

Relational Databases Management Systems (RDMBSs)

I RDMBSs: the dominant technology for storing structured data inweb and business applications.

I SQL is good• Rich language• Easy to use and integrate• Rich toolset• Many vendors

I They promise: ACID

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 4 / 64

ACID Properties

I Atomicity• All included statements in a transaction are either executed or the

whole transaction is aborted without affecting the database.

I Consistency• A database is in a consistent state before and after a transaction.

I Isolation• Transactions can not see uncommitted changes in the database.

I Durability• Changes are written to a disk before a database commits a transaction

so that committed data cannot be lost through a power failure.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 5 / 64

ACID Properties

I Atomicity• All included statements in a transaction are either executed or the

whole transaction is aborted without affecting the database.

I Consistency• A database is in a consistent state before and after a transaction.

I Isolation• Transactions can not see uncommitted changes in the database.

I Durability• Changes are written to a disk before a database commits a transaction

so that committed data cannot be lost through a power failure.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 5 / 64

ACID Properties

I Atomicity• All included statements in a transaction are either executed or the

whole transaction is aborted without affecting the database.

I Consistency• A database is in a consistent state before and after a transaction.

I Isolation• Transactions can not see uncommitted changes in the database.

I Durability• Changes are written to a disk before a database commits a transaction

so that committed data cannot be lost through a power failure.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 5 / 64

ACID Properties

I Atomicity• All included statements in a transaction are either executed or the

whole transaction is aborted without affecting the database.

I Consistency• A database is in a consistent state before and after a transaction.

I Isolation• Transactions can not see uncommitted changes in the database.

I Durability• Changes are written to a disk before a database commits a transaction

so that committed data cannot be lost through a power failure.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 5 / 64

RDBMS is Good ...

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 6 / 64

RDBMS Challenges

I Web-based applications caused spikes.• Internet-scale data size• High read-write rates• Frequent schema changes

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 7 / 64

Let’s Scale RDBMSs

I RDBMS were not designed to be distributed.

I Possible solutions:• Replication• Sharding

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 8 / 64

Let’s Scale RDBMSs - Replication

I Master/Slave architecture

I Scales read operations

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 9 / 64

Let’s Scale RDBMSs - Sharding

I Dividing the database across many machines.

I It scales read and write operations.

I Cannot execute transactions across shards (partitions).

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 10 / 64

Scaling RDBMSs is Expensive and Inefficient

[http://www.couchbase.com/sites/default/files/uploads/all/whitepapers/NoSQLWhitepaper.pdf]

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 11 / 64

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 12 / 64

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 13 / 64

NoSQL

I Avoidance of unneeded complexity

I High throughput

I Horizontal scalability and running on commodity hardware

I Compromising reliability for better performance

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 14 / 64

The NoSQL Movement

I It was first used in 1998 by Carlo Strozzi to name his relationaldatabase that did not expose the standard SQL interface.

I The term was picked up again in 2009 when a Last.fm developer, Jo-han Oskarsson, wanted to organize an event to discuss open sourcedistributed databases.

I The name attempted to label the emergence of a growing numberof non-relational, distributed data stores that often did not attemptto provide ACID.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 15 / 64

NoSQL Cost and Performance

[http://www.couchbase.com/sites/default/files/uploads/all/whitepapers/NoSQLWhitepaper.pdf]

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 16 / 64

RDBMS vs. NoSQL

[http://www.couchbase.com/sites/default/files/uploads/all/whitepapers/NoSQLWhitepaper.pdf]

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 17 / 64

NoSQL Data Models

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 18 / 64

NoSQL Data Models

[http://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques]

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 19 / 64

Key-Value Data Model

I Collection of key/value pairs.

I Ordered Key-Value: processing over key ranges.

I Dynamo, Scalaris, Voldemort, Riak, ...

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 20 / 64

Column-Oriented Data Model

I Similar to a key/value store, but the value can have multiple at-tributes (Columns).

I Column: a set of data values of a particular type.

I Store and process data by column instead of row.

I BigTable, Hbase, Cassandra, ...

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 21 / 64

Document Data Model

I Similar to a column-oriented store, but values can have complexdocuments, instead of fixed format.

I Flexible schema.

I XML, YAML, JSON, and BSON.

I CouchDB, MongoDB, ...

{

FirstName: "Bob",

Address: "5 Oak St.",

Hobby: "sailing"

}

{

FirstName: "Jonathan",

Address: "15 Wanamassa Point Road",

Children: [

{Name: "Michael", Age: 10},

{Name: "Jennifer", Age: 8},

]

}

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 22 / 64

Graph Data Model

I Uses graph structures with nodes, edges, and properties to representand store data.

I Neo4J, InfoGrid, ...

[http://en.wikipedia.org/wiki/Graph database]

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 23 / 64

Consistency

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 24 / 64

Consistency

I Strong consistency• After an update completes, any subsequent access will return the

updated value.

I Eventual consistency• Does not guarantee that subsequent accesses will return the

updated value.• Inconsistency window.• If no new updates are made to the object, eventually all accesses

will return the last updated value.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 25 / 64

Consistency

I Strong consistency• After an update completes, any subsequent access will return the

updated value.

I Eventual consistency• Does not guarantee that subsequent accesses will return the

updated value.• Inconsistency window.• If no new updates are made to the object, eventually all accesses

will return the last updated value.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 25 / 64

Quorum Model

I N: the number of nodes to which a data item is replicated.

I R: the number of nodes a value has to be read from to be accepted.

I W: the number of nodes a new value has to be written to beforethe write operation is finished.

I To enforce strong consistency: R+W > N

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 26 / 64

Quorum Model

I N: the number of nodes to which a data item is replicated.

I R: the number of nodes a value has to be read from to be accepted.

I W: the number of nodes a new value has to be written to beforethe write operation is finished.

I To enforce strong consistency: R+W > N

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 26 / 64

Consistency vs. Availability

I The large-scale applications have to be reliable: availability + re-dundancy

I These properties are difficult to achieve with ACID properties.

I The BASE approach forfeits the ACID properties of consistency andisolation in favor of availability, graceful degradation, and perfor-mance.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 27 / 64

BASE Properties

I Basic Availability• Possibilities of faults but not a fault of the whole system.

I Soft-state• Copies of a data item may be inconsistent

I Eventually consistent• Copies becomes consistent at some later time if there are no more

updates to that data item

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 28 / 64

CAP Theorem

I Consistency• Consistent state of data after the execution of an operation.

I Availability• Clients can always read and write data.

I Partition Tolerance• Continue the operation in the presence of network partitions.

I You can choose only two!

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 29 / 64

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 30 / 64

Dyanmo

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 31 / 64

Dynamo

I Distributed key/value storage system

I Scalable

I Highly available

I CAP

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 32 / 64

Design Consideration

I It sacrifices strong consistency for availability: always writable

I Incremental scalability

I Symmetry: every node should have the same set of responsibilitiesas its peers

I Decentralization

I Heterogeneity

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 33 / 64

Design Consideration

I It sacrifices strong consistency for availability: always writable

I Incremental scalability

I Symmetry: every node should have the same set of responsibilitiesas its peers

I Decentralization

I Heterogeneity

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 33 / 64

Design Consideration

I It sacrifices strong consistency for availability: always writable

I Incremental scalability

I Symmetry: every node should have the same set of responsibilitiesas its peers

I Decentralization

I Heterogeneity

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 33 / 64

Design Consideration

I It sacrifices strong consistency for availability: always writable

I Incremental scalability

I Symmetry: every node should have the same set of responsibilitiesas its peers

I Decentralization

I Heterogeneity

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 33 / 64

Design Consideration

I It sacrifices strong consistency for availability: always writable

I Incremental scalability

I Symmetry: every node should have the same set of responsibilitiesas its peers

I Decentralization

I Heterogeneity

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 33 / 64

System Architecture

I Data partitioning

I Replication

I Data versioning

I Dynamo API

I Membership management

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 34 / 64

Data Partitioning

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 35 / 64

Partitioning

I If size of data exceeds the capacity of a single machine: partitioning

I Sharding data (horizontal partitioning).

I Consistent hashing is one form of automatic sharding.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 36 / 64

Consistent Hashing

I Hash both data and nodes using the same hash function in a sameid space.

I partition = hash(d) mod n, d: data, n: number of nodes

hash("Fatemeh") = 12

hash("Ahmad") = 2

hash("Seif") = 9

hash("Jim") = 14

hash("Sverker") = 4

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 37 / 64

Consistent Hashing

I Hash both data and nodes using the same hash function in a sameid space.

I partition = hash(d) mod n, d: data, n: number of nodes

hash("Fatemeh") = 12

hash("Ahmad") = 2

hash("Seif") = 9

hash("Jim") = 14

hash("Sverker") = 4

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 37 / 64

Load Imbalance (1/4)

I Consistent hashing may lead to imbalance.

I Node identifiers may not be balanced.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 38 / 64

Load Imbalance (2/4)

I Consistent hashing may lead to imbalance.

I Data identifiers may not be balanced.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 39 / 64

Load Imbalance (3/4)

I Consistent hashing may lead to imbalance.

I Hot spots.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 40 / 64

Load Imbalance (4/4)

I Consistent hashing may lead to imbalance.

I Heterogeneous nodes.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 41 / 64

Load Balancing via Virtual Nodes

I Each physical node picks multiple random identifiers.

I Each identifier represents a virtual node.

I Each node runs multiple virtual nodes.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 42 / 64

Replication

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 43 / 64

Replication

I To achieve high availability and durability, data should be replicateson multiple nodes.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 44 / 64

Data Versioning

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 45 / 64

Data Versioning (1/3)

I Updates are propagated asynchronously.

I Each update/modification of an item results in a new and immutableversion of the data.

• Multiple versions of an object may exist.

I Replicas eventually become consistent.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 46 / 64

Data Versioning (1/3)

I Updates are propagated asynchronously.

I Each update/modification of an item results in a new and immutableversion of the data.

• Multiple versions of an object may exist.

I Replicas eventually become consistent.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 46 / 64

Data Versioning (1/3)

I Updates are propagated asynchronously.

I Each update/modification of an item results in a new and immutableversion of the data.

• Multiple versions of an object may exist.

I Replicas eventually become consistent.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 46 / 64

Data Versioning (2/3)

I Version branching can happen due to node/network failures.

I Use vector clocks for capturing causality, in the form of (node,counter)

• If causal: older version can be forgotten• If concurrent: conflict exists, requiring reconciliation

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 47 / 64

Data Versioning (2/3)

I Version branching can happen due to node/network failures.

I Use vector clocks for capturing causality, in the form of (node,counter)

• If causal: older version can be forgotten• If concurrent: conflict exists, requiring reconciliation

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 47 / 64

Data Versioning (2/3)

I Version branching can happen due to node/network failures.

I Use vector clocks for capturing causality, in the form of (node,counter)

• If causal: older version can be forgotten

• If concurrent: conflict exists, requiring reconciliation

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 47 / 64

Data Versioning (2/3)

I Version branching can happen due to node/network failures.

I Use vector clocks for capturing causality, in the form of (node,counter)

• If causal: older version can be forgotten• If concurrent: conflict exists, requiring reconciliation

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 47 / 64

Data Versioning (3/3)

I Client C1 writes new object via Sx.

I C1 updates the object via Sx.

I C1 updates the object via Sy.

I C2 reads D2 and updates theobject via Sz.

I C3 reads D3 and D4 via Sx.• The read context is a

summary of the clocks of D3and D4: [(Sx, 2), (Sy, 1), (Sz, 1)].

I Reconciliation

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 48 / 64

Data Versioning (3/3)

I Client C1 writes new object via Sx.

I C1 updates the object via Sx.

I C1 updates the object via Sy.

I C2 reads D2 and updates theobject via Sz.

I C3 reads D3 and D4 via Sx.• The read context is a

summary of the clocks of D3and D4: [(Sx, 2), (Sy, 1), (Sz, 1)].

I Reconciliation

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 48 / 64

Data Versioning (3/3)

I Client C1 writes new object via Sx.

I C1 updates the object via Sx.

I C1 updates the object via Sy.

I C2 reads D2 and updates theobject via Sz.

I C3 reads D3 and D4 via Sx.• The read context is a

summary of the clocks of D3and D4: [(Sx, 2), (Sy, 1), (Sz, 1)].

I Reconciliation

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 48 / 64

Data Versioning (3/3)

I Client C1 writes new object via Sx.

I C1 updates the object via Sx.

I C1 updates the object via Sy.

I C2 reads D2 and updates theobject via Sz.

I C3 reads D3 and D4 via Sx.• The read context is a

summary of the clocks of D3and D4: [(Sx, 2), (Sy, 1), (Sz, 1)].

I Reconciliation

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 48 / 64

Data Versioning (3/3)

I Client C1 writes new object via Sx.

I C1 updates the object via Sx.

I C1 updates the object via Sy.

I C2 reads D2 and updates theobject via Sz.

I C3 reads D3 and D4 via Sx.• The read context is a

summary of the clocks of D3and D4: [(Sx, 2), (Sy, 1), (Sz, 1)].

I Reconciliation

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 48 / 64

Data Versioning (3/3)

I Client C1 writes new object via Sx.

I C1 updates the object via Sx.

I C1 updates the object via Sy.

I C2 reads D2 and updates theobject via Sz.

I C3 reads D3 and D4 via Sx.• The read context is a

summary of the clocks of D3and D4: [(Sx, 2), (Sy, 1), (Sz, 1)].

I Reconciliation

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 48 / 64

Dynamo API

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 49 / 64

Dynamo API

I get(key)• Return single object or list of objects with conflicting version and

context.

I put(key, context, object)• Store object and context under key.• Context encodes system metadata, e.g., version number.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 50 / 64

Dynamo API

I get(key)• Return single object or list of objects with conflicting version and

context.

I put(key, context, object)• Store object and context under key.• Context encodes system metadata, e.g., version number.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 50 / 64

Execution of Operations

I Client can send the request:• To the node responsible for the data (coordinator): save on latency,

code on client• To a generic load balancer: extra hope

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 51 / 64

put Operation

I Coordinator generates new vector clock and writes the new versionlocally.

I Send to N nodes.

I Wait for response from W nodes.

I Using W=1• High availability for writes• Low durability

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 52 / 64

put Operation

I Coordinator generates new vector clock and writes the new versionlocally.

I Send to N nodes.

I Wait for response from W nodes.

I Using W=1• High availability for writes• Low durability

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 52 / 64

put Operation

I Coordinator generates new vector clock and writes the new versionlocally.

I Send to N nodes.

I Wait for response from W nodes.

I Using W=1• High availability for writes• Low durability

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 52 / 64

put Operation

I Coordinator generates new vector clock and writes the new versionlocally.

I Send to N nodes.

I Wait for response from W nodes.

I Using W=1• High availability for writes• Low durability

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 52 / 64

get Operation

I Coordinator requests existing versions from N.• Wait for response from R nodes.

I If multiple versions, return all versions that are causally unrelated.

I Divergent versions are then reconciled.

I Reconciled version written back.

I Using R=1• High performance read engine.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 53 / 64

get Operation

I Coordinator requests existing versions from N.• Wait for response from R nodes.

I If multiple versions, return all versions that are causally unrelated.

I Divergent versions are then reconciled.

I Reconciled version written back.

I Using R=1• High performance read engine.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 53 / 64

get Operation

I Coordinator requests existing versions from N.• Wait for response from R nodes.

I If multiple versions, return all versions that are causally unrelated.

I Divergent versions are then reconciled.

I Reconciled version written back.

I Using R=1• High performance read engine.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 53 / 64

get Operation

I Coordinator requests existing versions from N.• Wait for response from R nodes.

I If multiple versions, return all versions that are causally unrelated.

I Divergent versions are then reconciled.

I Reconciled version written back.

I Using R=1• High performance read engine.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 53 / 64

get Operation

I Coordinator requests existing versions from N.• Wait for response from R nodes.

I If multiple versions, return all versions that are causally unrelated.

I Divergent versions are then reconciled.

I Reconciled version written back.

I Using R=1• High performance read engine.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 53 / 64

Membership Management

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 54 / 64

Membership Management

I Administrator explicitly adds and removes nodes.

I Gossiping to propagate membership changes.• Eventually consistent view.• O(1) hop overlay.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 55 / 64

Membership Management

I Administrator explicitly adds and removes nodes.

I Gossiping to propagate membership changes.• Eventually consistent view.• O(1) hop overlay.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 55 / 64

Adding Nodes

I A new node X added to system.• X is assigned key ranges w.r.t. its virtual servers.• For each key range, it transfers the data items.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 56 / 64

Removing Nodes

I Reallocation of keys is a reverse process of adding nodes.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 57 / 64

Failure Detection

I Passive failure detection.• Use pings only for detection from failed to alive.

I In the absence of client requests, node A doesn’t need to know ifnode B is alive.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 58 / 64

Handling Permanent Failure

I Anti-entropy for replica synchronization.

I Use Merkle trees for fast inconsistency detection and minimumtransfer of data.

• Nodes maintain Merkle tree of each key range.• Exchange root of Merkle tree to check if the key ranges are updated.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 59 / 64

Handling Permanent Failure

I Anti-entropy for replica synchronization.

I Use Merkle trees for fast inconsistency detection and minimumtransfer of data.

• Nodes maintain Merkle tree of each key range.• Exchange root of Merkle tree to check if the key ranges are updated.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 59 / 64

Handling Permanent Failure

I Anti-entropy for replica synchronization.

I Use Merkle trees for fast inconsistency detection and minimumtransfer of data.

• Nodes maintain Merkle tree of each key range.

• Exchange root of Merkle tree to check if the key ranges are updated.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 59 / 64

Handling Permanent Failure

I Anti-entropy for replica synchronization.

I Use Merkle trees for fast inconsistency detection and minimumtransfer of data.

• Nodes maintain Merkle tree of each key range.• Exchange root of Merkle tree to check if the key ranges are updated.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 59 / 64

Failure Transient Detection

I Due to partitions, quorums might not exist.• Sloppy quorum.• Create transient replicas: N healthy

nodes from the preference list.• Reconcile after partition heals.

I Say A is unreachable.

I put will use D.

I Later, D detects A is alive.• Sends the replica to A• Removes the replica.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 60 / 64

Failure Transient Detection

I Due to partitions, quorums might not exist.• Sloppy quorum.• Create transient replicas: N healthy

nodes from the preference list.• Reconcile after partition heals.

I Say A is unreachable.

I put will use D.

I Later, D detects A is alive.• Sends the replica to A• Removes the replica.

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 60 / 64

Summary

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 61 / 64

Summary

I NoSQL data models: key-value, column-oriented, document-oriented, graph-based

I Sharding and consistent hashing

I ACID vs. BASE

I CAP (Consistency vs. Availability)

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 62 / 64

Summary

I Dynamo: key/value storage: put and get

I Data partitioning: consistent hashing

I Load balancing: virtual server

I Replication: several nodes, preference list

I Data versioning: vector clock, resolve conflict at read time by theapplication

I Membership management: join/leave by admin, gossip-based to up-date the nodes’ views, ping to detect failure

I Handling transient failure: sloppy quorum

I Handling permanent failure: Merkle tree

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 63 / 64

Questions?

Amir H. Payberah (Tehran Polytechnic) Dynamo 1393/7/19 64 / 64