43
{ } Knowledge Engineering K Georg-August-Universit¨ atG¨ottingen Institut f¨ ur Informatik Replication and Synchronization Algorithms for Distributed Databases Lena Wiese Research Group Knowledge Engineering Institut f¨ ur Informatik UniG¨ottingen Sep 19th, 2015 Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 1 / 43

Replication and Synchronization Algorithms for Distributed Databases - Lena Wiese

Embed Size (px)

Citation preview

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenInstitut fur Informatik

Replication and Synchronization Algorithms forDistributed Databases

Lena Wiese

Research Group Knowledge EngineeringInstitut fur Informatik

Uni Gottingen

Sep 19th, 2015

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 1 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenInstitut fur Informatik

Short CV Dr. Lena Wiese

University of Gottingen (Research Group Leader KnowledgeEngineering)

University of Hildesheim (Visiting Professor for Databases)

National Institute of Informatics, Tokyo, Japan

Robert Bosch India Ltd., Bangalore, India

Master/PhD: TU Dortmund

Teaching and Research

NoSQL databases (lecture, seminars, projects)Database security (encryption for Cassandra and HBase)

Web: http://wiese.free.fr/

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 2 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenInstitut fur Informatik

Book / WorkshopMaster’s level text book (in English):Lena Wiese: Advanced Data Management for SQL,NoSQL, Cloud and Distributed Databasesc© 2015 DeGruyter/Oldenbourg

(several pictures in this talk taken from the book)

Workshop NoSQL-Net

organized jointly with Irena Holubova (CharlesUnversity, Prague) and Valentina Janev (InstitutoMihailo Pupin, Belgrade)3rd edition to be organized at DEXA conference 2016We welcome industrial application papers

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 3 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenDistributed Database Systems Institut fur Informatik

Outline

1 Distributed Database Systems

2 Epidemic Protocols

3 Data Allocation and Replication

4 Multiversion Concurrency Control

5 Version Vectors

6 Quorums and Consensus

7 Implementations

8 Conclusion

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 4 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenDistributed Database Systems Institut fur Informatik

Distributed Database Systems

Scaling up (vertical scaling) vs Scaling out (horizontal scaling)

Scaling out on commodity hardware in a shared-nothing architecture

Distributed Database Systems

A distributed database is a collection of data records that are physicallydistributed on several servers in a network while logically they belongtogether.

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 5 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenDistributed Database Systems Institut fur Informatik

Distribution Transparency

For a user it must be transparent how the DBMS internally handlesdata storage and query processing in a distributed manner.

Access transparency: access independent of the structure of thenetwork or the storage organization

Location transparency: data distribution hidden from the user

Replication transparency: user may access any replica

Fragmentation transparency: fragmentation/sharding handledinternally; user can query the database as if it contained theunfragmented data set

Migration transparency: data reorganization does not affect dataaccess

Concurrency transparency: operations of multiple users must notinterfere

Failure transparency: continue processing user requests evenin the presence of failures

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 6 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenEpidemic Protocols Institut fur Informatik

Outline

1 Distributed Database Systems

2 Epidemic Protocols

3 Data Allocation and Replication

4 Multiversion Concurrency Control

5 Version Vectors

6 Quorums and Consensus

7 Implementations

8 Conclusion

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 7 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenEpidemic Protocols Institut fur Informatik

Epidemic Protocols

Propagation of information (like membership lists or data updates) inthe network of database servers difficult:

message losshigh churn: change due to insertions or removals of servers

Epidemic protocols spread new information like an infection or likegossip

infected nodes have received a new message that they want to pass onto othersusceptible nodes so far have not received the new messageremoved nodes have received the message but are no longer willing topass it on

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 8 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenEpidemic Protocols Institut fur Informatik

Variants of Epidemic ProtocolsAnti-entropy: periodic task (for example, run once every minute)

Anti-entropy is a simple epidemic: any server is either susceptible orinfective (there are no removed servers)the infection process does not degrade over time or due to someprobabilistic decision

Rumor spreading: the infection can be triggered by the arrival of anew message (in which case the server becomes infective); or it canbe run periodically

Rumor spreading is a complex epidemic: the infection proceeds inseveral roundsThe amount of infections decreases with every round as the number ofremoved servers grows.probabilistic: After each exchange with another server, the server stopsbeing infective with a certain probabilitycounter-based: After a certain number k of exchanges, the server stopsbeing infective

Alan J. Demers, Daniel H. Greene, Carl Hauser, Wes Irish, John Larson, Scott Shenker, Howard E. Sturgis, DanielC. Swinehart, and Douglas B. Terry. Epidemic algorithms for replicated database maintenance. Operating SystemsReview, 22(1):8-32, 1988.

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 9 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenEpidemic Protocols Institut fur Informatik

Hash Trees

Hash a list of messages level-wise

Quick detection of identical message listsDetection of different subtrees

Hash1,2,3,4

Hash1,2

Hash1 Hash2

Hash3,4

Hash3 Hash4

Message 1 Message 2 Message 3 Message 4

top hash

message list

BUT: all servers need the same sorting order in their message lists

Ralph C. Merkle. A digital signature based on a conventional encryption function. In International CryptologyConference (CRYPTO), pages 369-378. Springer, 1987.

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 10 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenData Allocation and Replication Institut fur Informatik

Outline

1 Distributed Database Systems

2 Epidemic Protocols

3 Data Allocation and Replication

4 Multiversion Concurrency Control

5 Version Vectors

6 Quorums and Consensus

7 Implementations

8 Conclusion

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 11 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenData Allocation and Replication Institut fur Informatik

Consistent Hashing

Allocation of servers and data items on a hash ring

Based on hash of server name and file name or key/ID

Allocate file to next (clock-wise) server on hash ring

Can handle churn

David R. Karger, Eric Lehman, Frank Thomson Leighton, Rina Panigrahy, Matthew S. Levine, and Daniel Lewin.Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web.In Frank Thom- son Leighton and Peter W. Shor, editors, Twenty-Ninth Annual ACM Symposium on the Theory ofComputing, pages 654-663, 1997.

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 12 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenData Allocation and Replication Institut fur Informatik

Consistent Hashing

server1hash:2

server2hash:7

server3hash:11

file1hash:0

file5hash:1

file2hash:10

file4hash:8

file6hash:6

file3hash:3

file7hash:5

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 13 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenData Allocation and Replication Institut fur Informatik

Consistent Hashing

server1hash:2

server2hash:7

server3hash:11

file1hash:0

file5hash:1

file2hash:10

file4hash:8

file6hash:6

file3hash:3

file7hash:5

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 14 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenData Allocation and Replication Institut fur Informatik

Consistent Hashing

server1hash:2

server2hash:7

server3hash:11

server4hash:4

file1hash:0

file5hash:1

file2hash:10

file4hash:8

file6hash:6

file3hash:3

file7hash:5

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 15 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenData Allocation and Replication Institut fur Informatik

Consistent Hashing with Virtual Servers

server1hash:2

server2hash:7

server3hash:11

server4ahash:4

server4bhash:9

file1hash:0

file5hash:1

file2hash:10

file4hash:8

file6hash:6

file3hash:3

file7hash:5

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 16 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenData Allocation and Replication Institut fur Informatik

Master-Slave Replication

Master

Slave

Slave

write

read read

read

update

update

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 17 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenData Allocation and Replication Institut fur Informatik

Multi-Master Replication

Master

Master

Master

write

write

write

read read

read

synchronize

synchronize

synchronize

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 18 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenMultiversion Concurrency Control Institut fur Informatik

Outline

1 Distributed Database Systems

2 Epidemic Protocols

3 Data Allocation and Replication

4 Multiversion Concurrency Control

5 Version Vectors

6 Quorums and Consensus

7 Implementations

8 Conclusion

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 19 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenMultiversion Concurrency Control Institut fur Informatik

Multiversion Concurrency Control

Non-blocking reads

User 1

v-current=v0 v-current=v1 v-current=v2

User 2 t0 t1 t2 t3

abc uvw xyzv=0 v=1 v=2

timeread v0 read v0 read v0 read v1 write v2

read v0 write v1

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 20 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenMultiversion Concurrency Control Institut fur Informatik

Multiversion Concurrency Control

Abort transaction on concurrent write

User 1

v-current=v0

v-current=v0

v-current=v1-1

User 2

v-read=v0

v-read=v0

t0 t1 t2 t3

abc xyz

ijk

v=0 v=1-1

v=1-2

timeread v0 write v1-1

read v0 write v1-2conflict!!!v-read < v-current

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 21 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenVersion Vectors Institut fur Informatik

Outline

1 Distributed Database Systems

2 Epidemic Protocols

3 Data Allocation and Replication

4 Multiversion Concurrency Control

5 Version Vectors

6 Quorums and Consensus

7 Implementations

8 Conclusion

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 22 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenVersion Vectors Institut fur Informatik

Version Vectors

Logical counter for each client

Option 1: Automatic synch with union semantics (e.g. shopping cart)

Replica 1

Replica 2

a:0

b:0

c:0

a:1

b:0

c:0

a:1

b:1

c:0

a:1

b:1

c:1

a:1

b:1

c:1

a:0

b:0

c:0

a:0

b:1

c:0

a:1

b:1

c:0

a:1

b:1

c:1

Client a Client b Client c

readwrite

read

writeunion

synch

readwrite

synch

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 23 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenVersion Vectors Institut fur Informatik

Version Vectors

Logical counter for each client

Option 2: Sibling versions and reconciliation by client context

Replica 1

Replica 2

a:0

b:0

c:0

a:1

b:0

c:0

a:1

b:0

c:0

a:0

b:1

c:0

a:1

b:1

c:1

a:1

b:1

c:1

a:0

b:0

c:0

a:0

b:1

c:0

a:1

b:0

c:0

a:0

b:1

c:0

a:1

b:1

c:1

Client a Client b Client c

readwrite

read

writesiblings

synch

read

reconciling write

synch

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 24 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenVersion Vectors Institut fur Informatik

Version Vectors

Caution: Logical counter for each replica leads to lost updates!

Replica 1

Replica 2

rep1:0

rep2:0

rep1:1

rep2:0

rep1:2

rep2:0

rep1:0

rep2:0

rep1:?

rep2:0

Client a

Client b

readwrite

read

writeread

write

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 25 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenQuorums and Consensus Institut fur Informatik

Outline

1 Distributed Database Systems

2 Epidemic Protocols

3 Data Allocation and Replication

4 Multiversion Concurrency Control

5 Version Vectors

6 Quorums and Consensus

7 Implementations

8 Conclusion

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 26 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenQuorums and Consensus Institut fur Informatik

Write and Read quorums

Use quorums to ensure consistency

A read quorum is defined as a subset of replicas that have to becontacted when reading data; for a successful read, all replicas in aread quorum have to return the same answer value.

A write quorum is defined as a subset of replicas that have to becontacted when writing data; for a successful write, all replicas in thewrite quorum have to acknowledge the write request.

Quorum rules for N replicas, read quorum size R, write quorum sizeW

R+W > N (consistent reads)2W > N (strongly consistent writes)

Partial quorums only ensure weak consistency (for example, duringfailures)

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 27 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenQuorums and Consensus Institut fur Informatik

Quorums

Example 1: ROWA read one write all (left)Example 2: Majority quorums (right)

S1

S2

S3

S4

S5

read quorum

write quorum

S1

S2

S3

S4

S5

read quorum write quorum

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 28 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenQuorums and Consensus Institut fur Informatik

Hinted Handoff and Read Repair

Hinted handoff: handle temporary failures

if a replica is unavailable, write requests are delegated to anotheravailable serverhint that these requests should be relayed to the replica server as soonas possible

Read repair: a coordinator node sends out the read request to acouple of replicas

coordinator contacts a set of replicas larger than the needed quorumreturns the majority response to the clientsends repair (update) instructions to those replicas that are not yetsynchronized

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 29 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenQuorums and Consensus Institut fur Informatik

2-Phase Commit

Execution of a distributed transaction where all agents have toacknowledge a successful finalization of the transaction

2PC has a voting phase and a decision phase

Failure of a single agent will lead to a global abort of the transaction

The state between the two phases – before the coordinator sends hisdecision to all agents – is called the in-doubt state

In case the coordinator irrecoverably fails before sending his decision,the agents cannot proceed to either commit or abort the transaction.

Jim Gray. Notes on Data Base Operating Systems. In: M. J. Flynn, J. N. Gray, A. K. Jones, K. Lagally H. Opderbeck,G. J. Popek, B. Randell J. H. Saltzer, H. R. Wehle: Operating Systems: An Advanced Course. Lecture Notes inComputer Science, volume 60, pp. 393-481. Springer, 1978.

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 30 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenQuorums and Consensus Institut fur Informatik

2-Phase Commit

coordinator agent 1 agent 2 agent 3

Phase 1

Phase 2

prepare

ready

commit

acknowledge commit

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 31 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenQuorums and Consensus Institut fur Informatik

2-Phase Commit

coordinator agent 1 agent 2 agent 3

Phase 1

Phase 2

prepare

failed ready ready

abort

acknowledge abort

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 32 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenQuorums and Consensus Institut fur Informatik

Paxos

Paxos algorithm can be applied to keep a distributed DBMS in aconsistent state under certain failures

A client can issue a read requestDatabase servers have to come to a consensus on what the currentstate (and hence the most recent value) of the record is

Database servers as agents

ProposerLeaderAcceptorLearner

Leslie Lamport. The part-time parliament. ACM Transactions on Computer Systems (TOCS), 16(2):133–169, 1998Leslie Lamport. Generalized consensus and Paxos. Technical report, Technical Report MSR-TR-2005-33, MicrosoftResearch, 2005

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 33 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenQuorums and Consensus Institut fur Informatik

Paxos

client leader acceptor 1 acceptor 2 acceptor 3 learner

Phase 1

Phase 2

request

prepare(propNum)

promise(propNum,maxAcceptPropNum,value)

accept(propNum,chosenValue)

accepted(propNum,value)

return(value)

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 34 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenQuorums and Consensus Institut fur Informatik

Paxos with minority of failing acceptors

client leader acceptor 1 acceptor 2 acceptor 3 learner

Phase 1

Phase 2

request

prepare(propNum)

promise(propNum,maxAcceptPropNum,value)

accept(propNum,chosenValue)

accepted(propNum,value)

return(value)

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 35 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenQuorums and Consensus Institut fur Informatik

Paxos with majority of failing acceptorsclient leader acceptor 1 acceptor 2 acceptor 3 learner

acceptor 5acceptor 4

Phase 1

Phase 2

request

prepare(p1)

promise(p1,maxAcceptPropNum,value)

accept(p1,chosenValue)

accepted(p1,value)

Phase 1p1<p2

Phase 2

prepare(p2)

promise(p2,maxAcceptPropNum,value)

accept(p2,chosenValue)

accepted(p2,value)

return(value)

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 36 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenQuorums and Consensus Institut fur Informatik

Paxos with dueling proposersclient leader1 leader2 acceptor 1 acceptor 2 acceptor 3 learner

Phase 1p1<p2p2<p3p3<p4

request

prepare(p1)

promise(p1,maxAcceptPropNum,value)

prepare(p2)

promise(p2,maxAcceptPropNum,value)

accept(p1,chosenValue)

Notaccepted(p1)

prepare(p3)

promise(p3,maxAcceptPropNum,value)

accept(p2,chosenValue)

Notaccepted(p2)

prepare(p4)

promise(p4,maxAcceptPropNum,value)

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 37 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenImplementations Institut fur Informatik

Outline

1 Distributed Database Systems

2 Epidemic Protocols

3 Data Allocation and Replication

4 Multiversion Concurrency Control

5 Version Vectors

6 Quorums and Consensus

7 Implementations

8 Conclusion

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 38 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenImplementations Institut fur Informatik

Amazon Dynamo

Giuseppe DeCandia, Deniz Hastorun, Madan Jam-pani, Gunavardhan Kakulapati, Avinash Lakshman,Alex Pilchin, Swaminathan Sivasubramanian, PeterVosshall, and Werner Vogels. Dynamo: Amazon’shighly available key-value store. In Sympo- siumon Operating Systems Principles (SOSP), pages205–220. ACM, 2007.

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 39 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenImplementations Institut fur Informatik

Riak

Uses SHA1 for consistent hashing with virtual servers to distribute theload

Uses “dotted” version vectors (no locks) with sibling semantics

Exposes them to the users as a context for client-side conflictresolution

Uses hinted handoff (temporary and permanent “ownership transfer”)and read repair

Uses Anti-Antropy with hash trees as a background task to detectinconsistent data

http://basho.com/posts/technical/why-riak-just-works/

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 40 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenImplementations Institut fur Informatik

MongoDB

Uses master-slave replication

Replica set with one primary and several secondaries

If primary fail a new one is selected from the replica set

Heartbeat: by default 10 seconds

Write concern is used to configure the write quorum (0,1,majority,...)

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 41 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenImplementations Institut fur Informatik

Cassandra

“A true masterless architecture”

Supports consistent hashing with RandomPartitioner

Uses read repair and hinted handoff

Write quorums can be configured (“tunable consistency”: ONE,QUORUM, or ALL)

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 42 / 43

{ }Knowledge

Engineering

K∃

Georg-August-Universitat GottingenConclusion Institut fur Informatik

Conclusion

Modern distributed databases provide a revival / an implementation ofestablished ideas on novel hardware.

Thank you!

Lena Wiese Replication and Synchronization Algorithms for Distributed Databases 43 / 43