View
692
Download
1
Category
Preview:
Citation preview
How Raft consensus algorithm will make replication even better in MongoDB 3.2
Speaker: Henrik Ingo, @h_ingoAuthor: Spencer T Brody, @stbrody
Agenda
• Introduction to consensus• Leader-based replicated state machine• Elections and data replication in MongoDB• Improvements coming in MongoDB 3.2
What is Consensus?
• Getting multiple processes/servers to agree on state machine state
• Must handle a wide range of failure modes• Disk failure• Network partitions• Machine freezes• Clock skews• Bugs?
Leader Based Consensus
State Machine
X 3
Y 2
Z 7
Replicated Log
X 1⬅️
Y 2⬅️
X 3⬅️
State Machine
X 3Y 2
Z 7
Replicated Log
X 1⬅️Y 2⬅️
X 3⬅️
State Machine
X 3Y 2
Z 7
Replicated Log
X 1⬅️Y 2⬅️
X 3⬅️
Agenda
• Introduction to consensus• Leader-based replicated state machine• Elections and data replication in MongoDB• Improvements coming in MongoDB 3.2
Elections
X 3
Y 2
Z 7
X 1⬅️
Y 2⬅️
X 3⬅️
X 3
Y 2
Z 7
X 1⬅️
Y 2⬅️
X 3⬅️X 3
Y 2
Z 7
X 1⬅️
Y 2⬅️
X 3⬅️
X 3
Y 2
Z 7
X 1⬅️
Y 2⬅️
X 3⬅️
X 3
Y 2
Z 7
X 1⬅️
Y 2⬅️
X 3⬅️
Elections
X 3
Y 2
Z 7
X 3
Y 2
Z 7X 3
Y 2
Z 7
X 3
Y 2
Z 7
X 3
Y 2
Z 7
X 1⬅️
Y 2⬅️
X 3⬅️
X 1⬅️
Y 2⬅️
X 3⬅️X 1⬅️
Y 2⬅️
X 3⬅️
X 1⬅️
Y 2⬅️
X 3⬅️
X 1⬅️
Y 2⬅️
X 3⬅️
Elections
X 3
Y 2
Z 7
X 3
Y 2
Z 7X 3
Y 2
Z 7
X 3
Y 2
Z 7
X 3
Y 2
Z 7
X 1⬅️
Y 2⬅️
X 3⬅️
X 1⬅️
Y 2⬅️
X 3⬅️X 1⬅️
Y 2⬅️
X 3⬅️
X 1⬅️
Y 2⬅️
X 3⬅️
X 1⬅️
Y 2⬅️
X 3⬅️
Elections
X 3
Y 2
Z 7
X 3
Y 2
Z 7X 3
Y 2
Z 7
X 3
Y 2
Z 7
X 3
Y 2
Z 7
X 1⬅️
Y 2⬅️
X 3⬅️
X 1⬅️
Y 2⬅️
X 3⬅️X 1⬅️
Y 2⬅️
X 3⬅️
X 1⬅️
Y 2⬅️
X 3⬅️
X 1⬅️
Y 2⬅️
X 3⬅️
Elections
X 3
Y 2
Z 7
X 3
Y 2
Z 7X 3
Y 2
Z 7
X 3
Y 2
Z 7
X 3
Y 2
Z 7
X 1⬅️
Y 2⬅️
X 3⬅️
X 1⬅️
Y 2⬅️
X 3⬅️X 1⬅️
Y 2⬅️
X 3⬅️
X 1⬅️
Y 2⬅️
X 3⬅️
X 1⬅️
Y 2⬅️
X 3⬅️
Key requirements for elections
• There must be exactly 1 leader at any time• Majority election: there can only be 1 majority
• Note: 50% is not a majority!• Old leader must know to step down, even if
disconnected
Agenda
• Introduction to consensus• Leader-based replicated state machine• Elections and data replication in MongoDB• Improvements coming in MongoDB 3.2
Agenda
• Introduction to consensus• Leader-based replicated state machine• Elections and data replication in MongoDB• Improvements coming in MongoDB 3.2
• Goals and inspiration from Raft Consensus Algorithm
• Preventing double voting• Monitoring node status• Calling for elections
Goals for MongoDB 3.2
• Decrease failover time• Speed up detection and resolution of false
primary situations
Finding Inspiration in Raft
• “In Search of an Understandable Consensus Algorithm” by Diego Ongaro: https://ramcloud.stanford.edu/raft.pdf
• Designed to address the shortcomings of Paxos• Easier to understand• Easier to implement in real applications
• Provably correct• Remarkably similar to what we’re doing already
Raft Concepts
• Term (election) IDs• Heartbeats using existing data replication
channel• Asymmetric election timeouts
Preventing Double Voting
• Can’t vote for 2 nodes in the same election• Pre-3.2: 30 second vote timeout• Post-3.2: Term IDs• Term:
• Monotonically increasing ID• Incremented on every election *attempt*• Lets voters distinguish elections so they can vote
twice quickly in different elections• Logical clock
MongoDB 3.0 OpLog entry
> db.oplog.rs.find().sort( { $natural : -1 } ).limit(1).pretty(){
"ts" : Timestamp(1444466011, 1),"h" : NumberLong("-6240522391332325619"),"v" : 2,"op" : "i","ns" : "test.nulltest","o" : {
"_id" : 3,"a" : 3
}}
millisecond + counter
Random hash = gtid
MongoDB 3.2 OpLog entry> db.oplog.rs.find().sort( { $natural : -1 } ).limit(1).pretty(){
"ts" : Timestamp(1445632081, 861),"t" : NumberLong(42),"h" : NumberLong("5466055178864103715"),"v" : 2,"op" : "u","ns" : "ycsb.usertable","o2" : {
"_id" : "user645414720104877157"},"o" : {
"$set" : {"field4" :
BinData(0,"KT5sN1wrPl...Ny9wIg==")}
}}
millisecond + Raft counter
Random hash = gtid
Raft Term
Heartbeats
• MongoDB 3.0: Heartbeats• Sent every two seconds from every node to every
other node• Volume increases quadratically as nodes are added
to the replica set• Odd behaviors with asymmetric network partitions,
corner cases• MongoDB 3.2: Extra metadata sent
via existing data replication channel• Utilizes chained replication• Faster heartbeats = faster elections
and detection of false primaries
Daisy chaining
X 3Y 2
Z 7
X 1⬅️
Y 2⬅️
X 3⬅️
X 3
Y 2
Z 7
X 1⬅️
Y 2⬅️
X 3⬅️
X 3Y 2
Z 7
X 1⬅️
Y 2⬅️
X 3⬅️
X 3Y 2
Z 7
X 1⬅️Y 2⬅️
X 3⬅️
X 3
Y 2
Z 7
X 1⬅️
Y 2⬅️
X 3⬅️
Election timeouts
• Tradeoff between failover time and spurious failovers
• Node calls for an election when it hasn’t heard
from the primary within the election timeout• Starting in 3.2:
• Single election timeout• Election timeout is configurable• Election timeout is varied randomly
for each node• Varying timeouts help reduce tied votes• Fewer tied votes = faster failover
rs.config().settings.
electionTimeoutMillis
Conclusion
• MongoDB 3.2 will have• Faster failovers• Faster error detection• More control to prevent spurious failovers• Robust also in corner cases
• This means your systems are• More resilient to failure• More uptime• Easier to maintain
Recommended