Upload
mongodb
View
1.351
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
Advanced Replication
Internals
Design and Goals Goals § Highly Available § Consistent Data § Automatic Failover § Multi-Region/DC § Dynamic Reads
Design • All DBs, each node • Quorum/Election • Smart clients • Source selection • Read Preferences • Record operations • Asynchronous • Write/Replication
acknowledgements
Goal: High Availability • Node Redundancy: Duplicate Data • Record Write Operations • Apply Write Operations • Use capped collection called "oplog"
Replication Operations Insert • oplog entry (fields):
o op, o
{ "ns" : "test.gamma", "op" : "i", "v" : 2, "ts" : Timestamp(1350504342, 5), "o" : { "_id" : 2, "x" : "hi"} }
Replication Operations Update • oplog entry (fields):
o o = update, o2 = query
{ "ns" : "test.tags", "op" : "u", "v" : 2,
"ts": Timestamp(1368049619, 1), "o2" : { "_id" : 1 },
"o" : { "$set" : { "tags.4" : "e" } } }
Operation Transformation • Idempotent (update by _id) • Multi-update/delete (results in many ops) • Array modifications (replacement)
Interchangeable • All members maintain oplog + dbs • All able to take over, or be used for same
functions
Replication Process • Record oplog entry on write • Idempotent entries • Pulled by replicas
1. Read over network 2. Buffer locally 3. Apply in batch 4. Repeat
Read + Apply Decoupled • Background oplog reader thread • Pool of oplog applier threads (by collection)
Repl Source
Applier Thread Pool 16
Buffer
DB4
DB3
DB1 DB2
Local Oplog
Network
Replication Metrics "network": {
"bytes": 103830503805,
"readersCreated": 2248,
"getmores": {
"totalMillis": 257461206,
"num": 2152267 },
"ops": 7285440 }
"buffer": {
"sizeBytes": 0,
"maxSizeBytes": 268435456,
"count": 0},
"preload": { "docs": {
"totalMillis":0,"num":0},
"indexes": {
"totalMillis": 23142318,
"num": 14560667 } }, "apply": {
"batches": {
"totalMillis": 231847,
"num": 1797105},
"ops": 7285440 },
"oplog": {
"insertBytes": 106866610253,
"insert": {
"totalMillis": 1756725,
"num": 7285440 } }
Good Replication States • Initial Sync
o Record oplog start position o Clone/copy all dbs o Set minvalid, apply oplog since start o Build indexes
• Replication Batch: MinValid
Goal: Consistent Data • Single Master • Quorum (majority) • Ordered Oplog
Consistent Data Why a single master?
Election Events Election events: • Primary failure • Stepdown (manual) • Reconfigure • Quorum loss
Election Nomination Disqualifications A replica will nominate itself unless: • Priority:0 or arbiter • Not freshest • Just stepped down (in unelectable state) • Would be vetoed by anyone because
o There is a Primary already o They don't have us in their config o Higher priority member out there
• Higher config version out there
The Election Nomination: • If it looks like a tie, sleep random time
(unless first node) Voting: • If all goes well, only one nominee • All voting members vote for one nominee • Majority of votes wins
Goal: Automatic Failover • Single Master • Smart Clients • Discovery
Discovery isMaster command:
setName: <name>, ismaster: true, secondary: false, arbiterOnly: hosts: [ <visible nodes> ], passives: [ <prio:0 nodes> ], arbiters: [ <nodes> ], primary: <active primary>, tags: {<tags>}, me: <me>
Failover Scenario
Client P
S
S
Discovery (isMaster) Active Primary
Failover Scenario
Client P
S
S
Active Primary P
Failed Primary
Failover Scenario
Client Failed
P
S
Discovery (isMaster)
Replication Source Select'n • Select closest source
o Limit to non-hidden or slave delayed o If nothing, try again with hidden/slave delayed o Select node with fastest "ping" time o Must be fresher
• Choose source when o Starting o Any error with existing source (network, query) o Any member is 30s ahead of current source
• Manual override o replSetSyncSource -- good until we choose again
Goal: Datacenter Aware • Dynamic replication topologies • Beachhead data center server
P
Goal: Dynamic Reads Controls for consistency • Default to Primary • Non-primary allowed • Based on
o Locality (ping/tags) o Tags
Client
S
P
S
Tags: A, B
Tags: B, C
Asynchronous Replication • Important considerations • Additional requirements • System/Application controls
Write Propagation • Write Concern • Replication requirements • Timing • Dynamic requirements
Exceptional Conditions • Multiple Primaries • Rollback • Too stale
Design and Goals Goals § Highly Available § Consistent Data § Automatic Failover § Multi-Region/DC § Dynamic Reads
Design • All DBs, each node • Quorum/Election • Smart clients • Source selection • Read Preferences • Record operations • Asynchronous • Write/Replication
acknowledgements
Thanks Questions?