Advanced Replication Internals

Advanced Replication

Internals

Design and Goals Goals §  Highly Available §  Consistent Data §  Automatic Failover §  Multi-Region/DC §  Dynamic Reads

Design •  All DBs, each node •  Quorum/Election •  Smart clients •  Source selection •  Read Preferences •  Record operations •  Asynchronous •  Write/Replication

acknowledgements

Goal: High Availability •  Node Redundancy: Duplicate Data •  Record Write Operations •  Apply Write Operations •  Use capped collection called "oplog"

Replication Operations Insert •  oplog entry (fields):

o  op, o

{ "ns" : "test.gamma", "op" : "i", "v" : 2, "ts" : Timestamp(1350504342, 5), "o" : { "_id" : 2, "x" : "hi"} }

Replication Operations Update •  oplog entry (fields):

o  o = update, o2 = query

{ "ns" : "test.tags", "op" : "u", "v" : 2,

"ts": Timestamp(1368049619, 1), "o2" : { "_id" : 1 },

"o" : { "$set" : { "tags.4" : "e" } } }

Operation Transformation •  Idempotent (update by _id) •  Multi-update/delete (results in many ops) •  Array modifications (replacement)

Interchangeable •  All members maintain oplog + dbs •  All able to take over, or be used for same

functions

Replication Process •  Record oplog entry on write •  Idempotent entries •  Pulled by replicas

1.  Read over network 2.  Buffer locally 3.  Apply in batch 4.  Repeat

Read + Apply Decoupled •  Background oplog reader thread •  Pool of oplog applier threads (by collection)

Repl Source

Applier Thread Pool 16

Buffer

DB4

DB3

DB1 DB2

Local Oplog

Network

Replication Metrics "network": {

"bytes": 103830503805,

"readersCreated": 2248,

"getmores": {

"totalMillis": 257461206,

"num": 2152267 },

"ops": 7285440 }

"buffer": {

"sizeBytes": 0,

"maxSizeBytes": 268435456,

"count": 0},

"preload": { "docs": {

"totalMillis":0,"num":0},

"indexes": {


"num": 14560667 } }, "apply": {

"batches": {


"num": 1797105},

"ops": 7285440 },

"oplog": {

"insertBytes": 106866610253,

"insert": {


"num": 7285440 } }

Good Replication States •  Initial Sync

o  Record oplog start position o  Clone/copy all dbs o  Set minvalid, apply oplog since start o  Build indexes

•  Replication Batch: MinValid

Goal: Consistent Data •  Single Master •  Quorum (majority) •  Ordered Oplog

Consistent Data Why a single master?

Election Events Election events: •  Primary failure •  Stepdown (manual) •  Reconfigure •  Quorum loss

Election Nomination Disqualifications A replica will nominate itself unless: •  Priority:0 or arbiter •  Not freshest •  Just stepped down (in unelectable state) •  Would be vetoed by anyone because

o  There is a Primary already o  They don't have us in their config o  Higher priority member out there

•  Higher config version out there

The Election Nomination: •  If it looks like a tie, sleep random time

(unless first node) Voting: •  If all goes well, only one nominee •  All voting members vote for one nominee •  Majority of votes wins

Goal: Automatic Failover •  Single Master •  Smart Clients •  Discovery

Discovery isMaster command:

setName: <name>, ismaster: true, secondary: false, arbiterOnly: hosts: [ <visible nodes> ], passives: [ <prio:0 nodes> ], arbiters: [ <nodes> ], primary: <active primary>, tags: {<tags>}, me: <me>

Failover Scenario

Client P

S

S

Discovery (isMaster) Active Primary

Failover Scenario

Client P

S

S

Active Primary P

Failed Primary

Failover Scenario

Client Failed

P

S

Discovery (isMaster)

Replication Source Select'n •  Select closest source

o  Limit to non-hidden or slave delayed o  If nothing, try again with hidden/slave delayed o  Select node with fastest "ping" time o  Must be fresher

•  Choose source when o  Starting o  Any error with existing source (network, query) o  Any member is 30s ahead of current source

•  Manual override o  replSetSyncSource -- good until we choose again

Goal: Datacenter Aware •  Dynamic replication topologies •  Beachhead data center server

P

Goal: Dynamic Reads Controls for consistency •  Default to Primary •  Non-primary allowed •  Based on

o  Locality (ping/tags) o  Tags

Client

S

P

S

Tags: A, B

Tags: B, C

Asynchronous Replication •  Important considerations •  Additional requirements •  System/Application controls

Write Propagation •  Write Concern •  Replication requirements •  Timing •  Dynamic requirements

Exceptional Conditions •  Multiple Primaries •  Rollback •  Too stale

Design and Goals Goals §  Highly Available §  Consistent Data §  Automatic Failover §  Multi-Region/DC §  Dynamic Reads

Design •  All DBs, each node •  Quorum/Election •  Smart clients •  Source selection •  Read Preferences •  Record operations •  Asynchronous •  Write/Replication

acknowledgements

Thanks Questions?

Technology

Advanced Replication Internals