Advanced Replication Internals

Advanced Replication

Internals

Design and GoalsGoals■ Highly Available■ Consistent Data■ Automatic Failover■ Multi-Region/DC■ Dynamic Reads

Design● All DBs, each node● Quorum/Election● Smart clients● Source selection● Read Preferences● Record operations● Asynchronous● Write/Replication

acknowledgements

Goal: High Availability● Node Redundancy: Duplicate Data● Record Write Operations● Apply Write Operations● Use capped collection called "oplog"

Replication Operations Insert● oplog entry (fields):

○ op, o{

"ns" : "test.gamma","op" : "i", "v" : 2,"ts" : Timestamp(1350504342, 5),"o" : { "_id" : 2, "x" : "hi"} }

Replication Operations Update● oplog entry (fields):

○ o = update, o2 = query{

"ns" : "test.tags","op" : "u", "v" : 2,

"ts": Timestamp(1368049619, 1), "o2" : { "_id" : 1 },

"o" : { "$set" : { "tags.4" : "e" } } }

Operation Transformation● Idempotent (update by _id)● Multi-update/delete (results in many ops)● Array modifications (replacement)

Interchangeable● All members maintain oplog + dbs● All able to take over, or be used for same

functions

Replication Process● Record oplog entry on write● Idempotent entries● Pulled by replicas

1. Read over network2. Buffer locally3. Apply in batch4. Repeat

Read + Apply Decoupled● Background oplog reader thread● Pool of oplog applier threads (by collection)

Repl Source

Applier Thread Pool

16

Buffer

DB4

DB3

DB1 DB2

Local Oplog

Network

Batch Complete

Replication Metrics"network": { "bytes": 103830503805, "readersCreated": 2248, "getmores": { "totalMillis": 257461206, "num": 2152267 }, "ops": 7285440 }"buffer": { "sizeBytes": 0, "maxSizeBytes": 268435456, "count": 0},

"preload": { "docs": { "totalMillis":0,"num":0}, "indexes": { "totalMillis": 23142318, "num": 14560667 } }, "apply": { "batches": { "totalMillis": 231847, "num": 1797105}, "ops": 7285440 },"oplog": { "insertBytes": 106866610253, "insert": { "totalMillis": 1756725, "num": 7285440 } }

Good Replication States● Initial Sync

○ Record oplog start position○ Clone/copy all dbs○ Set minvalid, apply oplog since start○ Build indexes

● Replication Batch: MinValid

Goal: Consistent Data● Single Master● Quorum (majority)● Ordered Oplog

Consistent DataWhy a single master?

Election EventsElection events:● Primary failure● Stepdown (manual)● Reconfigure● Quorum loss

Election Nomination DisqualificationsA replica will nominate itself unless:● Priority:0 or arbiter● Not freshest● Just stepped down (in unelectable state)● Would be vetoed by anyone because

○ There is a Primary already○ They don't have us in their config○ Higher priority member out there

● Higher config version out there

The ElectionNomination:● If it looks like a tie, sleep random time

(unless first node)Voting:● If all goes well, only one nominee● All voting members vote for one nominee● Majority of votes wins

Goal: Automatic Failover● Single Master● Smart Clients● Discovery

DiscoveryisMaster command:

setName: <name>,ismaster: true, secondary: false, arbiterOnly:hosts: [ <visible nodes> ],passives: [ <prio:0 nodes> ],arbiters: [ <nodes> ],primary: <active primary>,tags: {<tags>},me: <me>

Failover Scenario

ClientP

S

S

Discovery (isMaster)Active Primary

Failover Scenario

ClientP

S

S

Active PrimaryP

Failed Primary

Failover Scenario

ClientFailed

P

S

Discovery (isMaster)Active Primary

Replication Source Select'n● Select closest source

○ Limit to non-hidden or slave delayed○ If nothing, try again with hidden/slave delayed○ Select node with fastest "ping" time○ Must be fresher

● Choose source when○ Starting○ Any error with existing source (network, query)○ Any member is 30s ahead of current source

● Manual override○ replSetSyncSource -- good until we choose again

Goal: Datacenter Aware● Dynamic replication topologies● Beachhead data center server

P

Goal: Dynamic ReadsControls for consistency ● Default to Primary● Non-primary allowed● Based on

○ Locality (ping/tags)○ Tags

Client

S

P

S

Tags: A, B

Tags: B, C

Asynchronous Replication● Important considerations● Additional requirements● System/Application controls

Write Propagation● Write Concern● Replication requirements● Timing● Dynamic requirements

Exceptional Conditions● Multiple Primaries● Rollback● Too stale

Design and GoalsGoals■ Highly Available■ Consistent Data■ Automatic Failover■ Multi-Region/DC■ Dynamic Reads

Design● All DBs, each node● Quorum/Election● Smart clients● Source selection● Read Preferences● Record operations● Asynchronous● Write/Replication

acknowledgements

ThanksQuestions?

Technology

Advanced Replication Internals