29
Advanced Replication Internals

Advanced Replication Internals

Embed Size (px)

DESCRIPTION

Internals of replication in mongodb. These internals cover replication selection, the replication process, elections (and the rules), and oplog transformation. This presentation was given at the MongoDB San Francisco conference.

Citation preview

Page 1: Advanced Replication Internals

Advanced Replication

Internals

Page 2: Advanced Replication Internals

Design and GoalsGoals■ Highly Available■ Consistent Data■ Automatic Failover■ Multi-Region/DC■ Dynamic Reads

Design● All DBs, each node● Quorum/Election● Smart clients● Source selection● Read Preferences● Record operations● Asynchronous● Write/Replication

acknowledgements

Page 3: Advanced Replication Internals

Goal: High Availability● Node Redundancy: Duplicate Data● Record Write Operations● Apply Write Operations● Use capped collection called "oplog"

Page 4: Advanced Replication Internals

Replication Operations Insert● oplog entry (fields):

○ op, o{

"ns" : "test.gamma","op" : "i", "v" : 2,"ts" : Timestamp(1350504342, 5),"o" : { "_id" : 2, "x" : "hi"} }

Page 5: Advanced Replication Internals

Replication Operations Update● oplog entry (fields):

○ o = update, o2 = query{

"ns" : "test.tags","op" : "u", "v" : 2,

"ts": Timestamp(1368049619, 1), "o2" : { "_id" : 1 },

"o" : { "$set" : { "tags.4" : "e" } } }

Page 6: Advanced Replication Internals

Operation Transformation● Idempotent (update by _id)● Multi-update/delete (results in many ops)● Array modifications (replacement)

Page 7: Advanced Replication Internals

Interchangeable● All members maintain oplog + dbs● All able to take over, or be used for same

functions

Page 8: Advanced Replication Internals

Replication Process● Record oplog entry on write● Idempotent entries● Pulled by replicas

1. Read over network2. Buffer locally3. Apply in batch4. Repeat

Page 9: Advanced Replication Internals

Read + Apply Decoupled● Background oplog reader thread● Pool of oplog applier threads (by collection)

Repl Source

Applier Thread Pool

16

Buffer

DB4

DB3

DB1 DB2

Local Oplog

Network

Batch Complete

Page 10: Advanced Replication Internals

Replication Metrics"network": { "bytes": 103830503805, "readersCreated": 2248, "getmores": { "totalMillis": 257461206, "num": 2152267 }, "ops": 7285440 }"buffer": { "sizeBytes": 0, "maxSizeBytes": 268435456, "count": 0},

"preload": { "docs": { "totalMillis":0,"num":0}, "indexes": { "totalMillis": 23142318, "num": 14560667 } }, "apply": { "batches": { "totalMillis": 231847, "num": 1797105}, "ops": 7285440 },"oplog": { "insertBytes": 106866610253, "insert": { "totalMillis": 1756725, "num": 7285440 } }

Page 11: Advanced Replication Internals

Good Replication States● Initial Sync

○ Record oplog start position○ Clone/copy all dbs○ Set minvalid, apply oplog since start○ Build indexes

● Replication Batch: MinValid

Page 12: Advanced Replication Internals

Goal: Consistent Data● Single Master● Quorum (majority)● Ordered Oplog

Page 13: Advanced Replication Internals

Consistent DataWhy a single master?

Page 14: Advanced Replication Internals

Election EventsElection events:● Primary failure● Stepdown (manual)● Reconfigure● Quorum loss

Page 15: Advanced Replication Internals

Election Nomination DisqualificationsA replica will nominate itself unless:● Priority:0 or arbiter● Not freshest● Just stepped down (in unelectable state)● Would be vetoed by anyone because

○ There is a Primary already○ They don't have us in their config○ Higher priority member out there

● Higher config version out there

Page 16: Advanced Replication Internals

The ElectionNomination:● If it looks like a tie, sleep random time

(unless first node)Voting:● If all goes well, only one nominee● All voting members vote for one nominee● Majority of votes wins

Page 17: Advanced Replication Internals

Goal: Automatic Failover● Single Master● Smart Clients● Discovery

Page 18: Advanced Replication Internals

DiscoveryisMaster command:

setName: <name>,ismaster: true, secondary: false, arbiterOnly:hosts: [ <visible nodes> ],passives: [ <prio:0 nodes> ],arbiters: [ <nodes> ],primary: <active primary>,tags: {<tags>},me: <me>

Page 19: Advanced Replication Internals

Failover Scenario

ClientP

S

S

Discovery (isMaster)Active Primary

Page 20: Advanced Replication Internals

Failover Scenario

ClientP

S

S

Active PrimaryP

Failed Primary

Page 21: Advanced Replication Internals

Failover Scenario

ClientFailed

P

S

Discovery (isMaster)Active Primary

Page 22: Advanced Replication Internals

Replication Source Select'n● Select closest source

○ Limit to non-hidden or slave delayed○ If nothing, try again with hidden/slave delayed○ Select node with fastest "ping" time○ Must be fresher

● Choose source when○ Starting○ Any error with existing source (network, query)○ Any member is 30s ahead of current source

● Manual override○ replSetSyncSource -- good until we choose again

Page 23: Advanced Replication Internals

Goal: Datacenter Aware● Dynamic replication topologies● Beachhead data center server

P

Page 24: Advanced Replication Internals

Goal: Dynamic ReadsControls for consistency ● Default to Primary● Non-primary allowed● Based on

○ Locality (ping/tags)○ Tags

Client

S

P

S

Tags: A, B

Tags: B, C

Page 25: Advanced Replication Internals

Asynchronous Replication● Important considerations● Additional requirements● System/Application controls

Page 26: Advanced Replication Internals

Write Propagation● Write Concern● Replication requirements● Timing● Dynamic requirements

Page 27: Advanced Replication Internals

Exceptional Conditions● Multiple Primaries● Rollback● Too stale

Page 28: Advanced Replication Internals

Design and GoalsGoals■ Highly Available■ Consistent Data■ Automatic Failover■ Multi-Region/DC■ Dynamic Reads

Design● All DBs, each node● Quorum/Election● Smart clients● Source selection● Read Preferences● Record operations● Asynchronous● Write/Replication

acknowledgements

Page 29: Advanced Replication Internals

ThanksQuestions?