Upload
scott-hernandez
View
190
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Internals of replication in mongodb. These internals cover replication selection, the replication process, elections (and the rules), and oplog transformation. This presentation was given at the MongoDB San Francisco conference.
Citation preview
Advanced Replication
Internals
Design and GoalsGoals■ Highly Available■ Consistent Data■ Automatic Failover■ Multi-Region/DC■ Dynamic Reads
Design● All DBs, each node● Quorum/Election● Smart clients● Source selection● Read Preferences● Record operations● Asynchronous● Write/Replication
acknowledgements
Goal: High Availability● Node Redundancy: Duplicate Data● Record Write Operations● Apply Write Operations● Use capped collection called "oplog"
Replication Operations Insert● oplog entry (fields):
○ op, o{
"ns" : "test.gamma","op" : "i", "v" : 2,"ts" : Timestamp(1350504342, 5),"o" : { "_id" : 2, "x" : "hi"} }
Replication Operations Update● oplog entry (fields):
○ o = update, o2 = query{
"ns" : "test.tags","op" : "u", "v" : 2,
"ts": Timestamp(1368049619, 1), "o2" : { "_id" : 1 },
"o" : { "$set" : { "tags.4" : "e" } } }
Operation Transformation● Idempotent (update by _id)● Multi-update/delete (results in many ops)● Array modifications (replacement)
Interchangeable● All members maintain oplog + dbs● All able to take over, or be used for same
functions
Replication Process● Record oplog entry on write● Idempotent entries● Pulled by replicas
1. Read over network2. Buffer locally3. Apply in batch4. Repeat
Read + Apply Decoupled● Background oplog reader thread● Pool of oplog applier threads (by collection)
Repl Source
Applier Thread Pool
16
Buffer
DB4
DB3
DB1 DB2
Local Oplog
Network
Batch Complete
Replication Metrics"network": { "bytes": 103830503805, "readersCreated": 2248, "getmores": { "totalMillis": 257461206, "num": 2152267 }, "ops": 7285440 }"buffer": { "sizeBytes": 0, "maxSizeBytes": 268435456, "count": 0},
"preload": { "docs": { "totalMillis":0,"num":0}, "indexes": { "totalMillis": 23142318, "num": 14560667 } }, "apply": { "batches": { "totalMillis": 231847, "num": 1797105}, "ops": 7285440 },"oplog": { "insertBytes": 106866610253, "insert": { "totalMillis": 1756725, "num": 7285440 } }
Good Replication States● Initial Sync
○ Record oplog start position○ Clone/copy all dbs○ Set minvalid, apply oplog since start○ Build indexes
● Replication Batch: MinValid
Goal: Consistent Data● Single Master● Quorum (majority)● Ordered Oplog
Consistent DataWhy a single master?
Election EventsElection events:● Primary failure● Stepdown (manual)● Reconfigure● Quorum loss
Election Nomination DisqualificationsA replica will nominate itself unless:● Priority:0 or arbiter● Not freshest● Just stepped down (in unelectable state)● Would be vetoed by anyone because
○ There is a Primary already○ They don't have us in their config○ Higher priority member out there
● Higher config version out there
The ElectionNomination:● If it looks like a tie, sleep random time
(unless first node)Voting:● If all goes well, only one nominee● All voting members vote for one nominee● Majority of votes wins
Goal: Automatic Failover● Single Master● Smart Clients● Discovery
DiscoveryisMaster command:
setName: <name>,ismaster: true, secondary: false, arbiterOnly:hosts: [ <visible nodes> ],passives: [ <prio:0 nodes> ],arbiters: [ <nodes> ],primary: <active primary>,tags: {<tags>},me: <me>
Failover Scenario
ClientP
S
S
Discovery (isMaster)Active Primary
Failover Scenario
ClientP
S
S
Active PrimaryP
Failed Primary
Failover Scenario
ClientFailed
P
S
Discovery (isMaster)Active Primary
Replication Source Select'n● Select closest source
○ Limit to non-hidden or slave delayed○ If nothing, try again with hidden/slave delayed○ Select node with fastest "ping" time○ Must be fresher
● Choose source when○ Starting○ Any error with existing source (network, query)○ Any member is 30s ahead of current source
● Manual override○ replSetSyncSource -- good until we choose again
Goal: Datacenter Aware● Dynamic replication topologies● Beachhead data center server
P
Goal: Dynamic ReadsControls for consistency ● Default to Primary● Non-primary allowed● Based on
○ Locality (ping/tags)○ Tags
Client
S
P
S
Tags: A, B
Tags: B, C
Asynchronous Replication● Important considerations● Additional requirements● System/Application controls
Write Propagation● Write Concern● Replication requirements● Timing● Dynamic requirements
Exceptional Conditions● Multiple Primaries● Rollback● Too stale
Design and GoalsGoals■ Highly Available■ Consistent Data■ Automatic Failover■ Multi-Region/DC■ Dynamic Reads
Design● All DBs, each node● Quorum/Election● Smart clients● Source selection● Read Preferences● Record operations● Asynchronous● Write/Replication
acknowledgements
ThanksQuestions?