Javantura v2 - Replication with MongoDB - what could go wrong... - Philipp Krenn

MongoDBReplication

Philipp Krenn

@xeraa

https://twitter.com/xeraa

MotivationAvailability & data safety

Read scalability

Helping backups

Data migration

Delayed members

Oplog Tailing (Meteor. js)

https://meteorhacks.com/mongodb-oplog-and-meteor.html

https://meteorhacks.com/mongodb-oplog-and-meteor.html

Basics

TerminologyPrimary + Secondaries

Master + Slaves problematic — renamed

Arbiter

http://docs.mongodb.org





> rs.addArb("arbiter.example.com:3000")




Limits50 replica set members

12 before 2.7.8

7 voting members

Example

Single instance$ mkdir 1$ mongod --dbpath 1 --port 27001 --logpath log1$ mongo --port 27001> db.test.insert({ name: "Philipp", city: "Wien" })> db.test.find()

Stop instance

Add replication$ mkdir 2$ mkdir 3$ mongod --replSet javantura --dbpath 1 --port 27001 --logpath log1 --oplogSize 20$ mongod --replSet javantura --dbpath 2 --port 27002 --logpath log2 --oplogSize 20$ mongod --replSet javantura --dbpath 3 --port 27003 --logpath log3 --oplogSize 20

Connect

$ hostname$ mongo --port 27001> db.test.find()

Configure replicationStart on the old instance, otherwise data lostrs.initiate()rs.status()rs.add("PK-MBP:27002")rs.add("PK-MBP:27003")rs.status()db.isMaster()db.test.find()db.test.insert({ name: "Peter", city: "Steyr" })db.test.find()

Read from secondaries$ mongo --port 27002> db.test.find()> rs.slaveOk()> db.test.find()> db.test.insert({ name: "Dieter", city: "Graz" })

slaveOk only valid for the current connection

FailoverKill primary with [Ctrl]+[C]Write to new primary> rs.status()> db.test.insert({ name: "Dieter", city: "Graz" })> db.test.find()

Restart old primary$ mongod --replSet name --dbpath 1 --port 27001 --logpath log1 --oplogSize 20$ mongo --port 27001> rs.status()> rs.slaveOk()> db.test.find()

Inner detailsCapped collection in oplog.rs of the local database> use local> show collectionsme 0.000MB / 0.008MBoplog.rs 0.000MB / 20.000MBreplset.minvalid 0.000MB / 0.008MBslaves 0.000MB / 0.008MBstartup_log 0.003MB / 10.000MBsystem.indexes 0.001MB / 0.008MBsystem.replset 0.000MB / 0.008MB

Inner details> db.oplog.rs.find(){ "h": NumberLong("-265486071808715859"), "ns": "test.test", "o": { "_id": ObjectId("541a8ed285ea5f8ae059d530"), "name": "Dieter" "city": "Graz" }, "op": "i", "ts": Timestamp(1411026642, 1), "v": 2}...

Election

Heartbeat2s interval

10s until election

Election rules1. Priority

2. Optime

3. Connections

Prioritycfg = rs.conf()cfg.members[0].priority = 0cfg.members[1].priority = 1cfg.members[2].priority = 2rs.reconfig(cfg)

Optime

Connections

ElectionCandidate node asks for a vote

Others can veto

ElectionOne yes for one node within 30s

Majority yes elects a new primary

Issues

CAPSelect Availability or Consistency

Partition-tolerance is a prerequisite for distributed systems

"The network is reliable":http://aphyr.com/posts/288-the-network-is-reliable

http://aphyr.com/posts/288-the-network-is-reliable

RollbackOld primary rolls back unreplicated changes once it rejoins the replica set

Rollback filerollback/ in data folder

File name: <database>.<collection>.

<timestamp>.bson

Election timeAt times 5 to 7 minutes

http://www.tokutek.com/2014/07/explaining-ark-part-2-how-elections-and-failover-currently-work/



Missing synchronization during election

Old primary sends last changes to a single node

If not new primary: rollback

Remember

Replication is asynchronous

Multiple primariesUnlikely but possible

Bugs: https://jira.mongodb.org/browse/SERVER-9765

Test script with no replies: https://groups.google.com/forum/#!topic/mongodb-dev/-mH6BOYyzeI

https://jira.mongodb.org/browse/SERVER-9765

https://groups.google.com/forum/#!topic/mongodb-dev/-mH6BOYyzeI

https://groups.google.com/forum/#!topic/mongodb-dev/-mH6BOYyzeI

Kyle Kingsbury @aphyr: Call Me Maybehttp://aphyr.com/tags/jepsen

PostgreSQL, Redis, MongoDB, Riak, Zookeeper, RabbitMQ, etcd + Consul,

ElasticSearch

http://aphyr.com/tags/jepsen

http://aphyr.com/posts/284-call-me-maybe-mongodb

05/2013 version 2.4

Up to 42% data lost

Data written to old primary: rollback



WriteConcernConfigure durability vs performance

https://github.com/mongodb/mongo-java-driver/blob/master/src/main/com/mongodb/WriteConcern. java

https://github.com/mongodb/mongo-java-driver/blob/master/src/main/com/mongodb/WriteConcern.java

https://github.com/mongodb/mongo-java-driver/blob/master/src/main/com/mongodb/WriteConcern.java

WriteConcern. UNACKNOWLEDGED

w=0, j=0

Fire and forget

Default until 11/2012

WriteConcern. ACKNOWLEDGED

w=1, j=0

Current default

Operation completed successfully in memory

WriteConcern. JOURNALED

w=1, j=1

Operation written to the journal file

Since 1.8, single server durability

WriteConcern.FSYNCEDw=1, fsync=true

Operation written to disk

WriteConcern. REPLICA_ACKNOWLEDGED

w=2, j=0

Acknowledged by primary and at least one secondary

w is the server number

WriteConcern. MAJORITY

w=majority, j=0

Acknowledgement by the majority of nodes

wtimeout recommended

WriteConcern. MAJORITY

Nearly no data lost, but high overhead

Write concern performancehttps://blog.serverdensity.com/mongodb-on-google-

compute-engine-tips-and-benchmarks/

3 x 1,000 inserts on GCE

Local 10GB system diskDedicated 200GB disk

Dedicated 200GB for data and journal

https://blog.serverdensity.com/mongodb-on-google-compute-engine-tips-and-benchmarks/

https://blog.serverdensity.com/mongodb-on-google-compute-engine-tips-and-benchmarks/

n1-standard-2

n1-highmem-8

Thanks! Questions?Now, later today, or @xeraa

https://twitter.com/xeraa

Backup Slides

Oplog

Replication via logsMongoDB: Operations log (Oplog)

MySQL: Binary log (Binlog)

Naiv approach: Transmit original queryStatement Based Replication (SBR)DELETE FROM test.table WHERE quantity > 20 LIMIT 1

db.collection.remove({ quantity: { $gt: 20 }}, true)//justOne: true

Unambiguous representation

Row-Based Replication (RBR): Oplog

MongoDBAsynchronous replication

Secondaries can get the Oplog from...

their primary

another secondary with more recent data

Oplog size32bit: 48MB

64bit OS X: 183MB

64bit *nix, Windows: 1GB to 50GB (5% free disk)

Technology

Javantura v2 - Replication with MongoDB - what could go wrong... - Philipp Krenn