31
Ensuring Consistency in a Replicated World Josh Snyder 20140930

Ensuring Consistency in a Replicated World

Embed Size (px)

DESCRIPTION

Ensuring Consistency in a Replicated World

Citation preview

Page 1: Ensuring Consistency in a Replicated World

Ensuring Consistency in a Replicated World

Josh  Snyder  2014-­‐09-­‐30

Page 2: Ensuring Consistency in a Replicated World

2

what is Yelp?

Page 3: Ensuring Consistency in a Replicated World

• we operate in a bunch of markets • aim to be globally distributed • our users should never see stale content • our developers should be able to design an application resilient to

replication delay

3

goals

Page 4: Ensuring Consistency in a Replicated World

4

a sample architecture

Page 5: Ensuring Consistency in a Replicated World

• a small set of moving parts • enables us to do more with fewer shards • masks geographic traffic split from users and developers • enhanced tolerance to replication delay • ability to

– perform online replication hierarchy changes – batch-load data

5

our toolset

Page 6: Ensuring Consistency in a Replicated World

6

cookies

Page 7: Ensuring Consistency in a Replicated World

• give the client a short-lived “dirty session” cookie • encode the time of the latest interaction between you and them • expire or ignore the cookie after replicas have caught up

7

cookies

Page 8: Ensuring Consistency in a Replicated World

• load balancer: • POST? • GET? -> cookie?

• routes the request into the appropriate datacenter • adds headers to requests

8

request routing

Page 9: Ensuring Consistency in a Replicated World

• users get read-after-write consistency • routing a user’s request between datacenters increases latency !

• getting it wrong: increased load on the master database

9

tradeoffs

Page 10: Ensuring Consistency in a Replicated World

• we need to be assured that a user’s request falls back to a datacenter that has all of their data

10

tradeoffs

Page 11: Ensuring Consistency in a Replicated World

• we need a clear picture of it • never underestimate replication delay, always overestimate

11

replication delay

Page 12: Ensuring Consistency in a Replicated World

• made of lies (for this purpose) • underestimates most of the time • overestimates some of the time

12

Seconds_Behind_Master

http://bugs.mysql.com/bug.php?id=66921

Page 13: Ensuring Consistency in a Replicated World

13

heartbeats

Page 14: Ensuring Consistency in a Replicated World

• insert known data on the master • wait until you see it on the slave • time waited is replication delay

14

heartbeats

Page 15: Ensuring Consistency in a Replicated World

15

clocks are evil

Page 16: Ensuring Consistency in a Replicated World

16

clocks are evil (2)

Page 17: Ensuring Consistency in a Replicated World

17

pt-heartbeat

Page 18: Ensuring Consistency in a Replicated World

18

yelp_heartbeat

Page 19: Ensuring Consistency in a Replicated World

19

the secret sauce

Page 20: Ensuring Consistency in a Replicated World

• A sensu check:

20

what does that get us? (pt 1)

Page 21: Ensuring Consistency in a Replicated World

21

why that way?

Page 22: Ensuring Consistency in a Replicated World

22

time is hard

http://bugs.mysql.com/bug.php?id=48326

Page 23: Ensuring Consistency in a Replicated World

• aggregates heartbeat information • provides it to the webapp • determines when to expire the dirty session cookie

23

repl_delay_reporter

Page 24: Ensuring Consistency in a Replicated World

• Wait for replication: • “I inserted some data; when will it be available on all replicas?”

• Throttle to replication: • “I want to bulk insert data. Will doing so cause too much replication delay?”

24

operations

Page 25: Ensuring Consistency in a Replicated World

• insert some data • ask the master database “what’s the heartbeat right now?”

• ask the repl_delay_reporter “what’s the lowest heartbeat right now?” • wait a bit

• loop until the lowest heartbeat exceeds the original master heartbeat

25

wait for replication

Page 26: Ensuring Consistency in a Replicated World

• determines when to expire the dirty session cookie • relies on only 1 clock, and only for monotonicity • used heavily by batches

– provides read-after-write consistency

26

wait for replication

Page 27: Ensuring Consistency in a Replicated World

• prevents batches from causing excessive replication delay • operates before the beginning of each transaction

– batches ask “is replication delay low enough for me to write right now?”

• batches are required to keep their transactions reasonably-sized

27

throttle to replication

Page 28: Ensuring Consistency in a Replicated World

• load on masters • laggards • over-throttling

28

gotchas

Page 29: Ensuring Consistency in a Replicated World

• batch data can reside on the same shards that serve OLTP requests • support databases with heterogenous SLAs • automatic load-shedding when there is a replication issue

29

what this gets us

Page 30: Ensuring Consistency in a Replicated World

• shunting of nearly ALL reading and reporting off of the master • better mileage out of the Percona toolkit • on-line replication hierarchy changes

30

what this gets us

Page 31: Ensuring Consistency in a Replicated World