Upload
yelp-engineering
View
187
Download
0
Embed Size (px)
DESCRIPTION
Ensuring Consistency in a Replicated World
Citation preview
Ensuring Consistency in a Replicated World
Josh Snyder 2014-‐09-‐30
2
what is Yelp?
• we operate in a bunch of markets • aim to be globally distributed • our users should never see stale content • our developers should be able to design an application resilient to
replication delay
3
goals
4
a sample architecture
• a small set of moving parts • enables us to do more with fewer shards • masks geographic traffic split from users and developers • enhanced tolerance to replication delay • ability to
– perform online replication hierarchy changes – batch-load data
5
our toolset
6
cookies
• give the client a short-lived “dirty session” cookie • encode the time of the latest interaction between you and them • expire or ignore the cookie after replicas have caught up
7
cookies
• load balancer: • POST? • GET? -> cookie?
• routes the request into the appropriate datacenter • adds headers to requests
8
request routing
• users get read-after-write consistency • routing a user’s request between datacenters increases latency !
• getting it wrong: increased load on the master database
9
tradeoffs
• we need to be assured that a user’s request falls back to a datacenter that has all of their data
10
tradeoffs
• we need a clear picture of it • never underestimate replication delay, always overestimate
11
replication delay
• made of lies (for this purpose) • underestimates most of the time • overestimates some of the time
12
Seconds_Behind_Master
http://bugs.mysql.com/bug.php?id=66921
13
heartbeats
• insert known data on the master • wait until you see it on the slave • time waited is replication delay
14
heartbeats
15
clocks are evil
16
clocks are evil (2)
17
pt-heartbeat
18
yelp_heartbeat
19
the secret sauce
• A sensu check:
20
what does that get us? (pt 1)
21
why that way?
• aggregates heartbeat information • provides it to the webapp • determines when to expire the dirty session cookie
23
repl_delay_reporter
• Wait for replication: • “I inserted some data; when will it be available on all replicas?”
• Throttle to replication: • “I want to bulk insert data. Will doing so cause too much replication delay?”
24
operations
• insert some data • ask the master database “what’s the heartbeat right now?”
• ask the repl_delay_reporter “what’s the lowest heartbeat right now?” • wait a bit
• loop until the lowest heartbeat exceeds the original master heartbeat
25
wait for replication
• determines when to expire the dirty session cookie • relies on only 1 clock, and only for monotonicity • used heavily by batches
– provides read-after-write consistency
26
wait for replication
• prevents batches from causing excessive replication delay • operates before the beginning of each transaction
– batches ask “is replication delay low enough for me to write right now?”
• batches are required to keep their transactions reasonably-sized
27
throttle to replication
• load on masters • laggards • over-throttling
28
gotchas
• batch data can reside on the same shards that serve OLTP requests • support databases with heterogenous SLAs • automatic load-shedding when there is a replication issue
29
what this gets us
• shunting of nearly ALL reading and reporting off of the master • better mileage out of the Percona toolkit • on-line replication hierarchy changes
30
what this gets us