Paxos Quorum Leases
Sayed Hadi Hashemi
BACKGROUND
• Setting– Status: Key-Value Storage– Commands: Read / Write / Batch (Read, Write)– Goal: Minimized WAN Delay
• Original Paxos– Read: At least 2 RT (more in case of dueling
leaders)– Write: At least 2 RT
RT
Client
Replica 0
Replica 1
Replica 2
Request (CMD)
Accept (CMD) Accept (OK) Committed (CMD)
PaxosCan we do any better?
Prepare
RT
OK
Result / ACK
• Multi Paxos– Temporary Stable Leader Replica to ignore Prepare
(election) phase– Read: 1 RT from the leader– Write is the same as the read– A replica becomes the stable leader by running
the prepare phase for a large number of instances at the same time, taking ownership of all of them.
• Google’s Megastore– All Replica are leader!– Read: 0 RT from any Replica! (Reading Locally)– Write: At least 1 RT to All Replica
RT
Client
LeaderReplica 0
Replica 1
Replica 2
Request (CMD)
Accept (CMD) Accept (OK) Committed (CMD)
Steady state interaction in Multi-Paxos.The asynchronous messages are represented as dashed arrows.
Result/ACK
RT
Client
Replica 0
Replica 1
Replica 2
Request (Read)
Committed (CMD)
Megastore
Result Request (Write)
Accept (OK)
ACK
Accept (CMD)
Can we have benefits of the both?
• Quorum Leases– Middle ground– Read: Most of the time 0 RT (80% in the
experiment), 1 RT otherwise– Write is almost the same as the Multi Paxos
QUORUM LEASES
Overview
• The idea is to have multiple leases for different sets of objects
• Each lease is granted to lease holders by a majority of grantors
• Read:– Lease holders can read locally while the lease is active– Any one else, use Multi-Paxos
• Write:– Notify Lease holders synchronously through Lease
Grantors (Majority)
• Lease Configuration– Describes the set of granted objects to quorum leases
• Replica is added to a lease if it reads an object frequently• Replica is removed from a lease if it fails, or it stop reading
an object frequently
• Granting and Refreshing leases– |N+1|/2 grantors will activate a lease for a set of
holders– Grantor Promise Holder that:
• Notify r synchronously before committing any update• Acknowledge “Accept” and “Prepare” for writing with the
condition that the proposer must notify r synchronously
Grantor
Holder
Guard
Grant Lease
Guard ACK
Promise
Promise ACK
T1 T3 T5
T2 T4t_guard t_lease
t_guard
t_lease1
t_lease2
1. if Promise ACK has received2. if Promise ACK has not received
Grantor
Holder
Guard
Grant Renew
Guard ACK
Promise
Promise ACK
Promise
Promise ACK
T1 T3 T5
T2 T4t_guard t_lease
t_guard
t_lease1
t_lease2
1. if Promise ACK has received2. if Promise ACK has not received
T6t_lease
t_lease
EVALUATION
Evaluation
• Run implementations of quorum leases, classic leader leases and Megastore-type leases
• Geo-distributed Amazon EC2 cluster.• 5 Multi-Paxos replicas in Virginia, Northern
California, Oregon, Ireland and Japan.• 10 Client co-located in each replica• Workload– YCSB key-value workload (Zipf)– Uniform key-value workload
Selects as leader because of Low RTT
Test1: Latency Evaluation
• Multi-Paxos Leader: Northern California• Each client sends 10000 request to its co-
located replica• Request:– 1:1 Read-Write– 9:1 Read-Write
• Parameters:– lease duration: 2s, renew duration: 500ms, lease
configuration update: every 10s
LL is the best in writing, ML in reading
Test2: Recovering froma Replica Failure
• Shutdown a (non leader) replica, 10s after starting the test (Lease Configuration Update)
• Parameters:– Guard duration: 2s, Grace delay: 5s, lease
duration: 2s, renew duration: 500ms, lease configuration update: every 10s
• Recover time:– Update + Grace + Guard + Lease
Test3: Throughput in a Cluster
• Run in one local cluster (no geo-distributed)• Requests are generated open-loop by one client
for each replica • 2 Situations:– (1) different objects are popular at different replicas– (2) clients direct their reads uniformly at random
across all replicas. • Use batching to commit writes (the leader
batches up to 5000 updates at a time)
REVIEW
• Strong Consistency• Acceptable Availability• Combine the best of two approaches• Using objects, instead of Replica• Separating “Lease Configuration Updates”
than the other operations• Compatibility with Multi-Paxos (or other
implementations)
Pro
• What is the messaging overhead?– Lease Renewal– Lease Configuration
• Experiment– 1:1 Read-Write Ratio vs. 9:1
• Recovery Time in Practice:– Update + Grace + Guard + Lease – Worse case +20s
Cons
QUESTIONS?Thanks for your attention