Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Consensus as aNetwork Service
Huynh Tu Dang, Pietro Bressana, Han Wang, Ki Suh Lee, Hakim Weatherspoon, Marco Canini, Fernando Pedone, and Robert Soulé Università della Svizzera italiana (USI),Cornell University, and KAUST
1
Consensus is a Fundamental Problem
Many distributed problems can be reduced to consensus
E.g., Atomic broadcast, atomic commit
Consensus protocols are the foundation for fault-tolerant systems
E.g., OpenReplica, Ceph, Chubby
Any improvement in performance would have HUGE impact
2
Key Idea: Move Consensus Into Network Hardware
This work focuses on Paxos
One of the most widely used consensus protocol
Has been proved to be correct
Enabling technology trends:
Hardware is becoming more flexible: e.g. PISA, FlexPipe, NFP-6xxx
Hardware is becoming more programmable: e.g., POF, PX, and P4
3
Outline of This Talk
Introduction
Consensus Background
Design, Implementation & Evaluation
Conclusions
4
Paxos Roles and Communication
5
Proposers propose values
A distinct proposer assumes the role of Coordinator
Acceptors accept a proposal, promise not to accept any other proposals
Learners require a quorum of messages from Acceptors, “deliver” a value
Coordinator
Proposer
Acceptor 1
Acceptor 2
Acceptor 3
. .Learners
(up to n). .
proposalPhase2A
Phase2B
Design
6
Design Goals 1: Be a Drop-In Replacement
István et al. [NSDI ’16] implement ZAB in FPGAs, but require that the application written in the Hardware Description Language
High-level languages make hardware development easier
Implementing LevelDB in P4 might still be tricky….
7
Standard Paxos API
8
1 void submit(struct paxos_ctx * ctx,2 char * value,3 int size);4
5 void (*deliver)(struct paxos_ctx* ctx,6 int instance,7 char * value,8 int size);9
10 void recover(struct paxos_ctx * ctx,11 int instance,12 char * value,13 int size);
Figure 4: CAANS application level API.
paxos ctx struct. When a learner learns a value, itcalls the application-specific deliver function. Thedeliver function returns a buffer containing the learnedvalue, the size of the buffer, and the instance number forthe learned value.
The recover function is used by the application todiscover a previously agreed upon value for a particularinstance of consensus. The recover function results in thesame sequence of Paxos messages as the submit func-tion. The difference in the API, though, is that the ap-plication must pass the consensus instance number as aparameter, as well as an application-specific no-op value.The resulting deliver callback will either return the ac-cepted value, or the no-op value if no value had been pre-viously accepted for the particular instance number.Hardware/Software divide. An important question foroffering consensus as a network service is: exactly whatlogic should be implemented in network hardware, andwhat logic should be implemented in software?
In the CAANS architecture, network hardware executesthe logic of coordinators and acceptors. This choice al-lows CAANS to address the bottlenecks identified in Sec-tion 2. Moreover, since the proposer and learner code areimplemented in software, the design facilitates the simpleapplication-level interface described above. The logic ofeach of the roles is neatly encapsulated by communicationboundaries.
Figure 3 illustrates the CAANS architecture for aswitch-based deployment. In the figure, switch hardwareis shaded grey, and commodity servers are colored white.Note that a backup coordinator can execute on either asecond switch, or a commodity server, as we’ll discussbelow. We should also point out that CAANS could be de-ployed on other devices, such as the programmable NICsthat we use in the evaluation.Paxos header. Network hardware is optimized to processpacket headers. Since CAANS targets network hardware,it is a natural choice to map Paxos messages into a Paxos-protocol header. The Paxos header follows the transportprotocol header (e.g., UDP), allowing CAANS messages
1 struct paxos_t {2 uint8_t msgtype;3 uint8_t inst[INST_SIZE];4 uint8_t rnd;5 uint8_t vrnd;6 uint8_t swid[8]7 uint8_t value[VALUE_SIZE];8 };
Figure 5: Paxos packet header.
to co-exist with standard network hardware.In a traditional Paxos implementation, each participant
receives messages of a particular type (e.g., Phase 1A,2A), executes some processing logic, and then synthesizesa new message that it sends to the next participant in theprotocol.
However, network hardware, in general, cannot craftnew messages; they can only modify fields in the headerof the packet that they are currently processing. There-fore, a network-based Paxos needs to map participantlogic into forwarding and header rewriting decisions (e.g.,the message from proposer to coordinator is transformedinto a message from coordinator to each acceptor byrewriting certain fields). Because the message size can-not be changed at the switch, each packet must containthe union of all fields in all Paxos messages, which fortu-nately are still a small set.
Figure 5 shows the CAANS packet header for Paxosmessages, written as a C struct. To keep the header small,the semantics of some of the fields change depending onwhich participant sends the message. The fields are as fol-lows: (i) msgtype distinguishes the various Paxos mes-sages (e.g., phase 1A, 2A); (ii) inst is the consensusinstance number; (iii) rnd is either the round numbercomputed by the proposer or the round number for whichthe acceptor has cast a vote; vrnd is the round numberin which an acceptor has cast a vote; (iv) swid identi-fies the sender of the message; and (v) value containsthe request from the proposer or the value for which anacceptor has cast a vote.
A CAANS proposer differs from a standard Paxos pro-poser because before forwarding messages to the coor-dinator, it must first encapsulate the message in a Paxosheader. Through standard sockets, the Paxos header isthen encapsulated inside a UDP datagram and we rely onthe UDP checksum to ensure data integrity.Memory limitations. CAANS aims to support practi-cal systems that use Paxos as a building block to achievefault tolerance. A prominent example of these are servicesthat rely on a replicated log to persistently record the se-quence of all consensus values. The Paxos algorithm doesnot specify how to handle the ever-growing, replicatedlog that is stored at acceptors. On any system, this cancause problems, as the log would require unbounded disk
5
Send a value
Deliver a value
Discover prior value
Design Goals 2: Alleviate Bottlenecks
9
0%
25%
50%
75%
100%
ProposerCoordinator
AcceptorLearner
CPU
util
izat
ion
● ●
●● ●
25%
50%
75%
100%
4 8 12 16 20Number of Learners
CPU
Util
izat
ion
● CoordinatorProposerAcceptorLearner
Coordinator and acceptors are to blame!
Hardware/Software
10
Proposer Proposer
Learner
Coordinator
AcceptorAcceptorAcceptor
Coordinator Backup
Learner
Challenge: map Paxos logic into stateful forwarding decisions
Facilitate software
API
Alleviatebottlenecks
NetPaxos: Header Definition & Parser
header_type paxos_t { fields { msgtype : 16; inst : 32; rnd : 16; vrnd : 16; acptid : 16; paxosval : 256; } }
parser parse_ethernet { extract(ethernet); return parse_ipv4; } parser parse_ipv4 { extract(ipv4); return parse_udp; } parser parse_udp { extract(udp); return select(udp.dstPort) { PAXOS_PROTOCOL: parse_paxos; default: ingress; } } parser parse_paxos { extract(paxos); return ingress; }
11
Acceptor Control Flow
isPaxos?
round_tbl
acceptor_tbl
Packet’srnd>=acceptor’srnd?
forward_tbl
12
Drop
load acceptor’s rnd stored in registers
Ingress
Update: registers’ states ‘msgtype’ ‘acptid’ UDP dst port
isIPv4?Drop
Egress
round_tbl
forward_tbl
13
Drop
Ingress
Drop
Egress
control ingress { if (valid(ipv4)) {
apply(forward_tbl); }
if (valid(paxos)) { apply(round_tbl);
if(paxos.rnd >= current.rnd){ apply(acceptor_tbl); } } }
acceptor_tbl
round_tbl
forward_tbl
14
Drop
Ingress
Drop
Egress
acceptor_tbl
Acceptor Control Flow
round_tbl// uint16_t rounds_regs[64000]; register rounds_reg { width : 16; instance_count : 64000; }
action read_round() { // uint16_t current.round = rounds_reg[paxos.inst] register_read(current.round, rounds_reg, paxos.inst); }
table round_tbl { actions { read_round; } size : 1; }
15
round_tbl
forward_tbl
16
Drop
Ingress
Drop
Egress
Acceptor Control Flow
acceptor_tbl
acceptor_tblaction handle_2a(learner_port) { // rounds_reg[paxos.inst] = paxos.rnd register_write(rounds_reg, paxos.inst, paxos.rnd); // vrounds_reg[paxos.inst] = paxos.rnd register_write(vrounds_reg, paxos.inst, paxos.rnd); // values_reg[paxos.inst] = paxos.rnd register_write(values_reg, paxos.inst, paxos.paxosval);
register_read(paxos.acptid, acceptor_id, 0); modify_field(paxos.msgtype, PAXOS_2B); modify_field(udp.dstPort, learner_port); }
table acceptor_tbl { reads { paxos.msgtype : exact }; actions { handle_1a; handle_2a }; }
17
ImplementationSource code
Proposer and learner written in C
Coordinator and acceptor written in P4
4 Compilers
P4C
P4FPGA
Xilinx SDNet
Netronome SDK
18
4 Hardware target platforms
NetFPGA SUME (4x10G)
Netronome Agilio-CX (1x40G)
Alpha Data ADM-PCIE-KU3 (2x40G)
Xilinx VCU109 (4x100G)
2 Software target platforms
Bmv2
DPDK ( work in progress )
P4 Compilers
19
Compiler Target Remark
P4C SoftwareSwitch Supports most of the P4 constructs
P4@ELTE DPDK Does not support register operations. Limits field length to 32 bits
P4FPGA FPGAs Must write modules for unsupported P4 constructs
XilinxSDNet FPGAs Does not support register operations.
Requires a wrapper for the packet stream Netronome
SDK Netronome ISAs Works only with Netronome devices. Custom actions can be written in Micro-C
BarefootCapilano
Barefoot Tofino Tbps switch
Evaluation
20
Experiment: What is the Absolute Performance?
Run Coordinator / Acceptor in isolation
Testbed:
NetFPGA SUME board in a SuperMicro Server
A Packet generator for offering load
21
Absolute Performance
22
Late
ncy
(us)
0
0.2
0.4
0.6
0.8
Forwarding Coordinator Acceptor
Measured on NetFPGA SUME using P4FPGA
Throughput is over 9 million consensus messages / second (close to line rate)
Little overhead latencycompared to simply forwarding packets
Experiment: What is the End-to-End Performance?
Comparing NetPaxos to a software-based Paxos (Libpaxos)
Testbed:
4 NetFPGA SUME boards in SuperMicro Servers
An OpenFlow-enable 10 Gbps switch (Pica8 P-3922 switch)
23
End-to-End Performance
24
0
1,000
2,000
3,000
50,000 100,000Throughput (Msgs / S)
Late
ncy
(µs)
CAANSLibpaxos
2.24x throughput improvement over software implementation
75% reduction in latency
Similar results when replicating LevelDB as application
Next Steps
We make consensus great again!
The ball is now in the application developer’s court
Suggests direction for future work
25
0%
25%
50%
75%
100%
Proposer Learner
CPU
util
izat
ion
Lessons Learned
26
Outlook
The performance of consensus protocols has a dramatic impact on the performance of data center applications
Moving consensus logic into network hardware results in significant performance improvements
27
“a HUGE wave of consensus messages is approaching”
29
Questions & Answers
Performance After Failure
30
0
50,000
100,000
150,000
2 4 6 8 10Time (Second)
Thro
ughp
ut (M
sgs /
S)
0
50,000
100,000
150,000
2 4 6 8 10Time (Second)
Thro
ughp
ut (M
sgs /
S)
Coordinator failurewith software backup Acceptor failure
End-to-End Experiment NetPaxos Setup
31
RunPaxosprotocol
Programmable device
ApplicationClients
ApplicationServers
ApplicationClients
ApplicationServers