15
Università della Svizzera italiana USI Technical Report Series in Informatics P4xos: Consensus as a Network Service Huynh Tu Dang 1 , Pietro Bressana 1 , Han Wang 2,5 , Ki Suh Lee 2 , Marco Canini 3 , Noa Zilberman 4 , Fernando Pedone 1 , Robert Soulé 1,5 1 Faculty of Informatics, Università della Svizzera italiana, Switzerland 2 Cornell University 3 University of Cambridge 4 KAUST 5 Barefoot Networks Abstract In this paper, we explore how a programmable forwarding plane offered by a new breed of network switches might naturally accelerate consensus protocols, specifically focusing on Paxos. The performance of consensus protocols has long been a concern; By implementing Paxos in the forwarding plane, we are able to significantly increase throughput and reduce latency. Our P4-based implementation running on an ASIC in isolation can process over 2.5 billion consensus messages per second, a four orders of magnitude improvement in through- put over a widely-used software implementation. This effectively removes consensus as a bottleneck for distributed applications in data centers. Beyond sheer performance, our ap- proach offers several other important benefits: it readily lends itself to formal verification; it does not rely on any additional network hardware; and as a full Paxos implementation, it makes only very weak assumptions about the network. Report Info Published May 2018 Number USI-INF-TR-2018-01 Institution Faculty of Informatics Università della Svizzera italiana Lugano, Switzerland Online Access www.inf.usi.ch/techreports 1 Introduction In the past, we thought of the network as being simple, fixed, and providing little functionality besides commu- nication. However, this appears to be changing, as a new breed of programmable switches match the performance of fixed function devices [48, 4]. If this trend continues—as has happened in other areas of the industry, such as GPUs, DSPs, TPUs— then fixed function switches will soon be replaced by programmable ones. Leveraging this trend, several recent projects have explored ways to improve the performance of consensus protocols by folding functionality into the network. Consensus protocols are a natural target for network offload since they are both essential to a broad range of distributed systems and services (e.g., [6, 5, 46]), and widely recognized as a performance bottleneck [13, 17]. Beyond the motivation for better performance, there is also a clear opportunity. Since consensus protocols critically depend on assumptions about the network [22, 27, 36, 37], they can clearly benefit from tighter network integration. Most prior work optimizes consensus protocols by strengthening basic assumptions about the behavior of the network, e.g., expecting that the network provides reliable delivery (e.g., Reed and Junqueira [41]) or ordered delivery (e.g., FastPaxos [22], Speculative Paxos [40], and NoPaxos [24]). Istvan et al. [16], which demonstrated consensus acceleration on FPGA, assumed lossless and strongly ordered communication. Somewhat similarly, Eris [23] uses programmable switches to sequence transactions, thereby avoiding aborts. This paper proposes an alternative approach. Recognizing that strong assumptions about delivery may not hold in practice, or may require undesirable restrictions (e.g., enforcing a particular topology [40]), we demonstrate how a programmable forwarding plane can naturally accelerate consensus protocols without strengthening assumptions 1

P4xos: Consensus as a Network Service - Semantic Scholar€¦ · P4xos: Consensus as a Network Service Huynh Tu Dang1, Pietro Bressana1, Han Wang2,5, Ki Suh Lee2, Marco Canini3, Noa

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: P4xos: Consensus as a Network Service - Semantic Scholar€¦ · P4xos: Consensus as a Network Service Huynh Tu Dang1, Pietro Bressana1, Han Wang2,5, Ki Suh Lee2, Marco Canini3, Noa

Universitàdella Svizzeraitaliana

USI Technical Report Series in Informatics

P4xos: Consensus as a Network Service

Huynh Tu Dang1, Pietro Bressana1, Han Wang2,5, Ki Suh Lee2, Marco Canini3, Noa Zilberman4,Fernando Pedone1, Robert Soulé1,5

1 Faculty of Informatics, Università della Svizzera italiana, Switzerland2 Cornell University3 University of Cambridge4 KAUST5 Barefoot Networks

Abstract

In this paper, we explore how a programmable forwarding plane offered by a new breed ofnetwork switches might naturally accelerate consensus protocols, specifically focusing onPaxos. The performance of consensus protocols has long been a concern; By implementingPaxos in the forwarding plane, we are able to significantly increase throughput and reducelatency. Our P4-based implementation running on an ASIC in isolation can process over 2.5billion consensus messages per second, a four orders of magnitude improvement in through-put over a widely-used software implementation. This effectively removes consensus as abottleneck for distributed applications in data centers. Beyond sheer performance, our ap-proach offers several other important benefits: it readily lends itself to formal verification;it does not rely on any additional network hardware; and as a full Paxos implementation, itmakes only very weak assumptions about the network.

Report Info

PublishedMay 2018

NumberUSI-INF-TR-2018-01

InstitutionFaculty of InformaticsUniversità della Svizzera italianaLugano, Switzerland

Online Accesswww.inf.usi.ch/techreports

1 Introduction

In the past, we thought of the network as being simple, fixed, and providing little functionality besides commu-nication. However, this appears to be changing, as a new breed of programmable switches match the performanceof fixed function devices [48, 4]. If this trend continues—as has happened in other areas of the industry, such asGPUs, DSPs, TPUs— then fixed function switches will soon be replaced by programmable ones.

Leveraging this trend, several recent projects have explored ways to improve the performance of consensusprotocols by folding functionality into the network. Consensus protocols are a natural target for network offloadsince they are both essential to a broad range of distributed systems and services (e.g., [6, 5, 46]), and widelyrecognized as a performance bottleneck [13, 17].

Beyond the motivation for better performance, there is also a clear opportunity. Since consensus protocolscritically depend on assumptions about the network [22, 27, 36, 37], they can clearly benefit from tighter networkintegration. Most prior work optimizes consensus protocols by strengthening basic assumptions about the behaviorof the network, e.g., expecting that the network provides reliable delivery (e.g., Reed and Junqueira [41]) orordered delivery (e.g., FastPaxos [22], Speculative Paxos [40], and NoPaxos [24]). Istvan et al. [16], whichdemonstrated consensus acceleration on FPGA, assumed lossless and strongly ordered communication. Somewhatsimilarly, Eris [23] uses programmable switches to sequence transactions, thereby avoiding aborts.

This paper proposes an alternative approach. Recognizing that strong assumptions about delivery may not holdin practice, or may require undesirable restrictions (e.g., enforcing a particular topology [40]), we demonstrate howa programmable forwarding plane can naturally accelerate consensus protocols without strengthening assumptions

1

Page 2: P4xos: Consensus as a Network Service - Semantic Scholar€¦ · P4xos: Consensus as a Network Service Huynh Tu Dang1, Pietro Bressana1, Han Wang2,5, Ki Suh Lee2, Marco Canini3, Noa

about the behavior of the network. Our approach is to execute Paxos [18] logic directly in switch ASICS. We callour approach P4xos.

Despite many attempts to optimize consensus [29, 1, 21, 35, 22, 40, 27, 41], performance remains a problemin real-world systems [17]. There are at least two challenges for performance. One is the protocol latency, whichseems to be fundamental: Lamport proved that it takes at least 3 communication steps to order messages in adistributed setting [20], where a step means server-to-server communication. The second is high packet rate, sincePaxos roles must quickly process a large number of messages to achieve high throughput.

P4xos addresses both of these problems. First, P4xos improves latency by processing consensus messages inthe forwarding plane as they pass through the network, reducing the number of hops, both through the networkand through hosts, that messages must travel. Moreover, processing packets in hardware helps reduce tail latency.Trying to curtail tail latency in software is quite difficult, and often depends on kludges (e.g., deliberately fail-ing I/O operations). Second, P4xos improves throughput, as ASICs are designed for high-performance messageprocessing. In contrast, server hardware is inefficient in terms of memory throughput and latency, and there isadditional software overhead due to memory management and safety features, such as the separation of kernel anduser-space memory [11].

P4xos is a network service deployed in existing switching devices. It does not require dedicated hardware.There are no additional cables or power requirements. Consensus is offered as a service to the system withoutadding additional hardware beyond what would already be deployed in a data center. Second, using a smallshim-library, applications can immediately use P4xos without modifying their application code, or porting theapplication to an FPGA [16]. P4xos is a drop-in replacement for software-based consensus libraries offering acomplete Paxos implementation.

P4xos provides significant performance improvements compared with traditional implementations. In a datacenter network, P4xos reduces the latency by ×3, regardless the ASIC used. In a small scale, FPGA-based, deploy-ment, P4xos reduced the 99t h latency by ×10, for a given throughput. In terms of throughput, our implementationon Barefoot Network Tofino ASIC chip [4] can process over 2.5 billion consensus messages per second, a fourorders of magnitude improvement. An unmodified instance of LevelDB, running on our small scale deployment,achieved ×4 throughput improvement.

Besides sheer performance gains, our use of P4 offers an additional benefit. By construction, P4 is not aTuring-complete language—it excludes looping constructs, which are undesirable in hardware pipelines—and asa result is particularly amenable to verification by bounded model checking. We have verified our implementationusing the SPIN model checker, giving us confidence in the correctness of the protocol.

In short, this paper makes the following contributions:• It describes a re-interpretation of the Paxos protocol, as an example of how one can map consensus protocol

logic into stateful forwarding decisions, without imposing any constraints on the network.• It presents an open-source implementation of Paxos with at least ×3 latency improvement and 4 orders of

magnitude throughput improvement vs. host based consensus in data centers.• It shows that the services can run in parallel to traditional network operations, while using minimal resources

and without incurring hardware overheads (e.g., accelerator boards, more cables, etc.) leading to a more efficientusage of the network.

Overall, P4xos effectively removes consensus as a bottleneck for replicated, fault tolerant services in a datacenter, and shifts the limiting factor for overall performance to the application. Moreover, it shows how distributedsystems can become distributed networked systems, with the network performing services traditionally runningon the host.

2 Paxos Performance Bottlenecks

We focus on Paxos [18] for three reasons. First, it makes very few assumptions about the network (e.g., point-to-point packet delivery and the election of a non-faulty leader), making it widely applicable to a number ofdeployment scenarios. Second, it has been proven correct (e.g., safe under asynchronous assumptions, live underweak synchronous assumptions, and resilience-optimum [18]). And, third, it is deployed in numerous real-worldsystems, including Microsoft Azure Storage [6], Ceph [46], and Chubby [5].

2

Page 3: P4xos: Consensus as a Network Service - Semantic Scholar€¦ · P4xos: Consensus as a Network Service Huynh Tu Dang1, Pietro Bressana1, Han Wang2,5, Ki Suh Lee2, Marco Canini3, Noa

ProposerLeader

AcceptorLearner

0% 25% 50% 75% 100%CPU utilization

Figure 1: Leader bottleneck

2.1 Paxos Background

Paxos is used to solve a fundamental problem for distributed systems – getting a group of participants to reliablyagree on some value (e.g., the next valid application state). Paxos distinguishes the following roles that a processcan play: proposers, acceptors and learners (leaders are introduced later). Clients of a replicated service aretypically proposers, and propose commands that need to be ordered by Paxos before they are learned and executedby the replicated state machines. Replicas typically play the roles of acceptors (i.e., the processes that actuallyagree on a value) and learners. Paxos is resilience-optimum in the sense that it tolerates the failure of up to facceptors from a total of 2 f +1 acceptors—to ensure progress—where a quorum of f +1 acceptors must be non-faulty [20]. In practice, replicated services run multiple executions of the Paxos protocol to achieve consensus ona sequence of values [7] (i.e., multi-Paxos). An execution of Paxos is called an instance.

An instance of Paxos proceeds in two phases. During Phase 1, a proposer that wants to submit a value selectsa unique round number and sends a prepare request to at least a quorum of acceptors. Upon receiving a preparerequest with a round number bigger than any previously received round number, the acceptor responds to theproposer promising that it will reject any future requests with smaller round numbers. If the acceptor alreadyaccepted a request for the current instance, it will return the accepted value to the proposer, together with theround number received when the request was accepted. When the proposer receives answers from a quorum ofacceptors, the second phase begins.

In Phase 2, the proposer selects a value according to the following rule. If no value is returned in the responses,the proposer can select a new value for the instance; however, if any of the acceptors returned a value in the firstphase, the proposer must select the value with the highest round number among the responses. The proposer thensends an accept request with the round number used in the first phase and the value selected to the same quorumof acceptors. When receiving such a request, the acceptors acknowledge it by sending the accepted value to thelearners, unless the acceptors have already acknowledged another request with a higher round number. When aquorum of acceptors accepts a value consensus is reached.

If multiple proposers simultaneously execute the procedure above for the same instance, then no proposer maybe able to execute the two phases of the protocol and reach consensus. To avoid scenarios in which proposerscompete indefinitely, a leader process can be elected. Proposers submit values to the leader, which executes thefirst and second phases of the protocol. If the leader fails, another process takes over its role. Paxos ensures (i)consistency despite concurrent leaders and (ii) progresses in the presence of a single leader.

2.2 Performance Obstacles

Given the central role that Paxos plays in fault-tolerant, distributed systems, improving the performance of theprotocol has been an intense area of study. From a high-level, there are performance obstacles that impact bothlatency and throughput.Protocol Latency. The performance of Paxos is typically measured in “communication steps”, where a commu-nication step corresponds to a server-to-server communication in an abstract distributed system. Lamport provedthat it takes at least 3 steps to order messages in a distributed setting [20]. This means that there is not much hopefor significant performance improvements, unless one revisits the model (e.g., Charron-Bost and Schiper [9]) orassumptions (e.g., spontaneous message ordering [22, 36, 37]).

These communication steps have become the dominant factor for Paxos latency overhead. Our experimentsshow that the Paxos logic execution time takes around 2.5us, without I/O. Using kernel-bypass [11], a packet canbe sent out of host in 5us (median) [50]. One way delay in the data center is 100us (median) [39], more than10x the host! Implementing Paxos in switch ASICS as “bumps-in-the-wire” processing allows consensus to bereached in sub-round-trip time (RTT).

Figure 2 illustrates the difference in number of hops needed by P4xos and traditional deployments: whilein a standard Paxos implementation, every communication step requires traversing the network, (e.g., Fat-tree),

3

Page 4: P4xos: Consensus as a Network Service - Semantic Scholar€¦ · P4xos: Consensus as a Network Service Huynh Tu Dang1, Pietro Bressana1, Han Wang2,5, Ki Suh Lee2, Marco Canini3, Noa

ToRProposer ToR/Learner

Application

ToR Aggregate Spine Aggregate ToR

Aggregate Spine Aggregate ToR

Leader

Spine Aggregate ToR

Aggregate Spine/Leader

Aggregate/Acceptor

Aggregate

P4xos: Time to reach consensus: RTT/2

Paxos: Time to reach consensus: RTT x 3/2

ToR

ToR

Learner/Application

Acceptor

Proposer

Figure 2: Contrasting propagation time for P4xos with server-based deployment.

in P4xos it is possible for each network device to fill a role in achieving a consensus: the spine as a leader,the aggregate as an acceptor, the last Top of Rack (ToR) switch as a learner, and the hosts serving as proposersand replicated applications. Note that the number of uplinks from a ToR does not need to reach a quorum ofaggregators, as the aggregator may also reside at the fourth hop. In this manner, P4xos saves two traversals ofthe network compared to Paxos, meaning ×3 latency improvement. As shown in §5.2.2, eliminating the hosts’latency from each communication step also significantly improves the latency’s tail. The latency saving is notdevice dependent: the same relative improvement will be achieved with any (programmable) chipset and set ofhosts.Throughput Bottlenecks. Beyond protocol latency, there are additional challenges to improve the performanceof consensus [28]. To investigate the performance bottleneck for a typical Paxos deployment, we measured theCPU utilization for each of the Paxos roles when transmitting messages at peak throughput. As a representativeimplementation of Paxos, we used the open-source libpaxos library [25]. There are, naturally, many Paxosimplementations, so it is difficult to make generalizations about their collective behavior. However, libpaxosis a faithful implementation of Paxos that distinguishes all the Paxos roles. It has been extensively tested and isoften used as a reference implementation (e.g., [16, 26, 38, 42]).

In the initial configuration, there are seven processes spread across three machines running on separate core,distributed as follows: Server 1 hosts 1 proposer, 1 acceptor, and 1 learner. Server 2 hosts 1 leader and 1 acceptor.And, Server 3 hosts 1 acceptor and 1 learner.

We chose this distribution after experimentally verifying that it produced the best performance forlibpaxos.The details of the hardware setup are explained in Section 5.

The client application sends 64-byte messages to the proposer at a peak throughput rate of 64,949 values/sec.The results, which show the average CPU utilization per role, are plotted in Figure 1. They show that the leader isthe bottleneck, as it becomes CPU bound.

This is as expected. The leader must process the most messages of any role in the protocol, and as a result,becomes the first bottleneck. The bottleneck is largely due to the overhead of handling a large number of networkinterrupts, and copying data from kernel space into user space for the application to process. The other componentsin the protocol are similarly afflicted. A second experiment, not shown here for brevity, has shown that once youremove the leader bottleneck, the acceptor becomes the next bottleneck.

3 P4xos Design

P4xos is designed to address the two main obstacles for achieving high-performance: (i) it reduces end-to-endlatency by executing consensus logic as messages pass through the network, and (ii) it avoids network I/O bottle-necks in software implementations by executing Paxos logic in the forwarding hardware.

4

Page 5: P4xos: Consensus as a Network Service - Semantic Scholar€¦ · P4xos: Consensus as a Network Service Huynh Tu Dang1, Pietro Bressana1, Han Wang2,5, Ki Suh Lee2, Marco Canini3, Noa

In a network implementation of Paxos, protocol messages are encoded in a custom packet header; the dataplane executes the logic of leaders, acceptors, and learners; the logic of each of the roles is partitioned by com-munication boundaries. A shim library provides the interface between the application and the network.

We expect that P4xos would be deployed in data center in Top-of-Rack, Aggregate and Spine switches, asshown in Figure 2. Each role in the protocol is deployed on a separate switch. We note, though, that the P4xosroles are interchangeable with software equivalents. So, for example, a backup leader could be deployed on astandard server (with a reduction in performance).

3.1 Paxos with Match-Action

Paxos is a notoriously complex and subtle protocol [19, 28, 43, 7]. Typical descriptions of Paxos [18, 19, 28, 43, 7]describe the sequence of actions during the two phases of the protocol. In this paper, we provide an alternativeview of the algorithm, in which the protocol is described by the actions that Paxos agents take in response todifferent messages. In other words, we re-interpret the algorithm as a set of stateful forwarding decisions. Thispresentation of the algorithm may provide a different perspective on the protocol and aid in its understanding.Notation. Our pseudocode roughly correspond to P4 statements. The Initialize blocks identify statestored in registers. id[N] indicates a register named id with N cells. The notation “:= {0}” indicates thatevery cell element in the register should be initialized to 0. The match blocks correspond to table matcheson a packet header, and the case blocks correspond to P4 actions. We distinguish updates to the local state(“:=”), from writes to a packet header (“←”). We also distinguish between unicast (forward) and multicast(multicast).Paxos header. P4xos encodes Paxos messages in packet headers. The header is encapsulated by a UDP packetheader, allowing P4xos packets to co-exist with standard (non-programmable) network hardware. Moreover, weuse the UDP checksum to ensure data integrity.

Since current network hardware lacks the ability to generate packets, P4xos participants must respond to inputmessages by rewriting fields in the packet header (e.g., the message from proposer to leader is transformed into amessage from leader to each acceptor).

The P4xos packet header includes six fields. To keep the header small, the semantics of some of the fieldschange depending on which participant sends the message. The fields are as follows: (i ) msgtype distinguishesthe various Paxos messages (e.g., REQUEST, PHASE1A, PHASE2A, etc.) (i i ) inst is the consensus instancenumber; (i i i ) rnd is either the round number computed by the proposer or the round number for which theacceptor has cast a vote; vrnd is the round number in which an acceptor has cast a vote; (i v ) swid identifies thesender of the message; and (v ) value contains the request from the proposer or the value for which an acceptorhas cast a vote. Our prototype requires that the entire Paxos header, including the value, be less than the maximumtransmission unit.

void submit(struct paxos_ctx *ctx, char *value, int size);

Listing 1: P4xos proposer API.

Proposer. A P4xos proposer mediates client requests, and encapsulates the request in a Paxos header. Ideally,this logic could be implemented by an operating system kernel network stack, allowing it to add Paxos headers inthe same way that transport protocol headers are added today. As a proof-of-concept, we have implemented theproposer as a user-space library that exposes a small API to client applications.

The P4xos proposer library is a drop-in replacement for existing software libraries. The API consists of asingle submit function, shown in Listing 1. The submit function is called when the application using Paxosto send a value. The application simply passes a character buffer containing the value, and the buffer size. Thepaxos_ctx struct maintains Paxos-related state across invocations (e.g., socket file descriptors).Leader. A leader brokers requests on behalf of proposers. The leader ensures that only one process submitsa message to the protocol for a particular instance (thus ensuring that the protocol terminates), and imposes anordering of messages. When there is a single leader, a monotonically increasing sequence number can be used toorder the messages. This sequence number is written to the inst field of the header.

Algorithm 1 shows the pseudocode for the primary leader implementation. The leader receives REQUESTmessages from the proposer. REQUEST messages only contain a value. The leader must perform the following:write the current instance number and an initial round number into the message header; increment the instancenumber for the next invocation; store the value of the new instance number; and broadcast the packet to acceptors.

5

Page 6: P4xos: Consensus as a Network Service - Semantic Scholar€¦ · P4xos: Consensus as a Network Service Huynh Tu Dang1, Pietro Bressana1, Han Wang2,5, Ki Suh Lee2, Marco Canini3, Noa

Algorithm 1 Leader logic.1: Initialize State:2: instance[1] := {0}3: upon receiving pkt(msgtype, inst, rnd, vrnd, swid, value)4: match pkt.msgtype:5: case REQUEST:6: pkt.msgtype← Phase2A7: pkt.rnd ← 08: pkt.inst← instance[0]9: instance[0] := instance[0] + 1

10: multicast pkt11: default :12: drop pkt

Algorithm 2 Acceptor logic.1: Initialize State:2: round[MaxInstances] := {0}3: value[MaxInstances] := {0}4: vround[MaxInstances] := {0}5: upon receiving pkt(msgtype, inst, rnd, vrnd, swid, value)6: if pkt.rnd ≥ round[pkt.inst] then7: match pkt.msgtype:8: case Phase1A:9: round[pkt.inst] := pkt.rnd

10: pkt.msgtype← Phase1B11: pkt.vrnd ← vround[pkt.inst]12: pkt.value← value[pkt.inst]13: pkt.swid ← swid14: forward pkt15: case Phase2A:16: round[pkt.inst] := pkt.rnd17: vround[pkt.inst] := pkt.rnd18: value[pkt.inst] := pkt.value19: pkt.msgtype← Phase2B20: pkt.swid ← swid21: forward pkt22: default :23: drop pkt24: else25: drop pkt

P4xos uses a well-known Paxos optimization [14], where each instance is reserved for the primary leaderat initialization (i.e., round number zero). Thus, the primary leader does not need to execute Phase 1 beforesubmitting a value (in a REQUEST message) to the acceptors. Since this optimization only works for one leader,the backup leader must reserve an instance before submitting a value to the acceptors. To reserve an instance, thebackup leader must send a unique round number in a PHASE1A message to the acceptors. For brevity, we omitthe backup leader algorithm since it essentially follows the Paxos protocol.Acceptor. Acceptors are responsible for choosing a single value for a particular instance. For each instanceof consensus, each individual acceptor must “vote” for a value. Acceptors must maintain and access the historyof proposals for which they have voted. This history ensures that acceptors never vote for different values for aparticular instance, and allows the protocol to tolerate lost or duplicate messages.

Algorithm 2 shows logic for an acceptor. Acceptors can receive either PHASE1A or PHASE2A messages.Phase 1A messages are used during initialization, and Phase 2A messages trigger a vote. The logic for handlingboth messages, when expressed as stateful routing decisions, involves: (i ) reading persistent state, (i i ) modifyingpacket header fields, (i i i ) updating the persistent state, and (i v ) forwarding the modified packets. The logic differsin which header fields are involved.Learner. Learners are responsible for replicating a value for a given consensus instance. Learners receive votes

6

Page 7: P4xos: Consensus as a Network Service - Semantic Scholar€¦ · P4xos: Consensus as a Network Service Huynh Tu Dang1, Pietro Bressana1, Han Wang2,5, Ki Suh Lee2, Marco Canini3, Noa

Algorithm 3 Learner logic.1: Initialize State:2: history2B[MaxInstances][NumAcceptor]:= {0}3: value[MaxInstances] := {0}4: vround[MaxInstances] := {−1}5: count[MaxInstances] := {0}6: upon receiving pkt(msgtype, inst, rnd, vrnd, swid, value)7: match pkt.msgtype:8: case Phase2B:9: if (pkt.rnd > vround[pkt.inst] or vround[pkt.inst] = −1) then

10: history2B[pkt.inst][0] := 011:

...12: history2B[pkt.inst][NumAcceptor-1] := 013: history2B[pkt.inst][pkt.swid] := 114: vround[pkt.inst] := pkt.rnd15: value[pkt.inst] := pkt.value16: count[pkt.inst] := 117: else if (pkt.rnd = vround[pkt.inst]) then18: if (history2B[pkt.inst][pkt.swid] = 0) then19: count[pkt.inst] := count[pkt.inst] + 120: history2B[pkt.inst][pkt.swid] := 121: else22: drop pkt23: if (count[pkt.inst] = Majority) then24: forward pkt.value25: default :26: drop pkt

from the acceptors, and “deliver” a value if a majority of votes are the same (i.e., there is a quorum).Algorithm 3 shows the pseudocode for the learner logic. Learners should only receive PHASE2B messages.

When a message arrives, each learner extracts the instance number, switch id, and value. The learner maintainsa mapping from a pair of instance number and switch id to a value. Each time a new value arrives, the learnerchecks for a majority-quorum of acceptor votes. A majority is equal to f + 1 where f is the number of faultyacceptors that can be tolerated.

The learner provides the interface between the network consensus and the replicated application. The behavioris split between the network, which listens for a quorum of messages, and a library, which is linked to the applic-ation. To compute a quorum, the learner counts the number of PHASE2B messages it receives from differentacceptors in a round. If there is no quorum of PHASE2B messages in an instance (e.g., because the primaryleader fails), the learner may need to recount PHASE2B messages in a quorum (e.g., after the backup leader re-executes the instance). Once a quorum is received, it delivers the value to the client by sending the message to theuser-space library that exposes the application-facing API. The API, shown in Listing 2, provides two functions.

To receive delivered values, the application registers a callback function with the type signature of deliver.To register the function, the application sets a function pointer in the paxos_ctx struct. When a learner learnsa value, it calls the application-specific deliver function. The deliver function returns a buffer containingthe learned value, the size of the buffer, and the instance number for the learned value.

The recover function is used by the application to discover a previously agreed upon value for a partic-ular instance of consensus. The recover function results in the same exchange of messages as the submitfunction. The difference in the API, though, is that the application must pass the consensus instance number as aparameter, as well as an application-specific no-op value. The resulting deliver callback will either return theaccepted value, or the no-op value if no value had been previously accepted for the particular instance number.

void (*deliver)(struct paxos_ctx *ctx, int instance, char *value, int size);

void recover(struct paxos_ctx *ctx, int instance, char *value, int size);

Listing 2: P4xos learner API.

7

Page 8: P4xos: Consensus as a Network Service - Semantic Scholar€¦ · P4xos: Consensus as a Network Service Huynh Tu Dang1, Pietro Bressana1, Han Wang2,5, Ki Suh Lee2, Marco Canini3, Noa

3.2 Correctness

Given this alternative interpretation of the Paxos algorithm, it is natural to question if this is a faithful imple-mentation of the original protocol [18]. In this respect, we are aided by our P4 specification. In comparison toHDL or general purpose programming languages, P4 is high-level and declarative. By design, P4 is not a Turing-complete language, as it excludes looping constructs, which are undesirable in hardware pipelines. Consequently,it is particularly amenable to verification by bounded model checking.

We have mapped the P4 specification to Promela, and verified the correctness using the SPIN model checker.Specifically, we verify the safety property of agreement: the learners never decide on two separate values for asingle instance of consensus.

4 Discussion

The design outlined in the previous section begs several questions, which we expand on below.Isn’t this just Paxos? Yes! In fact, that is the central premise of our work: you don’t need to change a fun-damental building block of distributed systems in order to gain performance. This thesis is quite different fromthe prevailing wisdom. There have been many optimizations proposed for consensus protocols. These optim-izations typically rely on changes in the underlying assumptions about the network, e.g., the network providesordered [22, 40, 24] or reliable [41] delivery. Consensus protocols, in general, are easy to get wrong. Strongassumptions about network behavior may not hold in practice. Incorrect implementations of consensus yieldunexpected behavior in applications that is hard to debug.

In contrast, Paxos is widely considered to be the “gold standard”. It has been proven safe under asynchronousassumptions, live under weak synchronous assumptions, and resilience-optimum [18].Isn’t this just faster hardware? The latency saving across a data center are not hardware dependent: If youchange the switches used in your network, or the CPU used in your servers, the relative latency improvement willbe maintained. In the experiments described in section 5 (Table 2), the P4xos implementation on FPGA operatesat 250MHz, while libpaxos runs on a host operating at 1.6GHz, yet the performance of P4xos on FPGA is fortytimes higher. It is therefore clear that fast hardware is not the sole reason for throughput improvement.Isn’t offload useful only when the network is heavily loaded? P4xos fits the changing conditions in datacenters, where operators often increase the size of their network over time. As software and hardware componentsare interchangeable, P4xos allows starting with all components running on hosts, and gradually shifting load tothe network as the data center grows and the consensus message rate increases. As §5.2.1 shows, even a moderatepacket rate is sufficient to overload libpaxos.Doesn’t the storage need to be durable? Paxos usually requires that acceptor storage be persistent. In otherwords, if the acceptor fails and restarts, it should be able to recover its durable state. Our prototype uses non-persistent SRAM, which means that there must always be a majority of processes that never fail. As we move thefunctionality into the network, this is equivalent to expecting a majority of aggregate switches not to fail.

Providing persistent storage for network deployments of P4xos can be addressed in a number of ways. Priorwork on implementing consensus in FPGAs used on chip RAM, and suggested that the memory could be madepersistent with a battery [16]. Alternatively, a switch could access non-volatile memory (i.e., an SSD drive)directly via PCI-express [12].Is there enough storage available? The Paxos algorithm does not specify how to handle the ever-growing,replicated acceptor log. On any system, including P4xos, this can cause problems, as the log would requireunbounded storage space, and recovering replicas might need unbounded recovery time to replay the log. We notethat in a P4xos deployment, the number of instance messages that can be stored is bounded by the size of theinst field of the Paxos header. Users of P4xos will have to set the value to an appropriate size for a particulardeployment. To cope with the ever-growing log size, an application using P4xos must implement a checkpointmechanism [7]. Since the decisions of when and how to checkpoint are application-specific, we do not includethese as part of the P4xos implementation. The amount of memory available on a Tofino chip is confidential, buta top-of-the-line FPGA has 64Gb RAM [44].

5 Evaluation

Our evaluation of P4xos explores three questions: (i) What is the absolute performance of individual P4xos com-ponents? (ii) What is the end-to-end performance of P4xos as a system for providing consensus? And, (iii) What

8

Page 9: P4xos: Consensus as a Network Service - Semantic Scholar€¦ · P4xos: Consensus as a Network Service Huynh Tu Dang1, Pietro Bressana1, Han Wang2,5, Ki Suh Lee2, Marco Canini3, Noa

is the performance under failure?To evaluate end-to-end performance, we compare P4xos with a software-based implementation, the open-

source libpaxos library [25]. Overall, the evaluation shows that P4xos dramatically increases throughput andreduces latency for end-to-end performance, when compared to traditional software implementations.Implementation. We have implemented a prototype of P4xos in P4 [3]. We have also written C implementationsof the leader, acceptor, and learner using DPDK. The DPDK and P4 versions of the code are interchangeable,allowing, for example, a P4 based hardware deployment of a leader to use a DPDK implementation as a backup.Because P4 is portable across devices, we have used several compilers [45, 32, 47, 33, 34] to run P4xos on a varietyof hardware devices, including a re-configurable ASIC, numerous FPGAs, an NPU, and a CPU with and withoutkernel-bypass software. A total of 6 different implementations were tested. An appendix includes performancemeasurements from all hardware. All source code, other than the version that targets Barefoot Network’s Tofinochip, is publicly available with an open-source license (https://github.com/P4xos).

5.1 ASIC Deployment

The first set of experiments evaluate the performance of individual P4xos components deployed on a program-mable ASIC. Since we only had access to a single platform, we only report absolute latency and throughputnumbers for the individual Paxos components.Experimental setup. We compiled P4xos to run on a 64-port, 40G ToR switch, with Barefoot Network’s TofinoASIC [4]. To generate traffic, we used a 2× 40G b Ixia XGS12-H as packet sender and receiver. We used atechnique known as “snake testing” to verify that we can achieve full line rate across 64-ports; all external portswere connected in a sequence, using 40Gb SFP+ direct-attached cables. Thus, the output of one port becomes theinput of another. The use of all ports as part of the experiments was validated, e.g., using per-port counters. Wesimilarly checked equal load across ports and potential packet loss (which did not occur).

Our implementation concatenated the P4xos pipeline and the switching pipeline. This enabled demonstratingthe co-existence of Paxos and forwarding operations at the same time. While the coexistence of both functionsadded an additional match-action unit to the latency, it allowed us to concurrently implement on Tofino forwardingrules and verify P4xos operation.Latency and throughput. We measured the throughput for all Paxos roles to be 41 million 102 byte consensusmsgs/sec per port. In the Tofino architecture, implementing pipelines of 16 ports each [15], a single instance ofP4xos reached 656 million consensus messages per second. We deployed 4 instances in parallel on a 64 port x40GE switch, processing over 2.5 billion consensus msgs/sec. Moreover, our measurements indicate that P4xosshould be able to scale up to 6.5 Tb/second of consensus messages on a single switch, using 100GE ports.

We used Barefoot’s compiler to report the precise theoretical latency for the packet processing pipeline. Thelatency is less that 0.1 µs. To be clear, this number does not include the SerDes, MAC, or packet parsing compon-ents. Hence, the wire-to-wire latency would be slightly higher. We discuss latency evaluation further in §5.2.1.These experiments show that moving Paxos into the forwarding plane can substantially improve performance. Fur-thermore, using devices such as Tofino means that moving stateful applications to the network requires softwareupdates, rather than hardware upgrades, therefore diminishing the effect on network operators.Resources and coexisting with other traffic. The P4xos pipeline uses less than 5% of the available SRAMon Tofino, and no TCAM. Thus, adding P4xos to an existing switch pipeline on a re-configurable ASIC woulduse few of the available resources, and have a minimal effect on other switch functionality (e.g., the number offine-grain rules in tables).

Moreover, the P4xos on Tofino experiment demonstrates that consensus operation can coexist with standardnetwork switching operation, as the peak throughput is measured while the device runs traffic at full line rate of6.5Tbps. This is a clear indication that network devices can be used more efficiently, implementing consensusservices parallel to network operations. Using network devices for more than just network operations reduces theload on the host while not affecting network performance or adding more hardware.

5.2 FPGA Deployment

To explore P4xos beyond a single device and within a distributed system, we ran a set of experiments demon-strating a proof-of-concept of P4xos using different hardware. The leader and acceptors running on NetFPGASUME [49], and the learners using DPDK implementation running on general purpose CPUs. The reason for not

9

Page 10: P4xos: Consensus as a Network Service - Semantic Scholar€¦ · P4xos: Consensus as a Network Service Huynh Tu Dang1, Pietro Bressana1, Han Wang2,5, Ki Suh Lee2, Marco Canini3, Noa

FPGALeader

Proposer

Application

Switch

NIC FPGAAcceptor

Learner

Application

NIC

DPDKLearner

FPGAAcceptor

Learner

Application

NIC

DPDKLearner

FPGAAcceptor

Learner

Application

NIC

DPDKLearner

Server 1 Server 2

Server 3 Server 4

Figure 3: FPGA test bed for the evaluation.

running the learners on an FPGA is, again, as we did not have access to additional boards. The experiments alsoshow the interchangeability of the P4xos design.Experimental setup. Our testbed includes four Supermicro 6018U-TRTP+ servers and a Pica8 P-3922 10 GbpsEthernet switch, connected in the topology shown in Figure 3. The servers have dual-socket Intel Xeon E5-2603CPUs, with a total of 12 cores running at 1.6GHz, 16GB of 1600MHz DDR4 memory and two Intel 82599 10Gbps NICs. NetFPGA SUME boards operated at 250MHz. We installed one NetFPGA SUME board in eachserver using a PCIe x8 slot, though NetFGPAs behave as stand-alone systems in our testbed. Two of the fourSFP+ interfaces of the NetFPGA SUME board and one of the four SFP+ interfaces provided by the 10 Gbps NICsare connected to Pica8 switch with a total of 12 SFP+ copper cables. The servers were running Ubuntu 14.04 withLinux kernel version 3.19.0.

For the DPDK learner, we dedicated one of the two sockets and two NICs to the DPDK application. All RAMbanks on the server were moved to the slots managed by the socket. Virtualization, frequency scaling, and powermanagement were disabled in the BIOS. The RAM frequency was set to the maximum value. Additionally, twoCPU cores in the selected socket were isolated and the NICs coupled to the socket were bound to the DPDKdrivers. Finally, 1024 hugepages (2MB each) were reserved for the DPDK application.

5.2.1 Absolute performance

The FPGA experiments measure the latency, throughput, and resource utilization for the individual P4xos com-ponents. Overall, our evaluation shows that P4xos can saturate a network link, and that the latency overhead forPaxos logic is little more than forwarding delay.Single-packet latency. To quantify the processing overhead added by executing Paxos logic, we measured thepipeline latency of forwarding with and without Paxos. In particular, we computed the difference between twotimestamps, one when the first word of a consensus message entered the pipeline and the other when the first wordof the message left the pipeline. For DPDK, the CPU timestamp counter (TSC) was used.

Table 1 shows the latency for DPDK, P4xos running on NetFPGA SUME and on Tofino. The first row showsthe results for forwarding without Paxos logic. The latency was measured from the beginning of the packetparser until the end of the packet deparser. The remaining rows show the pipeline latency for the various Paxoscomponents. The higher latency for acceptors reflects the complexity of operations of that role. Note that thelatency of the FPGA and ASIC based targets is constant as their pipelines use a constant number of stages.Overall, the experiments show that P4xos adds little latency beyond simply forwarding packets, around 0.15 µs(38 clock cycles) on FPGA and less than 0.1 µs on ASIC .

We wanted to compare a compiled P4 code to a native implementation in Verilog or VHDL. The closestrelated work in this area is by Istvan et al. [16], which implemented Zookeeper Atomic Broadcast on an FPGA. Itis difficult to make a direct comparison, because (i) they implement a different protocol, and (ii) they timestampthe packet at different places in their hardware. But, as best we can tell, the latency numbers are similar.

10

Page 11: P4xos: Consensus as a Network Service - Semantic Scholar€¦ · P4xos: Consensus as a Network Service Huynh Tu Dang1, Pietro Bressana1, Han Wang2,5, Ki Suh Lee2, Marco Canini3, Noa

Role DPDK P4xos (NetFPGA) P4xos (Tofino)

Forwarding n/a 0.370 µs n/aLeader 2.3 µs 0.520 µs less than 0.1 µs

Acceptor 2.6 µs 0.550 µs less than 0.1 µsLearner 2.8 µs 0.540 µs less than 0.1 µs

Table 1: P4xos latency. The latency accounts only for the packet processing within each implementation.

Role libpaxos DPDK P4xos(NetFPGA) P4xos(Tofino)

Leader 241K 5.5M 10M 656M×4 = 2.5BAcceptor 178K 950K 10M 656M×4 = 2.5BLearner 189K 650K 10M 656M×4 = 2.5B

Table 2: Throughput in messages/s. NetFPGA uses a single 10Gb link. Tofino uses 40Gb links. On Tofino, we ran 4deployments in parallel, each using 16 ports.

Measured maximum achievable throughput. We first measured the maximum rate at which P4xos componentscan process consensus messages. As a baseline comparison, we also include the measurements for libpaxosand DPDK implementations. To generate traffic, we used a P4FPGA-based [45] hardware packet generator andcapturer to send 102-byte1 consensus messages to each component, then captured and timestamped each messagemeasuring maximum receiving rate.

The results in Table 2 show that on the FPGA, the acceptor, leader, and learner can all process close to10 million consensus messages per second, an order of magnitude improvement over libpaxos, and almostdouble the best DPDK throughput. The ASIC deployment allows two additional order of magnitude improvement.Resource utilization. To evaluate the cost of implementing Paxos logic on FPGAs, we report resource utilizationon NetFPGA SUME using P4FPGA [45]. An FPGA contains a large number of programmable logic blocks: look-up tables (LUTs), registers and Block RAM (BRAM). In NetFPGA SUME, we implemented P4 stateful memorywith on-chip BRAM to store the consensus history. As shown in Table 3, current implementation uses 54% ofavailable BRAMs, out of which 35% are used for stateful memory2. We could scale up the current implementationin NetFPGA SUME by using large, off-chip DRAM at a cost of higher memory access latency. Prior worksuggests that increased DRAM latency should not impact throughput [16]. We note that cost may differ with othercompilers and targets.

5.2.2 End-to-End Performance

Our first end-to-end evaluation uses a simple echo server as an application on the testbed illustrated in Figure 3.Server 1 runs a multi-threaded client process and a single proposer process. Servers 2, 3, and 4 run single-threaded learner processes and the echo server. The deployment for libpaxos is similar, except that the leaderand acceptor processes run in software on their servers, instead of running on the FPGA boards.

Each client thread submits a message with the current timestamp written in the value. When the value isdelivered by the learner, a server program retrieves the message via a deliver callback function, and then returnsthe message back to the client. When the client gets a response, it immediately submits another message. Thelatency is measured at the client as the round-trip time for each message. Throughput is measured at the learneras the number of deliver invocations over time.

To push the system towards higher a message throughput, we increased the number of threads running inparallel at the client. The number of threads, N , ranged from 1 to 24 by increments of 1. We stopped measuringat 24 threads because the CPU utilization on the application reached 100%. For each value of N , the client sent atotal of 2 million messages. We repeat this for three runs, and report the 99t h -ile latency and mean throughput.Latency and predictability. We measure the latency and predictability for P4xos as a system, and show thelatency distribution in Figure 4a. Since applications typically do not run at maximum throughput, we report theresults for when the application is sending traffic at a rate of 24k messages / second, which favors the libpaxosimplementation. This rate is far below what P4xos can achieve. We see that P4xos shows lower latency andexhibits better predictability than libpaxos: it’s median latency is 57 µs, compared with 146 µs, and the

1Ethernet header (14B), IP header (20B), UDP header (8B), Paxos header (44B), and Paxos payload (16B)2On newer FPGA [44] the resource utilization will be an order of magnitude lower

11

Page 12: P4xos: Consensus as a Network Service - Semantic Scholar€¦ · P4xos: Consensus as a Network Service Huynh Tu Dang1, Pietro Bressana1, Han Wang2,5, Ki Suh Lee2, Marco Canini3, Noa

Resource Utilization

LUTs 84674 / 433200 (19.5%)Registers 103921 / 866400 (11.9%)BRAMs 801 / 1470 (54.4%)

Table 3: Resource utilization on NetFPGA SUME with P4FPGA with 65,535 Paxos instance numbers.

0.00

0.25

0.50

0.75

1.00

0 50 100 150 200Latency (µs)

CD

F

LibpaxosP4xos

(a) Latency CDF.

0

200

400

600

0 50 100 150 200 250Throughput (1000 x Msgs / s)

99th

Per

cent

ile L

aten

cy (

µs)

LibpaxosP4xos

(b) Throughput vs. latency

0.00

0.25

0.50

0.75

1.00

0 50 100 150 200Latency (µs)

CD

F

LibpaxosP4xos

(c) Latency CDF (with LevelDB).

0

200

400

600

0 50 100 150Throughput (1000 x Msgs / s)

99th

Per

cent

ile L

aten

cy (

µs)

LibpaxosP4xos

(d) Throughput vs. latency (with LevelDB).Figure 4: The end-to-end performance of P4xos compared to libpaxos: (a) latency CDF, (b) throughput vs. latency, (c)latency CDF (LevelDB) and (d) throughput vs. latency (LevelDB). P4xos has higher throughput and lower latency.

difference between 25% and 75% quantiles is less than 3 µs, compared with 30 µs in libpaxos. Note thathigher tail latencies are attributed to the Proposers and Learners, running on the host.Throughput and 99t h -ile latency. Figure 4b shows that P4xos results in significant improvements in latency andthroughput. While libpaxos is only able to achieve a maximum throughput of 63,099 messages per second,P4xos reach 275,341 messages per second, at which point the application becomes CPU-bound. This is a 4.3×improvement. Given that using P4xos on Tofino can support four orders of magnitude more messages, and that theapplication is CPU-bound, cross-traffic will have a small effect on overall P4xos performance. The lowest 99t h -ile latency for libpaxos occurs at the lowest throughput rate, and is 183µs. However, the latency increasessignificantly as the throughput increases, reaching 774µs. In contrast, the latency for P4xos starts at only 51µs,and is 117µs at the maximum throughput, mostly due to the server.Case Study: Replicated LevelDB. As an end-to-end performance experiment, we measured the latency andthroughput for consensus messages for our replicated LevelDB example application. The LevelDB instances weredeployed on the three servers running the learners. We followed the same methodology as described above, butrather than sending dummy values, we sent an equal mix of get and put requests. The latency distribution is shownin Figure 4c. We report the results for a light workload rate of 24k messages / second for both systems. ForP4xos, the round trip time (RTT) of 99% of the requests is 50µs, including the client’s latency. In contrast, forlibpaxos, the RTT ranges from 100µs to 200µs. This demonstrates that P4xos latency is both lower and more

12

Page 13: P4xos: Consensus as a Network Service - Semantic Scholar€¦ · P4xos: Consensus as a Network Service Huynh Tu Dang1, Pietro Bressana1, Han Wang2,5, Ki Suh Lee2, Marco Canini3, Noa

0

50,000

100,000

150,000

0 4 8 12 16 20 24Time (Second)

Thr

ough

put (

Msg

/ s)

(a) Acceptor failure

0

50,000

100,000

150,000

0 4 8 12 16 20 24Time (Second)

Thr

ough

put (

Msg

/ s)

(b) Leader failure

Figure 5: P4xos performance when (a) an acceptor fails, and (b) when FPGA leader is replaced by DPDK backup.

predictable, even when used for replicating a relatively more complex application.The 99t h -ile latency and throughput when replicating LevelDB are shown in Figure 4d. The limiting factor

for performance is the application itself, as the CPU of the servers are fully utilized. P4xos removes consensus asa bottleneck. The maximum throughput achieved here by P4xos is 157,589 messages per second. In contrast, forthe libpaxos deployment, we measured a maximum throughput of only 54,433 messages per second.

Note that LevelDB was unmodified, i.e., there were no changes to the application. We expect that given ahigh-performance implementation of Paxos, applications could be modified to take advantage of the increasedmessage throughput, for example, by using multi-core architectures to process requests in parallel [26].

5.2.3 Performance Under Failure

To evaluate the performance of P4xos after failures, we repeated the latency and throughput measurements fromSection 5.2.2 under two different scenarios. In the first, one of the three P4xos acceptors fails. In the second,the P4xos leader fails, and the leader FPGA is temporarily replaced with a DPDK leader. In both the graphs inFigure 5, the vertical line indicates the failure point.

Figure 5a shows a throughput increase after the loss of one acceptor; the learner is the bottleneck, and itprocesses fewer messages after a failure. To handle the loss of a leader, we re-route traffic to the backup. Figure 5bshows that P4xos is resilient to a failure, and that after a very short recovery period, it continues to provide highthroughput. Note that P4xos could fail over to a backup libpaxos leader, as they provide the same API.

6 Related Work

Consensus is a well studied problem [18, 30, 31, 8]. Many have proposed consensus optimizations, includingexploiting application semantics (e.g., EPaxos [29], Generalized Paxos [21], Generic Broadcast [35]), restrictingthe protocol (e.g., Zookeeper atomic broadcast [41]), or careful engineering (e.g., Gaios [2]).

Figure 6 compares related work along two axes. The y-axis plots the strength of the assumptions that aprotocol makes about network behavior (e.g., reliable delivery, ordered delivery, etc.). The x-axis plots the levelof support that network devices need to provide (e.g., quality-of-service queues, support for adding sequencenumbers, maintaining persistent state, etc.).

Lamport’s basic Paxos protocol falls in the lower left quadrant, as it only assumes packet delivery in point-to-point fashion and election of a non-faulty leader. It also requires no modification to network forwarding devices.Fast Paxos [22] optimizes the protocol by optimistically assuming a spontaneous message ordering [22, 36, 37].However, if that assumption is violated, Fast Paxos reverts to the basic Paxos protocol.

NetPaxos [10] assumes ordered delivery, without enforcing the assumption, which is likely unrealistic. Spec-ulative Paxos [40] and NoPaxos [24] uses programmable hardware to increase the likelihood of in-order delivery,and leverage that assumption to optimize consensus à la Fast Paxos [22]. In contrast, P4xos makes few assump-tions about the network behavior, and uses the programmable data plane to provide high-performance.

13

Page 14: P4xos: Consensus as a Network Service - Semantic Scholar€¦ · P4xos: Consensus as a Network Service Huynh Tu Dang1, Pietro Bressana1, Han Wang2,5, Ki Suh Lee2, Marco Canini3, Noa

No HardwareSupport

Full HardwareSupport

Wea

kas

sum

ptio

nsSt

rong

assu

mpt

ions NetPaxos

Fast Paxos SpeculativePaxos

NoPaxos

P4xosPaxos

Figure 6: Design space for consensus/network interaction.

7 Conclusion

P4xos significantly improves the performance of consensus, without strengthening assumptions about the network,and without additional hardware. This is a first step towards a more holistic approach to designing distributedsystems, in which the network can accelerate services traditionally running on the host.

References

[1] Benz, S., Marandi, P. J., Pedone, F., and Garbinato, B. Building global and scalable systems with atomic multicast. InMiddleware (Dec. 2014).

[2] Bolosky, W. J., Bradshaw, D., Haagens, R. B., Kusters, N. P., and Li, P. Paxos replicated state machines as the basis of ahigh-performance data store. In USENIX NSDI (Mar. 2011).

[3] Bosshart, P., Daly, D., Gibb, G., Izzard, M., McKeown, N., Rexford, J., Schlesinger, C., Talayco, D., Vahdat, A., Var-ghese, G., and Walker, D. P4: Programming protocol-independent packet processors. SIGCOMM Comput. Commun.Rev. 44, 3 (July 2014), 87–95.

[4] Bosshart, P., Gibb, G., Kim, H.-S., Varghese, G., McKeown, N., Izzard, M., Mujica, F., and Horowitz, M. Forwardingmetamorphosis: Fast programmable match-action processing in hardware for sdn. SIGCOMM Comput. Commun. Rev.43, 4 (Aug. 2013), 99–110.

[5] Burrows, M. The chubby lock service for loosely-coupled distributed systems. In USENIX OSDI (Nov. 2006).

[6] Calder, B., Wang, J., Ogus, A., Nilakantan, N., Skjolsvold, A., McKelvie, S., Xu, Y., Srivastav, S., Wu, J., Simitci, H.,Haridas, J., Uddaraju, C., Khatri, H., Edwards, A., Bedekar, V., Mainali, S., Abbasi, R., Agarwal, A., Haq, M. F. u.,Haq, M. I. u., Bhardwaj, D., Dayanand, S., Adusumilli, A., McNett, M., Sankaran, S., Manivannan, K., and Rigas, L.Windows azure storage: A highly available cloud storage service with strong consistency. In ACM SOSP (Oct. 2011).

[7] Chandra, T. D., Griesemer, R., and Redstone, J. Paxos made live: An engineering perspective. In ACM PODC (Aug.2007).

[8] Chandra, T. D., and Toueg, S. Unreliable failure detectors for reliable distributed systems. J. ACM 43, 2 (Mar. 1996),225–267.

[9] Charron-Bost, B., and Schiper, A. Uniform consensus is harder than consensus. J. Algorithms 51, 1 (Apr. 2004), 15–37.

[10] Dang, H. T., Sciascia, D., Canini, M., Pedone, F., and Soulé, R. Netpaxos: Consensus at network speed. In ACM SOSR(June 2015).

[11] DPDK. http://dpdk.org/, 2015.

[12] FPGA Drive FMC. https://opsero.com/product/fpga-drive-fmc/, 2016.

[13] Friedman, R., and Birman, K. Using group communication technology to implement a reliable and scalable distributedin coprocessor. In TINA Conference (Sept. 1996).

[14] Gray, J., and Lamport, L. Consensus on transaction commit. ACM Trans. Database Syst. 31, 1 (Mar. 2006), 133–160.

[15] Gurevich, V. Barefoot networks, programmable data plane at terabit speeds. In DXDD (2016), Open-NFP.

[16] István, Z., Sidler, D., Alonso, G., and Vukolic, M. Consensus in a box: Inexpensive coordination in hardware. In USENIXNSDI (Mar. 2016).

[17] Kulkarni, S., Bhagat, N., Fu, M., Kedigehalli, V., Kellogg, C., Mittal, S., Patel, J. M., Ramasamy, K., and Taneja, S.Twitter heron: Stream processing at scale. In ACM SIGMOD (May 2015).

14

Page 15: P4xos: Consensus as a Network Service - Semantic Scholar€¦ · P4xos: Consensus as a Network Service Huynh Tu Dang1, Pietro Bressana1, Han Wang2,5, Ki Suh Lee2, Marco Canini3, Noa

[18] Lamport, L. The part-time parliament. ACM Trans. Comput. Syst. 16, 2 (May 1998), 133–169.

[19] Lamport, L. Paxos made simple. ACM SIGACT News 32, 4 (Dec. 2001), 18–25.

[20] Lamport, L. Lower bounds for asynchronous consensus. Distributed Computing 19, 2 (2003), 104–125.

[21] Lamport, L. Generalized Consensus and Paxos. Tech. Rep. MSR-TR-2005-33, Microsoft Research, 2004.

[22] Lamport, L. Fast paxos. Distributed Computing 19 (Oct. 2006), 79–103.

[23] Li, J., Michael, E., and Ports, D. R. Eris: Coordination-free consistent transactions using in-network concurrency control.In ACM SOSP (2017), pp. 104–120.

[24] Li, J., Michael, E., Sharma, N. K., Szekeres, A., and Ports, D. R. K. Just say no to paxos overhead: Replacing consensuswith network ordering. In USENIX OSDI (Nov. 2016).

[25] libpaxos, 2013. https://bitbucket.org/sciascid/libpaxos.

[26] Marandi, P. J., Benz, S., Pedonea, F., and Birman, K. P. The performance of paxos in the cloud. In IEEE SRDS (Oct.2014).

[27] Marandi, P. J., Primi, M., Schiper, N., and Pedone, F. Ring paxos: A high-throughput atomic broadcast protocol. In DSN(June 2010).

[28] Mazieres, D. Paxos Made Practical. Unpublished manuscript, Jan. 2007.

[29] Moraru, I., Andersen, D. G., and Kaminsky, M. There is more consensus in egalitarian parliaments. In ACM SOSP (Nov.2013).

[30] Oki, B. M., and Liskov, B. H. Viewstamped replication: A new primary copy method to support highly-availabledistributed systems. In ACM PODC (Jan. 1988).

[31] Ongaro, D., and Ousterhout, J. In search of an understandable consensus algorithm. In USENIX ATC (Aug. 2014).

[32] Open-NFP. http://open-nfp.org/, 2015.

[33] P4@ELTE. http://p4.elte.hu/, 2015.

[34] P4.org. http://p4.org, 2015.

[35] Pedone, F., and Schiper, A. Generic broadcast. In DISC (Sept. 1999).

[36] Pedone, F., and Schiper, A. Optimistic atomic broadcast: A pragmatic viewpoint. Theor. Comput. Sci. 291, 1 (Jan. 2003),79–101.

[37] Pedone, F., Schiper, A., Urbán, P., and Cavin, D. Solving agreement problems with weak ordering oracles. In EuropeanDependable Computing Conference on Dependable Computing (Oct. 2002).

[38] Poke, M., and Hoefler, T. Dare: High-performance state machine replication on RDMA networks. In ACM HPDC (June2015).

[39] Popescu, D. A., and Moore, A. W. PTPmesh: Data center network latency measurements using PTP. In IEEE MASCOTS(2017), pp. 73–79.

[40] Ports, D. R. K., Li, J., Liu, V., Sharma, N. K., and Krishnamurthy, A. Designing distributed systems using approximatesynchrony in data center networks. In USENIX NSDI (May 2015).

[41] Reed, B., and Junqueira, F. P. A simple totally ordered broadcast protocol. In ACM/SIGOPS LADIS (Sept. 2008).

[42] Sciascia, D., and Pedone, F. Geo-Replicated Storage with Scalable Deferred Update Replication. In DSN (June 2013).

[43] Van Renesse, R., and Altinbuken, D. Paxos made moderately complex. ACM Comput. Surv. 47, 3 (Feb. 2015), 1–36.

[44] Virtex UltraScale+, 2017. https://www.xilinx.com/products/silicon-devices/fpga/virtex-ultrascale-plus.html#productTable.

[45] Wang, H., Soulé, R., Dang, H. T., Lee, K. S., Shrivastav, V., Foster, N., and Weatherspoon, H. P4FPGA: A rapidprototyping framework for P4. In ACM SOSR (Apr. 2017).

[46] Weil, S. A., Brandt, S. A., Miller, E. L., Long, D. D. E., and Maltzahn, C. Ceph: A scalable, high-performance distributedfile system. In USENIX OSDI (Nov. 2006).

[47] Xilinx SDNet Development Environment. www.xilinx.com/sdnet, 2014.

[48] XPliant Ethernet Switch Product Family. www.cavium.com/XPliant-Ethernet-Switch-Product-Family.html, 2014.

[49] Zilberman, N., Audzevich, Y., Covington, G. A., and Moore, A. W. NetFPGA SUME: Toward 100 Gbps as researchcommodity. IEEE MICRO 34, 5 (Sept. 2014), 32–41.

[50] Zilberman, N., Grosvenor, M., Popescu, D. A., Manihatty-Bojan, N., Antichi, G., Wójcik, M., and Moore, A. W. Wherehas my time gone? In PAM (2017), pp. 201–214.

15