MODIST: Transparent Model Checking of …systems.cs.columbia.edu/files/wpid-modist-nsdi09.pdfMODIST, covering its architecture (x2.1), its checking process(x 2.2), the checks it enables

MODIST: Transparent Model Checking of Unmodified Distributed SystemsJunfeng Yang◦∗, Tisheng Chen‡, Ming Wu‡, Zhilei Xu‡, Xuezheng Liu‡

Haoxiang Lin‡, Mao Yang‡, Fan Long†, Lintao Zhang‡∗, Lidong Zhou‡∗

◦ Columbia University, ‡ Microsoft Research Asia∗ Microsoft Research Silicon Valley, † Tsinghua University

Abstract

MODIST is the first model checker designed for transparentlychecking unmodified distributed systems running on unmod-ified operating systems. It achieves this transparency via anovel architecture: a thin interposition layer exposes all ac-tions in a distributed system and a centralized, OS-independentmodel checking engine explores these actions systematically.We made MODIST practical through three techniques: an ex-ecution engine to simulate consistent, deterministic executionsand failures; a virtual clock mechanism to avoid false positivesand false negatives; and a state exploration framework to incor-porate heuristics for efficient error detection.

We implemented MODIST on Windows and applied it tothree well-tested distributed systems: Berkeley DB, a widelyused open source database; MPS, a deployed Paxos implemen-tation; and PACIFICA, a primary-backup replication protocolimplementation. MODIST found 35 bugs in total. Most im-portantly, it found protocol-level bugs (i.e., flaws in the coredistributed protocols) in every system checked: 10 in total, in-cluding 2 in Berkeley DB, 2 in MPS, and 6 in PACIFICA.

1 IntroductionDespite their growing popularity and importance, dis-tributed systems remain difficult to get right. These sys-tems have to cope with a practically infinite number ofnetwork conditions and failures, resulting in complexprotocols and even more complex implementations. Thiscomplexity often leads to corner-case errors that are dif-ficult to test, and, once detected in the field, impossibleto reproduce.

Model checking has been shown effective at detect-ing subtle bugs in real distributed system implementa-tions [19, 27]. These tools systematically enumerate thepossible execution paths of a distributed system by start-ing from an initial state and repeatedly performing allpossible actions to this state and its successors. Thisstate-space exploration makes rare actions such as net-work failures appear as often as common ones, therebyquickly driving the target system (i.e., the system wecheck) into corner cases where subtle bugs surface.

To make model checking effective, it is crucial to ex-pose the actions a distributed system can perform and doso at an appropriate level. Previous model checkers fordistributed systems tended to place this burden on users,who have to either write (or rewrite) their systems in a

restricted language that explicitly annotates event han-dlers [19], or heavily modify their system to shoehorn itinto a model checker [27].

This paper presents MODIST, a system that checks un-modified distributed systems running on unmodified op-erating systems. It simulates a variety of network con-ditions and failures such as message reordering, networkpartitions, and machine crashes. The effort required tostart checking a distributed system is simply to providea simple configuration file specifying how to start thedistributed system. MODIST spawns this system in thenative environment the system runs within, infers whatactions the system can do by transparently interposingbetween the application and the operating system (OS),and systematically explores these actions with a cen-tralized, OS-independent model checking engine. Wehave carefully engineered MODIST to ensure the exe-cutions MODIST explores and the failures it injects areconsistent and deterministic: inconsistency creates falsepositives that are painful to diagnose; non-determinismmakes it hard to reproduce detected errors.

Real distributed systems tend to rely on timeouts forfailure detection (e.g., leases [14]); many of these time-outs hide in branch statements (e.g., “if(now > t +timeout)”). To find bugs in the rarely tested timeouthandling code, MODIST provides a virtual clock mech-anism to explore timeouts systematically using a novelstatic symbolic analysis technique. Compared to thestate-of-the-art symbolic analysis techniques [3, 4, 13,31], our method reduces analysis complexity using thefollowing two insights: (1) programmers use time val-ues in simple ways (e.g., arithmetic operations) and (2)programmers check timeouts soon after they query thecurrent time (e.g., by calling gettimeofday()).

We implemented MODIST on Windows. We appliedit to three well-tested distributed systems: Berkeley DB,a widely used open-source database; MPS, a Paxos im-plementation that has managed production data centerswith more than 100K machines for over two years; andPACIFICA, a primary-backup replication protocol imple-mentation. MODIST found 35 bugs in total. In particular,it found protocol-level bugs (i.e., flaws in the core proto-cols) in every system checked: 10 in total, including 2 inBerkeley DB, 2 in MPS, and 6 in PACIFICA. We mea-sured the speed of MODIST and found that (1) MODISTincurs reasonable overhead (up to 56.5%) as a checking

tool and (2) it can speed up a checked execution (up to159 times faster) using its virtual clock.

MODIST provides a customizable framework for in-corporating various state-space exploration strategies.Using this framework, we implemented dynamic partialorder reduction (DPOR) [9], random exploration, depth-first exploration, and their variations. Among these,DPOR is a strategy well-known in the model checkingcommunity for avoiding redundancy in exploration. Toevaluate these strategies, we measured their protocol-level coverage (i.e., unique protocol states explored).The results show that, while DPOR achieves good cov-erage for a small bounded state space, it scales poorlyas the state space grows; a more balanced variation ofDPOR, with a set of randomly selected paths as startingpoints, achieves the best coverage.

This paper is organized as follows. We present anoverview of MODIST (§2), then describe its implemen-tation (§3) and evaluation (§4). Next we discuss relatedwork (§5) and conclude (§6).

2 OverviewA typical distributed system that MODIST checks hasmultiple processes,∗ each running multiple threads.These processes communicate with each other by send-ing and receiving messages through socket connections.MODIST can re-order messages and inject failures tosimulate an asynchronous and unreliable network. Theprocesses may write data to disk, and MODIST willgenerate different possible crash scenarios by permutingthese disk writes.

The remainder of this section gives an overview ofMODIST, covering its architecture (§2.1), its checkingprocess(§ 2.2), the checks it enables (§ 2.3), and its userinterface (§2.4).

2.1 Architecture

Figure 1 illustrates the architecture of MODIST appliedto a 4-node distributed system. The master node runsmultiple threads (the curved lines in the figure) and mightsend or receive messages (the solid boxes). For eachprocess in the target system, MODIST inserts an inter-position frontend between the process and its native op-erating system to intercept and control non-deterministicdecisions involving thread and network operations.

MODIST further employs a backend that runs in a dif-ferent address space and communicates with the fron-tends via RPC. This design minimizes MODIST’s pertur-bation of the target system, allowing us to build a genericbackend that runs on a POSIX-compliant operating sys-tem, and makes it possible to build the frontends forMODIST on different operating systems. The backend

∗In this paper we use node and process interchangeably.

ClientNode2Node1

MODIST Backend

Failure SimulationVirtual Clock

Model CheckingGlobal Assertion

InterpositionFrontend

Master

OS

�� Dep. Tracking

Figure 1: MODIST architecture. All MODIST compo-nents are shaded. The target system consists of one mas-ter, two replication nodes, and one client. MODIST’sfrontend interposes between each process in the targetsystem and the operating system to intercept and controlnon-deterministic actions, such as message interleavingand thread interleaving. MODIST’s backend runs in aseparate address space to schedule these actions.

consists of five components: a dependency tracker, a fail-ure simulator, a virtual clock manager, a model checkingengine, and a global assertion checker.

Interposition. MODIST’s interposition frontend is athin layer that exposes what actions a distributed sys-tem can do and lets MODIST’s backend deterministicallyschedule them. Specifically, it does so in two steps: (1)when the target system is about to execute an action,the frontend pauses it and reports it to the backend; and(2) upon the backend’s command, the frontend either re-sumes or fails the paused action, turning the target sys-tem into a “puppet” of the backend.

We place the interposition layer at the OS-applicationboundary to avoid modifying either the target system orthe underlying operating system. In addition, despitevariations in OS-application interfaces, they provide sim-ilar functions, allowing us to build a generic backend.

Since the interposition layer runs inside the target sys-tem, we explicitly design it to be simple and mostly state-less, and leave the logic and the state in the backend,thereby reducing the perturbation of the target system.

Dependency Tracking. MODIST’s dependencytracker oversees how actions interfere with each other. Ituses these dependencies to compute the set of enabledactions, i.e., the actions, if executed, that will not blockin the OS. For example, a recv() is enabled if there isa message to receive, and disabled otherwise. The modelchecking engine (described below) only schedules en-abled actions, because scheduling a disabled action willdeadlock the target system (analogous to a cooperativethread scheduler scheduling a blocked thread).

Failure Simulation. MODIST’s failure simulator or-

# command working dir inject failure?master.exe ./master/ 1node.exe ./node1/ 1node.exe ./node2/ 1client.exe test1 ./client/ 0

Figure 2: A configuration file that spawns the distributedsystem in Figure 1. We used this file to check PACIFICA.

chestrates many rare events that may occur in a dis-tributed system, including message reordering, mes-sage loss, network partition, and machine crashes; theseevents can expose bugs in the often-untested failure han-dling code. The failure simulator lets MODIST injectthese failures as needed, consistently to avoid false posi-tives, and deterministically to let users reliably reproduceerrors (cf. §2.2).

Virtual Clock. MODIST’s virtual clock manager hastwo main functions: (1) to discover timers in the targetsystem and fire them as requested by MODIST’s modelchecking engine to trigger more bugs, and (2) to ensurethat all processes in the target system observe a consis-tent clock to avoid false positives. Since the clock is vir-tual, MODIST can “fast forward” the clock as needed,often making a checked execution faster than a real one.

Model Checking. MODIST’s model checking engineacts as an “omnipresent” scheduler of the target system.It systematically explores a distributed system’s execu-tions by enumerating the actions, failures, and timers ex-posed by the other MODIST components. It uses a setof search heuristics and state-space reduction techniquesto improve the efficiency of its exploration. We elabo-rate the model checking process in next section and thesearch strategies in §3.6.

Global Assertion. MODIST’s global assertion mecha-nism lets users check distributed properties on consistentglobal snapshots; these properties cannot be checked byobserving only the local states at each individual node.Its implementation leverages our previous work [25].

2.2 Checking Process

With all MODIST’s components in place, we now de-scribe MODIST’s checking process. To begin checking adistributed system, the user only needs to prepare a sim-ple configuration file that specifies how to start the targetsystem. Figure 2 shows a configuration file for the 4-node replication system shown in Figure 1; it is a realconfiguration that we used to check PACIFICA. Each linein the configuration tells MODIST how to start a processin the target system. A typical configuration consistsof 2 to 10 processes. The “inject failure” flag is usefulwhen users do not want to check failures for a process.For example, client.exe is an internal test program

init state = checkpoint(create init state());q.enqueue(init state, init state.actions);

while(!q.empty()) {<state, action> = q.dequeue();try {

next state = checkpoint(action(restore(state)));global assert(next state); //check user-provided global assertionsif (next state has never been seen before)

q.enqueue(next state, next state.actions);} catch (Error e) {

// save trace and report error. . .

}}

Figure 3: Model checking pseudo-code.

that does not handle any failures, so we turned off failurechecking for this process.

With a configuration file, users can readily start check-ing their systems by running modist <config>.MODIST then instruments the executables referred to inthe configuration file to interpose between the applica-tion and the operating system, and starts its model check-ing loop to explore the possible states and actions in thetarget system: a state is an instantaneous snapshot of thetarget system, while an action can be to resume a pausedWinAPI function via the interposition layer, to inject afailure via the failure simulator, or to fire a timer via thevirtual clock manager.

Figure 3 shows the pseudo-code of MODIST’s modelchecking loop. MODIST first spawns the processesspecified in the configuration to create an initial state,and adds all 〈initial state,action〉 pairs to a state queue,where action is an action that the target system can do inthe initial state. Next, MODIST takes a 〈state,action〉pair off the state queue, restores the system to state,and performs action. If the action generates an error,MODIST will save a trace and report the error. Other-wise, MODIST invokes the user-provided global asser-tions on the resultant global state. MODIST further addsnew state/action pairs to the state queue based on one ofMODIST’s search strategies (cf. §3.6 for details.) Then,it takes off another 〈state,action〉 pair and repeats.

To implement the above process, MODIST needs tocheckpoint and restore states. It uses a stateless ap-proach [12]: it checkpoints a state by remembering theactions that created the state and restores it by redoing allthe actions. Compared to a stateful approach that check-points a state by saving all the relevant memory bits, astateless approach requires little modifications to the tar-get system, as previous work has shown [12, 19, 28, 39].

2.3 Checks

The checks that MODIST performs include genericchecks that require no user intervention as well as user-written system-specific checks.

Currently, MODIST detects two classes of generic er-rors. The first is “fail-stop” errors, which manifest them-selves when the target system unexpectedly crashes inthe absence of an injected crash from MODIST. Thesecrashes can be segmentation faults due to memory er-rors or program aborts because MODIST has brought thetarget system into an erroneous state. MODIST detectsthese unexpected crashes by catching the correspondingsignals. The second is “divergence” errors [12], whichmanifest themselves when the target system deadlocksor goes into an infinite loop. MODIST catches these er-rors using timeouts. When MODIST schedules one of theactions of the target system, it waits for a user-specifiedtimeout interval (10 seconds by default) until the targetsystem gets back to it; otherwise, MODIST will flag adivergence error.

Because MODIST checks the target system by execut-ing it, MODIST can easily check the effects of real ex-ecutions and find errors. Thus, we can always combineMODIST with other dynamic error detection tools (e.g.,Purify [16] and Valgrind [29]) to check more genericproperties; we leave these checks for future work.

In addition to generic checks, MODIST can performsystem-specific checks via user-provided assertions, in-cluding local assertions (via the assert() statements)inserted into the target system and global assertions thatrun in the centralized model checking engine. Giventhese assertions, MODIST will amplify them by drivingthe target code into many possible states where these as-sertions may fail. In general, the more assertions usersadd, the more effective MODIST will be.

2.4 Advanced User Interface

As with most other automatic error detection tools, themore system-specific knowledge MODIST has, the moreeffective it will be. For users who want to check theirsystem more thoroughly, MODIST provides the follow-ing methods for incorporating domain knowledge.

Users can add more program assertions in the code fora more thorough check. In addition to these local asser-tions, users can enrich the set of checks by specifyingglobal assertions in MODIST. These assertions checkdistributed properties on any consistent global snapshot.

Users can make MODIST more effective by reducingtheir system’s state space. A simple trick is to bound thenumber of failures MODIST injects per execution. Ourprevious work [38, 39] showed that tricky bugs are oftencaused by a small number of failures at critical moments.Obviously, without bounds on the number of failures, adistributed system may keep failing without making any

progress. In addition, developers tend to find bugs trig-gered by convoluted failures uninteresting [38].

Users can provide hints to let MODIST focus on thestates (among an infinite number of states) that users con-sider most interesting. Users can do so in two ways: (1)extend one of MODIST’s search algorithms through thewell-defined state queue interface, and (2) construct atest case to test some unusual parts of the state space.

3 ImplementationWe implemented MODIST on Windows by interceptingcalls to WinAPI [36], the Windows Application Pro-gramming Interface. We chose WinAPI because it isthe predominant programming interface used by almostall Windows applications and libraries, including the de-fault POSIX implementation on Windows. While webuilt MODIST on Windows, we expect that porting toother operating systems, such as Linux, BSD, and So-laris, should be easy because WinAPI is more compli-cated than the POSIX API provided by most other oper-ating systems. For example, WinAPI has several timesas many functions as POSIX. Moreover, many WinAPIfunctions operate in both synchronous and asynchronousmode, and the completion notifications of asynchronousIO (AIO) may be delivered through several mechanisms,such as events, select, or IO completion ports [36].

When we implemented MODIST we tried to adhere tothe following two goals:1. Consistent and deterministic execution. The ex-

ecutions MODIST explores and the failures it in-jects should be consistent and deterministic toavoid difficult-to-diagnose false positives and non-deterministic errors.

2. Tailor for distributed systems. We explicitly designedMODIST to check distributed systems. Having thisgoal in mind, we customized our implementation fordistributed systems and avoided being overly general.

These goals were reflected at many places in our im-plementation. In the rest of this section, we describeMODIST’s implementation in details, highlighting thedecisions entailed by these goals.

3.1 Interposition

MODIST’s interposition layer transparently interceptsthe WinAPI functions in the target system and allowsMODIST’s backend to control it deterministically. Thereare two main issues regarding interposition. First, inter-position complexity: since the interposition layer runs in-side the address space of the target system, it should be assimple as possible to avoid perturbing the target system,or introducing inconsistent or non-deterministic execu-tions. Second, IO abstraction: as previously mentioned,WinAPI is a wide interface with rich semantics; Win-dows networking IO is particularly complex. To avoid

Category # of functions # of LOC

Network 28 1816Time 7 161File System 9 640Mem 5 126Thread 33 1433

Shared 1290

Total 82 5466

Table 1: Interposition complexity. This table shows thelines of code for WinAPI wrappers, broken down by cat-egories. The “Shared” row refers to the code sharedamong all API categories. Most wrappers are fairly small(67 lines on average).

excessive complexity in MODIST’s backend, the interpo-sition layer should abstract out the semantics irrelevant tochecking and abstract the WinAPI networking interfaceto a simpler form.

Interposition complexity. To reduce the interpositioncomplexity, we implemented the interposition layer us-ing the binary instrumentation toolkit from our previouswork [25]. This toolkit takes a list of annotated WinAPIfunctions we want to hook and automatically generatesmuch of the wrapper code for interposition. Under thehood, it intercepts calls to dynamically linked librariesby overwriting the function addresses in relocation tables(import tables in Windows terminology).

Since we check distributed systems, we only needto intercept WinAPIs relevant to these systems. Ta-ble 1 shows the categories of WinAPIs we currentlyhook: (1) networking APIs, such as WSARecv()(receiving a message), for exploring network condi-tions; (2) time APIs, such as GetSystemTime(),for discovering timers; (3) file system APIs, such asWriteFile() and FlushFileBuffers(), for in-jecting disk failures and simulating crashes, (4) memoryAPIs, such as malloc(), for injecting memory fail-ures; and (5) thread APIs, such as CreateThread()and SetEvent(), for scheduling threads.

Most WinAPI wrappers are simple: they notifyMODIST’s backend about the WinAPI calls using anRPC call, wait for the reply from the backend, and,upon receiving the reply, they either call the underlyingWinAPIs or inject failures. Table 1 shows the total linesof code in all manually-written wrappers. Each wrapperon average consists of only 67 lines of code.

IO abstraction. Controlling the Windows networkingIO interface is complex for three reasons: (1) there aremany networking functions; (2) these functions heavilyuse AIO, whose executions are hidden inside the kernel

and not exposed to MODIST; and (3) these functionsmay produce non-deterministic results due to failuresin the network. We addressed these issues using threemethods: (1) abstracting similar network operations intoone generic operation to narrow the networking IO in-terface, (2) exposing AIO to MODIST by running it syn-chronously in a proxy thread, and (3) carefully placingerror injection points to avoid non-determinism.

To demonstrate our methods, we show in Figure 4 thewrapper for WSARecv(), a WinAPI function to syn-chronously or asynchronously receive data from a socket.For simplicity, we omit error-handling code and assumeAIO completion is delivered using events only (eventsare similar to binary semaphores.)

Our wrapper first checks whether the network con-nection represented by the socket argument s is alreadybroken by MODIST (line 5–8). If so, it simply returnsan error to avoid inconsistently returning success on abroken socket. It then handles AIO (line 9–24) by cre-ating a generic network IO structure net io (line 10–14), hijacking the application’s IO completion event (line16–18), spawning a proxy thread (line 21), and issuingthe AIO to the OS (line 23). The proxy thread willinvoke function mc::net io::run() (line 29–55).This function first notifies MODIST about the IO (line34). Upon MODIST’s reply, it either injects a failure(line 36–40), or waits for the OS to complete the IO(line 40–51). Function run() then reports the IO re-sult to MODIST, which in this example is the length ofthe data received (47–50). Finally, it calls the wrapper toSetEvent() to wake up any real threads in the targetsystem that are waiting for the IO to complete.

This wrapper example demonstrates the abstractionwe use between MODIST’s interposition frontend andthe backend. A network IO is split into an io issueand an io result RPC. The first RPC, io issue,expresses the IO intent of the target system to MODISTbefore it proceeds to a potentially blocking IO, lettingMODIST avoid scheduling a disabled (i.e., blocked) IO.Its second purpose is to serve as a failure injection point.The second RPC, io result, lets MODIST update thecontrol information it tracks.

These RPC methods take the message sizes and thenetwork connections as arguments, but not the spe-cific message buffers or sockets, which may changeacross different executions. This approach ensures thatMODIST’s backend sees the same RPC calls when it re-plays the actions to recreate the same state as when itinitially created the state. If MODIST detects a non-deterministic replay (e.g., a WSARecv() receives fewerbytes than expected), it will retry the IO by default.

There are two additional nice features about our IOabstraction: (1) it allows wrapper code sharing andtherefore reduces the interposition complexity (Table 1,

1 : // the OS uses lpOverlap to deliver IO completion2 : int mc WSARecv(SOCKET s, LPWSABUF buf, DWORD nbuf,3 : . . ., LPWSAOVERLAPPED lpOverlap, . . .) {4 : // check if MODIST has broken this connection5 : if(mc socket is broken(s)) {6 : ::WSASetLastError(WSAENETRESET);7 : return SOCKET ERROR;8 : }9 : if(overlap) { // Asynchronous mode10: mc::net io *io = . . .;11: io−>orig lpOverlap = lpOverlap;12: io−>op = mc::RECV MESSAGE; // set IO type13: io−>connection = . . .; // Identify connection using14: // source <ip, port> and destination <ip, port>15:16: // Hijack application’s IO completion notification event17: io−>orig event = lpOverlap−>hEvent;18: lpOverlap−>hEvent = io−>proxy event;19:20: // Create a proxy thread and run mc::net io::run21: io−>start proxy thread();22: // Issue asynchronous receive to the OS23: return ::WSARecv(s,buf,nbuf,. . .,io−>proxy lpOverlap,. . .);24: }25: // Synchronous mode26: . . .27: }28: // mc::net io code is shared among all networking IO29: void mc::net io::run() {// called by proxy thread30: mc::rpc client *rpc = mc::current thread rpc client();31:32: // This RPC blocks this thead. It returns only when MODIST33: // wants to (1) inject a failure, or (2) complete the IO34: int ret = rpc−>io issue(this−>op, this−>connection);35:36: if(ret == mc::FAILURE) {37: // MODIST wants to inject a failure38: this−>orig lpOverlap−>Internal // Fake an IO failure39: = STATUS CONNECTION RESET;40: . . . // Ask the OS to cancel the IO41: } else { // MODIST wants to complete this IO42: // Wait for the OS to actually complete the IO, because the43: // data to receive may still be in the real network.44: // This wait will not block forever, since MODIST’s45: // dependency tracker knows there are bytes to receive46: ::WaitForSingleObject(this−>proxy event, INFINITE);47:48: // Report the bytes actually sent or received, so MODIST’s49: // dep. tracker knows how many bytes are in the network.50: int msg size = this−>orig lpOverlap−>InternalHigh;51: rpc−>io result(this−>op, this−>connection, msg size);52: }53: // deliver IO notification to application. mc SetEvent is54: // a wrapper to WinAPI SetEvent;55: mc SetEvent(this−>orig event);56: }

Figure 4: Simplified WSARecv() wrapper.

“Shared” row), (2) it abstracts away the OS-specific fea-tures and enables the backend to be OS-agnostic.

3.2 Dependency Tracking

MODIST’s dependency tracker monitors how actionsmight affect each other. The notion of dependency is

from [12]: two actions are dependent if one can enableor disable the other or if executing them in a differentorder leads to a different state. MODIST uses these de-pendencies to avoid false deadlocks (described below), tosimulate failures (§3.3), and to reduce state space (§3.6).

To avoid false deadlocks, MODIST needs to computethe set of enabled actions that will not block in the OS.For determinism, MODIST schedules one action at a timeand pauses all other actions (cf. §2.2). If MODIST in-correctly schedules a disabled action (such as a blockingWSARecv()), it will deadlock the target system becausethe scheduled action is blocked in the OS while all otheractions are paused by MODIST.

Since the dependency tracker tries to infer whetherthe OS scheduler would block a thread in a WinAPIcall (recall that the interposition layer exposes AIOs asthreads), it unsurprisingly resembles an OS schedulerand replicates a small amount of the control data in theOS and the network. To illustrate how it works, con-sider the WSARecv() wrapper in Figure 4. The de-pendency tracker will track precisely how many bytesare sent and received for each network connection us-ing the io result RPC (line 50). If a thread tries toreceive a message (line 34) when none is available, thedependency tracker will mark this thread as disabled andplace it on the wait queue of the connection. Later, whena WSASend() occurs at the other end of the connec-tion, the dependency tracker will remove this thread fromthe wait queue and mark it as enabled. When MODISTschedules this thread by replying to its RPC io issue,the thread will not block at line 45 because there is datato receive. In addition to network control data, the depen-dency tracker also tracks threads, locks, and semaphores.

3.3 Failure Simulation

When requested by the model checking engine,MODIST’s failure simulator injects five categories offailures: API failures (e.g., WriteFile() returns“disk error”), message reordering,∗ message loss, net-work partitions, and machine crashes. Simulating APIfailures is the easiest: MODIST simply tells the interpo-sition layer to return an error code. Reordering messagesis also easy since the model checking engine already ex-plores different orders of actions. To simulate differentcrash scenarios, we used techniques from our previouswork [38, 39] to permute the disk writes that a systemissues.

Simulating network failures is more complicated dueto the consistency and determinism requirement. We firsttried a naıve approach: simply closing sockets to simu-late connection failures. This approach did not work wellbecause we frequently experienced inconsistent failures:

∗Message reordering is not a failure, but since it is often caused byabnormal network delay, for convenience we consider it as a failure.

the “macro” failures we want to inject (e.g., network par-tition) map to not one but a set of “micro” failures wecan inject through the interposition layer (e.g., a failedWSARecv()). For example, to break a TCP connection,we must carefully fail all pending asynchronous IOs as-sociated with the connection at both endpoints. Other-wise, the target system may see an inconsistent connec-tion status and crash, thus generating a false positive.

We also frequently experienced non-deterministicfailures because the OS detects failures using non-deterministic timeouts. Consider the following ac-tions:1. Process P1 calls WSASend(P2, message).2. Process P2 calls asynchronous WSARecv(P1).3. MODIST breaks the connection between P1 and P2.P2 may or may not receive the message, depending onwhen P2’s OS times out the broken connection.

Our current approach ensures that failure simulation isconsistent and deterministic as follows. We know theexact set of real or proxy threads that are paused byMODIST in rpc->io issue() (Figure 4, line 34).To simulate a network failure, we inject failures to allthese threads, and we do so immediately to avoid anynon-deterministic kernel timeouts. Note that doing so inthe example above will not cause us to miss the scenariowhere P2 receives the message before the connectionbreaks; MODIST will simply explore this scenario in adifferent execution where it completes P2’s asynchronousWSARecv() first (by replying to P2’s io issue()RPC), and then breaks the connection between P1 andP2.

3.4 Virtual Clock

MODIST’s virtual clock manager injects timeouts whenrequested by the model checking engine and providesa consistent view of the clock to the target system. Aside benefit of virtual clock is that, the target system mayrun faster because the virtual clock manager can fast for-ward time. For example, when the target system callssleep(1000), the virtual clock manager can add 1000to its current virtual clock and let the target system wakeup immediately.

Discovering Timeouts. To detect bugs in rarely testedtimeout handling code, we want to discover as manytimers as possible. This task is made difficult be-cause system code extensively uses implicit timers wherethe code first gets the current time (e.g., by callinggettimeofday()), then checks if a timeout occurs(e.g., using an if-statement). Figure 5 shows a real ex-ample in Berkeley DB.

Since implicit timers do not use OS APIs to checktimeouts, they are difficult to discover by a modelchecker. Previous work [19, 27, 38] requires users tomanually annotate implicit timers.

// db-4.7.25.NC/repmgr/repmgr sel.cint repmgr compute timeout(ENV *env, timespec * timeout){

db timespec now, t;. . . // Set t to the first due time.if (have timeout) {

os gettime(env, &now, 1); // Query current time.if (now >= t) // Timeout check, immediately follows the query.

*timeout = 0; // Timeout occurs.else

*timeout = t − now; // No timeout.}. . .

}

Figure 5: An implicit timer in Berkeley DB (after macroexpansion and minor editing).

To discover implicit timers automatically, we devel-oped a static symbolic analysis technique. It is based onthe following two observations:1. Programmers use time in simple ways. For ex-

ample, they explicitly label time values (e.g.,db timespec in Figure 5), they do simple arith-metic on time values, and they generally do not casttime values to pointers and other unusual types. Thisobservation implies that simple static analysis is suf-ficient to track how a time value flows.

2. Programmers check timeouts soon after they querythe current time. The intuition is that programmerswant the current time to be “fresh” when they checktimeouts. This observation implies that our analysisonly needs to track a short flow of a time value (e.g.,within three function calls) and may stop when theflow becomes long.

We analyzed how time values are used in Berkeley DBversion 4.7.25. We found that Berkeley DB mostly uses“+,” “-,” and occasionally “*” and “/” (for conversions,e.g., from seconds to milliseconds). In 12 out of 13 im-plicit timers, the time query and time check are within afew lines.

Our analysis resembles symbolic execution [3, 4, 13,31]. It has three steps: (1) statically analyze the codeof the target system and find all system calls that re-turn time values; (2) track how the time values flow tovariables; and (3) upon a branch statement involving atracked time value, use a simple constraint solver to gen-erate symbolic values to make both branches true. Toshow the idea, we use a source code instrumentation ex-ample. In Figure 5, our analysis can track how time flowsfrom “ os gettime” to “if (now >= t),” and re-place the “ os gettime” line in Figure 5 with

mc::rpc client *rpc = mc::current thread rpc client()now = rpc−>gettime(/*timer=*/t);

This RPC call tells the virtual clock manager that a timerfires at t; the virtual clock manager can then return atime value smaller than t for one execution, and greaterthan t for another execution, to explore both possibleexecution paths. We implemented our analysis using thePhoenix compiler framework [30].

Since our analysis is static, it avoids the runtime over-head of instrumenting each load and store for trackingsymbolic values and thus is much simpler than dynamicsymbolic execution tools [3, 4, 13, 31], which often takeiterations to become stable [3, 4]. Note our analysis isunsound, as with other symbolic analysis tools, in thatit may miss some timers and thus miss bugs. However,it will not introduce false positives because the virtualclock manager ensures the consistency of time.

Ensuring Consistent Clock. A consistent clock is cru-cial to avoid false positives. For example, the safety ofthe lease mechanism [14] requires that the lessee time-outs before the lessor; reversing the order may trigger“bugs” that never occur in practice. We actually encoun-tered a painful false positive due to a violation of thissafety requirement when checking PACIFICA.

To maintain consistent time, the virtual clock managersorts all timers in the target system from earliest to lastbased on when these timers will fire. When the modelchecking engine decides to fire a timer, it will systemat-ically choose one of several timers that fall in the rangeof [T,T + E], where T is the earliest timer and E is aconfigurable clock error allowed by the target system.This mechanism lets MODIST explore interesting timerbehaviors while not deviating too much from real timer-triggered executions.

3.5 Global Assertion

We have implemented global assertions leveraging ourprevious work D3S [25]. D3S enables transparent predi-cate checking of a running distributed system. It providesa simple programming interface for developers to spec-ify global assertions, interposes both user-level functionsand OS system calls in the target system to expose itsruntime state as state tuples, and collects such tuples asglobally consistent snapshots for evaluating assertions.To use D3S, developers need to specify the functions be-ing interposed, the state tuples being retrieved from func-tion parameters, and a sequential program that takes acomplete state snapshot as input to evaluate the predi-cate. D3S compiles such assertions into a state exposingmodule, which is injected into all processes of the targetsystem, and a checking module, which contains the eval-uation programs and outputs checking results for everyconstructed snapshot.

MODIST incorporates D3S to enable global asser-tions, with two noticeable modifications. First, we sim-plify D3S by letting each node transmit state tuples syn-

chronously to MODIST’s checking process, which ver-ifies assertions immediately. Previously, because nodesmay transmit state tuples concurrently, D3S must, beforechecking assertions, buffer each received tuple until alltuples causally dependent before that tuple have been re-ceived. Since MODIST runs one action at a time, it nolonger needs to buffer tuples. Second, while D3S uses aLamport clock [23] to totally order state tuples into snap-shots, MODIST uses a vector clock [26] to check moreglobal snapshots.

3.6 State Space Exploration

MODIST maintains a queue of the state/action pairs tobe explored. Due to the complexity of a distributed sys-tem, it is often infeasible for MODIST to exhaust the statespace. Thus, it is key to decide which state/action pairsto add to the queue and the order in which they are ex-plored.

MODIST tags each action with a vector clock and im-plements a customizable modular framework for explor-ing the state space so different reduction techniques andheuristics can be incorporated. This is largely inspired byour observation that the effectiveness of various strate-gies and heuristics is often application-dependent.

The basic state exploration process is simple:MODIST takes the first state/action pair 〈s,a〉 from thequeue, steers the system execution to state s if that is notthe current state, applies the action a, reaches a new states′, and examines the new resulting state for errors. It thencalls a customizable function explore, which takes theentire path from the initial state to s and then s′, whereeach state is tagged with its vector clock and each statetransition is tagged with the action corresponding to thetransition. For s′, all enabled actions are provided to thefunction. The function then produces a list of state/actionpairs and indicates whether the list should be added to thefront of the queue or the back. MODIST then inserts thelist into the queue and repeats the steps.

MODIST has a natural bias towards exploring 〈s,a〉pairs where s is the state MODIST is in. This defaultstrategy will save the cost of replaying the trace to reachthe state in the selected state/action pair.

Now we show how various state exploration strate-gies and heuristics can be implemented in the MODISTframework.Random. Random exploration with a bounded maxi-mum path length explores a random path up to a boundedpath length and then starts from the initial state for an-other random path. The explore function works as fol-lows: if the current path has not exceeded the bound,the function will randomly pick an enabled action a′ atthe new state s′ and has 〈s′,a′〉 inserted to the end ofthe queue (note that the queue is empty). If the currentpath has reached the bound, the function will randomly

choose an enabled action a0 in the initial state s0, and has〈s0,a0〉 inserted to the end of the queue.DFS and BFS. For Depth First Search (DFS) andBreadth First Search (BFS), the explore function sim-ply inserts 〈s′,a′〉 for every enabled action a′ in state s′.For DFS, the new list is inserted at the front of the queue,while for BFS at the back. Clearly, DFS is more attrac-tive since MODIST does not have to replay traces oftento recreate states.DPOR. For dynamic partial order reduction (DPOR), theexplore function works as follows. Let a be the lastaction causing the transition from s to s′. The functionlooks at every state sp before s on the path and the ac-tion ap taken at that state. If a is enabled at sp (i.e., if sand sp are concurrent judged by the vector clocks) and adoes not commute with ap (i.e., the different orders of thetwo actions could lead to different executions), we record〈sp,a〉 in the list of pairs to explore. Once all states areexamined, the function returns the list and has MODISTinsert the list in the queue.

By specifying how the list is inserted, the functioncould choose to use DFS or BFS on top of DPOR. Also,by ordering the pairs in the list differently, MODIST willbe instructed to explore the newly added branches in dif-ferent orders (e.g., top-down or bottom-up). The defaultis DFS again to avoid the cost of recreating states. Wefurther introduce Bounded DPOR to refer to the varia-tion of DPOR with bounds on DFS for a more balancedstate-space exploration.

The explore function can be constructed to favor cer-tain actions (e.g., crash events) over others, to bound theexploration in various ways (e.g., the path length and thenumber of certain actions on the path), and to focus on asubset of possible actions.

4 EvaluationWe have applied MODIST to three distributed systems:(1) Berkeley DB, a widely used open-source database(a version with replication); (2) MPS, a closed sourcePaxos [22] implementation built by a Microsoft productteam and has been deployed in commercial data centersfor more than two years; and (3) PACIFICA, a matureimplementation of a primary-backup replication proto-col we developed. We picked Berkeley DB and MPSbecause of their wide deployment and importance andPACIFICA because it provides an interesting case studywhere the developers apply model checking to their ownsystems.

Table 2 summarizes the errors we found, all of whichare previously unknown bugs. We found a total of 35errors, 10 of which are protocol-level bugs that occuronly under rare interleavings of messages and crashes;these bugs reflect flaws in the underlying communica-tion protocols of the systems. Implementation bugs are

System KLOC Protocol Impl. Total

Berkeley DB 172.1 2 5 7MPS 53.5 2 11 13PACIFICA 12 6 9 15

Total 237.6 10 25 35

Table 2: Summary of errors found. The KLOC (thou-sand lines of code) column shows the sizes of the systemswe checked. We separate protocol-level bugs (Protocol)and implementation-level bugs (Impl.), in addition to re-porting the total (Total). 31 of the 35 bugs have beenconfirmed by the developers.

those that can be caused by injecting API failures. AllMPS and PACIFICA bugs were confirmed by the devel-opers. Three out of seven Berkeley DB bugs, includ-ing one protocol-level bug, were confirmed by BerkeleyDB developers; we are having the rest confirmed. Theseunconfirmed bugs are likely real bugs because we canreproduce them without MODIST by manually tweak-ing the executions and killing processes according to thetraces from MODIST.

While other tools (e.g., a static analyzer) can also findimplementation bugs, MODIST has the advantage of notgenerating false positives. In addition, it can expose theeffects of these bugs, helping prioritize fixing.

In the rest of this section, we describe our error detec-tion methodology, the bugs we found, MODIST’s cov-erage results and runtime overhead, and the lessons wehave learned.

4.1 Experimental Methodology

Test driver. Model checking is most effective at check-ing complicated interactions between a small number ofobjects. Thus, in all tests we run, we use several pro-cesses servicing a bounded number of requests. Sincethe systems we check came with test cases, we simplyuse them with minor modifications.Global assertions. By default, MODIST checks fail-stop errors. To check the correctness properties of a dis-tributed system, MODIST supports user supplied globalassertions (§2). For the replication systems we checked,we added two types of assertions. The first type wasglobal predicates for the safety properties. For example,all replicas agree on the same sequence of commands.The second type of predicates check for liveness. Trueliveness conditions cannot be checked by execution mon-itoring so we instead approximate them by checking forprogress in the system: we expect the target system tomake progress in the absence of failures. In the end, wedid not find any bug that violated the safety properties inany of the systems, probably reflecting the relative ma-turity of these systems. However, we did find bugs thatviolated liveness global assertions in every system.

Search strategy. MODIST has a set of built-in searchstrategies; no single strategy works the best. We havecombined these strategies in our experiments for discov-ering bugs effectively. For example, we can first performrandom executions (Random) on the system and injectthe API failures randomly to get the shallow implemen-tation bugs. We can then use the DPOR strategy withrandomly chosen initial paths to explore message orderssystematically. We can further add crash and recoveryevents on top of the message interleaving, starting froma single crash and gradually increasing the number ofcrashes, to exercise the system’s handling of crash andrecovery. We can run these experiments concurrently andfine-tune the strategies.Terminology. Distributed systems use different termi-nologies to describe the roles the nodes play in the sys-tems. In this paper, we will use primary and secondaryto distinguish the replicas in the systems. They are calledmaster and client respectively in Berkeley DB docu-ments. In the Paxos literature, a primary is also calleda leader.

4.2 Berkeley DB: a Replicated Database

Berkeley DB is a widely used open source transactionalstorage engine. Its latest version supports replication forapplications that must be highly available. In a BerkeleyDB replication group, the primary supports both readsand writes while secondaries support reads only. Newreplicas can join the replication group at any time.

We checked the latest Berkeley DB production re-lease: 4.7.25.NC. We use ex_rep_mgr, an example ap-plication that comes with Berkeley DB as the test driver.This application manages its data using the Berkeley DBReplication Manager. Our test setup has 3 to 5 pro-cesses. They first run an election. Once the election com-pletes, the elected primary inserts data into the replicateddatabase, reads it back, and verifies that it matches thedata inserted.

Results and Discussions. We found seven bugs inBerkeley DB: four were triggered by injecting API fail-ures, one was a dangling pointer error triggered by theprimary waiting for multiple ACK messages simultane-ously from the secondaries, and the remaining two wereprotocol-level bugs, which we describe below.

The first protocol-level bug causes a replica to crashdue to an “unexpected” message. The timing dia-gram of this bug is depicted in Figure 6. Replica Cis the original primary. Suppose a new election islaunched, resulting in replica A becoming the new pri-mary. Replica A will broadcast a REP_NEWMASTERmessage, which means “I am the new primary.” Afterreplica B receives this message, it tries to synchronizewith the new primary and sends A a REP_UPDATA_REQmessage to get the up-to-date data. Meanwhile, C

A gives up its

primary role

A B C

I am new primary

A panic

C enters a primary state

A enters a primary state

B sends an update

request to A C broadcasts

duplicate primary

detected message

Figure 6: Timing Diagram of Message Exchanges in aBerkeley DB Replication Bug.

processes REP_NEWMASTER by first broadcasting aREP_DUPMASTER message, which means “duplicateprimary detected,” and then degrading itself to a sec-ondary. Broadcasting a REP_DUPMASTER messageis necessary to ensure that all other replicas knowthat C is not primary anymore. When A processesREP_DUPMASTER, it has to give up its primary rolebecause it cannot make sure that it is the latest pri-mary. Soon A receives the delayed but not-outdatedREP_UPDATA_REQ message from B. Replica A pan-ics at once, because such message should only be re-ceived by primary. Such panics occur whenever a de-layed REP_UPDATA_REQ message arrives at a recentlydegraded primary.

The second protocol level bug is more severe: it causespermanent failures in leader election due to a primarycrash when all secondaries believe they cannot be pri-maries. Suppose replica A is the original primary and issynchronizing data with secondaries B and C. Normallysynchronization works as follows. A sends a REP_PAGEmessage with the modified database page to B and C.Upon receipt of this message, B and C transit to log re-covery state by setting the REP_F_RECOVER_LOG flag.A then sends a REP_LOG message with the updated logrecords. However, if A crashes before it sends REP_LOG,B and C will never be able to elect a new primary be-cause, in Berkeley DB’s replication protocol, a replica inlog recovery is not allowed to be a primary.

4.3 MPS: Replicated State Machine Library

MPS is a practical implementation of a replicated statemachine library. The library has been used for over twoyears in production clusters of more than 100K machinesfor maintaining important system metadata consistentlyand reliably. It consists of 8.5K lines of C++ code forthe communication protocol, and 45K for utilities suchas networking and storage.

At the core of MPS is a distributed Paxos protocol forconsensus [22]. The protocol is executed on a set of ma-chines called replicas. The goal of the protocol is to havereplicas agree on a sequence of deterministic commands

A enters a stable state

A B C

Status query

t2 > t1 + grace period:

A enters a preparing state

Status acks

t1: A enters a learning state

Figure 7: The Timing Diagram of Message Exchange inMPS Bug 1.

and execute the commands in the same sequence order.Because all replicas start with the same initial state andexecute the same sequence of commands, consistencyamong replicas is guaranteed.

The MPS consensus protocol is leader (primary)based. While the protocol ensures safety despite the ex-istence of multiple primaries, a single primary is neededfor the protocol to make progress. A replica can act asa primary using a certain ballot number. A primary ac-cepts requests from clients and proposes those requestsas decrees, where decree numbers indicate the positionsof the requests in the sequence of commands that is goingto be executed by the replicated state machine. A decreeis considered committed when the primary gets acknowl-edgment from a quorum (often a majority) of replicas in-dicating that they have accepted and persistently storedthe decree.

If a replica receives a message that indicates that adecree unknown to the replica is committed, then thereplica enters a learning phase, in which it learns themissing decrees from other replicas.

When an existing primary is considered to have failed,a new primary can be elected. The new primary will usea higher ballot number and carry out a prepare phase tolearn the decrees that could have been committed andensure no conflicting decrees are proposed. For eachreplica, a proposal with a higher ballot number over-writes any previous proposal with lower ballot numbers.

Our test setup consists of 3 replicas, proposing a smallnumber of decrees.

Results and Discussions. We found 13 bugs in MPS,11 are implementation bugs that crash replicas, and theother two bugs are protocol-level bugs.

The first protocol-level bug reveals a scenario thatleads to state transitions that are not expected by the de-velopers (as demonstrated by the assertion that rules outthe transition). MPS has a simple set of states and statetransitions. A replica is normally in a stable state. Whenit gets indication that its state is falling behind (i.e., miss-ing decrees), it enters a learning state. In the learningstate, it fetches the decrees from a quorum of replicas.Once it brings its state up to date with what it receives

A B C

Status query

B.decree > A.decree =>

B rejects A¶V�3UHSDUH�UHTXHVW

Status acks

A enters a learning state

A enters a preparing state

A.ballot > B.ballot =>

A learns nothing.

A enters a preparing state again

Prepare

Prepare reject

Prepare

Figure 8: The Timing Diagram of Message Exchange inMPS Bug 2.

from a quorum of replicas, it checks whether it shouldbecome a primary: if the primary lease expires, thenthe replica will compete to be a primary by entering apreparing state; otherwise, it will return to a stable state.There is an assertion in the code (and also in the designdocument for MPS) that the state transition from stableto preparing is impossible.

Figure 7 shows the MODIST-generated scenario thattriggers the assertion failure. The following is a list ofsteps that lead to the violation. Consider the case wherethe system consists of three replicas A, B, and C, whereany two of them form a quorum. Replica A enters thelearning state because it realizes that it does not havethe information related to some decree numbers. Thiscould be due to the receipt of a message that indicatesthat the last committed decree number is at least k, whileA knows only up to some decree number less than k. Athen sends a status query to B and C. A receives the re-sponse from B and learns all the missing decrees. SinceA and B form a quorum, A enters the stable state. Cwas the primary. C’s response to A status query was de-layed, and the primary lease becomes expired on A. Atsome later point, C’s response arrives. The implementa-tion will handle that message as if A were in the learningstate. After A is done, it notices that the primary leasehas expired and transitions into the preparing state, caus-ing the unexpected state transition. As a result, A crashesand reboots.

The second protocol-level bug is a violation of a globalliveness assertion. It is triggered during primary electionunder the following scenario: replica A has accepted adecree with ballot number 2 and decree number 1, whilereplica B only has ballot number 1, but accepted a decreeof decree number 2.

The following series of events lead to this problematicscenario: B is a primary with ballot number 1, it pro-poses a decree with decree number 1 and the decree isaccepted by all replicas including A and B. It then pro-

poses another decree with decree number 2, which is ac-cepted only on B. B fails before A gets the proposal. Athen becomes a primary with ballot number 2, learns thedecree with decree number 1, re-proposes it with a ballotnumber 2.

Figure 8 shows the timing diagram continuing fromthis scenario. B comes back, receives the prepare requestfrom A, and sends a rejection to A because B thinks Ais not up-to-date given that B has a higher decree num-ber. After getting the rejection, A enters a learning state.In the learning state, even if B returns the decree withdecree number 2, A will reject it because it has a lowerballot number. A will consider itself up-to-date and enterthe preparing state again with a yet higher ballot number.This continues as A keeps increasing its ballot number,but unable to have new decrees committed, triggering aliveness violation.

The problem in this scenario is due to the inconsis-tency of the views on what constitutes a newer state be-tween the preparing phase and the learning phase: oneview uses a higher ballot number, while the other usesa higher decree number. The inconsistency is exposedwhen one has a higher decree number, but a lower ballotnumber than the other.

4.4 PACIFICA: a Primary-Backup Replication Pro-tocol

PACIFICA [24] is a large-scale storage system for semi-structured data. It implements a Primary-Backup pro-tocol for data replication. We used MODIST to checkan implementation of PACIFICA’s replication protocol.This implementation consists of 5K lines of C++ codefor the communication protocol and 7K for utilities.

PACIFICA uses a variety of familiar components in-cluding two-phase commit for consistent replica updates,perfect failure detection, replica group reconfiguration tohandle node failures, and replica reconciliation for nodesrejoining a replica group.

Our test setup for PACIFICA has 4 processes: 1 mas-ter that maintains global metadata, 2 replica nodes thatimplement the replication protocol, and 1 client that up-dates the system and drives the checking process. Fig-ure 2 shows the configuration file.

Results and Discussions. We found 15 bugs in PACI-FICA: 9 are implementation bugs that cause crashes and6 are protocol-level bugs. We managed to find moreprotocol-level bugs in PACIFICA than in other systemsfor two reasons: (1) since we built the system, we couldquickly fix the bugs MODIST found then re-run MODISTto go after other bugs; and (2) we could check moreglobal assertions for PACIFICA.

The most interesting bug we found in PACIFICA pre-vents PACIFICA from making progress. It is triggered bya node crash followed by a replication group reconfigu-

1 10 100 1000 10000 1e+05# of paths

1

10

100

1000

10000

1e+05

# of

uni

que

trac

es

DPORRandomDFS

Figure 9: Partial order state coverage of different explo-ration strategies.

ration. A primary replica keeps a list of prepared updates(i.e., updates that have been prepared on all replicas, butnot yet committed); a secondary replica does not havethis data structure. When a primary crashes, a secondarywill try to take over and become the new primary. If thecrash happens in the middle of a commit operation thatleaves some commands prepared but not yet committed,the new primary will try to re-commit all prepared up-dates by sending the “prepare” messages to the remain-ing secondary replicas. Unfortunately, PACIFICA did notput these newly prepared updates into the prepared up-date list. This prevents all the following updates fromgetting committed because of a hole in the prepared up-date list.

4.5 State Coverage

To evaluate the state-space exploration strategies de-scribed in §3.6, we measured state coverage: the numberof unique states a strategy could explore after running afixed number of execution paths. We examined the cov-erage of two types of states:1. Partial order traces [12]. Since two paths with the

same partial order are equivalent, the number of dif-ferent partial order traces provides an upper boundon the number of unique behaviors a strategy can ex-plore.

2. Protocol states. These states capture the more impor-tant protocol behaviors of a distributed system.

We did two experiments, both on MPS: one with asmall partial order state space and the other with a nearlyunbounded state space. These two state spaces give anidea of how sensitive the strategies are to state-spacesizes. No crash was injected during the evaluation.

In the first experiment, we made the state space smallusing a configuration of two nodes, each receiving upto two messages. Figure 9 shows the number of uniquepartial order traces with respect to the number of pathsexplored. (Note that both axes are in log scale.) DPOR

0 10000 20000 30000 40000 50000# of paths

0

50

100

150

200

250

300

# of

pro

toco

l-le

vel s

tate

s

Random + Bounded DPORRandomBounded DPORDPORDFS

Figure 10: Protocol state coverage of different explo-ration strategies.

shows a clear advantage: it exhausted all 115,425 tracesafter 134,627 paths (the small redundancy was due to anapproximation in our DPOR implementation.) The Ran-dom strategy explored 6,614 unique traces or 5.7% of theentire state space after 200,000 paths. DFS is the worst:all the 200,000 paths were partial order equivalent andcorresponded to only one partial order trace.

In the second experiment, we used a nearly unboundedpartial order state space with three MPS nodes send-ing and receiving an unbounded number of messages.We bounded the maximum decree (two decrees) and themaximum path length (40,000 actions) to make the exe-cution paths finite. Since the state space was large, it wasunlikely that Random ever explored a partial order tracetwice. As a result, DPOR behaved the same as Random.(This result is not shown.)

While partial order state coverage provides an up-per bound on the unique behaviors a strategy explores,different partial order traces may still be redundantand map to the same protocol state. Thus, we fur-ther measured the protocol state coverage of differentexploration strategies. We defined the protocol stateof MPS as a tuple 〈state,ballot,decree〉, where thestate could be initializing, learning, stableprimary, or stable secondary.∗

Figure 10 shows the protocol states covered by thefirst 50,000 paths explored in each strategy, using theMPS configuration from the second experiment. DFShad the worst coverage: it found no new states after ex-ploring the first path. The reason is, when the state spaceis large, DFS tends to explore a large number of pathsthat differ only at the final few steps; these paths areoften partial-order equivalent. DPOR performed almostequally badly: it found less than 30 protocol states. Thisresult is not surprising for two reasons: (1) different par-

∗We also measured the coverage of global protocol states, whichconsist of protocol states of each node in a consistent global snapshot.The results were similar and not shown.

tial order traces might correspond to the same protocolstate and (2) DPOR is DFS-based, thus suffers the sameproblem as DFS when the state space is large.

In Bounded DPOR, protocol-level redundancy is par-tially conquered by the bounds on backtracks. Asshown in Figure 10, the protocol-level state coverage ofBounded DPOR was larger than that of DPOR by an or-der of magnitude, in the first 50,000 paths.

Surprisingly, the Random strategy yielded better cov-erage than DFS, DPOR, and even Bounded DPOR. Thereason is that Random is more balanced: it explores ac-tions anywhere along a path uniformly, therefore it has abetter chance to jump to a new path early on and exploresa different area of the state space.

These results prompted us to develop a hybrid Random+ Bounded DPOR search strategy that works as follows.It starts with a random path and explores the state spacewith Bounded DPOR. We further bound the total num-ber of backtracks so that the Bounded DPOR explorationends. Then, a new round of Bounded DPOR explorationstarts with a new random path. Random + BoundedDPOR inherits both the balance of Random and the thor-oughness of DPOR to cover the corner cases. Both theround number of DPOR explorations and the bound ofthe total number of backtracks are customizable, reflect-ing a bias towards Random or towards DPOR. As shownin Figure 10, the Random + Bounded DPOR strategywith a round number 100 performed the best.

4.6 Performance

In our performance measurements, we focused on threemetrics: (1) MODIST’s path exploration speed; (2) thespeedup due to the virtual clock fast-forward; and (3)the runtime overhead MODIST adds to the target system,including interposition, RPC, and backend scheduling.

We set up our experiments as follows. We ranMODIST with two different search strategies: RAN-DOM and DPOR. For each search strategy, we letMODIST explore 1K execution paths and recorded therunning times. We repeated this experiment 50 times andtook the average. We used Berkeley DB and MPS asour benchmarks, using identical configurations as thoseused for error detection. We ran our experiments on a 64-bit Windows Server 2003 machine with dual Intel Xeon5130 CPU and 4GB memory. We measured all time val-ues using QueryPerformanceCounter(), a high-resolution performance counter.

It appears that we should measure MODIST’s over-head by comparing a system’s executions with MODISTto those without. However, due to nondeterminism, wecannot compare these two directly: the executions with-out MODIST may run different program paths than thosewith MODIST. Moreover, repeated executions of thesame testcase without MODIST may differ; we did ob-

System Strategy Real (s) Sleep (s) Speedup Overhead (absolute and relative)

Berkeley DB RANDOM 1,717±14 38,204±193 22.3±0.2 302±1s (17.7±0.1%)Berkeley DB DPOR 1,658±24 36,402±5,137 22.1±3.2 301±17s (18.2±0.9%)MPS RANDOM 1,661±20 240,568±1,405 145±2 825±11s (49.9±0.2%)MPS DPOR 1,853±116 295,435±45,659 159±19 1,048±108s (56.5±2.6%)

Table 3: MODIST’s performance. All numbers are of the form average± standard deviation.

serve a large variance in MPS’s execution times and finalprotocol states. Thus, we evaluated MODIST’s overheadby running a system with MODIST and measuring thetime spent in MODIST’s components.

Table 3 shows the performance results. The Real col-umn shows the time it took for MODIST to explore 1Kpaths of Berkeley DB and MPS with RANDOM andDPOR strategies; the exploration speed is roughly twoseconds per path and does not change much for the twodifferent search strategies. The Sleep column shows thetime MODIST saved using its virtual clock when thetarget systems were asleep; we would have spent thisamount of extra time had we run the same executionswithout MODIST. As shown in the table, the real execu-tion time is much smaller that the sleep time, translatedinto significant speedups (Column Speedup, computedas Sleep/Real). The Overhead column in this tableshows the time spent in MODIST’s interposition, RPC,and backend scheduling. For Berkeley DB, MODIST ac-counts for about 18% of the real execution time. ForMPS, MODIST accounts for a higher percentage of exe-cution time (up to 56.5%) because the MPS testcase weused is almost the worst case for MODIST: it only exer-cises the underlying communication protocol and doesno real message processing. Nonetheless, we believesuch overhead is reasonable for an error detection tool.

4.7 Lessons

This section discusses the lessons we learned.Real distributed protocols are buggy. We found

many protocol-level bugs and we found them in everysystem we target, suggesting that real distributed proto-cols are buggy. Amusingly, these protocols are basedon theoretically sound protocols; the bugs are introducedwhen developers filled in the unspecified parts in the pro-tocols in practice.

Controlling all non-determinism is hard. System-atic checking requires control of non-determinism in thetarget system. This task is very hard given the non-determinism in the OS and network, the wide API inter-face, the many possible failures and their combinations,and MODIST’s goal of reducing intrusiveness to the tar-get system. We have had bitter experiences debugging

non-deterministic errors in Berkeley DB, which uses pro-cess id, memory address, and time to generate randomnumbers, and in MPS, which randomly interferes withthe default Windows firewall. Among all, making theWindows socket APIs deterministic was the most diffi-cult; the interface shown in §3 went through several it-erations. Our own experiences show that controlling allnon-determinism is much harder than merely capturingit as in replay-debugging tools.

Avoid false positives at all cost. False positives maytake several days to diagnose. Thus, we want to avoidthem, even at the risk of missing errors.

Leverage domain knowledge. In a sense, this entirepaper boils down to leveraging the domain knowledgeof distributed systems to better model-check them. Thecore idea of model checking is simple: explore all pos-sible executions; a much more difficult task is to imple-ment this idea effectively in an application domain.

When in doubt, reboot. When we checked MPS, wewere surprised by how robust it was. MPS uses a de-fensive programming technique that works particularlywell in the context of distributed replication protocols.MPS extensively uses local assertions, reboots when anyassertion fails, and relies on the replication protocol torecover from these eager reboots. This recovery mecha-nism makes MPS robust against a wide range of failures.Of course, rebooting is not without penalty: if a primaryreboots, there could be noticeable performance degrada-tion, and the system also becomes less fault tolerant.

5 Related Work

5.1 Model Checking

Model checkers have previously been used to find er-rors in both the design and the implementation of soft-ware [1, 6, 12, 17–19, 27, 28, 34, 38, 39]. Traditionalmodel checkers require users to write an abstract modelof the target system, which often incurs large up-frontcost when checking large systems. In contrast, MODISTis an implementation-level model checker that checkscode directly, thus avoids this cost. Below we compareMODIST to implementation-level model checkers.

Model checkers for distributed system. MODIST ismost related to model checkers that check real distributedsystem implementations. CMC [27] is a stateful modelchecker that checks C code directly. It has been usedto check network protocol implementations [27] and filesystems [38]. However, to check a system, CMC re-quires invasive modifications to run the system insideCMC’s address space [39]. MaceMC [19] uses boundeddepth first search combined with random walk to findsafety and liveness bugs in a number of network pro-tocol implementations written in a domain-specific lan-guage. Compared to these two checkers, MODIST di-rectly checks live, unmodified distributed systems run-ning in their native execution environments, thus avoidsthe invasive modifications required by CMC, and the lan-guage restrictions [20] enforced by MaceMC.

CrystalBall [37] detects and avoids errors in deployeddistributed systems using an efficient global state col-lection and exploration technique. While CrystalBall isbased on MaceMC and thus checks only systems writ-ten in the Mace language [20], its core technique maybe portable to MODIST’s model checking framework toimprove the reliability of general distributed systems.

Other software model checkers. We compareMODIST to other closely related implementation-levelmodel checkers. Our transparent checking approachis motivated by our previous work EXPLODE [39].However, EXPLODE focuses on storage systems anddoes not check distributed systems.

To our best knowledge, VeriSoft [12] is the firstimplementation-level model checker. It systematicallyexplores the interleavings of concurrent C programs, anduses partial order reduction to soundly reduce the numberof states it explores. It has been used to check industrial-strength programs [5].

Chess [28] is a stateless model checker for explor-ing the interleavings of multi-threaded programs. Toavoid perturbing the target system, it also interposes onWinAPIs. In addition, Chess uses a context-boundingheuristic and a starvation-free scheduler to make itschecking more efficient. It has been applied to severalindustry-scale systems and found many bugs.

ISP [35] is an implementation-level model checker forMPI programs. It controls a MPI program by intercept-ing calls to MPI methods and reduces the state-space itexplores using new partial order reduction algorithms.

All three systems focus on checking interleavingsof concurrent programs, thus do not address issues onchecking real distributed systems, such as providing atransparent, distributed checking architecture and en-abling consistent and deterministic failure simulation

5.2 Replay-based debugging

A number of systems [11, 21, 32], including our pre-vious work [15, 25], use deterministic replay to debugdistributed system. These approaches attack a differentproblem: when a bug occurs, how to capture its manifes-tation so that developers can reproduce the bug. Com-bined with fault injection, these tools can be used to de-tect bugs. Like these systems, MODIST also provides re-producibility of errors. Unlike these systems, MODISTaims to proactively drive the target system into corner-cases for errors in the testing phase before the system isdeployed. MODIST uses the instrumentation library inour previous work [25] to interpose on WinAPIs.

5.3 Other error detection techniques

We view testing as complementary to our approach. Test-ing is usually less comprehensive than our approach, butworks “out of the box.” Thus, there is no reason not touse both testing and MODIST together.

There has been much recent work on static bug finding(e.g., [1, 2, 7, 8, 10, 33]). Roughly speaking, because dy-namic checking runs code, it is limited to just executedpaths, but can more effectively check deeper propertiesimplied by the code (e.g., two replicas are consistent).The protocol-level errors we found would be difficult tofind statically. We view static analysis as complemen-tary: easy enough to apply such that there is no reasonnot to use them together with MODIST.

Recently, symbolic execution [3, 4, 13, 31] has beenused to detect errors in real systems. This technique isgood at detecting bugs caused by tricky input values,whereas our approach is good at detecting bugs causedby the non-deterministic events in the environment.

6 ConclusionsMODIST represents an important step in achieving theideal of model checking unmodified distributed systemin a transparent and effective way. Its effectiveness hasbeen demonstrated by the subtle bugs it uncovered inwell-tested production and deployed systems.

Our experience shows that it requires a combination ofart, science, and engineering. It is an art because variousheuristics must be developed for finding delicate bugseffectively, taking into account the peculiarity of com-plex distributed systems; it is a science because a sys-tematic, modular approach with a carefully designed ar-chitecture is a key enabler; it involves heavy engineeringeffort to interpose between the application and the OS,to model and control low-level system behavior, and tohandle system-level non-determinism.

AcknowledgementWe thank Huayang Guo and Liying Tang for their help inevaluating MODIST, Jian Tang for his contributions to an

earlier version of the system, and our colleagues at Mi-crosoft Research Silicon Valley and the System ResearchGroup at Microsoft Research Asia for their commentsand support. We especially thank Stephen A. Edwardsfor extensive edits and Stelios Sidiroglou-Douskos fordetailed comments. We are also grateful to the anony-mous reviewers for their valuable feedback and to ourshepherd Steven Hand for his guidance.

References[1] T. Ball and S. K. Rajamani. Automatically validating temporal safety prop-

erties of interfaces. In Proceedings of the Eighth International SPIN Work-shop on Model Checking of Software (SPIN ’01), pages 103–122, May2001.

[2] W. Bush, J. Pincus, and D. Sielaff. A static analyzer for finding dynamicprogramming errors. Software: Practice and Experience, 30(7):775–802,2000.

[3] C. Cadar, V. Ganesh, P. M. Pawlowski, D. L. Dill, and D. R. Engler. EXE:automatically generating inputs of death. In Proceedings of the 13th ACMconference on Computer and communications security (CCS ’06), pages322–335, Oct.–Nov. 2006.

[4] C. Cadar, D. Dunbar, and D. Engler. KLEE: Unassisted and automaticgeneration of high-coverage tests for complex systems programs. In Pro-ceedings of the Eighth Symposium on Operating Systems Design and Im-plementation (OSDI ’08), pages 209–224, Dec. 2008.

[5] S. Chandra, P. Godefroid, and C. Palm. Software model checking in prac-tice: an industrial case study. In Proceedings of the 24th International Con-ference on Software Engineering (ICSE ’02), pages 431–441, May 2002.

[6] J. C. Corbett, M. B. Dwyer, J. Hatcliff, S. Laubach, C. S. Pasareanu, Robby,and H. Zheng. Bandera: Extracting finite-state models from Java sourcecode. In Proceedings of the 22nd International Conference on SoftwareEngineering (ICSE ’00), pages 439–448, June 2000.

[7] M. Das, S. Lerner, and M. Seigle. Esp: path-sensitive program verificationin polynomial time. In Proceedings of the ACM SIGPLAN 2002 Conferenceon Programming Language Design and Implementation (PLDI ’02), pages57–68, June 2002.

[8] D. Engler, B. Chelf, A. Chou, and S. Hallem. Checking system rules usingsystem-specific, programmer-written compiler extensions. In Proceedingsof the Fourth Symposium on Operating Systems Design and Implementation(OSDI ’00), Sept. 2000.

[9] C. Flanagan and P. Godefroid. Dynamic partial-order reduction for modelchecking software. In Proceedings of the 32nd Annual Symposium on Prin-ciples of Programming Languages (POPL ’05), pages 110–121, Jan. 2005.

[10] C. Flanagan, K. Leino, M. Lillibridge, G. Nelson, J. Saxe, and R. Stata.Extended static checking for Java. In Proceedings of the ACM SIGPLAN2002 Conference on Programming Language Design and Implementation(PLDI ’02), pages 234–245, June 2002.

[11] D. Geels, G. Altekarz, P. Maniatis, T. Roscoey, and I. Stoicaz. Friday:Global comprehension for distributed replay. In Proceedings of the FourthSymposium on Networked Systems Design and Implementation (NSDI ’07),Apr. 2007.

[12] P. Godefroid. Model Checking for Programming Languages using VeriSoft.In Proceedings of the 24th Annual Symposium on Principles of Program-ming Languages (POPL ’97), pages 174–186, Jan. 1997.

[13] P. Godefroid, N. Klarlund, and K. Sen. Dart: Directed automated randomtesting. In Proceedings of the ACM SIGPLAN 2005 Conference on Pro-gramming Language Design and Implementation (PLDI ’05), pages 213–223, June 2005.

[14] C. Gray and D. Cheriton. Leases: An efficient fault-tolerant mechanismfor distributed file cache consistency. In Proceedings of the 12th ACMSymposium on Operating Systems Principles (SOSP ’89), pages 202–210,Dec. 1989.

[15] Z. Guo, X. Wang, J. Tang, X. Liu, Z. Xu, M. Wu, M. F. Kaashoek, andZ. Zhang. R2: An application-level kernel for record and replay. In Pro-ceedings of the Eighth Symposium on Operating Systems Design and Im-plementation (OSDI ’08), pages 193–208, Dec. 2008.

[16] R. Hastings and B. Joyce. Purify: Fast detection of memory leaks andaccess errors. In Proceedings of the USENIX Winter Technical Conference(USENIX Winter ’92), pages 125–138, Dec. 1992.

[17] G. J. Holzmann. The model checker SPIN. Software Engineering, 23(5):279–295, 1997.

[18] G. J. Holzmann. From code to models. In Proceedings of the SecondInternational Conference on Applications of Concurrency to System Design(ACSD ’01), June 2001.

[19] C. Killian, J. W. Anderson, R. Jhala, , and A. Vahdat. Life, death, and thecritical transition: Finding liveness bugs in systems code. In Proceedings ofthe Fourth Symposium on Networked Systems Design and Implementation(NSDI ’07), pages 243–256, April 2007.

[20] C. E. Killian, J. W. Anderson, R. Braud, R. Jhala, and A. M. Vahdat. Mace:language support for building distributed systems. In Proceedings of theACM SIGPLAN 2007 Conference on Programming Language Design andImplementation (PLDI ’07), pages 179–188, June 2007.

[21] R. Konuru, H. Srinivasan, and J.-D. Choi. Deterministic replay of dis-tributed Java applications. In Proceedings of the 14th International Sympo-sium on Parallel and Distributed Processing (IPDPS ’00), pages 219–228,May 2000.

[22] L. Lamport. The part-time parliament. ACM Transactions on ComputerSystems, 16(2):133–169, 1998.

[23] L. Lamport. Time, clocks, and the ordering of events in a distributed sys-tem. Comm. ACM, 21(7):558–565, 1978.

[24] W. Lin, M. Yang, L. Zhang, and L. Zhou. Pacifica: Replication in log-baseddistributed storage systems. Technical report.

[25] X. Liu, Z. Guo, X. Wang, F. Chen, X. Lian, J. Tang, M. Wu, M. F.Kaashoek, and Z. Zhang. D3s: Debugging deployed distributed systems.In Proceedings of the Fifth Symposium on Networked Systems Design andImplementation (NSDI ’08), pages 423–437, Apr. 2008.

[26] F. Mattern. Virtual time and global states of distributed systems. In Pro-ceedings of the International Workshop on Parallel and Distributed Algo-rithms, pages 215–226. 1988.

[27] M. Musuvathi, D. Y. Park, A. Chou, D. R. Engler, and D. L. Dill. CMC:A pragmatic approach to model checking real code. In Proceedings of theFifth Symposium on Operating Systems Design and Implementation (OSDI’02), pages 75–88, Dec. 2002.

[28] M. Musuvathi, S. Qadeer, T. Ball, G. Basler, P. A. Nainar, and I. Neamtiu.Finding and reproducing heisenbugs in concurrent programs. In Proceed-ings of the Eighth Symposium on Operating Systems Design and Implemen-tation (OSDI ’08), pages 267–280, Dec. 2008.

[29] N. Nethercote and J. Seward. Valgrind: a framework for heavyweight dy-namic binary instrumentation. In Proceedings of the ACM SIGPLAN 2007Conference on Programming Language Design and Implementation (PLDI’07), pages 89–100, June 2007.

[30] Phoenix. http://research.microsoft.com/Phoenix/.

[31] K. Sen, D. Marinov, and G. Agha. CUTE: A concolic unit testing enginefor C. In 5th joint meeting of the European Software Engineering Con-ference and ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering (ESEC/FSE’05), pages 263–272, Sept. 2005.

[32] S. M. Srinivasan, S. Kandula, C. R. Andrews, and Y. Zhou. Flashback:A lightweight extension for rollback and deterministic replay for softwaredebugging. In Proceedings of the USENIX Annual Technical Conference(USENIX ’04), pages 29–44, June 2004.

[33] The Coverity Software Analysis Toolset. http://coverity.com.

[34] W. Visser, K. Havelund, G. Brat, S. Park, and F. Lerda. Model checkingprograms. Automated Software Engineering, 10(2):203–232, 2003.

[35] A. Vo, S. Vakkalanka, M. DeLisi, G. Gopalakrishnan, R. M. Kirby, , andR. Thakur. Formal verification of practical mpi programs. In Principles andPractices of Parallel Programming (PPoPP), pages 261–260, Feb. 2009.

[36] Windows API. http://msdn.microsoft.com/.

[37] M. Yabandeh, N. Knezevic, D. Kostic, and V. Kuncak. CrystalBall: Pre-dicting and preventing inconsistencies in deployed distributed systems. InProceedings of the Sixth Symposium on Networked Systems Design and Im-plementation (NSDI ’09), Apr. 2009.

[38] J. Yang, P. Twohey, D. Engler, and M. Musuvathi. Using model checkingto find serious file system errors. In Proceedings of the Sixth Symposiumon Operating Systems Design and Implementation (OSDI ’04), pages 273–288, Dec. 2004.

[39] J. Yang, C. Sar, and D. Engler. Explode: a lightweight, general systemfor finding serious storage system errors. In Proceedings of the SeventhSymposium on Operating Systems Design and Implementation (OSDI ’06),pages 131–146, Nov. 2006.

Documents

MODIST: Transparent Model Checking of …systems.cs.columbia.edu/files/wpid-modist-nsdi09.pdfMODIST, covering its architecture (x2.1), its checking process(x 2.2), the checks it enables