15
A Sense of Time for Node.js: Timeouts as a Cure for Event Handler Poisoning Anonymous Abstract—The software development community has begun to adopt the Event-Driven Architecture (EDA) to provide scalable web services. Though the Event-Driven Architecture can offer better scalability than the One Thread Per Client Architecture, its use exposes service providers to a Denial of Service attack that we call Event Handler Poisoning (EHP). This work is the first to define EHP attacks. After examining EHP vulnerabilities in the popular Node.js EDA framework and open-source npm modules, we explore various solutions to EHP- safety. For a practical defense against EHP attacks, we propose Node.cure, which defends a large class of Node.js applications against all known EHP attacks by making timeouts a first-class member of the JavaScript language and the Node.js framework. Our evaluation shows that Node.cure is effective, broadly applicable, and offers strong security guarantees. First, we show that Node.cure defeats all known language- and framework- level EHP attacks, including real reported EHP attacks. Second, we evaluate its costs, finding that Node.js incurs reasonable overheads in micro-benchmarks, and low overheads ranging from 1.02x (real servers) to 1.24x (middleware). Third, we show that Node.cure offers strong security guarantees against EHP attacks for the majority of the Node.js ecosystem. I. I NTRODUCTION Web services are the lifeblood of the modern Internet. To handle the demands of millions of clients, service providers replicate their services across many servers. Since every server costs time and money, service providers want to maximize the number of clients each server can handle. Over the past decade, this goal has led the software community to seriously consider a paradigm shift in their software architecture — from the One Thread Per Client Architecture (OTPCA) used in Apache to the Event-Driven Architecture (EDA) championed by Node.js 1 . Although using the EDA may increase the number of clients each server can handle, it also exposes the service to a new class of Denial of Service (DoS) attack unique to the EDA: Event Handler Poisoning (EHP). The OTPCA and the EDA differ in the resources they dedicate to each client. In the OTPCA, each client is assigned a thread, and the overhead associated with each thread limits the number of concurrent clients the service can handle. In the EDA, a small pool of Event Handler threads is shared by all clients. The per-client overhead is thus smaller in the EDA, improving scalability. Though researchers have investigated application- and language-level security issues that affect many services, we are the first to examine the security implications of using the EDA from a software architecture perspective. In §III we describe a 1 The EDA paradigm is also known as the reactor pattern. new Denial of Service attack that can be used against EDA- based services. Our Event Handler Poisoning attack exploits the most important limited resource in the EDA: the Event Handlers themselves. The source of the EDA’s scalability is also its Achilles’ heel. Multiplexing unrelated work onto the same thread re- duces overhead, but it also moves the burden of time sharing out of the thread library or operating system and into the application itself. Where OTPCA-based services can rely on preemptive multitasking to ensure that resources are shared fairly, using the EDA requires the service to enforce its own cooperative multitasking [89]. An EHP attack identifies a way to defeat the cooperative multitasking used by an EDA-based service, ensuring Event Handler (s) remain dedicated to the attacker rather than handling all clients fairly. Our analysis identifies many EHP vulnerabilities among previously-reported npm vulnerabilities, and a sample of npm modules suggests that EHP vulnerabilities may be an epidemic (§IV). In §V we define criteria for EHP-safety, and use these criteria to evaluate different EHP defenses (§VI). The general solution we propose is to relax but not eliminate the coop- erative multitasking constraint of the EDA. Our prototype, Node.cure (§VII), makes timeouts a first-class member of the Node.js framework, and is the first framework we know of that does so. If an Event Handler in Node.cure blocks, Node.cure will emit a TimeoutError. Although cooperative multitasking is ultimately maintained, Node.cure thus informs every Event Handler when it ought to yield. Our findings on EHP-safety have been corroborated by the Node.js community (§IX). We have developed a guide for practitioners on building EHP-proof systems (PR preliminarily approved), updated the Node.js documentation to warn devel- opers about the perils of particular APIs (two PRs merged), and submitted a pull request to improve the safety of these APIs (PR preliminarily approved). In our evaluation of Node.cure (§VIII), we describe the strong security guarantees against EHP that Node.cure offers, and demonstrate this by showing that Node.cure defeats rep- resentatives of each of the classes of EHP attack. Our cost evaluation on micro- and application benchmarks shows that Node.cure introduces acceptable monitoring overhead, as low as 0% on real server applications. The first steps toward this paper were presented at a workshop 2 . There we introduced the idea of EHP attacks and argued that some previously-reported security vulnerabilities were actually EHP vulnerabilities. We also proposed the “C- WCET Principle”, which we have evaluated and found wanting 2 We omit the reference to meet the double-blind requirement.

A Sense of Time for Node.js: Timeouts as a Cure for Event ...people.cs.vt.edu/davisjam/downloads/blog/ehp/... · the Event-Driven Architecture (EDA) championed by Node.js1. Although

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Sense of Time for Node.js: Timeouts as a Cure for Event ...people.cs.vt.edu/davisjam/downloads/blog/ehp/... · the Event-Driven Architecture (EDA) championed by Node.js1. Although

A Sense of Time for Node.js:Timeouts as a Cure for Event Handler Poisoning

Anonymous

Abstract—The software development community has begun toadopt the Event-Driven Architecture (EDA) to provide scalableweb services. Though the Event-Driven Architecture can offerbetter scalability than the One Thread Per Client Architecture,its use exposes service providers to a Denial of Service attackthat we call Event Handler Poisoning (EHP).

This work is the first to define EHP attacks. After examiningEHP vulnerabilities in the popular Node.js EDA framework andopen-source npm modules, we explore various solutions to EHP-safety. For a practical defense against EHP attacks, we proposeNode.cure, which defends a large class of Node.js applicationsagainst all known EHP attacks by making timeouts a first-classmember of the JavaScript language and the Node.js framework.

Our evaluation shows that Node.cure is effective, broadlyapplicable, and offers strong security guarantees. First, we showthat Node.cure defeats all known language- and framework-level EHP attacks, including real reported EHP attacks. Second,we evaluate its costs, finding that Node.js incurs reasonableoverheads in micro-benchmarks, and low overheads ranging from1.02x (real servers) to 1.24x (middleware). Third, we show thatNode.cure offers strong security guarantees against EHP attacksfor the majority of the Node.js ecosystem.

I. INTRODUCTION

Web services are the lifeblood of the modern Internet. Tohandle the demands of millions of clients, service providersreplicate their services across many servers. Since every servercosts time and money, service providers want to maximize thenumber of clients each server can handle. Over the past decade,this goal has led the software community to seriously considera paradigm shift in their software architecture — from the OneThread Per Client Architecture (OTPCA) used in Apache tothe Event-Driven Architecture (EDA) championed by Node.js1.Although using the EDA may increase the number of clientseach server can handle, it also exposes the service to a newclass of Denial of Service (DoS) attack unique to the EDA:Event Handler Poisoning (EHP).

The OTPCA and the EDA differ in the resources theydedicate to each client. In the OTPCA, each client is assigneda thread, and the overhead associated with each thread limitsthe number of concurrent clients the service can handle. In theEDA, a small pool of Event Handler threads is shared by allclients. The per-client overhead is thus smaller in the EDA,improving scalability.

Though researchers have investigated application- andlanguage-level security issues that affect many services, we arethe first to examine the security implications of using the EDAfrom a software architecture perspective. In §III we describe a

1The EDA paradigm is also known as the reactor pattern.

new Denial of Service attack that can be used against EDA-based services. Our Event Handler Poisoning attack exploitsthe most important limited resource in the EDA: the EventHandlers themselves.

The source of the EDA’s scalability is also its Achilles’heel. Multiplexing unrelated work onto the same thread re-duces overhead, but it also moves the burden of time sharingout of the thread library or operating system and into theapplication itself. Where OTPCA-based services can rely onpreemptive multitasking to ensure that resources are sharedfairly, using the EDA requires the service to enforce its owncooperative multitasking [89]. An EHP attack identifies a wayto defeat the cooperative multitasking used by an EDA-basedservice, ensuring Event Handler (s) remain dedicated to theattacker rather than handling all clients fairly. Our analysisidentifies many EHP vulnerabilities among previously-reportednpm vulnerabilities, and a sample of npm modules suggeststhat EHP vulnerabilities may be an epidemic (§IV).

In §V we define criteria for EHP-safety, and use thesecriteria to evaluate different EHP defenses (§VI). The generalsolution we propose is to relax but not eliminate the coop-erative multitasking constraint of the EDA. Our prototype,Node.cure (§VII), makes timeouts a first-class member of theNode.js framework, and is the first framework we know of thatdoes so. If an Event Handler in Node.cure blocks, Node.curewill emit a TimeoutError. Although cooperative multitaskingis ultimately maintained, Node.cure thus informs every EventHandler when it ought to yield.

Our findings on EHP-safety have been corroborated bythe Node.js community (§IX). We have developed a guide forpractitioners on building EHP-proof systems (PR preliminarilyapproved), updated the Node.js documentation to warn devel-opers about the perils of particular APIs (two PRs merged),and submitted a pull request to improve the safety of theseAPIs (PR preliminarily approved).

In our evaluation of Node.cure (§VIII), we describe thestrong security guarantees against EHP that Node.cure offers,and demonstrate this by showing that Node.cure defeats rep-resentatives of each of the classes of EHP attack. Our costevaluation on micro- and application benchmarks shows thatNode.cure introduces acceptable monitoring overhead, as lowas 0% on real server applications.

The first steps toward this paper were presented at aworkshop2. There we introduced the idea of EHP attacks andargued that some previously-reported security vulnerabilitieswere actually EHP vulnerabilities. We also proposed the “C-WCET Principle”, which we have evaluated and found wanting

2We omit the reference to meet the double-blind requirement.

Page 2: A Sense of Time for Node.js: Timeouts as a Cure for Event ...people.cs.vt.edu/davisjam/downloads/blog/ehp/... · the Event-Driven Architecture (EDA) championed by Node.js1. Although

(§VI-C). Here we offer the complete package: we refineour previous definition of EHP, increase our survey of EHPvulnerabilities in the wild, and describe a prototype that offersstrong security guarantees using timeouts.

In summary, our work makes the following contributions:

1) We are the first to define Event Handler Poisoning (EHP)(§III), a DoS attack against the EDA. EHP attacks con-sume Event Handlers, a limited resource in the EDA. Ourdefinition of EHP accommodates all known EHP attacks.

2) After evaluating the solution space for EHP defenses (§VI),we describe a general antidote for Event Handler Poisoningattacks: first-class timeouts (§VII). We demonstrate itseffectiveness with a prototype for the popular Node.js EDAframework. To the best of our knowledge, our prototype,Node.cure, is the first EDA framework to incorporate first-class timeouts.

3) We evaluate the security guarantees of timeouts (strong),their costs (small), and their effectiveness (quite) in theNode.js ecosystem (§VIII).

4) We released a guide on EHP-safe techniques for newdevelopers in the EDA, which has been accepted. Our PRimproving the safety of Node.js APIs has been acceptedand our PRs documenting unsafe APIs were merged (§IX).

II. BACKGROUND

In this section we contrast the EDA with the OTPCA(§II-A). We then explain our choice of EDA framework forstudy (§II-B), and finish by introducing formative work onAlgorithmic Complexity attacks (§II-C).

A. Overview of the EDA

There are two main architectures for scalable web servers,the established One Thread Per Client Architecture (OTPCA)and the upstart Event-Driven Architecture (EDA). In the OT-PCA style, exemplified by the Apache Web Server [61], eachclient is assigned a thread or a process, with a large maximumnumber of concurrent clients. The OTPCA model isolatesclients from one another, reducing the risk of interference.However, each additional client incurs memory and contextswitching overhead. In contrast, by multiplexing multiple clientconnections onto a single-threaded Event Loop and offering asmall Worker Pool for asynchronous tasks, the EDA approachreduces a server’s per-client resource use. While the OTPCAand EDA architectures have the same expressive power [71],in practice EDA-based servers can offer greater scaling [84].

Although EDA approaches have been discussed and im-plemented in the academic and professional communities fordecades (e.g. [94], [40], [63], [56], [98], [87]), historically theEDA was only applied in user interface settings like GUIs andweb browsers. The EDA is increasingly relevant in systemsprogramming today thanks largely to the explosion of interestin Node.js, an event-driven server-side JavaScript framework.

The most common version of the EDA on the serverside is the Asymmetric Multi-Process Event-Driven (AMPED)architecture [83], used by all mainstream general-purpose EDAframeworks. The AMPED architecture (hereafter “the EDA”)is illustrated in Figure 1. In the EDA the operating system or aframework places events in a queue (e.g. through Linux’s poll

system call3), and pending events are handled sequentiallyby an Event Loop. The Event Loop may offload expensiveTasks to the queue of a small, fixed-size Worker Pool, whoseWorkers execute Tasks and generate “Task Done” events forthe Event Loop when they finish. One common use of theWorker Pool is to provide quasi-asynchronous (blocking butconcurrent) access to the file system.

Fig. 1. This is the AMPED Event-Driven Architecture. Incoming events fromclients A and B are stored in the event queue, and the associated Callbacks(CBs) will be executed sequentially by the Event Loop. The Event Loop maygenerate new events, to be handled by itself later or as Tasks offloaded to theWorker Pool. Alas, B has performed an EHP attack on this service. His evilinput (third in the event queue) will eventually poison the Event Loop, andthe events after his input will not be handled (§III-D).

Application developers must adopt the EDA paradigmwhen developing EDA-based software. Developers define theset of possible Event Loop Events for which they supply aset of Callbacks [60]. During the execution of an EDA-basedapplication, events are generated by external activity or internalprocessing and routed to the appropriate Callbacks.

We refer to the Event Loop and the Workers as EventHandlers. Events for the Event Loop are supplied by theoperating system or the framework, while events for theWorker Pool (i.e. Tasks) are supplied by the Event Loop. Wesay that the Event Loop “executes a Callback” in response toan event, and that a Worker “performs a Task”. A Callbackis executed synchronously on the Event Loop, and a Task isperformed synchronously on a Worker, asynchronously fromthe Event Loop’s perspective.

In the EDA, each Callback and Task is guaranteed atom-icity: once scheduled, it runs to completion. Because there arerelatively few Event Handlers, cooperative multitasking [89]is a must. Developers therefore partition the generation ofresponses into multiple stages (forming its Lifeline [38], aDAG) and regularly yield to other requests by deferringwork until the next stage [52]. Unfortunately, in Figure 1 anattacker B carried out an EHP attack, defeating the cooperatingmultitasking and preventing other requests from being handled.

B. Node.js among other EDA frameworks

An EDA approach can be (and has been) implemented inmost programming languages. Examples of EDA frameworksinclude Node.js (JavaScript) [17], libuv (C/C++) [11], Event-Machine (Ruby) [5], Vert.x (Java) [28], Zotonic (Erlang) [31],

3“[P]oll...shall identify those file descriptors on which an application canread or write data, or on which certain events have occurred” [35].

2

Page 3: A Sense of Time for Node.js: Timeouts as a Cure for Event ...people.cs.vt.edu/davisjam/downloads/blog/ehp/... · the Event-Driven Architecture (EDA) championed by Node.js1. Although

Twisted (Python4) [27], and Microsoft’s P# [55]. These frame-works have been used to build a wide variety of industry andopen-source services, at companies including LinkedIn [78],PayPal [66], eBay [82], and IBM [68], and for projects likeUbuntu One [33], Apple’s Calendar and Contacts Server [32],and IoT frameworks [9], [4].

We selected Node.js for further investigation based on itscommunity interest and maturity. To measure this, we surveyedeach framework’s GitHub [7] repository for the number ofcontributors and forks (community interest), and the numberof lifetime and recent commits (maturity). As of November 14,2017, Node.js had the most community interest (1,782 contrib-utors against libuv’s 269; 8,775 forks against Vert.x’s 1,177),and strong maturity (19,832 lifetime commits to Twisted’s22,161; 2,382 commits in the last six months against Twisted’s1,106). Thus, we selected Node.js (JavaScript) for closerinvestigation, though as described in §III, EHP attacks applyto any EDA-based service.

Node.js is a server-side event-driven framework forJavaScript that uses the architecture shown in Figure 1. Node.jsexecutes an application’s JavaScript using Google’s V8 en-gine [3], and offers a set of core JavaScript modules withC++ bindings that implement system calls using libuv [11].Introduced in 2009, Node.js claimed 3.5 million users in2016 [34], and as of August 2017 its open-source packageecosystem, npm, is the largest of any programming languageor framework [13], with 550,000 modules and counting.

C. Algorithmic complexity attacks

Our work is inspired by Algorithmic Complexity (AC)attacks, a form of Denial of Service attack ([76], [51]). In anAC attack, attackers force victims to take longer than expectedto perform an operation by incurring the worst-case complexityof the victim’s algorithms. Rather than overwhelming thevictim with volume, the attacker turns the victim’s vulnerablealgorithms against him. AC vulnerabilities can therefore beexploited to launch DoS attacks, when the attacker forces thevictim to spend a long time processing his request at theexpense of legitimate clients, or when the attacker increasesthe cost even of legitimate requests.

Well-known examples of AC attacks include attacks onhash tables, whose worst-case O(N) behavior can be triggeredby attackers [51], and Regular expression Denial of Service(ReDoS) attacks, which exploit the worst-case exponential-time behavior typical of regular expression (regexp) engines([50], [88], [93], [70], [86], [95]).

As will be made clear in §III, our work is not simply theapplication of AC attacks to the EDA. The first distinctionis a matter of perspective: where AC attacks focus on thealgorithms the service employs, EHP attacks target the soft-ware architecture used by the service. The second distinctionis one of definition: AC attacks rely on CPU-bound activitiesto achieve DoS, while EHP attacks may use either CPU-boundor I/O-bound activities to poison an Event Handler.

4In addition, Python 3.4 introduced native EDA support.

1 const cb = (req, resp) => {2 let f = requestedFile(req);3 if (f.match(/(\/.+)+$/)) { // ReDoS4 FS.readFile(f, (err, d) => { // ReadDoS5 resp.end(’Finished read’);6 });7 }8 else { resp.end(’Invalid file’); }9 }

10 const srv = _HTTP.createServer(cb).listen(3000);

Fig. 2. Gist of our simple server vulnerable to EHP attacks. Line 3 isvulnerable to ReDoS. Line 4 is vulnerable to ReadDoS.

III. EVENT HANDLER POISONING ATTACKS

In this section we provide our threat model (§III-A),define Event Handler Poisoning (EHP) attacks more formally(§III-B), and present a taxonomy for EHP attacks (§III-C).To illustrate the problem, we demonstrate simple EHP attacksusing ReDoS and ReadDoS (§III-D).

A. Threat model

Our threat model is straightforward: the attacker crafts evilinput that triggers an expensive operation, and convinces theEDA-based victim to feed it to a Vulnerable API (§III-C).An ideal EHP attack is asymmetric, such that a small evilinput triggers a large amount of work on the part of thepoisoned Event Handler. Depending on how much work theevil input triggers, the attacker can achieve DoS outright, orcan repeatedly apply poison in a Slow DoS attack [44]. Serversare in general vulnerable to DDoS attacks, whether OTPCAor EDA, and we exclude these from our model.

This model is reasonable because our victim is an EDA-based server; servers exist solely to process client input usingAPIs. If the server uses a Vulnerable API, and if it invokesit with unsanitized client input, the server can be poisoned.In §III-C we show examples of Vulnerable APIs in Node.jsand other EDA frameworks, at the language, framework, andapplication levels.

B. What happens in an EHP attack?

Total/synchronous complexity and time. Before we candefine EHP attacks, we must introduce a few definitions. First,recall the standard EDA formulation illustrated in Figure 1.Each client request is the initial event of a Lifeline [38]partitioned into one or more events and Tasks handled by theappropriate Event Handler. A Lifeline is thus a graph of eventCallbacks and Tasks, whose vertices represent a Callback orTask and whose edges indicate events or Task submission.

To illustrate a Lifeline, consider the server in Figure 2. Thearrival of an HTTP request on port 3000 is the initial event.The corresponding Callback cb (lines 1-9) is handled by theEvent Loop, which synchronously invokes an application-levelAPI (line 2) and a language-level regexp API (line 3), and thencreates another vertex in the Lifeline graph by submitting anasynchronous file system request. The Worker Pool handlesthis Task, and on completion it adds another vertex to theLifeline by emitting a “Task Complete” event to the EventLoop (Callback defined on lines 4-6).

3

Page 4: A Sense of Time for Node.js: Timeouts as a Cure for Event ...people.cs.vt.edu/davisjam/downloads/blog/ehp/... · the Event-Driven Architecture (EDA) championed by Node.js1. Although

We define the total complexity of a Lifeline as the com-bined complexity (cost) of each vertex (Callbacks and Tasks)as a function of the cumulative input to all vertices. Thesynchronous complexity of a Lifeline is the greatest individualcomplexity among all vertices in the Lifeline. Similarly, wedefine a Lifeline’s total time and synchronous time as thecumulative and greatest-individual time taken, respectively,by the vertices in the Lifeline. Since the EDA relies oncooperative multitasking, a Lifeline’s synchronous complexityand synchronous time provide theoretical and practical upperbounds on how effective an EHP attack against it might be. Wealso refer to the synchronous complexity and time of individualCallbacks and Tasks, defined as the individual complexity andtime of the corresponding Lifeline vertex.

EHP attacks. Remember that each Callback and Task ishandled in its turn by one of the Event Handlers, and thatthe Event Handlers are not dedicated one to each client, butshared among all of the clients. This sharing means that theresponsiveness of the server to its clients depends not on OSpreemption, but on cooperative multitasking; each Callbackand Task must be handled quickly by its Event Handler togive Callbacks and Tasks from other clients a turn.

An EHP attack exploits this architecture by poisoning oneor more Event Handlers with evil input. After identifying a vul-nerable Lifeline, one with a large worst-case synchronous com-plexity, an attacker submits evil input that triggers this worst-case complexity. The Lifeline’s large synchronous complexityleads to large synchronous time, causing the affected EventHandler to block, perhaps indefinitely, while handling theexpensive event or Task. While an Event Handler is blocked,it will not continue to process other events or Tasks, and theLifelines of pending events and Tasks will starve. EHP attacksthus exploit a fundamental property of the EDA, namely theneed to correctly implement fair cooperative multitasking.

An EHP attack can be carried out against either the EventLoop or the Workers in the Worker Pool. If the Event Loop ispoisoned, neither pending nor future events will be handled. Onthe other hand, for each poisoned Worker the responsiveness ofthe Worker Pool will degrade, until every Worker is poisonedand all pending and future requests whose Lifelines submit aTask to the Worker Pool will receive no responses. Thus, anattacker’s aim is to poison either the Event Loop or enough ofthe Worker Pool to harm the throughput of the server.

You might be surprised that we name the Worker Pool asa potential target for EHP attacks. Remember, however, thatin the EDA the Worker Pool is deliberately quite small, aspart of the EDA’s low-overhead philosophy. For example, ina Node.js process the Worker Pool defaults to 4 Workers,with a maximum of 128 Workers [20]. Though a smallWorker Pool is in keeping with the spirit of the EDA, its sizemakes it a realistic target. In contrast, though typical OTPCA-based servers also maintain a pool of threads (Workers) tobound the total number of concurrent connections, this pool iscomparatively enormous. Apache’s httpd server has a defaultconnection pool of size 256-400, with a maximum around20,000 [61]. Due to the size of an OTPCA server’s “WorkerPool”, attempts to poison it would presumably be detectedby an IDS or addressed through rate limiting before theypoisoned enough of the pool to have an effect. By comparison,the small number of requests needed to poison a vulnerable

Vuln. APIs Event Loop (§VII-B) Worker Pool (§VII-A)CPU-bound I/O-bound CPU-bound I/O-bound

Language Regexp, JSON N/A N/A N/AFramework Crypto, zlib FS Crypto, zlib FS, DNSApplication while(1) DB query Regexp [15] DB query

Table I. Taxonomy of Vulnerable APIs in Node.js, with examples. AnEHP attack through a Vulnerable API poisons the Event Loop or a Worker,

and its synchronous time is due to CPU-bound or I/O-bound activity. AVulnerable API might be part of the language, framework, or application,and might be largely synchronous (Event Loop) or asynchronous (Worker

Pool). zlib is the Node.js compression library. N/A: JavaScript has no nativeWorker Pool nor any I/O APIs.

EDA server’s Worker Pool is too few to draw attention fromnetwork-level defenses, placing the burden of defense on thelanguage, framework, and application developers — or, as wepropose in §VI-D, the runtime system.

We believe EHP attacks on the EDA have not previouslybeen explored because the EDA has only recently seen popularadoption by the server-side community. On the client side EHPattacks are not a concern, as misbehaving users will hurt onlythemselves. On the server side, the stakes are rather higher,especially as the EDA is adopted in safety-critical settings likeIoT controllers [9] and robotics [4].

C. What makes an API Vulnerable?

Thus far we have relied on the intuitive notion of aVulnerable API. We turn now to a more precise taxonomy.Using the terms from §III-B, a Vulnerable API begins aLifeline that can have large synchronous time, due either toits large synchronous complexity or to the size of the input.A Vulnerable API poisons the Event Loop or a Worker. Itssynchronous time can be attributed to heavy computation orthe need for I/O. Note that an API that creates a Lifelinewith large total time is not vulnerable so long as each vertex(Callback/Task) has a small synchronous time; it is largesynchronous time that leads to EHP.

Table I classifies Vulnerable APIs along two axes. Alongthe first axis, a Vulnerable API affects one of the two types ofEvent Handlers, and it might be CPU-bound or I/O-bound.Along the second axis, a Vulnerable API can be found inone of three places: the language, the framework, or theapplication. Table I lists examples of Vulnerable APIs forapplications written in JavaScript for the Node.js framework.In our evaluation we provide an exhaustive list of VulnerableAPIs at the language and framework levels (§VIII-A).

Although the framework component of Table I is specific toNode.js, the same general classification can be applied to otherEDA frameworks. For example, most mainstream languages(Java, Python, Perl, Ruby, etc.) support regular expressionsand are vulnerable to ReDoS; and unless care is taken, filesystem or DNS requests to slow resources will poison EventHandlers in many EDA frameworks.

D. Two simple EHP attacks

To illustrate EHP attacks, we developed a small vulnerablefile server. a simplified version is shown in Figure 2. Thisserver has two EHP vulnerabilities. First, it sanitizes input witha regexp vulnerable to ReDoS. This regular expression ensuresthat paths are /-delimited, but the inclusive “.” will also matcha “/”. On the evil input “/.../\n” (50 /’s and a newline that “.”

4

Page 5: A Sense of Time for Node.js: Timeouts as a Cure for Event ...people.cs.vt.edu/davisjam/downloads/blog/ehp/... · the Event-Driven Architecture (EDA) championed by Node.js1. Although

does not match), this input triggers exponential-time behaviorin Node.js’s regexp engine while it backtracks in search of adivision of the /’s that will match, poisoning the Event Loop ina CPU-bound EHP attack and blocking all subsequent requests.

The second EHP vulnerability might be more surprising.While the sanitization regexp fulfills its primary purpose, pre-venting readFile from throwing a “malformed path” error, itpermits a client to read an arbitrary file — a directory traversalvulnerability. In the EDA, directory traversal vulnerabilitiescan be parlayed into I/O-bound EHP attacks, “ReadDoS”,provided the attacker can identify a Slow File5 or a LargeFile from which to read. Since line 4 uses the asynchronousframework API FS.readFile, each ReadDoS attack on thisserver will poison a Worker.

In Figure 3 you can see the baseline server throughput(circles) and the throughput while undergoing two attacks(ReDoS in triangles, ReadDoS in squares), averaged over fiveruns. In each condition we configured a Worker Pool of size4, and used the ab benchmark [1] to make 2500 legitimaterequests per second divided among 80 clients. We measuredthe instantaneous throughput of the server at 200 ms intervals.

Exploiting the ReDoS vulnerability has an immediate ef-fect. After 1 second the attacker submits the evil input, poi-soning the Event Loop and blocking all subsequent requests.In contrast, the ReadDoS attack only becomes noticeable afterthree evil requests because the Worker Pool is not saturatedin this example. At seconds 1, 2, 3, and 4 an attacker submitsone read request to a Slow File. Each evil input poisons oneof the four Workers. The throughput dips after three maliciousrequests, as the legitimate requests saturate the sole survivingWorker. The fourth evil input entirely poisons the Worker Pool,flatlining the throughput.

The architecture-level behavior of the ReDoS attack, anEHP attack on the Event Loop, is illustrated in Figure 1. Afterclient A’s benign request is sanitized (CBA1), the file readTask goes to the Worker Pool (TaskA1), and when the readcompletes the Callback returns the file content to A (CBA2).Then client B’s malicious request arrives, and the service’sinput sanitization triggers ReDoS (CBB1).

IV. EVENT HANDLER POISONING ATTACKS IN THE WILD

With a definition of EHP attacks in hand, we now assessthe presence of EHP vulnerabilities in the Node.js ecosystem.We found that the Node.js documentation does not carefullyoutline the need for cooperative multitasking (§IV-A), whichmight contribute to the fact that many popular npm modulesexpose Vulnerable APIs and rely on the application to sanitizeinput (§IV-B), and that 35% of known npm module vulnera-bilities can be used for EHP attacks (§IV-C).

A. What do the Node.js docs say about EHP attacks?

Users in search of Node.js documentation can look in threeplaces. First, there is the JavaScript language specification [14].

5In addition to files exposed on network file systems, /dev/random isa good example of a Slow File. According to the Linux manual, “[R]eadsfrom /dev/random may block...users will usually want to open it innonblocking mode (or perform a read with timeout)” [36]; by default, Node.json Linux does neither.

Fig. 3. This figure shows the effect of evil input on the throughput of asimple server based on Figure 2, using either Baseline Node.js (grey) or ourproposed runtime, Node.cure (black). For ReDoS (triangles), evil input wasinjected after three seconds, poisoning the Baseline Event Loop. For ReadDoS(circles), evil input was injected four times at one second intervals beginningafter three seconds, eventually poisoning the Baseline Worker Pool. The linesfor Node.cure shows its effectiveness against these EHP attacks. Node.cure’sthroughput on ReDoS dips until the first Callback times out (§VII-B), afterwhich it catches up to pending requests from clients. Because the WorkerPool is not saturated, Node.cure does not suffer while it waits for the timeoutof the first Task containing a ReadDoS request for /dev/random (§VII-A),and marks /dev/random’s inode Slow. Subsequent ReadDoS requests time outimmediately (§VII-D).

Second, Node.js maintains documentation for the APIs of itscore modules [21]. Third, the nodejs.org website offers a setof guides for new developers [19].

Although there are potential EHP vulnerabilities in animplementation of JavaScript, like ReDoS or large JSONmanipulation, language specifications do not always discusstheir risks. For example, in MDN’s documentation only evalhas warnings.

The Node.js documentation does not indicate any vulner-abilities in its JavaScript engine. However, it does give hintsabout avoiding EHP attacks on its framework APIs. Node.jsAPIs include several modules with Vulnerable APIs, and de-velopers are warned away from the synchronous versions. Theasynchronous versions, which could poison the Worker Pool,bear the warning: “[These] APIs use libuv’s threadpool, whichcan have surprising and negative performance implications forsome applications,” and refer to another page reading “Becauselibuv’s threadpool has a fixed size, it means that if for whateverreason any of these APIs takes a long time, other (seeminglyunrelated) APIs that run in libuv’s threadpool will experiencedegraded performance.”

Missing from the Node.js new developer guides is a discus-sion about how to design a Node.js application to avoid EHPvulnerabilities. Later we describe the guide we have preparedon these topics (§IX); it has been preliminarily approved forplacement on nodejs.org.

B. Do npm modules have Vulnerable APIs?

In December 2016 we examined 80 npm modules todetermine how they handle potentially computationally expen-sive operations, i.e. application-level CPU-bound potentially-Vulnerable APIs (Table I). We manually inspected both theirdocumentation and their implementation. Using npm’s “Bestoverall” sorting scheme, we examined the first 20 (unique)modules returned by npm searches for string, string

5

Page 6: A Sense of Time for Node.js: Timeouts as a Cure for Event ...people.cs.vt.edu/davisjam/downloads/blog/ehp/... · the Event-Driven Architecture (EDA) championed by Node.js1. Although

manipulation, math, and algorithm, in hopes of strikinga balance between popularity and potential complexity. Thesemodules were downloaded over 200 million times that month.

We found that many of these modules had VulnerableAPIs, performing operations with high synchronous complex-ity and thus leaving callers vulnerable to EHP attack. Althoughknowing the synchronous complexity of an API is criticalto preventing EHP, the documentation for the majority ofthe APIs did not state their running times. Nearly all of thehundreds of APIs we examined executed synchronously onthe Event Loop; though two were asynchronous, one of thesesynchronously executed an exponential-time algorithm beforeyielding control. During our analysis of the source code weidentified algorithms with complexities including O(1), O(n),O(n2), O(n3), and O(2n).

In summary, in our survey of 80 popular npm moduleswe found examples of Vulnerable APIs in the majority of themodules. Module and application developers who use theseAPIs must weigh the risks of calling these APIs, but this isdifficult without knowing the synchronous complexity of theAPI (usually undocumented) or studying its implementation tolearn what input might be evil (impractical).

Given the prevalence of Vulnerable APIs in these modules,we did not open bug reports against each of them. We believethe problem of EHP attacks is systemic in the Node.js commu-nity. Rather than pursuing ad hoc bug reports and patches, wetook fundamental steps to improve the state of affairs, with adetailed guide for new developers (§IX) and a comprehensivedefense against EHP in the form of Node.cure (§VII).

C. Are there examples of EHP vulnerabilities?

We identified two databases describing vulnerabilities innpm modules, one maintained by the Node Security Platform(NSP) [16] and the other by Snyk.io [25]. Our evaluationof the first several hundred vulnerabilities in each suggestedthat NSP’s database (355 entries) is a subset of Snyk.io’sdatabase (713 entries), so we performed a detailed analysisonly of Snyk.io’s database. In this section we summarizethe vulnerabilities in the Snyk.io database and analyze themthrough the lens of EHP attacks.

First, we divided the vulnerabilities by category. Each rawSnyk.io database entry includes a title and description, andsome have additional material like CWE labels and links toGitHub patches. Since the titles indicate the high-level exploit,e.g. “Directory Traversal in servergmf...”, we sorted thenpm vulnerabilities into 17 common categories and one Othercategory for the remaining 40. Figure 4 shows the distribution,absorbing categories with fewer than 10 vulnerabilities intoOther. A high-level CWE number is given next to each class.

Then, we studied each vulnerability to see if it could beemployed in an EHP attack. The dark bars in Figure 4 indicatethe 186 vulnerabilities (35%) that could be used to launch anEHP under our threat model (§III-A). The 145 EHP-relevantDirectory Traversal vulnerabilities are exploitable because theyallow arbitrary file reads, which can poison the Worker Poolthrough ReadDoS. The 32 EHP-relevant Denial of Servicevulnerabilities poison the Event Loop; 28 are ReDoS, 3 leadto infinite loops, and 1 allows an attacker to trigger worst-case algorithm behavior in input processing. In Other are 9

Fig. 4. Classification of the 713 npm module vulnerabilities, by categoryand by usefulness in EHP attacks. Vulnerabilities were iteratively categorizedusing regexp. We analyzed the contents of the database as of 8 August 2017.

Arbitrary File Write vulnerabilities that, like ReadDoS, can beused for EHP attacks by writing to Slow Files.

V. WHAT DOES IT MEAN TO BE EHP-SAFE?

Having defined the EDA (§II), explained EHP attacks(§III), and surveyed their reality (§IV), we now ponder whatit means to be EHP-Safe. The fundamental cause of EHPvulnerabilities should be clear: EHP vulnerabilities stem fromVulnerable APIs that fail to honor cooperative multitasking.

As categorized in Table I, Vulnerable APIs can poison theEvent Loop or the Worker Pool, they can be CPU-bound orI/O-bound, and they occur at any level of the application stack:language (vulnerable constructs), framework (vulnerable coremodules), or application (vulnerable code in the application orthe library modules on which it depends).

To be EHP-safe, an application must use APIs whosesynchronous time is bounded. If an application cannot boundthe synchronous time of its APIs, it is vulnerable to EHPattack; conversely, if an application can bound the synchronoustime of its APIs, then it is EHP-safe. This application-levelguarantee must span the application’s full stack: the languageconstructs it uses, the framework APIs it invokes, and theapplication code itself.

In §VI, we outline several approaches to EHP-safety. Howshall we choose between them? We suggest three criteria:

1) General. A developer must be able to use existing APIs orslight modifications thereof. This ensures that applicationsare general, that they can solve meaningful problems.

2) Developer-friendly. A developer should not need expertisein topics outside of their application domain.

3) Composable. If a developer uses only EHP-safe APIs, theyshould be unable to create a Vulnerable API. In a com-posably EHP-safe system, an API containing while(1);would not be Vulnerable, because it is composed onlyof non-Vulnerable “language APIs”. In contrast, a non-composably EHP-safe system has only ad hoc EHP safety,because in addition to making language- and framework-level APIs non-Vulnerable, application-level APIs com-posed of them must also be assessed.

VI. ANTIDOTES FOR EVENT HANDLER POISONING

Here we discuss four schemes by which to guarantee EHP-safety: removing Vulnerable APIs (§VI-A), only calling Vul-nerable APIs with safe input (§VI-B), refactoring Vulnerable

6

Page 7: A Sense of Time for Node.js: Timeouts as a Cure for Event ...people.cs.vt.edu/davisjam/downloads/blog/ehp/... · the Event-Driven Architecture (EDA) championed by Node.js1. Although

APIs into safe ones (§VI-C), and strictly bounding an EventHandler’s synchronous time at runtime (§VI-D). We evaluatethese schemes using the criteria from §V.

A. Remove calls to Vulnerable APIs: The Blacklist Approach

To become EHP-safe, an application might eliminate callsto Vulnerable APIs. Because this Blacklist Approach fails thegenerality criterion, it does not merit further discussion.

B. Sanitize input to Vulnerable APIs: The Sanitary Approach

Many Vulnerable APIs are only problematic if they arecalled with unsafe input. Another approach to EHP-safety,then, is the Sanitary Approach: ensure that every VulnerableAPI is always called with a safe input. The Sanitary Ap-proach is developer-unfriendly, because it requires developersto identify evil input for the APIs they call. For example,Vulnerable APIs like regexp are themselves commonly used toidentify invalid input. If the sanitization regexps are vulnerableto ReDoS, quis custodiet ipsos custodes?

C. Refactor Vulnerable APIs: The C-WCET Principle

A more promising path to EHP-safety is to refactor Vul-nerable APIs into safe ones. Since the goal of EHP-safety is tobound the synchronous time of a Vulnerable API, a good ruleof thumb might be to restrict the cost of every Callback/Taskvertex in a Lifeline to Constant Worst-Case Execution Time(C-WCET) by partitioning the generation of a response into(ideally constant-time) chunks wherever possible. Such a C-WCET partitioning would bound the synchronous complexityof the Lifeline rooted at this Vulnerable API for all inputs.If all Vulnerable APIs were replaced with APIs followingthe C-WCET Principle, the application would be EHP-safe.The C-WCET Principle is the logical extreme of cooperativemultitasking, extending input sanitization from the realm ofdata into the realm of the algorithms used to process it.

Evaluating the C-WCET Principle. The C-WCET Prin-ciple has two benefits. First, by bounding the synchronouscomplexity of all Callbacks and Tasks, every client requestcould be handled whether or not its input was evil. Second,the C-WCET Principle can be applied without changing thelanguage specification. For example, as has previously beenproposed [49], a non-exponential-time regexp engine couldbe used in place of Node.js’s exponential-time Irregexpengine [48], and the application would be none the wiser.

The C-WCET Principle meets the generality criterion;every API can be partitioned towards C-WCET. It is alsosomewhat developer-friendly: language- and framework-levelAPIs can be C-WCET partitioned by framework developers,leaving developers responsible for C-WCET partitioning theirown APIs. However, since application developers rely heav-ily on third-party npm modules that might not have beenC-WCET-partitioned, they might be called on to C-WCETpartition modules outside their own expertise. Critically, likethe Blacklist and Sanitary Approaches, the C-WCET Principleis not composable. Since Vulnerable APIs can always becomposed from non-Vulnerable APIs6, developers would need

6As noted in §V, an API consisting of a while(1); loop is Vulnerablein the Blacklist, Sanitary, and C-WCET Approaches, even though it only callsthe constant-time “language APIs” while(1) and ;.

to take responsibility for applying the C-WCET Principle toevery API at the application and library levels.

Nevertheless, because the C-WCET Principle has attractivebenefits, we developed a Node.C-WCET prototype to evaluateits practicality at the language and framework levels. Node.C-WCET aimed to defeat ReDoS (language-level) and ReadDoS(framework-level) EHP attacks using the C-WCET Princi-ple. To combat ReDoS, wherever possible Node.C-WCETuses the linear-time RE2 regexp engine [22] instead of V8’sexponential-time regexp engine. And to combat ReadDoS, wereplaced the file system read API’s synchronous-on-a-Workerimplementation with truly asynchronous kernel AIO7 so thata Worker would not be poisoned by a Slow File, and furtherensured that all reads of Large Files are partitioned. Due tospace limitations we omit a detailed description of Node.C-WCET and our experiments.

Ultimately, we deemed Node.C-WCET a failure, and re-jected the C-WCET Principle as a solution to EHP attacks.From an engineering perspective, we found that Node.C-WCET introduced significant overheads (e.g. Linux kernel AIOrequires O_DIRECT), and that the effort to understand andfully refactor all language- and framework-level VulnerableAPIs would be considerable. But more importantly, from asecurity perspective Node.C-WCET remained unsound in twoways. First, Node.C-WCET was still vulnerable to real ReDoSand ReadDoS attacks from known vulnerabilities, becausenot all regexps can be evaluated in linear time, and Linux’skernel AIO does not support device files like the Slow File/dev/random. Second, the C-WCET Principle is ad hoc, notcomposable: even if Node.C-WCET partitioned all lower-levelVulnerable APIs, application-level APIs composed of theseAPIs could still be Vulnerable.

D. Dynamically bound running time: The Timeout Approach

The goal of the C-WCET Principle was to bound an API’ssynchronous complexity as a way to bound its synchronoustime. Instead of statically bounding an API’s synchronouscomplexity, what if we could dynamically bound its syn-chronous time? Then the complexity of each Callback andTask would become irrelevant, because it would be unableto take more than the quantum provided it by the runtime. Inthis Timeout Approach, the runtime detects and aborts long-running Callbacks and Tasks by emitting a TimeoutError,thrown from Callbacks and returned from Tasks.

The Timeout Approach provides EHP-safety. Like the C-WCET Principle, the Timeout Approach is general. It ismore developer-friendly because it only requires developersto handle TimeoutErrors in their own APIs. And wherethe C-WCET Principle was ad hoc, the Timeout Approachis composable. If the EDA runtime had a sense of time, thena long-running Callback/Task could be interrupted no matterwhere it was, whether in the language, the framework, orthe application. An application that calls only APIs from thelanguage and the framework would always be EHP-safe.

Although the Timeout Approach fulfills all of theEHP-safety criteria, like C-WCET Partitioning it requires

7Unlike Posix asynchronous I/O, which is implemented by libc usinga threadpool, true Linux kernel AIO via io_submit gives the kernelresponsibility for managing the I/O.

7

Page 8: A Sense of Time for Node.js: Timeouts as a Cure for Event ...people.cs.vt.edu/davisjam/downloads/blog/ehp/... · the Event-Driven Architecture (EDA) championed by Node.js1. Although

application-level refactoring. Because the TimeoutErrorsemitted by the runtime are not part of the language andframework specification, existing applications would need tobe ported. We do not anticipate this to be a heavy burden forwell-written applications, which should already be exception-aware to catch errors like buffer overflow (RangeError) andnull pointer exceptions (ReferenceError), and which shouldtest the result of a Task for errors.

Though the C-WCET Principle and the Timeout Approachboth meet the generality criterion, the Timeout Approach ismore developer-friendly, and is the only EDA-centric solutionwe analyzed that is composable8. In §VII we describe ourNode.cure prototype, which applies the Timeout Approach toall levels of the Node.js stack.

VII. NODE.CURE: TIMEOUTS FOR NODE.JS

Our Node.cure prototype implements the Timeout Ap-proach for Node.js by bounding the synchronous time of everyCallback and Task. To do so, Node.cure makes timeouts afirst-class member of Node.js, extending the JavaScript speci-fication by introducing a TimeoutError to abort long-runningTasks and Callbacks. A long-running Task poisons its Worker.Such a Task is aborted and fulfilled with a TimeoutError(§VII-A). A long-running Callback poisons the Event Loop;in Node.cure these throw a TimeoutError whether they arein JavaScript or in a Node.js C++ binding (§VII-B). As aresult, Node.cure meets the composability criterion of §V: theNode.js language and framework APIs are timeout-aware, andan application whose APIs are composed of these APIs will beEHP-safe (§VII-C). Node.cure also has an exploratory SlowResource Policy to avoid unnecessary timeouts (§VII-D).

We describe the Timeout Approach for Tasks first, becauseit introduces machinery used to apply the Timeout Approachto Callbacks.

A. Timeout-aware Tasks

Node.cure defends against all EHP attacks that target theWorker Pool by bounding the synchronous time of Tasks. Tomake Tasks timeout-aware, we modified both the Worker Pooland the Node.js APIs that use it.

Timeout-aware Worker Pool. Asynchronous VulnerableAPIs submit long-running Tasks that can poison a Worker onthe Worker Pool. We modified the libuv Worker Pool to betimeout-aware, replacing libuv’s Workers with Executors thatcombine a permanent Manager with a disposable Worker. If aTask times out, the Executor creates a new Worker to resumehandling Tasks and creates a Hangman to pthread_cancelthe poisoned Worker. These roles are illustrated in Figure 5.

Here is a slightly more detailed description of our timeout-aware Worker Pool. Node.js’s Worker Pool is implementedin libuv, exposing a Task submission API uv_queue_work,which we extended as shown in Table II. The original Workersof the Worker Pool would simply invoke work and then notifythe Event Loop to invoke done. This is also the behavior of ourtimeout-aware Workers on the “good path”. When a Task takestoo long, however, the potentially-poisoned Worker’s Manager

8We discuss other approaches to EHP-safety in §IX.

Fig. 5. This figure illustrates Node.cure’s timeout-aware Worker Pool,including the roles of Event Loop, Executors (Worker Pool and Priority),and Hangman. Grey entities were present in the original Worker Pool, blackare new. The Event Loop can synchronously access the Priority Executor, orasynchronously offload Tasks to the Worker Pool. If an Executor’s Managersees its Worker time out, it creates a replacement Worker and passes theDangling Worker to a Hangman.

Callback Descriptionvoid work Perform Task.

int timed out* When Task has timed out. Can request extension.void done When Task is done. Special error code for timeout.

void killed* When a timed out Task’s thread has been killed.

Table II. Summary of the Worker Pool API. work is invoked on theWorker. done is invoked on the Event Loop. The new callbacks,timed_out and killed, are invoked on the Manager and the

Hangman, respectively. On a timeout, work, timed_out, and doneare invoked, in that order; there is no ordering between the done and

killed callbacks, which sometimes requires reference counting for safememory cleanup. *New callbacks.

invokes the Task’s timed_out callback. If the submitter doesnot request an extension, the Manager creates a replacementWorker so that it can continue to process subsequent Tasks,creates a Hangman thread for the poisoned Worker, andnotifies the Event Loop that the Task is complete. The EventLoop then invokes its done Callback with a TimeoutError.Concurrently, once the Hangman successfully kills the Workerthread, it invokes the Task’s killed callback and returns.

One difficulty with concurrently creating a replacementWorker and assigning the poisoned Worker to a Hangman isthat the poisoned Worker may not have been canceled beforeits TimeoutError is delivered to the Event Loop. Such aDangling Worker may be in a system call like write thatmodifies externally-visible state. It is therefore critical that if atimeout occurs, a Dangling Worker cannot corrupt shared state.In general this can be dealt with using transactional techniqueslike undo-logging or write buffering, but we investigate asimpler alternative for the file system in §VII-D. In additionto external shared state like the file system, memory safety isalso a consideration for Dangling Workers because the Node.jsWorker Pool is implemented using threads. Workers cannot usememory allocated by the Task submitter because a DanglingWorker may exist after its Task’s done is invoked.

Although the new uv_queue_work API is more complexthan the previous one, only experts like framework developerswill need to use it. This and other libuv-level changes are onlyvisible to the average application-level developer in the formof TimeoutErrors.

Timeout-aware Worker Pool Customers. The Node.jsAPIs affected by this change (i.e. that create Tasks) are fromthe file system, DNS, encryption, and compression modules.

Node.js implements its file system and DNS APIs byrelying on libuv’s file system and DNS support, which on

8

Page 9: A Sense of Time for Node.js: Timeouts as a Cure for Event ...people.cs.vt.edu/davisjam/downloads/blog/ehp/... · the Event-Driven Architecture (EDA) championed by Node.js1. Although

Linux make the appropriate calls to libc. When asynchronousfile system and DNS calls time out in libc, we allow theManager to dispose of its Dangling Worker. We modified thelibuv file system and DNS implementations to use bufferedmemory for memory safety of Dangling Workers. We usedpthread_setcancelstate to ensure that threads will not becanceled while in a non-cancelable libc API [8]. Although thisrequires the Hangman to await the completion of the blockedAPI, the Manager concurrently replaces its poisoned Worker.

It remains to handle Node.js’s asynchronous encryption andcompression APIs, implemented in the Node.js C++ bindingsby invoking APIs from openssl and zlib, respectively. Ifthe Worker Pool notifies these APIs of a timeout, they allowthe Task to be canceled and wait for it to be killed beforereturning, to ensure that the Worker is no longer modifyingstate in these “black-box” libraries nor accessing memory thatmight be released after done is invoked.

Finally, when the Event Loop sees that a Task timed out,it invokes the application’s Callback with a TimeoutError.

B. Timeouts for Callbacks

Node.cure defends against EHP attacks that target theEvent Loop by bounding the synchronous time of Callbacks.To make Callbacks timeout-aware, we introduce a Timeout-Watchdog that monitors the start and end of each Callbackand ensures that no Callback exceeds the timeout threshold.We time out JavaScript instructions using V8’s interrupt mech-anism (§VII-B1), and we modify Node.js’s C++ bindings toensure that Callbacks that enter these bindings will also betimed out (§VII-B2).

1) Timeouts for JavaScript:

TimeoutWatchdog. Our TimeoutWatchdog instrumentsevery Callback using the experimental Node.js async-hooksmodule [18]. These hooks cannot be disabled by an applica-tion.

Before a Callback begins, our TimeoutWatchdog startsa timer. If the Callback completes before the timer expires(“good path”), we erase the timer. If the timer expires, thewatchdog signals V8 to interrupt JavaScript execution bythrowing a TimeoutError. The watchdog then starts anothertimer, ensuring that timeouts while handling the previousTimeoutError(s) are also detected.

Although a precise TimeoutWatchdog can be implementedto carefully honor the timeout threshold, the resulting com-munication overhead between the Event Loop and the Time-outWatchdog can be non-trivial (§VIII-B). A cheaper alter-native is a lazy TimeoutWatchdog, which simply wakes upat intervals of the timeout threshold and checks whether thesame Callback it last observed is still executing; if so, itemits a TimeoutError. A lazy TimeoutWatchdog reduces theoverhead of making a Callback, but decreases the precision ofthe TimeoutError threshold: it can only guarantee to catchCallbacks that take at least twice the threshold.

V8 interrupts. To handle the TimeoutWatchdog’s requestfor a TimeoutError, Node.cure extends the interrupt in-frastructure of Node.js’s V8 JavaScript engine [3] to supporttimeouts. In V8, low priority interrupts like a pending garbage

collection are checked regularly (e.g. each loop iteration, func-tion call, etc.), but no earlier than after the current JavaScriptinstruction finishes. High priority interrupts take effect “imme-diately”, and can interrupt a running JavaScript instruction. Theonly high priority interrupt is TerminateExecution, whichthrows an uncatchable termination exception. Timeouts requirethe use of a high priority interrupt because they must be ableto interrupt long-running individual JavaScript instructions likestr.match(regexp) (possible ReDoS).

To support a TimeoutError, we modified V8 as follows:(1) We added the definition of a TimeoutError into theError class hierarchy; (2) We added a TimeoutInterruptinto the list of high-priority interrupts; and (3) We exposed anAPI by which Node.js can trigger a TimeoutInterrupt. TheTimeoutWatchdog calls this API, which interrupts the currentJavaScript stack by throwing a TimeoutError.

The only JavaScript instructions that V8 instruments to beinterruptible are regexp matching and JSON parsing; these arethe language-level Vulnerable APIs. Other JavaScript instruc-tions are treated as effectively constant-time, so these interruptsare evaluated e.g. at the end of basic blocks. We agreed withthe V8 developers in this9, and did not attempt to instrumentother JavaScript instructions to poll for pending interrupts.

2) Timeouts for the Node.js C++ bindings:

The TimeoutWatchdog described in §VII-B1 will interruptany Vulnerable APIs implemented in JavaScript, includinglanguage-level APIs like regexp and application-level APIs thatcontain blocking code like while(1);. It remains to give asense of time to the Node.js C++ bindings that allow Node.jsapplications to interface with the broader world.

Asynchronous C++ bindings are already timeout aware(§VII-A), because these perform a fixed amount of work tolaunch a Task and then return. However, Node.js also hassynchronous C++ bindings that complete the entire operationbefore returning. Node.js includes an asynchronous API for thesynchronous operations deemed Vulnerable by the developers,e.g. fs.readSync/fs.read; we made the synchronous ver-sions of these APIs timeout-aware.

In Node.js, the synchronous versions of these APIs executeon the Event Loop, either by calling libuv APIs without acallback (file system), or by invoking the openssl and zlibAPIs directly (encryption and compression). Node.js does notexpose a synchronous DNS API. We reviewed the rest of theNode.js APIs and determined that execSync also appeared tobe Vulnerable. We did not make it timeout-aware because thisAPI is not intended for use in Callbacks in the server context,per our threat model.

Because the Event Loop holds the state of all pend-ing clients, we cannot time it out using pthread_cancel,as this would result in the DoS the attacker desired. An-other alternative is to offload the request to the WorkerPool and await its completion, but this would incur highrequest latencies when the Worker Pool’s queue is not empty.Instead, we extended the Worker Pool paradigm with adedicated Priority Executor whose queue is exposed via a

9For example, we found that string operations complete in millisecondseven when a string is hundreds of MBs long.

9

Page 10: A Sense of Time for Node.js: Timeouts as a Cure for Event ...people.cs.vt.edu/davisjam/downloads/blog/ehp/... · the Event-Driven Architecture (EDA) championed by Node.js1. Although

new API: uv_queue_work_prio. This Executor follows thesame Manager-Worker-Hangman paradigm as the Executorsin Node.cure’s Worker Pool. To make these Vulnerable syn-chronous APIs timeout-aware, we offloaded them to the Prior-ity Executor using the existing asynchronous version, and hadthe Event Loop await the result. Because these synchronousAPIs are performed on the Event Loop as part of a Callback,when using a precise TimeoutWatchdog we propagate the Call-back’s remaining time to this Executor’s Manager to ensurethat the TimeoutWatchdog’s timer is honored.

C. Timeouts for application-level Vulnerable APIs

Let us revisit Node.cure’s composability guarantee. As de-scribed above, Node.cure makes Tasks (§VII-A) and Callbacks(§VII-B) timeout-aware to defeat EHP attacks against languageand framework APIs. The Timeout Approach is composable,so an application whose APIs are composed of calls to theseAPIs will be EHP-safe.

However, applications can still escape the reach of theTimeout Approach by defining their own C++ bindings. Thesebindings would need to be made timeout-aware, followingthe example we set while making Node.js’s Vulnerable C++bindings timeout-aware (file system, DNS, encryption, andcompression). Without refactoring, applications with their ownC++ bindings will not be EHP-safe. In our evaluation we foundthat application-defined C++ bindings are rare (§VIII-C).

D. Exploring policies for slow resources

Our Node.cure runtime detects and aborts long-runningCallbacks and Tasks executing on Node.js’s Event Handlers.For unique evil input this is the best we can do, as accuratelypredicting whether a not-yet-seen input will time out is diffi-cult. If an attacker might re-use the same evil input multipletimes, however, we can track whether or not an input led toa timeout and short-circuit subsequent requests that use thisinput with an early timeout.

While evil input memoization could in principle be appliedto any API, the size of the input space to track is a limitingfactor. The evil inputs that trigger CPU-bound EHP attacks —e.g. ReDoS— exploit properties of the vulnerable algorithmand are thus usually not unique. In contrast, the evil inputsthat trigger I/O-bound EHP attacks like ReadDoS must name aparticularly slow resource, presenting an opportunity to short-circuit requests on this slow resource.

In Node.cure we implemented a slow resource manage-ment policy for libuv’s file system APIs, targeting thosethat reference a single resource (e.g. open, read, close).Node.cure instruments libuv’s open and close to maintaina mapping from file descriptors to inode numbers. When oneof the APIs we manage times out, we mark the associatedinode number as slow, as well as the file descriptor if onewas provided. We took the simple approach of permanentlyblacklisting these aliases, immediately aborting all subsequentaccesses10. This policy is appropriate for the file system, whereaccess times are not likely to change11. For DNS queries,

10To avoid fd leaks, we do not eagerly abort close.11Of course, if the slow resource is in a networked file system like NFS

or GPFS, slowness might be due to a network hiccup, and incorporatingtemporary device-level blacklisting might be more appropriate.

timeouts are likely due to a network hiccup, and a temporaryblacklist might be more appropriate.

Blacklisting slow resources simplifies the management ofDangling Workers that were accessing these resources. Themoment a Worker times out, its Manager marks the resourceit was accessing as slow, and subsequent accesses will be timedout without interacting with the resource. Any modificationsmade by the Dangling Worker are thus invisible to the rest ofthe application.

E. Prototype details

Node.cure is built on top of Node.js v8.8.112, the mostrecent long-term support version of Node.js. Our prototype isfor Linux, and added 4,000 lines of C, C++, and JavaScriptcode across 50 files spanning V8, libuv, the Node.js C++bindings, and the Node.js JavaScript libraries.

With an appropriate choice of timeouts, Node.cure passesall but 42 of the 1,966 Node.js core tests. Six of the failures aredue to rarely-used file system APIs we did not make timeout-aware. The rest are due to interactions between experimentalor deprecated features and the experimental async-hooksmodule on which our TimeoutWatchdog relies, rather thanfundamental limitations of our prototype.

In Node.cure, timeouts for Callbacks and Tasks are con-trolled by separate environment variables. Our implementationwould readily accommodate a fine-grained assignment of time-outs for individual Callbacks and Tasks.

VIII. EVALUATING NODE.CURE

We evaluated Node.cure in terms of its effectiveness(§VIII-A), costs (§VIII-B), and security guarantees (§VIII-C).With a lazy TimeoutWatchdog, Node.cure defeats all knownEHP attacks with low overhead on the “good path”, estimatedat 1.3x using micro-benchmarks and manifesting at 1.0x-1.24xusing real applications. Node.cure offers EHP-safety to allNode.js applications that do not define their own C++ bindings,a rare practice.

All measurements provided in this section were obtainedon an otherwise-idle desktop running Ubuntu 16.04.1 (Linux4.8.0-56-generic), 16GB RAM, Intel i7 @3.60GHz, 4 physicalcores with 2 threads per core. For a baseline we used Node.jsLTS v8.8.1 (dc6bbb44da) from which Node.cure was derived,compiled with the same flags. We ran Node.js with 4 Workers.

A. Effectiveness

To evaluate the effectiveness of Node.cure, we developedan EHP test suite that makes EHP attacks of all types shownin Table I. Our suite is comprehensive and conducts EHPattacks using every Vulnerable API we identified, includingthe language level (regexp, JSON), framework level (all rel-evant APIs from the file system, DNS, cryptography, andcompression modules), and application level (infinite loops,long string operations, array sorting, etc.). This test suiteincludes each type of real EHP from our study of EHPvulnerabilities in npm modules (§IV-C). Node.cure defeats all92 EHP attacks in this suite: each Vulnerable synchronous API

12Specifically, Node.js v8.8.1 commit dc6bbb44da on Oct. 25, 2017.

10

Page 11: A Sense of Time for Node.js: Timeouts as a Cure for Event ...people.cs.vt.edu/davisjam/downloads/blog/ehp/... · the Event-Driven Architecture (EDA) championed by Node.js1. Although

throws a TimeoutError, and each Vulnerable asynchronousAPI returns a TimeoutError.

To finish our suite, we ported a vulnerable npm module toNode.cure. We selected the node-oniguruma [15] module,which offers asynchronous regexp APIs. Since the Onigurumaregexp engine is vulnerable to ReDoS, its APIs are Vulnerable,but they evaluate the regexps in a Task rather than in aCallback, poisoning a Worker rather than the Event Loop. Wemade node-oniguruma timeout-aware by using the modifiedWorker Pool uv_queue_work API (Table II). Node.cure nowdefends ReDoS attacks against this module’s Vulnerable APIs.

B. Costs

The Timeout Approach introduces runtime overheads,which we evaluated using micro-benchmarks and macro-benchmarks. It also may require some refactoring.

Overhead: Micro-benchmarks. Whether or not they timeout, Node.cure introduces several sources of overheads tomonitor Callbacks and Tasks. We evaluated the followingoverheads using micro-benchmarks:

1) Every time V8 checks for interrupts, it now tests for apending timeout as well.

2) Both the lazy and precise versions of the TimeoutWatch-dog must instrument every asynchronous Callback usingasync-hooks, with noticeable overhead dependent on thecomplexity of the Callback.

3) To ensure memory safety for Dangling Workers, Workersoperate on buffered data that must be allocated when theTask is submitted. For example, Workers must copy the I/Obuffers supplied to read and write twice.

New V8 interrupt. The overhead of our V8 Timeout inter-rupt was negligible, simply a test for one more interrupt inV8’s interrupt infrastructure.

TimeoutWatchdog’s async hooks. We measured the addi-tional cost of invoking a Callback due to TimeoutWatchdog’sasync hooks. A lazy TimeoutWatchdog increases the cost ofinvoking a Callback by 2.4x, and a precise TimeoutWatchdogincreases the cost by 7.9x. Of course, this overhead is mostnoticeable in empty Callbacks. As the number of instructionsin a Callback increases, the cost of the async-hooks amortizes.For example, if the Callback executes 500 empty loop itera-tions, the lazy overhead drops to 1.3x and the precise overheaddrops to 2.7x; at 10,000 empty loop iterations, the lazy andprecise overheads are 1.01x and 1.15x, respectively.

Worker buffering. Our timeout-aware Worker Pool requiresbuffering data to accommodate Dangling Workers, affectingDNS queries and file system I/O. While this overhead willvary from API to API, our micro-benchmark indicated a 1.3xoverhead using read and write calls with a 64KB buffer.

Overhead: Macro-benchmarks. Our micro-benchmarkssuggested that the overhead introduced by Node.cure mayvary widely depending on what an application is doing. Ap-plications that make little use of the Worker Pool will paythe overhead of the additional V8 interrupt (minimal) andthe TimeoutWatchdog’s async hooks, whose cost is stronglydependent on the size of the Callbacks. Applications that use

Benchmark Description Overheads (lazy, precise)LokiJS [12] Server, Key-value store 1.00, 1.00

Node Acme-Air [2] Server, Airline simulation 1.02, 1.03webtorrent [29] Server, P2P torrenting 1.02, 1.02

ws [30] Utility, websockets 1.00, 1.00*Three.js [26] Utility, graphics library 1.08, 1.09Express [6] Middleware 1.06, 1.24Sails [24] Middleware 1.14, 1.23*

Restify [23] Middleware 1.14, 1.63*Koa [10] Middleware 1.24, 1.60

Table III. Results of our macro-benchmark evaluation of Node.cure’soverhead. Where available, we used the benchmarks defined by the project

itself. Otherwise, we ran its test suite. Overheads are the ratio ofNode.cure’s performance to that of the baseline Node.js, averaged overseveral steady-state runs. We report the average overhead because we

observed no more than 3% standard deviation in all but LokiJS, whichaveraged 8% standard deviation across our samples of its sub-benchmarks.

*: Median of sub-benchmark overheads.

the Worker Pool will also pay the overhead of Worker buffering(variable, perhaps 1.3x).

We chose macro-benchmarks using a GitHub potpourritechnique: we searched GitHub for “language:JavaScript”,sorted by “Most starred”, and identified server-side projectsfrom the first 50 results. To add additional complete servers,we also included LokiJS [12], a popular Node.js-based key-value store, and IBM’s Acme-Air airline simulation [2].

Table III lists the macro-benchmarks we used and the per-formance overhead for each type of TimeoutWatchdog. Theseresults show that Node.cure introduces minimal overhead onreal server applications, and they confirm the value of a lazyTimeoutWatchdog. Matching our micro-benchmark assessmentof the TimeoutWatchdog’s async-hooks overhead, the overheadfrom Node.cure increased as the complexity of the Callbacksused in the macro-benchmarks decreased — the middlewarebenchmarks sometimes used empty Callbacks to handle clientrequests. In non-empty Callbacks like those of the real servers,this overhead is amortized.

Refactoring. Node.cure adds a TimeoutError to theJavaScript and Node.js specifications. See §IX.

C. Security Guarantees

We discussed the security guarantees of the TimeoutApproach in §VI-D. Here we discuss the precise securityguarantees of our Node.cure prototype. As described in §VII,we gave a sense of time to JavaScript (including interruptinglong-running regexp and JSON operations before their comple-tion) as well as to framework APIs identified by the Node.jsdevelopers as long-running (file system, DNS, cryptography,and compression). If any other language or framework APIsmight be long-running, Node.cure could be extended to givethem a sense of time as well.

Because the Timeout Approach is composable, application-level APIs composed of these timeout-aware language andframework APIs are also timeout-aware. However, Node.js alsopermits applications to add their own C++ bindings, and thesewill not be timeout-aware without refactoring.

To evaluate the extent of this limitation, we measuredhow frequently developers use C++ bindings. To this end,we examined a subset of the 550,155 modules in the npmecosystem as of 22 November 2017. We npm install’d125,173 modules (about 20%) using node v8.9.1 (npm v5.5.1),

11

Page 12: A Sense of Time for Node.js: Timeouts as a Cure for Event ...people.cs.vt.edu/davisjam/downloads/blog/ehp/... · the Event-Driven Architecture (EDA) championed by Node.js1. Although

resulting in 23,070 successful installations13. Of these, 695(3%) had C++ bindings14.

From this we conclude that the use of user-defined C++bindings is uncommon in npm modules, not surprising sincemany developers in this community come from client-sideJavaScript contexts and may not have strong C++ skills. Thisimplies, but does not directly indicate, the popularity of C++bindings in Node.js applications, as there may be applicationsthat use C++ bindings not published to npm. Nevertheless, weconclude that the need to refactor C++ bindings to be timeout-aware would not place a significant burden on the community.As a template, we performed several such refactorings our-selves in the Node.js framework and node-oniguruma.

IX. DISCUSSION

Questions abound: What should the EDA community dowithout first-class timeouts? Conversely, what might timeout-aware programs look like? How generalizable is the TimeoutApproach? And are there other ways to achieve EHP-safety?

What should the EDA community do without timeouts?Although they lack first-class timeouts, developers in the EDAcommunity must still be concerned about EHP attacks. Webelieve their best approach is the C-WCET Principle (§VI-C).

We have pursued a two-pronged approach to help develop-ers in the Node.js community follow the C-WCET Principle15.First, we have prepared a guide for the nodejs.org websiteproviding guidelines for building EHP-safe EDA-based appli-cations. This guide has received interest from key players inthe Node.js community and been preliminarily approved. Ourguide will sit on the nodejs.org website, where new Node.jsdevelopers go to learn best practices for Node.js development,and we believe that it will give developers important insightsinto secure Node.js programming practices.

Second, we studied the Node.js implementation andidentified several unnecessarily Vulnerable APIs inNode.js (fs.readFile, crypto.randomFill, andcrypto.randomBytes, all of which submit a singleunpartitioned Task). Our PRs warning developers aboutthe risks of EHP attacks using these APIs have beenmerged. We also submitted a trial pull request repairingthe simplest of these, by partitioning fs.readFile, whichhas been preliminarily approved after a discussion on theperformance-security tradeoff involved. In summary, theNode.js community seems open to learning more about EHPattacks and to seeing their framework and documentationhardened against them.

What might timeout-aware programs look like? Asdiscussed in §VII, applying the Timeout Approach whilepreserving the EDA changes the language and framework spec-ifications. In particular, all synchronous APIs may now throwa TimeoutError, and all asynchronous APIs may be fulfilledwith a TimeoutError. These changes have two implicationsfor application developers: Timeout-aware applications must

13Module installations failed for reasons including “No such module”,“Unsupported operating system”, and “Unsupported Node.js version”.

14We determined the presence of C++ bindings by searching a module’sdirectory tree for files with endings like “.h” and “.c[pp]”.

15We omit references to meet the double-blind requirement.

be able to tolerate arbitrary TimeoutErrors, and developersshould pursue low-variance Callbacks and Tasks.

Respond to TimeoutErrors. When a Callback or Taskdue to a client request times out, the application can attemptto recover, or can return an error to the client. The easiestpath is to return an error; any resources held on behalf ofthe client must be released. Refactoring along these lines willbe easier for applications that have minimal, well-structuredinteractions between client requests. Appropriate try-catchblocks should be added to Callbacks, and timeout-handlingcode should be added to completion Callbacks for Tasks.

Minimize variance. The Timeout Approach reduces the costof handling an EHP’s evil input from “arbitrarily long” to“the difference between the timeout threshold and the cost oflegitimate client requests”. A “tight” timeout threshold, oneclose to the average cost of legitimate client requests, willminimize the effect of an EHP attack: malicious requests willbe afforded roughly the same time as legitimate ones.

An ideal application will thus have a tight distribution ofLifelines’ synchronous times for legitimate client requests.If the standard deviation is small, with few or no longoutliers, then developers can configure a tight timeout withoutdiscomfiting legitimate users. On the other hand, applicationsin which the distribution of Lifelines’ synchronous times has along tail require configuring a more generous timeout, allowingmalicious users to occupy more of the server’s time anddegrade its throughput.

Developers can draw on existing tools for performancemeasurement to identify the distribution of synchronous times.They can use this to minimize variation in client requests, e.g.by applying the C-WCET Principle.

How generalizable is the Timeout Approach? TheTimeout Approach can be applied to any EDA framework.Callbacks must be made interruptible, and Tasks must be madeabortable. While these properties are more readily obtained inan interpreted language like JavaScript (applicable to frame-works in Python, Ruby, etc.), they could in principle be en-forced in compiled or VM-driven languages using techniqueslike dynamic instrumentation and binary rewriting.

Are there other avenues toward EHP-safety? In §VI wediscussed several EHP-safe designs. There are of course otherapproaches, like significantly increasing the size of the WorkerPool, performing speculative concurrent execution [46], orusing explicitly preemptable Callbacks and Tasks. However,each of these is a variation on the same theme: dedicatingresources to every client, i.e. the One Thread Per ClientArchitecture. The OTPCA is generally safe from EHP attacks,but only at the cost of scalability. If the server communitywishes to use the EDA, which offers high responsiveness andscalability through the use of cooperative multitasking, webelieve the Timeout Approach is a good path to EHP-safety.

X. RELATED WORK

Throughout the paper we have referred to specific relatedwork. Here we place our work in its broader context, situatedat the intersection of the Denial of Service literature and theEvent-Driven Architecture community.

12

Page 13: A Sense of Time for Node.js: Timeouts as a Cure for Event ...people.cs.vt.edu/davisjam/downloads/blog/ehp/... · the Event-Driven Architecture (EDA) championed by Node.js1. Although

Denial of Service attacks. Research on DoS can bebroadly divided into network-level attacks (e.g. distributed DoSattacks) and application-level attacks [37]. Since EHP attacksexploit the semantics of the application, they are application-level attacks, not easily defeated by network-level defenses.

DoS attacks seek to exhaust the resources critical to theproper operation of a server, and various kinds of exhaustionhave been considered. The brunt of the literature has focusedon exhausting the CPU. Some have examined the gap betweencommon-case and worst-case performance in algorithms anddata structures, for example in quicksort [76], hashing [51],and exponentially backtracking algorithms like ReDoS [50],[88] and rule matching [90]. Other CPU-centric DoS attackshave targeted more general programming issues, showing howto trigger infinite recursion [47] and how to trigger [91]or detect [42] infinite loops. But the CPU is not the onlyvulnerable resource, as shown by Olivo et al.’s work pollutingdatabases to increase the cost of queries [80]. We are not awareof prior research work that incurs DoS using the file system,as do our Large/Slow File ReadDoS attacks, though we havefound a handful of CVE reports to this effect16.

Our work identifies and shows how to exploit the mostlimited resource of the EDA: Event Handlers. The OTPCAapproach ensures that for production-grade web services, at-tackers must submit enough queries to overwhelm the operat-ing system’s preemptive scheduler, no mean feat. In contrast,in the EDA there are only a few Event Handlers that needbe poisoned. Although we prove our point using previously-reported attacks like ReDoS, the underlying resource we areexhausting is the small, fixed-size set of Event Handlersdeployed in EDA-based services.

Our work also takes a somewhat unusual tack in theapplication-level DoS space by providing a defense. Becauseapplication-level vulnerabilities rely on the semantics of theservice being exploited, defenses must often be preparedon a case-by-case basis. In consequence, many works onapplication-level DoS focus on providing detectors (e.g. [47],[91]), not defenses. When the application-level exploits takeadvantage of an underlying library, however, researchers haveoffered replacement libraries (e.g. [51]). Our Node.cure pro-totype shows how to achieve strong EHP safety by makingtimeouts a first-class member of Node.js.

Security in the EDA. Though researchers have studiedsecurity vulnerabilities in different EDA frameworks, mostprominently client-side applications, the security propertiesof the EDA itself have not been researched in detail. Someexamples include security studies on client-side JavaScrip-t/Web [72], [69], [53], [77] and Java/Android [58], [57], [41],[67] applications. These have focused on language issues (e.g.,eval in JavaScript [92]) or platform-specific issues (e.g.,DOM in web browsers [72]). However, there has been littleacademic engagement with the security aspects of server-sideEDA frameworks. Ojamaa et al. [79] performed a Node.js-specific high-level survey, and DeGroef et al. [54] proposeda method to securely integrate third-party modules from npm.Ours is the first to identify the EHP problem, and we have

16For Large Files, CVE-2001-0834, CVE-2008-1353, CVE-2011-1521, andCVE-2015-5295 mention DoS by ENOMEM using /dev/zero. For Slow Files(/dev/random), see CVE-2012-1987 and CVE-2016-6896.

taken a systems perspective to resolve it.

Synchronous EHP attacks. The server-side EDA practionercommunity is aware of the risk of DoS due to EHP on theEvent Loop. A common rule of thumb is “Don’t block theEvent Loop”, advised by many tutorials as well as recent booksabout EDA programming for Node.js [96], [45]. Wandschnei-der suggests worst-case linear-time partitioning on the EventLoop [96], while Casciaro advises developers to partition anycomputation on the Event Loop, and to offload computationallyexpensive tasks to the Worker Pool [45]. Our work offers amore rigorous taxonomy and evaluation of EHP attacks, andin particular we extend the idea of “Don’t block the EventLoop” to the Worker Pool.

We are not aware of research work discussing the useof the general EDA on the server side and the resultingEHP vulnerability. However, EHP attacks on the server sidecorrespond to performance bugs in client side EDA programs.Liu et al. studied such bugs [74], and Lin et al. proposedan automatic refactoring to offload expensive tasks from theEvent Loop to the Worker Pool [73]. This approach is soundon a client-side system like Android, where the demand forthreads is limited, but is not applicable to the server-side EDAbecause the small Worker Pool is shared among all clients. Forthe server-side EDA, Lin et al.’s approach would need to beextended to incorporate automatic C-WCET partitioning.

As future work, we believe that research into computationalcomplexity measurement ([85], [62], [43]) and estimation([81], [65]) might be adapted to the Node.js context for EHPvulnerability detection and automatic refactoring using the C-WCET Principle. This is made difficult, however, because ofthe complexity of statically analyzing JavaScript and the EDA([75], [39], [64], [97], [59]).

XI. CONCLUSION

The Event-Driven Architecture holds great promise forscalable web services, and it is increasingly popular in thesoftware development community. In this paper we defined anEvent Handler Poisoning attack, a novel class of DoS attackagainst the EDA that exploits the cooperative multitasking atits heart. We showed that there are EHP attacks hidden amongpreviously-reported vulnerabilities, and observed that the EDAcommunity is incautious about the risks of EHP attacks.

We proposed and evaluated two prototype defenses againstEHP attacks: the C-WCET Principle and the Timeout Ap-proach. Though we deemed the C-WCET Principle imprac-tical for strong security guarantees, our Timeout Approachprototype, Node.cure, can readily defend Node.js applicationsagainst all known EHP attacks. Everything needed to repro-duce our results is available on GitHub17.

Our findings can be directly applied by the EDA com-munity, and we hope they influence the design of existingand future EDA frameworks. With the rise of popularity inthe EDA, EHP attacks will become an increasingly criticalDoS vector, a vector that can be defeated using the TimeoutApproach.

17We omit the reference to meet the double-blind requirement.

13

Page 14: A Sense of Time for Node.js: Timeouts as a Cure for Event ...people.cs.vt.edu/davisjam/downloads/blog/ehp/... · the Event-Driven Architecture (EDA) championed by Node.js1. Although

REFERENCES

[1] “ab – apache http server benchmarking tool,”https://httpd.apache.org/docs/2.4/programs/ab.html.

[2] “acmeair-node,” https://github.com/acmeair/acmeair-nodejsh.[3] “Chrome V8.” [Online]. Available: https://developers.google.com/v8/[4] “Cylon.js,” https://cylonjs.com/.[5] “EventMachine.” [Online]. Available: http://rubyeventmachine.com/[6] “express,” https://github.com/expressjs/express.[7] “Github,” https://github.com/.[8] “Gnu libc – posix safety concepts,”

https://www.gnu.org/software/libc/manual/html node/POSIX-Safety-Concepts.html.

[9] “iot-nodejs,” https://github.com/ibm-watson-iot/iot-nodejs.[10] “Koa,” https://github.com/koajs/koa.[11] “libuv,” https://github.com/libuv/libuv.[12] “Lokijs,” https://github.com/techfort/LokiJS.[13] “Module Counts.” [Online]. Available: http://www.modulecounts.com/[14] “Mozilla developer network: Javascript,”

https://developer.mozilla.org/en-US/docs/Web/JavaScript.[15] “Node-oniguruma regexp library,” https://github.com/atom/node-

oniguruma.[16] “Node security platform,” https://nodesecurity.io/advisories.[17] “Node.js,” http://nodejs.org/.[18] “Nodejs async hooks,” https://nodejs.org/api/async hooks.html.[19] “Node.js developer guides,” https://nodejs.org/en/docs/guides/.[20] “Node.js thread pool documentation,”

http://docs.libuv.org/en/v1.x/threadpool.html.[21] “Node.js v8.9.1 documentation,” https://nodejs.org/dist/latest-

v8.x/docs/api/.[22] “Re2 regular expression engine,” https://github.com/google/re2.[23] “restify,” https://github.com/restify/node-restify.[24] “sails,” https://github.com/balderdashy/sails.[25] “Snyk.io,” https://snyk.io/vuln/.[26] “three.js,” https://github.com/mrdoob/three.js.[27] “Twisted.” [Online]. Available: https://twistedmatrix.com/trac/

https://github.com/twisted/twisted[28] “Vert.x,” http://vertx.io/.[29] “webtorrent,” https://github.com/webtorrent/webtorrent.[30] “ws: a node.js websocket library,” https://github.com/websockets/ws.[31] “Zotonic,” http://zotonic.com/.[32] “The Calendar and Contacts Server,” https://github.com/Apple/Ccs-

calendarserver, 2007.[33] “Ubuntu One: Technical Details,” https://wiki.ubuntu.com/UbuntuOne/

TechnicalDetails, 2012.[34] “New Node.js Foundation Survey Reports New “Full Stack” In

Demand Among Enterprise Developers,” 2016. [Online]. Available:https://nodejs.org/en/blog/announcements/nodejs-foundation-survey/

[35] “Posix poll specification,” http://pubs.opengroup.org/onlinepubs/9699919799/functions/poll.html, 2016.

[36] “Random(4),” http://man7.org/linux/man-pages/man4/random.4.html,2017.

[37] M. Abliz, “Internet Denial of Service Attacks and Defense Mecha-nisms,” Tech. Rep., 2011.

[38] S. Alimadadi, A. Mesbah, and K. Pattabiraman, “Understanding Asyn-chronous Interactions in Full-Stack JavaScript,” ICSE, 2016.

[39] S. Arzt, S. Rasthofer, C. Fritz, E. Bodden, A. Bartel, J. Klein,Y. Le Traon, D. Octeau, and P. McDaniel, “Flowdroid: Precise context,flow, field, object-sensitive and lifecycle-aware taint analysis for androidapps,” in PLDI, 2014.

[40] B. Atkinson, Hypercard. Apple Computer, 1988.[41] D. Barrera, H. G. Kayacik, P. C. van Oorschot, and A. Somayaji, “A

methodology for empirical analysis of permission-based security modelsand its application to android,” in CCS, 2010.

[42] J. Burnim, N. Jalbert, C. Stergiou, and K. Sen, “Looper: Lightweightdetection of infinite loops at runtime,” in International Conference onAutomated Software Engineering (ASE), 2009.

[43] J. Burnim, S. Juvekar, and K. Sen, “WISE: Automated Test Generationfor Worst-Case Complexity,” in International Conference on SoftwareEngineering (ICSE), 2009.

[44] E. Cambiaso, G. Papaleo, and M. Aiello, “Taxonomy of Slow DoSAttacks to Web Applications,” Recent Trends in Computer Networksand Distributed Systems Security, pp. 195–204, 2012.

[45] M. Casciaro, Node.js Design Patterns, 1st ed., 2014.[46] G. Chadha, S. Mahlke, and S. Narayanasamy, “Accelerating Asyn-

chronous Programs Through Event Sneak Peek,” in International Sym-posium on Computer Architecture (ISCA), 2015.

[47] R. Chang, G. Jiang, F. Ivancic, S. Sankaranarayanan, and V. Shmatikov,“Inputs of coma: Static detection of denial-of-service vulnerabilities,”in IEEE Computer Security Foundations Symposium (CSF), 2009.

[48] E. Corry, C. P. Hansen, and L. R. H. Nielsen,“Irregexp, Google Chrome’s New Regexp Implementation,”2009. [Online]. Available: https://blog.chromium.org/2009/02/irregexp-google-chromes-new-regexp.html

[49] R. Cox, “Regular Expression Matching Can Be Simple And Fast(but is slow in Java, Perl, PHP, Python, Ruby, ...),” 2007. [Online].Available: https://swtch.com/ rsc/regexp/regexp1.html

[50] S. Crosby, “Denial of service through regular expressions,” USENIXSecurity work in progress report, 2003.

[51] S. A. Crosby and D. S. Wallach, “Denial of Service via AlgorithmicComplexity Attacks,” in USENIX Security, 2003.

[52] J. Davis, A. Thekumparampil, and D. Lee, “Node.fz: Fuzzing theServer-Side Event-Driven Architecture,” in European Conference onComputer Systems (EuroSys), 2017.

[53] W. De Groef, D. Devriese, N. Nikiforakis, and F. Piessens, “Flowfox:A web browser with flexible and precise information flow control,” ser.CCS, 2012.

[54] W. De Groef, F. Massacci, and F. Piessens, “NodeSentry: Least-privilegelibrary integration for server-side JavaScript,” in Annual ComputerSecurity Applications Conference (ACSAC), 2014.

[55] A. Desai, V. Gupta, E. Jackson, S. Qadeer, S. Rajamani, and D. Zufferey,“P: Safe asynchronous event-driven programming,” in ACM SIGPLANConference on Programming Language Design and Implementation(PLDI), 2013.

[56] S. C. Drye and W. C. Wake, Java Swing reference. ManningPublications Company, 1999.

[57] W. Enck, D. Octeau, P. McDaniel, and S. Chaudhuri, “A study ofandroid application security,” in USENIX Security, 2011.

[58] W. Enck, M. Ongtang, and P. McDaniel, “Understanding androidsecurity,” IEEE Security and Privacy, 2009.

[59] Y. Feng, S. Anand, I. Dillig, and A. Aiken, “Apposcopy: Semantics-based detection of android malware through static analysis,” in FSE,2014.

[60] S. Ferg, “Event-driven programming: introduction, tutorial, history,”2006. [Online]. Available: http://eventdrivenpgm.sourceforge.net/

[61] A. S. Foundation, “The Apache web server.” [Online]. Available:http://www.apache.org

[62] S. F. Goldsmith, A. S. Aiken, and D. S. Wilkerson, “Measuring Empiri-cal Computational Complexity,” in Foundations of Software Engineering(FSE), 2007.

[63] D. Goodman and P. Ferguson, Dynamic HTML: The Definitive Refer-ence, 1st ed. O’Reilly, 1998.

[64] M. I. Gordon, D. Kim, J. Perkins, L. Gilhamy, N. Nguyenz, , and M. Ri-nard, “Information flow analysis of android applications in droidsafe.”in NDSS, 2015.

[65] S. Gulwani, K. K. Mehra, and T. Chilimbi, “SPEED: Precise andEfficient Static Estimation of Program Computational Complexity,” inPrinciples of Programming Languages (POPL), 2009.

[66] J. Harrell, “Node.js at PayPal,” 2013. [Online]. Available:https://www.paypal-engineering.com/2013/11/22/node-js-at-paypal/

14

Page 15: A Sense of Time for Node.js: Timeouts as a Cure for Event ...people.cs.vt.edu/davisjam/downloads/blog/ehp/... · the Event-Driven Architecture (EDA) championed by Node.js1. Although

[67] S. Heuser, A. Nadkarni, W. Enck, and A.-R. Sadeghi, “Asm: Aprogrammable interface for extending android security,” in Proceedingsof the 23rd USENIX Conference on Security Symposium, ser. SEC’14,2014.

[68] IBM, “Node-RED.” [Online]. Available: https://nodered.org/[69] X. Jin, X. Hu, K. Ying, W. Du, H. Yin, and G. N. Peri, “Code injection

attacks on html5-based mobile apps: Characterization, detection andmitigation,” in CCS, 2014.

[70] J. Kirrage, A. Rathnayake, and H. Thielecke, “Static Analysis forRegular Expression Denial-of-Service Attacks,” Network and SystemSecurity, vol. 7873, pp. 35–148, 2013.

[71] H. C. Lauer and R. M. Needham, “On the duality of operating systemstructures,” ACM SIGOPS Operating Systems Review, vol. 13, no. 2,pp. 3–19, 1979.

[72] S. Lekies, B. Stock, and M. Johns, “25 million flows later: Large-scaledetection of dom-based xss,” in CCS, 2013.

[73] Y. Lin, C. Radoi, and D. Dig, “Retrofitting Concurrency for AndroidApplications through Refactoring,” in ACM International Symposiumon Foundations of Software Engineering (FSE), 2014.

[74] Y. Liu, C. Xu, and S.-C. Cheung, “Characterizing and detecting perfor-mance bugs for smartphone applications,” in International Conferenceon Software Engineering (ICSE), 2014.

[75] M. Madsen, F. Tip, and O. Lhotak, “Static analysis of event-drivenNode. js JavaScript applications,” in ACM SIGPLAN InternationalConference on Object-Oriented Programming, Systems, Languages, andApplications (OOPSLA). ACM, 2015.

[76] M. D. Mcilroy, “Killer adversary for quicksort,” Software - Practiceand Experience, vol. 29, no. 4, pp. 341–344, 1999.

[77] N. Nikiforakis, L. Invernizzi, A. Kapravelos, S. Van Acker, W. Joosen,C. Kruegel, F. Piessens, and G. Vigna, “You are what you include:Large-scale evaluation of remote javascript inclusions,” in CCS, 2012.

[78] J. O’Dell, “Exclusive: How LinkedIn used Node.js and HTML5to build a better, faster app,” 2011. [Online]. Available:http://venturebeat.com/2011/08/16/linkedin-node/

[79] A. Ojamaa and K. Duuna, “Assessing the security of Node.js platform,”in 7th International Conference for Internet Technology and SecuredTransactions (ICITST), 2012, pp. 348–355.

[80] O. Olivo, I. Dillig, and C. Lin, “Detecting and Exploiting SecondOrder Denial-of-Service Vulnerabilities in Web Applications,” ACMConference on Computer and Communications Security (CCS), 2015.

[81] ——, “Static Detection of Asymptotic Performance Bugs in CollectionTraversals,” PLDI, 2015.

[82] S. Padmanabhan, “How We Built eBaysFirst Node.js Application,” 2013. [Online]. Avail-

able: http://www.ebaytechblog.com/2013/05/17/how-we-built-ebays-first-node-js-application/

[83] V. S. Pai, P. Druschel, and W. Zwaenepoel, “Flash: An Efficient andPortable Web Server,” in USENIX Annual Technical Conference (ATC),1999.

[84] D. Pariag, T. Brecht, A. Harji, P. Buhr, A. Shukla, and D. R. Cheriton,“Comparing the performance of web server architectures,” in EuropeanConference on Computer Systems (EuroSys). ACM, 2007.

[85] P. P. Puschner and C. Koza, “Calculating the Maximum Execution Timeof Real-Time Programs,” Real-Time Systems, vol. 1, no. 2, pp. 159–176,1989.

[86] A. Rathnayake and H. Thielecke, “Static Analysis for Regular Expres-sion Exponential Runtime via Substructural Logics,” CoRR, 2014.

[87] R. Rogers, J. Lombardo, Z. Mednieks, and B. Meike, Android applica-tion development: Programming with the Google SDK, 1st ed. O’ReillyMedia, Inc., 2009.

[88] A. Roichman and A. Weidman, “VAC - ReDoS: Regular ExpressionDenial Of Service,” Open Web Application Security Project (OWASP),2009.

[89] A. Silberschatz, P. B. Galvin, and G. Gagne, Operating System Con-cepts, 9th ed. Wiley Publishing, 2012.

[90] R. Smith, C. Estan, and S. Jha, “Backtracking Algorithmic ComplexityAttacks Against a NIDS,” in ACSAC, 2006, pp. 89–98.

[91] S. Son and V. Shmatikov, “SAFERPHP Finding Semantic Vulnerabil-ities in PHP Applications,” in Workshop on Programming Languagesand Analysis for Security (PLAS), 2011, pp. 1–13.

[92] C.-A. Staicu, M. Pradel, and B. Livshits, “Understanding and Automat-ically Preventing Injection Attacks on Node.js,” Tech. Rep., 2016.

[93] B. Sullivan, “Security Briefs - Regular Expression Denial of ServiceAttacks and Defenses,” Tech. Rep., 2010. [Online]. Available:https://msdn.microsoft.com/en-us/magazine/ff646973.aspx

[94] R. E. Sweet, “The Mesa programming environment,” ACM SIGPLANNotices, vol. 20, no. 7, pp. 216–229, 1985.

[95] B. Van Der Merwe, N. Weideman, and M. Berglund, “TurningEvil Regexes Harmless,” in SAICSIT, 2017. [Online]. Available:https://doi.org/10.1145/3129416.3129440

[96] M. Wandschneider, Learning Node.js: A Hands-on Guide to BuildingWeb Applications in JavaScript. Pearson Education, 2013.

[97] F. Wei, S. Roy, X. Ou, and Robby, “Amandroid: A precise and generalinter-component data flow analysis framework for security vetting ofandroid apps,” in Proceedings of the 2014 ACM SIGSAC Conferenceon Computer and Communications Security, ser. CCS ’14, 2014.

[98] M. Welsh, D. Culler, and E. Brewer, “SEDA : An Architecture for Well-Conditioned, Scalable Internet Services,” in Symposium on OperatingSystems Principles (SOSP), 2001.

15